diff --git a/github-data/discussions/100 - New argument _ env variable for GGML_SCHED_MAX_COPIES_.md b/github-data/discussions/100 - New argument env variable for GGML_SCHED_MAX_COPIES.md
similarity index 67%
rename from github-data/discussions/100 - New argument _ env variable for GGML_SCHED_MAX_COPIES_.md
rename to github-data/discussions/100 - New argument env variable for GGML_SCHED_MAX_COPIES.md
index 920d88c09..2ded518e1 100644
--- a/github-data/discussions/100 - New argument _ env variable for GGML_SCHED_MAX_COPIES_.md
+++ b/github-data/discussions/100 - New argument env variable for GGML_SCHED_MAX_COPIES.md
@@ -1,26 +1,27 @@
-### 🗣️ [#100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
+## 🗣️ [Discussion #100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
| **Author** | `Nexesenex` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-10-21 |
| **Updated** | 2024-10-21 |
---
-#### Description
+## 📄 Description
@ikawrakow, could you set up a CLI argument (or at least an env variable, it's much simpler I guess but I'm failing to do it right) to determine GGML_SCHED_MAX_COPIES without recompiling? It impacts VRAM occupation and performances, and it'd be great to set that up conveniently for benching and customized use.
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2024-10-21** at **08:29:25**:
+👤 **ikawrakow** commented on **2024-10-21** at **08:29:25**
I haven't looked into this at all. What is it good for?
---
-👤 **Nexesenex** replied the **2024-10-21** at **09:36:22**:
+👤 **Nexesenex** commented on **2024-10-21** at **09:36:22**
It's supposed to go faster inference on multi-GPU I guess. Mainline sets it at 4, I set it at 1, because I didn't notice much improvement back in the days, but I noticed more vram consumption and gpu load.
\ No newline at end of file
diff --git a/github-data/discussions/104 - Convenience improvements for llama-quantize.md b/github-data/discussions/104 - Convenience improvements for llama-quantize.md
index c4867acd1..6d5a81bd1 100644
--- a/github-data/discussions/104 - Convenience improvements for llama-quantize.md
+++ b/github-data/discussions/104 - Convenience improvements for llama-quantize.md
@@ -1,13 +1,14 @@
-### 🗣️ [#104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
+## 🗣️ [Discussion #104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
| **Author** | `Nexesenex` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-10-23 |
| **Updated** | 2024-10-23 |
---
-#### Description
+## 📄 Description
Hey IK.
diff --git a/github-data/discussions/140 - Questions about weight_j_.md b/github-data/discussions/140 - Questions about weightj.md
similarity index 93%
rename from github-data/discussions/140 - Questions about weight_j_.md
rename to github-data/discussions/140 - Questions about weightj.md
index f34877a18..253466654 100644
--- a/github-data/discussions/140 - Questions about weight_j_.md
+++ b/github-data/discussions/140 - Questions about weightj.md
@@ -1,13 +1,14 @@
-### 🗣️ [#140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]
+## 🗣️ [Discussion #140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]
| **Author** | `DavidZyy` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-12-13 |
| **Updated** | 2025-02-11 |
---
-#### Description
+## 📄 Description
Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions.
For example, at funtion `quantize_row_q4_0_impl` and other places, `weight[j]` is:
@@ -21,9 +22,9 @@ weight[j] = qw[j]
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2024-12-14** at **08:13:19**:
+👤 **ikawrakow** commented on **2024-12-14** at **08:13:19**
Hi @DavidZyy,
@@ -40,7 +41,7 @@ Why the need for correcting the Hessian in the first place?
---
-👤 **DavidZyy** replied the **2024-12-14** at **13:58:43**:
+👤 **DavidZyy** commented on **2024-12-14** at **13:58:43**
Thanks for taking time to answer this question and share information, I learned a lot from your answers.
Yes, it's very interesting :)
@@ -48,7 +49,7 @@ Yes, it's very interesting :)
---
-👤 **jukofyork** replied the **2025-02-10** at **17:03:34**:
+👤 **jukofyork** commented on **2025-02-10** at **17:03:34**
Oh shit, I just realised I totally forgot to reply to this post! @ikawrakow Thanks for the explanation!
@@ -66,7 +67,7 @@ but I still suspect that for these new very-high-expert-MoEs it should really be
---
-👤 **ikawrakow** replied the **2025-02-10** at **18:07:55**:
+👤 **ikawrakow** commented on **2025-02-10** at **18:07:55**
@jukofyork So, I have used regularization in a variety of contexts. Sadly, having spent the better part of my career in Medical Device where everything is closed source, there aren't many examples of that in the open. [This repository](https://github.com/ikawrakow/mnist) uses Tikhonov regularization for the training of an SVM model to recognize hand written digits. I put it out there because I find it funny that with fewer lines of code I can beet the [ggml mnist example](https://github.com/ggml-org/ggml/tree/master/examples/mnist) by a huge margin (0.4% vs 2% error rate, so 5X lower). But having used ragularization techniques in deformable image registration, large scale optimization of radiation therapy treatments, real-time target and/or critical organ tracking on live MRI images, MR and PET image reconstruction, etc., I think I know quite well when regularization is required, and LLM quantization is not one of the cases where it is, at least not in the classical sense of adding penalty term(s) to the optimization objective. For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term. At some level, one can consider i-quants as using "regularization" via forcing groups of quants to fall on a finite set of grid points, the set being much smaller than all possible grid points for the given number of bits per quant. E.g., `IQ2_XXS` uses 256 out of 6561 points on the E8 lattice. This prevents overfitting, thus can be considered as "regularization".
@@ -74,7 +75,8 @@ The other thing I have learned is that theories are rarely useful in their pure
Just my 2 cents
-> 👤 **jukofyork** replied the **2025-02-10** at **19:26:00**:
+> 👤 **jukofyork** replied on **2025-02-10** at **19:26:00**
+>
> > For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term.
>
> I was late to that discussion, but it was possibly me who mentioned this.
@@ -138,8 +140,9 @@ Just my 2 cents
> I am certainly no "Bayesian purist" and will happily tune the prior to get the best observed results too!
>
> BUT: I strongly believe the effectiveness of the `imatrix` calculations could be vastly improved by adding some method of interpolation/regularisation/whatever to allow for informed tuning of the weighting factors! :smile:
+
+> 👤 **saood06** replied on **2025-02-10** at **20:23:18**
>
-> 👤 **saood06** replied the **2025-02-10** at **20:23:18**:
> > I still think this is an important area to consider (whatever the chosen regularization method is):
> > #### (A) I see people still using using bartowski's same ~250kb `calibration_datav3.txt` file on `Deepseek-V3` as on fully-dense models.
> >
@@ -159,8 +162,9 @@ Just my 2 cents
> From: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6758d52499eea0c4b65d0475
>
> They do discuss the idea of needing more data because of MoE in that thread. I use their imatrix.dat files, and my ppl numbers I gave you are for IQ4_K_R4.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **06:01:32**
>
-> 👤 **ikawrakow** replied the **2025-02-11** at **06:01:32**:
> Is the inability to activate al experts observed just for layer 0 or for all layers?
>
> Are people aware of the fact that one can run the model with more active experts than specified by the meta data?
@@ -170,8 +174,9 @@ Just my 2 cents
> I think doing that will likely help activate more experts.
>
> I also don't understand why the entire experts tensor cannot be imatrix-quantized if just one expert is missing. If that's what we ended up with, it definitely needs fixing.
+
+> 👤 **saood06** replied on **2025-02-11** at **15:17:30**
>
-> 👤 **saood06** replied the **2025-02-11** at **15:17:30**:
> > Is the inability to activate al experts observed just for layer 0 or for all layers?
>
> Just layer 0.
@@ -201,19 +206,22 @@ Just my 2 cents
> They never reported that for any of the Deepseek models so I'm assuming they only encountered it with arctic and no matter what they did they were never able to activate that expert so I'm giving some credence to their theory that "There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate."
>
> Looking at the files in safetensors each expert is stored separately but with a GGUF that is not the case and they are all stored together.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **16:33:38**
>
-> 👤 **ikawrakow** replied the **2025-02-11** at **16:33:38**:
-> Thanks for making me aware of this situation. I prepared PR #202 to deal with it.
+> Thanks for making me aware of this situation. I prepared PR [#202](https://github.com/ikawrakow/ik_llama.cpp/issues/202) to deal with it.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **17:11:08**
>
-> 👤 **ikawrakow** replied the **2025-02-11** at **17:11:08**:
> > but are you sure that is recommended?
>
> I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
> * For 7 experts PPL is slightly lower (-0.2%)
> * For 8 and 9 experts it is about the same
> * For 10 experts PPL is ~0.3% higher.
+
+> 👤 **saood06** replied on **2025-02-11** at **17:27:49**
>
-> 👤 **saood06** replied the **2025-02-11** at **17:27:49**:
> With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
>
> Experts | PPL
@@ -235,8 +243,9 @@ Just my 2 cents
> IQ3_XXS (exp=4) | 2.87 | 3.61 | 2.60 | 2.25 | 2.09 | 1.97 | 1.89 | 1.87
> IQ3_XXS (exp=6) | 2.67 | 3.53 | 2.53 | 2.13 | 1.94 | 1.80 | 1.71 | 1.65
> IQ3_XXS (def) | 2.69 | 3.53 | 2.51 | 2.11 | 1.91 | 1.78 | 1.69 | 1.62
+
+> 👤 **jukofyork** replied on **2025-02-11** at **19:22:47**
>
-> 👤 **jukofyork** replied the **2025-02-11** at **19:22:47**:
> > > but are you sure that is recommended?
> >
> > I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
@@ -248,8 +257,9 @@ Just my 2 cents
> > * For 10 experts PPL is ~0.3% higher.
>
> Yeah, I managed to do this with `dbrx` before the PR that fixes the divisors for the experts separately. IIRC, I actually activated all the experts for `dbrx` and it got a better resulting `imatrix` than the pre-PR code did, and was quite usable.
+
+> 👤 **jukofyork** replied on **2025-02-11** at **19:24:47**
>
-> 👤 **jukofyork** replied the **2025-02-11** at **19:24:47**:
> > With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
>
> This could be because most previous MoEs use softmax to gate/weight with, so as you add more experts is scales down the weights, but `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger (you can probably also hack the weights and bias to counter this though).
@@ -260,13 +270,15 @@ Just my 2 cents
> INFO:hf-to-gguf:blk.11.exp_probs_b.bias, torch.float32 --> F32, shape = {256}
> INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight, torch.bfloat16 --> F32, shape = {7168, 256}
> ```
+
+> 👤 **saood06** replied on **2025-02-11** at **20:24:39**
>
-> 👤 **saood06** replied the **2025-02-11** at **20:24:39**:
> > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
>
> Then why does 16 experts work, but not 10/12?
+
+> 👤 **jukofyork** replied on **2025-02-11** at **20:33:32**
>
-> 👤 **jukofyork** replied the **2025-02-11** at **20:33:32**:
> > > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
> >
> > Then why does 16 experts work, but not 10/12?
diff --git a/github-data/discussions/15 - Will LQER improve k- and i-quants_.md b/github-data/discussions/15 - Will LQER improve k- and i-quants.md
similarity index 95%
rename from github-data/discussions/15 - Will LQER improve k- and i-quants_.md
rename to github-data/discussions/15 - Will LQER improve k- and i-quants.md
index 852166d4d..589fb7751 100644
--- a/github-data/discussions/15 - Will LQER improve k- and i-quants_.md
+++ b/github-data/discussions/15 - Will LQER improve k- and i-quants.md
@@ -1,13 +1,14 @@
-### 🗣️ [#15](https://github.com/ikawrakow/ik_llama.cpp/discussions/15) - Will LQER improve k- and i-quants?
+## 🗣️ [Discussion #15](https://github.com/ikawrakow/ik_llama.cpp/discussions/15) - Will LQER improve k- and i-quants?
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-08-09 |
-| **Updated** | 2025-07-12 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
[LQER/L²QER](https://arxiv.org/pdf/2402.02446) is the latest hype about LLM quantization. Promptly, there is an [issue](https://github.com/ggerganov/llama.cpp/discussions/8831) in `llama.cpp` to use that to improve the existing quantization methods because, you know, the gras is always greener on the other side of the road. But, unlike many earlier calls to improve quantization with the latest "SOTA" quantization advertisement, err, scientific paper, on arXiv, there are already efforts underway to actually implement this. E.g., [this PR](https://github.com/ggerganov/llama.cpp/pull/8939) adds Numpy dequantization so one can use Numpy to do the SVD of the difference between the full model and a quantized model.
@@ -51,9 +52,9 @@ Pinging @compilade who seems to be the main driving force behind implementing LQ
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **compilade** replied the **2024-08-09** at **15:12:32**:
+👤 **compilade** commented on **2024-08-09** at **15:12:32**
Thanks for pinging me, it's interesting to learn about your past attempts with SVD.
@@ -73,7 +74,7 @@ Also, I have not yet implemented Numpy dequantization for most of the `IQ` types
---
-👤 **ikawrakow** replied the **2024-08-09** at **16:01:22**:
+👤 **ikawrakow** commented on **2024-08-09** at **16:01:22**
> Also, I have not yet implemented Numpy dequantization for most of the IQ types, only IQ4_NL and IQ4_XS, because the grids for the others are a bit large. Ideally, they should be generated at runtime with a minimal amount of magic numbers. Is that possible?
@@ -93,11 +94,12 @@ But then again, I'm one of those people suffering from the NIH syndrome, so used
---
-👤 **ikawrakow** replied the **2024-08-27** at **15:11:01**:
+👤 **ikawrakow** commented on **2024-08-27** at **15:11:01**
Btw, on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/try_svd) there is some exploration of using SVD before or after the quantization. I have misused the `quantize-stats` tool to look at how the root-mean-square-error (rmse) behaves as a function of the number of SVD components. One can do the SVD before or after quantization. Certainly not production quality, AVX2-only vectorization, very simple multi-threading, but still enough to see that SVD does not add any value to LLMs quantization when the quantization works reasonably well. I know it works because full SVD reduces rmse to zero.
-> 👤 **compilade** replied the **2024-08-27** at **16:59:19**:
+> 👤 **compilade** replied on **2024-08-27** at **16:59:19**
+>
> Thanks!
>
> I see that when `SVD_BEFORE` is `false`, the initial output fed into `try_svd` is non-zero, and SVD is [done on the subtraction of input and output](https://github.com/ikawrakow/ik_llama.cpp/blob/63fc8014a25e5192b618e0d8f869f8c507c99793/examples/quantize-stats/quantize-stats.cpp#L317), which means this does look similar to LQER (while also quantizing the low-rank tensor?) if I understand it correctly. Still feels like a good proof of concept, even though it doesn't test using SVD both before quantization (to remove low-rank components from the input) *and* after (to then correct both the additional low-rank error and the quantization error) at the same time. It's helpful to know that plain LQER is worse than better quantization.
@@ -108,13 +110,14 @@ Btw, on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/try_svd)
---
-👤 **ikawrakow** replied the **2024-09-11** at **14:31:14**:
+👤 **ikawrakow** commented on **2024-09-11** at **14:31:14**
@compilade With your PR-9400 in `llama.cpp` I now have to write GGUF loading and link against `ggml` when I want to take a quick look at an imatrix? Instead of just copy/pasting the 20 LOC of imatrix structure definition and (de-)serialization into a `.cpp` file and being done in 5 minutes? Ouch. And no, HF tools will with 99.99% probability not help me with what I'm interested in. I mean, having a Python imatrix to GGUF converter is I guess great for those who want to look at imatrix files on HF, but changing the imatrix tool to output GGUFs is a bit too much afaik.
Oh well, I'll need to keep my own copy of the `imatrix` and `quantize` tools.
-> 👤 **ngxson** replied the **2024-09-11** at **15:17:56**:
+> 👤 **ngxson** replied on **2024-09-11** at **15:17:56**
+>
> Hi and sorry if this change disrupts your workflow.
>
> The main reason behind this change was that we want to unify file formats in llama.cpp. From the perspective of software engineering, is needed because it could help abstract out some parts of the implementation, thus provide a better code base for more features to come in the future.
@@ -126,8 +129,9 @@ Oh well, I'll need to keep my own copy of the `imatrix` and `quantize` tools.
> Another option would be have a CLI arg in imatrix to select the output file format, although this may make the code a bit harder to maintain.
>
> In anyway, I appreciate your work and would love to know if we can do anything to help you.
+
+> 👤 **ikawrakow** replied on **2024-09-11** at **16:01:09**
>
-> 👤 **ikawrakow** replied the **2024-09-11** at **16:01:09**:
> > In anyway, I appreciate your work and would love to know if we can do anything to help you.
>
> Not merge PR-9400? Or just merge the imatrix to GGUF Python conversion script?
@@ -138,16 +142,18 @@ Oh well, I'll need to keep my own copy of the `imatrix` and `quantize` tools.
> > Contrary to what you said (to have HF to visualize the GGUF file), in fact, this change does introduce a headache to HF backend,
>
> I see. We make a change that introduces headaches, triples or quadruples the code required to load/save such files thus magnifying the probability for bugs, and mandates linking against `libggml.so` for any tool that wants to operate with such files, to gain the benefit of "unifying file formats in llama.cpp"? Where the thing being unified is not some monstrous code with thousands of lines of code and massive maintenance burden but a 20 LOC thing that defines the format and implements (de-)serialization? Cool.
+
+> 👤 **ikawrakow** replied on **2024-09-11** at **16:19:12**
>
-> 👤 **ikawrakow** replied the **2024-09-11** at **16:19:12**:
> > From the perspective of software engineering, is needed because it could help abstract out some parts of the implementation, thus provide a better code base for more features to come in the future.
> ```
> ls -al ./ggml/src/libggml.so
> -rwxrwxr-x 1 iwan iwan 369408304 Sep 9 20:11 ./ggml/src/libggml.so
> ```
> Don't know about you, but having to link against a 370 MB `.so` to abstract 20 LoC does not add up afaik.
+
+> 👤 **ngxson** replied on **2024-09-11** at **16:57:26**
>
-> 👤 **ngxson** replied the **2024-09-11** at **16:57:26**:
> Regarding the merge decision, I can't determine whether it will be merged or not. My role is to provide clarity and explore options to help.
>
> The abstraction here isn't just about code length, but about creating a unified approach for tensor save/load operations within llama.cpp. In the future, this could also make it easier to add more parameters to imatrix.gguf file. It also allows more users to experiment with imatrix directly in the GGUF format, without needing conversions.
@@ -155,8 +161,9 @@ Oh well, I'll need to keep my own copy of the `imatrix` and `quantize` tools.
> I completely agree that linking against a 370 MB .so file is not desirable. However, it's worth noting that your `libggml.so` is likely built with CUDA support, which significantly increases its size. Also, the GGUF-related code is actually a small fraction of the whole ggml library.
>
> To address your specific workflow needs, I have a suggestion that might help: What if I provide you a header-only GGUF loader? This could potentially allow you to work with GGUF files without the need for linking against the full `libggml.so`. I've been considering this idea for a while, but couldn't find a valid usage for it.
+
+> 👤 **compilade** replied on **2024-09-12** at **02:48:39**
>
-> 👤 **compilade** replied the **2024-09-12** at **02:48:39**:
> @ikawrakow Thanks for expressing concern about the format change.
>
> The main reason for it is that there doesn't seem to be a backward-compatible way to make the non-GGUF-based `imatrix` format work with many ubatches per chunk, or many chunks per ubatches (in the simple format, ncalls is tied to the ubatch size but is also somehow used as the number of chunks). It's also impossible to get the chunk size used to make a non-GGUF `imatrix` file from its metadata. (The convert script assumes 512 was used, but that's not always true. This is mostly relevant when merging `imatrix` files with `--in-file`)
@@ -167,7 +174,7 @@ Oh well, I'll need to keep my own copy of the `imatrix` and `quantize` tools.
---
-👤 **ikawrakow** replied the **2024-09-12** at **13:16:15**:
+👤 **ikawrakow** commented on **2024-09-12** at **13:16:15**
@compilade Thank you for responding to my concerns.
@@ -204,7 +211,8 @@ void read_imatrix(std::istream in, ...) {
```
Voila, all existing imatrices continue to work, you can add whatever extensions you like (anywhere you like, not just at the end), we don't need to include `ggml/gguf` headers and link against a 370 MB `libggml.so`, etc.
-> 👤 **compilade** replied the **2024-09-13** at **01:56:41**:
+> 👤 **compilade** replied on **2024-09-13** at **01:56:41**
+>
> > I must admit I don't understand the concerns. The issue is that one cannot (correctly) combine imatrices computed with different `u_batch` sizes? (One can always combine them, but the files will not contribute to the combined imatrix with the correct weight). Why would one want to do that? AFAIK, not needing to worry about batch and u-batch sizes is a feature, not a bug.
>
> The sanest way to both not worry about batch sizes and correctly combine `imatrix` files is to store the number of tokens (or activations in this case) instead of the number of "chunks". This is what is done in the GGUF-based format. You're right that the chunk size in the metadata isn't really necessary. I *think* it would be possible to make it work that way in the simper format, but there would still be some weirdness with MoE tensors.
@@ -265,8 +273,9 @@ Voila, all existing imatrices continue to work, you can add whatever extensions
> - Can't make stand-alone programs for quantization experiments like before
> - Need to link to `libggml.so` to use GGUF-based `imatrix` files
> - Or need to include some `gguf.h` header-only library
+
+> 👤 **compilade** replied on **2025-07-12** at **14:18:22**
>
-> 👤 **compilade** replied the **2025-07-12** at **14:18:22**:
> @ikawrakow
>
> I made changes to since last time.
@@ -284,8 +293,9 @@ Voila, all existing imatrices continue to work, you can add whatever extensions
> I've had some complaints regarding using the filename extension to select the imatrix format. The alternative would be a format flag, but you would need to know about it (especially if the default isn't the format you're used to).
>
> It's still not completely clear to me what or how strict your requirements are. Is it closer to "GGUF imatrix files should not exist", "GGUF imatrix should only be used deliberately" (e.g. by using the `.gguf` suffix), or "a format flag for the previous format would be enough, even if the default is GGUF"?
+
+> 👤 **ikawrakow** replied on **2025-07-12** at **17:19:43**
>
-> 👤 **ikawrakow** replied the **2025-07-12** at **17:19:43**:
> @compilade
>
> Thank you for letting me know. I basically never use `llama.cpp` now, so the imatrix GG-ification is no longer relevant for my needs. The imatrix tool in mainline has been broken for MLA models for quite some time now, so I guess it is time to fix that by merging your PR.
diff --git a/github-data/discussions/164 - Latest CPU performance comparison with llama.cpp.md b/github-data/discussions/164 - Latest CPU performance comparison with llama.cpp.md
index 9d24171b4..ba73d79f3 100644
--- a/github-data/discussions/164 - Latest CPU performance comparison with llama.cpp.md
+++ b/github-data/discussions/164 - Latest CPU performance comparison with llama.cpp.md
@@ -1,13 +1,14 @@
-### 🗣️ [#164](https://github.com/ikawrakow/ik_llama.cpp/discussions/164) - Latest CPU performance comparison with llama.cpp
+## 🗣️ [Discussion #164](https://github.com/ikawrakow/ik_llama.cpp/discussions/164) - Latest CPU performance comparison with llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-12-24 |
-| **Updated** | 2025-04-28 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
There has been quite a bit of development here and in mainline `llama.cpp` since the performance results on the front page were generated, so I decided to make a new CPU performance comparison.
@@ -111,9 +112,9 @@ The fastest way to do prompt processing with `ik_llama.cpp` is the new 8-bit, 8-
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **saood06** replied the **2025-01-10** at **23:34:54**:
+👤 **saood06** commented on **2025-01-10** at **23:34:54**
I ran some benchmarks on an AVX2 machine (Xeon E5-2683 v4, 32 core, quad channel broadwell) on an IQ4_XS of Midnight Miqu 70B v1.5 via batched bench ( with arguments -pps -fa -t 32 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 -c 32768 [context only needed to be set for llama.cpp as otherwise it would skip some tests but ik_llama.cpp defaulted to 32768] ), build 4404 for llama.cpp. No runtime repacking for ik_llama.cpp.
I was curious about batch performance since there is inference software like arrows or loom which would definitely benefit from it.
@@ -278,7 +279,7 @@ I only ask because I'm not sure if the 80 tensors going from q5_K to iq5_k is lo
---
-👤 **ikawrakow** replied the **2025-01-11** at **07:28:46**:
+👤 **ikawrakow** commented on **2025-01-11** at **07:28:46**
@saood06 Thanks for testing.
@@ -294,7 +295,8 @@ Sorry, the goal was to make the `_R4` quants use the same quantization mixes, bu
`IQ5_K` is normally quite a bit better than `Q5_K`, so most of the time I would expect this to perform better.
-> 👤 **saood06** replied the **2025-01-11** at **09:59:16**:
+> 👤 **saood06** replied on **2025-01-11** at **09:59:16**
+>
> >Sorry, the goal was to make the _R4 quants use the same quantization mixes, but apparently I have not quite succeeded. The function where the quantization type is selected is quite messy. But instead of re-quantizing to *_R4, you can use the -rtr command line option, which will make your model use the exact same mix of quantization types (but those where an _R4 variant is available will be repacked to that).
>
> No worries, I only made the quant to test (for actual use, I'd make an IQK quant) and I didn't realize batched-bench supported rtr. It also didn't matter for this machine and test, but I also wasn't sure how runtime repacking and NUMA would behave, if the runtime repacking would interfere with the benefits from POSIX_MADV_RANDOM.
@@ -308,13 +310,15 @@ Sorry, the goal was to make the `_R4` quants use the same quantization mixes, bu
> Once almost all the model is in system cache, it did Prompt processing at 11.5 t/s, and token generation at 2.75 t/s. I still couldn't get it to fully fault, but it did basically stop paging, and performance stopped improving, once it hit those numbers.
>
> I couldn't get it to run with an _R4 quant it hit the GGML_ASSERT(nrc_x%4 == 0), but even without that I'm still happy with the performance of it.
+
+> 👤 **ikawrakow** replied on **2025-01-11** at **10:38:23**
>
-> 👤 **ikawrakow** replied the **2025-01-11** at **10:38:23**:
> > I couldn't get it to run with an _R4 quant it hit the GGML_ASSERT(nrc_x%4 == 0), but even without that I'm still happy with the performance of it.
>
> Can you post the assert you see? I was hoping to have covered all places where one needs to check for divisibility by 4 before using `_R4` quants, but apparently I'm still missing checks somewhere. What are the tensor dimensions of this model?
+
+> 👤 **saood06** replied on **2025-01-11** at **11:03:54**
>
-> 👤 **saood06** replied the **2025-01-11** at **11:03:54**:
> >Can you post the assert you see?
>
> Here's the full error output I got when trying to run it. I put it in a detail's thing as it is long.
@@ -664,27 +668,31 @@ Sorry, the goal was to make the `_R4` quants use the same quantization mixes, bu
> https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q2_K_L?show_file_info=DeepSeek-V3-Q2_K_L%2FDeepSeek-V3-Q2_K_L-00001-of-00005.gguf
>
> That link should list them in a relatively nice format. You'll have to click through to view all 5 parts though.
+
+> 👤 **ikawrakow** replied on **2025-01-11** at **11:17:30**
>
-> 👤 **ikawrakow** replied the **2025-01-11** at **11:17:30**:
> Thanks! This explains it. It is a MoE model, so I must have forgotten to make sure the number of rows is a multiple of 4 when splitting work between threads in the MoE matrix multiplication implementation. I'll try to fix it.
+
+> 👤 **saood06** replied on **2025-01-12** at **18:08:54**
>
-> 👤 **saood06** replied the **2025-01-12** at **18:08:54**:
> >Thanks! This explains it.
>
> I'm glad you were able to figure out the issue.
>
> >I'll try to fix it.
>
-> I see you did with #170, now the _R4 works for Deepseek V3 but performance is different from what I was expecting. I am pleasantly surprised by token generation going from 2.75 t/s to 3.10 t/s. Prompt processing on the other hand dropped from 11.5 t/s to 9.8 t/s.
+> I see you did with [#170](https://github.com/ikawrakow/ik_llama.cpp/issues/170), now the _R4 works for Deepseek V3 but performance is different from what I was expecting. I am pleasantly surprised by token generation going from 2.75 t/s to 3.10 t/s. Prompt processing on the other hand dropped from 11.5 t/s to 9.8 t/s.
>
> Either way thanks for the quick fix. The bump in TG speeds is nice, even if PP speed went down for me.
+
+> 👤 **ikawrakow** replied on **2025-01-13** at **05:54:15**
>
-> 👤 **ikawrakow** replied the **2025-01-13** at **05:54:15**:
> > Prompt processing on the other hand dropped from 11.5 t/s to 9.8 t/s.
>
> This is strange. In my testing with Mixtral8x7B, after the fix `IQ4_XS_R4` is about 30% faster than `IQ4_XS` for prompt processing. Deepseek V3 is beyond my compute capabilities, so not able to investigate.
+
+> 👤 **saood06** replied on **2025-01-19** at **13:00:33**
>
-> 👤 **saood06** replied the **2025-01-19** at **13:00:33**:
> >after the fix IQ4_XS_R4 is about 30% faster than IQ4_XS for prompt processing
>
> I've been testing IQ4_K_R4 vs IQ4_K. but I will test both IQ4_XS some for Mixtral-8x22B as I plan to test that, and I'll give some numbers against llama.cpp.
@@ -695,7 +703,7 @@ Sorry, the goal was to make the `_R4` quants use the same quantization mixes, bu
---
-👤 **ikawrakow** replied the **2025-01-11** at **07:58:35**:
+👤 **ikawrakow** commented on **2025-01-11** at **07:58:35**
> > Performance is good, but I don't understand why odd batch sizes seem to perform better.
@@ -708,30 +716,34 @@ Clearly I'm doing something there that works better for odd number of queries. I
---
-👤 **saood06** replied the **2025-01-19** at **13:33:06**:
+👤 **saood06** commented on **2025-01-19** at **13:33:06**
>We see that the CPU performance gap has widened significantly since July when I made the comparison on the front page.
Do you plan to update the README.md with these numbers? The R4 quants are very impressive.
-> 👤 **ikawrakow** replied the **2025-01-19** at **15:30:36**:
+> 👤 **ikawrakow** replied on **2025-01-19** at **15:30:36**
+>
> I should, I know. It is just that I prefer to solve problems rather that write about how I solved the problem and what came out.
+
+> 👤 **saood06** replied on **2025-04-27** at **09:33:26**
>
-> 👤 **saood06** replied the **2025-04-27** at **09:33:26**:
> You made a good list of things [here](https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828), the "Why?" section can be updated with newer models like the official bitnet release, Deepseek, Llama-4. Updating the benchmarks though I know is a lot.
+
+> 👤 **ikawrakow** replied on **2025-04-28** at **14:29:33**
>
-> 👤 **ikawrakow** replied the **2025-04-28** at **14:29:33**:
-> Something like PR #352 ?
+> Something like PR [#352](https://github.com/ikawrakow/ik_llama.cpp/issues/352) ?
---
-👤 **bartowski1182** replied the **2025-01-23** at **02:58:19**:
+👤 **bartowski1182** commented on **2025-01-23** at **02:58:19**
Out of curiousity, do you intend to maintain this fork as an alternative to llama.cpp perpetually? or is it more of a testing grounds before upstreaming?
wondering if it's worth recommending people run this specifically for better performance or if it's more of a "bleeding edge" kind of project that people should just wait to get later when it's more ready
-> 👤 **ikawrakow** replied the **2025-01-23** at **08:18:58**:
+> 👤 **ikawrakow** replied on **2025-01-23** at **08:18:58**
+>
> > Out of curiousity, do you intend to maintain this fork as an alternative to llama.cpp perpetually? or is it more of a testing grounds before upstreaming?
>
> Nothing is perpetual in this world :smiley:
@@ -741,15 +753,16 @@ wondering if it's worth recommending people run this specifically for better per
> It is also a bit of a chicken and egg game: I'll only get a more significant number of users if people know (or at least expect) that I'm seriously committed to his project and the project gets advertised around social networks, but I can only know if I want to seriously commit to maintaining this project long term for a significant number of users if I already have many users and have dealt with the associated bug reports and feature requests :smiley:
>
> As it stands, this project is only useful for technical users who are not scared to build the project themself (no docker images and pre-build binaries), and are using one of the platforms I develop/test on (Linux and macOS, `AVX2` or `ARM_NEON` CPUs, newer Nvidia GPUs). It may or may not work on Windows/Android/etc, old Nvidia or AMD GPUs, etc. I absolutely don't have the bandwidth (or desire) to be supporting every operating system and computing platform under the sun, including 10+ year old CPUs and GPUs, and obscure platforms used by exactly 3 people in the worlds, as `llama.cpp` does.
+
+> 👤 **bartowski1182** replied on **2025-01-23** at **15:12:49**
>
-> 👤 **bartowski1182** replied the **2025-01-23** at **15:12:49**:
> yeah that makes sense! would be cool to see someone attempt to upstream some improvements but I understand your lack of desire considering it's probably quite the headache
>
> Good to know though you intend to keep this going for at least awhile
---
-👤 **saood06** replied the **2025-01-30** at **22:48:57**:
+👤 **saood06** commented on **2025-01-30** at **22:48:57**
I was curious due to Deepseek's design to test the MHA 35B c4ai-command-r-v01.Q8_0 on my Xeon E5-2683 v4. Ran as much context as I had RAM for. TG is set 5 not 32 as it was slow.
diff --git a/github-data/discussions/165 - Norm RMS Epsilon.md b/github-data/discussions/165 - Norm RMS Epsilon.md
index b9ee79cd8..7b37f5e94 100644
--- a/github-data/discussions/165 - Norm RMS Epsilon.md
+++ b/github-data/discussions/165 - Norm RMS Epsilon.md
@@ -1,13 +1,14 @@
-### 🗣️ [#165](https://github.com/ikawrakow/ik_llama.cpp/discussions/165) - Norm RMS Epsilon
+## 🗣️ [Discussion #165](https://github.com/ikawrakow/ik_llama.cpp/discussions/165) - Norm RMS Epsilon
| **Author** | `Nexesenex` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-12-25 |
| **Updated** | 2024-12-27 |
---
-#### Description
+## 📄 Description
While it crosses my mind..
@@ -19,9 +20,9 @@ And merry XMAS btw, if you celebrate it!
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2024-12-27** at **17:44:24**:
+👤 **ikawrakow** commented on **2024-12-27** at **17:44:24**
I'm travelling, so just quickly from the phone.
diff --git a/github-data/discussions/166 - Learning more LLM quantization.md b/github-data/discussions/166 - Learning more LLM quantization.md
index 4a9d4e3a2..2d19f5d41 100644
--- a/github-data/discussions/166 - Learning more LLM quantization.md
+++ b/github-data/discussions/166 - Learning more LLM quantization.md
@@ -1,13 +1,14 @@
-### 🗣️ [#166](https://github.com/ikawrakow/ik_llama.cpp/discussions/166) - Learning more LLM quantization
+## 🗣️ [Discussion #166](https://github.com/ikawrakow/ik_llama.cpp/discussions/166) - Learning more LLM quantization
| **Author** | `robinnarsinghranabhat` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-01-05 |
| **Updated** | 2025-03-13 |
---
-#### Description
+## 📄 Description
For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.
@@ -15,9 +16,9 @@ It might sound silly but we have separate tricks for quantization during trainin
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-01-05** at **10:37:28**:
+👤 **ikawrakow** commented on **2025-01-05** at **10:37:28**
> For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.
@@ -29,7 +30,7 @@ I developed all quantization types in `llama.cpp` apart from the legacy quants `
---
-👤 **robinnarsinghranabhat** replied the **2025-01-10** at **21:38:11**:
+👤 **robinnarsinghranabhat** commented on **2025-01-10** at **21:38:11**
Thank you for this humble response !
@@ -45,5 +46,6 @@ I want to be a programmar like you.
Sorry .. lots of questions all over the place :(
-> 👤 **arnfaldur** replied the **2025-03-13** at **02:10:31**:
+> 👤 **arnfaldur** replied on **2025-03-13** at **02:10:31**
+>
> Trying to understand this codebase isn't attacking the wall where it's lowest. You're probably best off finding some beginner/intermediate C++ courses online. I imagine that there are plenty available for free. You don't strictly need to understand all these fundamentals to understand what this project is doing, but you sound like you're in the *don't know what you don't know* phase and a general Computer Science course would likely get you the farthest at this point.
\ No newline at end of file
diff --git a/github-data/discussions/18 - CPU beating GPU in token generation speed.md b/github-data/discussions/18 - CPU beating GPU in token generation speed.md
index 38114f5cc..a3489a0eb 100644
--- a/github-data/discussions/18 - CPU beating GPU in token generation speed.md
+++ b/github-data/discussions/18 - CPU beating GPU in token generation speed.md
@@ -1,13 +1,14 @@
-### 🗣️ [#18](https://github.com/ikawrakow/ik_llama.cpp/discussions/18) - CPU beating GPU in token generation speed
+## 🗣️ [Discussion #18](https://github.com/ikawrakow/ik_llama.cpp/discussions/18) - CPU beating GPU in token generation speed
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-08-13 |
| **Updated** | 2025-04-03 |
---
-#### Description
+## 📄 Description
The [TriLM](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2) ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with `IQ2_TN`. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:
@@ -37,11 +38,11 @@ Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the f
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2024-09-02** at **13:20:54**:
+👤 **ikawrakow** commented on **2024-09-02** at **13:20:54**
-Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get
+Now that we have efficient Flash Attention (FA) implementation on the CPU via PR [#32](https://github.com/ikawrakow/ik_llama.cpp/issues/32), we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get
| model | size | params | backend | ngl | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ------------: | ---------------: |
@@ -54,15 +55,18 @@ TG speed is now about the same, which is still quite remarkable.
FA has improved CPU prompt processing speed by almost 50%, TG by 22%.
-> 👤 **saood06** replied the **2025-04-02** at **10:36:44**:
+> 👤 **saood06** replied on **2025-04-02** at **10:36:44**
+>
> Is there a chance SpargeAttn could be implemented here. Code [here](https://github.com/thu-ml/SpargeAttn), Paper [here](https://arxiv.org/abs/2502.18137).
>
> If it could would it benefit speed on CPU?
+
+> 👤 **ikawrakow** replied on **2025-04-02** at **13:44:09**
>
-> 👤 **ikawrakow** replied the **2025-04-02** at **13:44:09**:
> Other than the paper, is there any evidence that this works as advertised? If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.
+
+> 👤 **saood06** replied on **2025-04-03** at **00:24:39**
>
-> 👤 **saood06** replied the **2025-04-03** at **00:24:39**:
> >Other than the paper, is there any evidence that this works as advertised?
>
> Not really (there are multiple ComfyUI custom nodes that port support but not much on people using it), the paper looked interesting to me and the idea makes sense to me, but the implementation they have looks premature. The same group put out SageAttention/SageAttention2 which has been widely adopted (mostly for image/video models) and the performance matched the paper but SpargeAttn has gotten interest but not much adoption because of the state of the implmentation.
@@ -73,9 +77,9 @@ FA has improved CPU prompt processing speed by almost 50%, TG by 22%.
---
-👤 **ikawrakow** replied the **2024-09-08** at **07:16:59**:
+👤 **ikawrakow** commented on **2024-09-08** at **07:16:59**
-With PR #42 we get this
+With PR [#42](https://github.com/ikawrakow/ik_llama.cpp/issues/42) we get this
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | ---------------: |
diff --git a/github-data/discussions/201 - What is the NUMA situation _.md b/github-data/discussions/201 - What is the NUMA situation.md
similarity index 93%
rename from github-data/discussions/201 - What is the NUMA situation _.md
rename to github-data/discussions/201 - What is the NUMA situation.md
index a03a4265b..5ad806721 100644
--- a/github-data/discussions/201 - What is the NUMA situation _.md
+++ b/github-data/discussions/201 - What is the NUMA situation.md
@@ -1,13 +1,14 @@
-### 🗣️ [#201](https://github.com/ikawrakow/ik_llama.cpp/discussions/201) - What is the NUMA situation ?
+## 🗣️ [Discussion #201](https://github.com/ikawrakow/ik_llama.cpp/discussions/201) - What is the NUMA situation ?
| **Author** | `bhugueney` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-02-11 |
| **Updated** | 2025-05-21 |
---
-#### Description
+## 📄 Description
It seems to me that output generation being memory bandwidth bounded and LLM requiring a lot of RAM , a cheap way to try increase both RAM amount and bandwidth is to go for NUMA.
For instance, a dual Epyc server can have 16 or 24 memory channels each CPU can also have up to 4 NUMA domains for best theoretical performance (also, on Gen 2 Epyc at least, L3 cache is shared only amongst cores on the same CCX).
@@ -23,37 +24,41 @@ Thx !
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-02-11** at **06:09:03**:
+👤 **ikawrakow** commented on **2025-02-11** at **06:09:03**
In `ik_llama.cpp`, being a fork of `llama.cpp`, the NUMA situation is the same as in `llama.cpp`.
Improving performance on NUMA systems is something I would be interested in looking into, but I don't have a dual socket system available (with enough memory bandwidth to make it interesting), and I'm just a lonely guy hacking here for fun without the resources to go and rent/buy such a system.
-> 👤 **bhugueney** replied the **2025-02-11** at **10:56:00**:
+> 👤 **bhugueney** replied on **2025-02-11** at **10:56:00**
+>
> Thx !
> I sure hope my message didn't come of as complaining : I've very grateful for what you already did !
> If you are interested I will try to provide you full access to my dual Epyc server with 16 × 64 GB of DDR4 @3200.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **14:47:10**
>
-> 👤 **ikawrakow** replied the **2025-02-11** at **14:47:10**:
> This would be of course great, but I'm hesitant to promise to tackle the NUMA issue right away.
>
> When you say "full access", you mean you are not going to be using the system while I'm using it? Which Epycs do you have?
+
+> 👤 **bhugueney** replied on **2025-02-11** at **23:17:06**
>
-> 👤 **bhugueney** replied the **2025-02-11** at **23:17:06**:
> I'm not expecting any promises, especially as I'm afraid llama.cpp cannot be patched to become NUMA efficient. My (very) limited understanding is that people ran llama.cpp CPU backend on NUMA and got bad performance because one thread was doing all the memory allocation (so in one NUMA domain) and they started trying to address that by patching the CPU backend. Unfortunately, such approach seems doomed to hit a wall as NUMA efficiency probably requires a different architecture more like a multi-GPU backend with tensor parallelism where each NUMA domain would be treated like a GPU wrt trying to minimize inter GPU communication and maximize parallelism. This is the vLLM approach for NUMA if I'm note mistaken.
>
> When I say "full access", I mean IPMI access while I'm not using it. But I have to figure things out first. Epycs would be 7R32 (same as AWS c5a instances).
+
+> 👤 **saood06** replied on **2025-02-11** at **23:58:26**
>
-> 👤 **saood06** replied the **2025-02-11** at **23:58:26**:
> So in regards to the current state of llama.cpp/ik_llama.cpp NUMA performance I don't think it's that bad. I've seen a few reports from a few users on more modern NUMA machines than mine report performance running multiple instances of llama.cpp on each NUMA domain isolated, vs running one larger instance on all NUMA domains and although there was gain to be had it wasn't that dramatic of a difference. My older NUMA machine also gets decent performance for it's bandwidth.
>
> I'm looking into expert parallelism for the Deepseek V3/R1 MoE model, which should benefit NUMA systems. The plan for that is port over the PR which allows you to specify what tensor is loaded onto what backend, change the tensor representation of this model to not consolidate the experts. At that point I'd test performance with that and each NUMA node on a separate RPC backend, since changing ik_llama.cpp to create a backend for each NUMA domain might require a lot more work, but I'd look into it once I get there.
---
-👤 **saood06** replied the **2025-03-13** at **05:53:54**:
+👤 **saood06** commented on **2025-03-13** at **05:53:54**
There is actually a good discussion on mainline: https://github.com/ggml-org/llama.cpp/discussions/12088
@@ -61,12 +66,14 @@ They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU
Also you can look at zts9989's comment [here](https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2716225570) where he talks about NUMA and what llama.cpp could improve on after he found that "approximately 50% of CPU usage is spent on thread synchronization" when running Deepseek R1 with multiple numa nodes.
-> 👤 **ikawrakow** replied the **2025-03-13** at **07:27:34**:
+> 👤 **ikawrakow** replied on **2025-03-13** at **07:27:34**
+>
> > They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU at Q8_0) where it still outperformed mainline for CPU only.
>
> Where can I find the test results?
+
+> 👤 **saood06** replied on **2025-03-13** at **07:44:42**
>
-> 👤 **saood06** replied the **2025-03-13** at **07:44:42**:
> In the linked post the second table under 6980P Benchmarks has it, but pasting it here for reference:
>
> Quantization | Tokens/Second | NUMA Configuration
@@ -75,8 +82,9 @@ Also you can look at zts9989's comment [here](https://github.com/ggml-org/llama.
> Q8_0 | 6.2 | 1x NUMA Node on 1x CPU
>
> This is the only published result for ik_llama but they do state "Keep an eye on [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) fork which has interesting optimizations." so they may run more.
+
+> 👤 **saood06** replied on **2025-03-13** at **08:45:24**
>
-> 👤 **saood06** replied the **2025-03-13** at **08:45:24**:
> I forgot he had much more detailed results under Methodology and Notes, there is a section for ik_llama.cpp showing the command and bench numbers, interestingly ik_llama.cpp performance peaked at 128 threads for both PP and TG compared to peaking at 86 threads for TG and 128 threads for PP in mainline. He also shares PP numbers as well, where ik_llama again shows better performance than mainline. He does explicitly state TODO for testing ik_llama.cpp for 2x CPU Q8_0.
>
> Again pasting the segment of his post featuring ik_llama.cpp for reference:
@@ -110,7 +118,7 @@ Also you can look at zts9989's comment [here](https://github.com/ggml-org/llama.
---
-👤 **ikawrakow** replied the **2025-03-13** at **11:55:55**:
+👤 **ikawrakow** commented on **2025-03-13** at **11:55:55**
@saood06
@@ -141,7 +149,8 @@ I'm curious which `AVX512` extensions are supported by this CPU to understand if
Playing with some of the more advanced options that mainline `llama.cpp` does not have would be of course very interesting too.
-> 👤 **saood06** replied the **2025-03-13** at **21:20:04**:
+> 👤 **saood06** replied on **2025-03-13** at **21:20:04**
+>
> >I'm curious which AVX512 extensions are supported by this CPU to understand if vanilla AVX2 is being used, or the code optimized for the Zen4 core (requires AVX512F, AVX512VNNI, AVX512VL, AVX512BW, AVX512DQ).
>
> All of those extensions are supported (and also AVX512_fp16 which AMD does not support even on Zen 5), none of the normal sources I use for this have been updated to show Granite Rapids but I did find [this](https://www.phoronix.com/image-viewer.php?id=intel-xeon-6980p-performance&image=intel_xeon_6980p_2_lrg). Granite rapids was supposed to have support for Intel AVX10 (Version 1, or Intel AVX10.1) but that apparently did not happen.
@@ -149,8 +158,9 @@ Playing with some of the more advanced options that mainline `llama.cpp` does no
> >I have seen a higher than usual amount of stars added to my repository in the last few days, I guess this must be due to your post.
>
> I've also seen an uptick in organic mentions of ik_llama.cpp recently and have done my best to help people understand all the new features and benefits.
+
+> 👤 **ubergarm** replied on **2025-03-13** at **22:15:00**
>
-> 👤 **ubergarm** replied the **2025-03-13** at **22:15:00**:
> @ikawrakow
>
> > Very interesting results, thank you for posting and including my little LLM inference playground in the results.
@@ -222,8 +232,9 @@ Playing with some of the more advanced options that mainline `llama.cpp` does no
> $ lscpu | grep Flags
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
> ```
+
+> 👤 **saood06** replied on **2025-03-13** at **22:51:58**
>
-> 👤 **saood06** replied the **2025-03-13** at **22:51:58**:
> > > Playing with some of the more advanced options that mainline llama.cpp does not have would be of course very interesting too.
> >
> > Yes, I'm playing with [ktransformers](https://github.com/ubergarm/r1-ktransformers-guide/) as well, but it has a hard requirement on GPU. Unfortunately, this 6980P rig has no GPU so I'm limited to CPU only testing.
@@ -265,7 +276,7 @@ Playing with some of the more advanced options that mainline `llama.cpp` does no
---
-👤 **saood06** replied the **2025-03-25** at **03:29:01**:
+👤 **saood06** commented on **2025-03-25** at **03:29:01**
@ubergarm (thought you might also be interested in this).
@@ -281,7 +292,8 @@ The downside of duplicating the model is pretty heavy, but this approach obvious
Looking at the codebase, I think it currently only works for dual socket nodes, and I would have been more interested in testing it but none of my machines (even the very unstable one quad socket 1 TB memory node that I haven't turned on in a long time) would have enough RAM to replicate my preferred quant of R1, I'd have to use one under 192 GB (I do still have my IQ1_S_R4 V2 that is 129 GB).
-> 👤 **ubergarm** replied the **2025-03-25** at **15:58:04**:
+> 👤 **ubergarm** replied on **2025-03-25** at **15:58:04**
+>
> Super, I just fetched this fork and will take a peek.
>
> > The downside of duplicating the model is pretty heavy
@@ -293,21 +305,23 @@ Looking at the codebase, I think it currently only works for dual socket nodes,
> Also [mingfeima](https://github.com/mingfeima) left an [interesting comment](https://github.com/ggml-org/llama.cpp/issues/12003#issuecomment-2731572966) recently discussing some of the intel specific optimizations and work he's doing on sglang.
>
> Finally, I recently saw Wendell of [level1techs youtube channel do a video](https://www.youtube.com/watch?v=kOh04PhXqmY) about quad socket Intel Xeon. Seems like it could be configured into 8 individual NUMA nodes with 1TB each possibly? Talk about wasting RAM, but would be fun to try haha...
+
+> 👤 **saood06** replied on **2025-03-27** at **07:24:15**
>
-> 👤 **saood06** replied the **2025-03-27** at **07:24:15**:
> >Super, I just fetched this fork and will take a peek.
>
> Did you ever test it?
---
-👤 **ikawrakow** replied the **2025-03-25** at **16:06:42**:
+👤 **ikawrakow** commented on **2025-03-25** at **16:06:42**
> Ideally you would have the most number of individual NUMA nodes to maximize performance,
Why?
-> 👤 **ubergarm** replied the **2025-03-25** at **16:14:54**:
+> 👤 **ubergarm** replied on **2025-03-25** at **16:14:54**
+>
> Looking at Intel Memory Latency Checker `mlc` benchmarks suggest that the memory local to the compute on a specific NUMA node gives best bandwidth and latency.
>
> My thinking is that duplicating weights into each NUMA node and having local threads working with that RAM would maximize performance.
@@ -318,11 +332,12 @@ Why?
---
-👤 **ikawrakow** replied the **2025-03-25** at **16:24:17**:
+👤 **ikawrakow** commented on **2025-03-25** at **16:24:17**
Sure, that would be if you wanted to squeeze out the last bit of performance. But we are not at that stage. Instead, we are a factor of 2 or more away from what should be possible. Having 2 big NUMA nodes would make the distribution of weights much easier: simply change the weight loading to use two threads, each pinned to a specific NUMA node, and each loading half of the tensor data. During inference pin half the threads to run on the 1st NUMA node, and the other half to the second NUMA node. My thinking is that this should give a significant boost in performance without replicating the model on both NUMA nodes. It is of course possible to do stuff such as this with several NUMA nodes, but it makes things way more complicated. So, I'm thinking that the 1st step should be to get better performance with 2 NUMA nodes. But if you are telling me that this is very far from ideal, and that the only way to get better performance is to enable and utilize all NUMA nodes, then it is a waste of time to implement the simple approach described above.
-> 👤 **ubergarm** replied the **2025-03-25** at **16:36:46**:
+> 👤 **ubergarm** replied on **2025-03-25** at **16:36:46**
+>
> > that would be if you wanted to squeeze out the last bit of performance. But we are not at that stage.
>
> Yes, I agree on both points.
@@ -338,8 +353,9 @@ Sure, that would be if you wanted to squeeze out the last bit of performance. Bu
> No need to worry about rare brand new quad socket intel xeon boards or more smaller NUMA nodes currently imo.
>
> I'll try to find my `mlc` benchmarks and post here, as the bandwidth is still pretty good converting a single CPU into 1 NUMA node.
+
+> 👤 **ubergarm** replied on **2025-03-25** at **16:52:11**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **16:52:11**:
> #### intel `mlc`
>
> Configuring BIOS to `SNC=Disable` to collapse 3x NUMA nodes per CPU socket into a single NUMA node per 6980P socket gives similar enough RAM bandwidth/latency performance.
@@ -518,8 +534,9 @@ Sure, that would be if you wanted to squeeze out the last bit of performance. Bu
>
> ## References
> * [Additional Benchmarks and discussions on Phoronix](https://www.phoronix.com/review/xeon-6980p-snc3-hex)
+
+> 👤 **saood06** replied on **2025-03-25** at **18:09:30**
>
-> 👤 **saood06** replied the **2025-03-25** at **18:09:30**:
> > During inference pin half the threads to run on the 1st NUMA node, and the other half to the second NUMA node.
>
> The problem is not splitting the model, it is ensuring the work of any given thread is stored local to it's NUMA node.
@@ -527,16 +544,19 @@ Sure, that would be if you wanted to squeeze out the last bit of performance. Bu
> This PR: https://github.com/ggml-org/llama.cpp/pull/6915 made it difficult as mentioned here: https://github.com/ggml-org/llama.cpp/issues/1437#issuecomment-2095809308
>
> Maybe you could use [this](https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2023-0/thread-affinity-interface.html#LOW_LEVEL_AFFINITY_API) so that each thread could change it's affinity to a random thread on the correct numa node (this would also work since I don't think this would otherwise be compatible with --numa interleave [but not sure has been a long time since I looked into that).
+
+> 👤 **ikawrakow** replied on **2025-03-25** at **18:17:01**
>
-> 👤 **ikawrakow** replied the **2025-03-25** at **18:17:01**:
> There is no dynamic thread scheduling here. No thread pools either.
>
> In my experience from the past, touching memory with on a NUMA node makes it automatically that the actual data is stored in a memory bank local to the node on which the thread is running. The difficulty will be more in fighting with the almighty `ggml` backend than anything else.
+
+> 👤 **ikawrakow** replied on **2025-03-25** at **18:26:08**
>
-> 👤 **ikawrakow** replied the **2025-03-25** at **18:26:08**:
> Dynamic thread scheduling does help for PP with big enough batch sizes. It would also help on systems with a mix of P/E cores (although, if mainline `llama.cpp` has that, I notice absolutely zero benefit on my M2-Max. Performance there is still best with 8 threads, not 12). But for TG with all same cores the overhead of thread synchronization for work stealing is typically too high to have benefit. Maybe it is different for a humongous model such as DeepSeek-R1? But then again, it has nearly 4X the number of nodes in the compute graph, so the work per node is not that much higher than DeepSeek-Lite.
+
+> 👤 **saood06** replied on **2025-03-25** at **18:36:09**
>
-> 👤 **saood06** replied the **2025-03-25** at **18:36:09**:
> > There is no dynamic thread scheduling here. No thread pools either.
>
> @bmtwl
@@ -551,7 +571,7 @@ Sure, that would be if you wanted to squeeze out the last bit of performance. Bu
---
-👤 **ubergarm** replied the **2025-03-30** at **17:25:05**:
+👤 **ubergarm** commented on **2025-03-30** at **17:25:05**
Oh I see a benchmark in the wild attempting to benchmark that [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp) NUMA data parallel code against ik fork: https://github.com/ggml-org/llama.cpp/discussions/12289#discussioncomment-12668490
@@ -559,31 +579,34 @@ Oh I see a benchmark in the wild attempting to benchmark that [vproxy-tools/llam
Not sure the details of how they are running it though...
-> 👤 **saood06** replied the **2025-03-30** at **20:58:05**:
-> > Oh I see a benchmark in the wild attempting to benchmark that [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp) NUMA data parallel code against ik fork: [ggml-org/llama.cpp#12289 (comment)](https://github.com/ggml-org/llama.cpp/discussions/12289#discussioncomment-12668490)
+> 👤 **saood06** replied on **2025-03-30** at **20:58:05**
+>
+> > Oh I see a benchmark in the wild attempting to benchmark that [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp) NUMA data parallel code against ik fork: [ggml-org/llama.cpp[#12289](https://github.com/ikawrakow/ik_llama.cpp/issues/12289) (comment)](https://github.com/ggml-org/llama.cpp/discussions/12289#discussioncomment-12668490)
> >
> > Not sure the details of how they are running it though...
>
> Thanks for the link, I agree it would be nice if they included more details.
+
+> 👤 **ubergarm** replied on **2025-03-30** at **21:14:31**
>
-> 👤 **ubergarm** replied the **2025-03-30** at **21:14:31**:
> Yeah, I gave it a try and while it did run it wasn't allocating threads on both NUMA nodes so I gave up for now after posting my logs.
+
+> 👤 **saood06** replied on **2025-03-30** at **21:34:22**
>
-> 👤 **saood06** replied the **2025-03-30** at **21:34:22**:
> > Yeah, I gave it a try and while it did run it wasn't allocating threads on both NUMA nodes so I gave up for now after posting my logs.
>
> Did you try running it with numactl on just 2 NUMA nodes? There is also an issue tracker for [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp/issues) where you could report that.
---
-👤 **bhugueney** replied the **2025-04-08** at **10:24:55**:
+👤 **bhugueney** commented on **2025-04-08** at **10:24:55**
I currently settle for running my DeepSeek v3 model on just one NUMA / socket of my dual socket system. However, while investigating the draft models situation, it occurred to me that if should be relatively easy to specify cores for the main model (on one socket) and specify other cores (in my case on the other socket/NUMA node) for the draft model as communication between the two should be minimal.
What do people think about it?
---
-👤 **saood06** replied the **2025-05-20** at **08:37:01**:
+👤 **saood06** commented on **2025-05-20** at **08:37:01**
On my dual socket machine using https://github.com/intel/pcm
@@ -605,7 +628,7 @@ And during TG:
---
-👤 **VinnyG9** replied the **2025-05-21** at **04:15:29**:
+👤 **VinnyG9** commented on **2025-05-21** at **04:15:29**
just sharing i tried all snoop modes on my x99 dual board and got 200-300% boost vs stock bios settings, this setting is also available on xeon scalable fwiw
@@ -632,7 +655,8 @@ just sharing i tried all snoop modes on my x99 dual board and got 200-300% boost
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg128 | 34.76 ± 1.54 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg256 | 35.70 ± 0.34 |
-> 👤 **ubergarm** replied the **2025-05-21** at **14:26:30**:
+> 👤 **ubergarm** replied on **2025-05-21** at **14:26:30**
+>
> Wow, big gains! I'd never heard of "snoop" mode, but don't have a lot of intel server experience:
>
> > DIR+OSB mode allows for low local memory latency, high local memory bandwidth and I/O directory cache to reduce directory update overheads for I/O accesses.
@@ -640,8 +664,9 @@ just sharing i tried all snoop modes on my x99 dual board and got 200-300% boost
> Are you running hybrid CPU+GPU CUDA offloading some layers? I forget your exact system specs and VRAM, but if you can offload the whole thing it can go quite faster psure. Also, if I'm running CPU/RAM *only* I generally recompile and disable CUDA backend fwiw.
>
> Glad you're having fun tweaking and tuning!
+
+> 👤 **VinnyG9** replied on **2025-05-21** at **18:07:27**
>
-> 👤 **VinnyG9** replied the **2025-05-21** at **18:07:27**:
> > Wow, big gains! I'd never heard of "snoop" mode, but don't have a lot of intel server experience:
> >
> > > DIR+OSB mode allows for low local memory latency, high local memory bandwidth and I/O directory cache to reduce directory update overheads for I/O accesses.
diff --git a/github-data/discussions/211 - help me create an importance matrix primer.md b/github-data/discussions/211 - help me create an importance matrix primer.md
index e16f55397..5d85ed89e 100644
--- a/github-data/discussions/211 - help me create an importance matrix primer.md
+++ b/github-data/discussions/211 - help me create an importance matrix primer.md
@@ -1,13 +1,14 @@
-### 🗣️ [#211](https://github.com/ikawrakow/ik_llama.cpp/discussions/211) - help me create an importance matrix primer
+## 🗣️ [Discussion #211](https://github.com/ikawrakow/ik_llama.cpp/discussions/211) - help me create an importance matrix primer
| **Author** | `robbiemu` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-02-19 |
| **Updated** | 2025-02-22 |
---
-#### Description
+## 📄 Description
this primer, if I am honest is mostly about the related main stream llama.cpp project, but the details are so general I think it generally applies. I was hoping @ikawrakow you might review this and help me to track down gaps and errors, before I release a final version. (I'm the [llama-gguf-optimize](https://github.com/robbiemu/llama-gguf-optimize) guy interested in language preservation, btw -- hello again! ).
@@ -217,9 +218,9 @@ This documentation introduces general approaches to quantization and then llama.
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-02-21** at **06:51:45**:
+👤 **ikawrakow** commented on **2025-02-21** at **06:51:45**
1. Many equations do not show in my Browsers (Firefox, Safari)
2. You are trying to describe the imatrix as used in llama.cpp. Hence, it would be better to use the mathematical foundation of that instead of the LeanQuants paper.
@@ -233,7 +234,8 @@ This documentation introduces general approaches to quantization and then llama.
Etc. Sorry @robbiemu, but this is just too far from representing the actual imatrix fundamentals and the imatrix use for guiding quantization.
-> 👤 **robbiemu** replied the **2025-02-21** at **11:55:47**:
+> 👤 **robbiemu** replied on **2025-02-21** at **11:55:47**
+>
> thank you for that :) Its a draft, of course there are things going to be wrong, its a big project that I've worked _with_ much more than _in_, and I need and appreciate the help identifying where I need to correct.
>
> especially things like simple errata like Github's markdown not rendering latex and my confusing at one point blocks of 32 for superblocks of 256 vis-a-vis AVX2 are little burden. But there were a couple of points that I dont feel confident how to process.
@@ -241,15 +243,16 @@ Etc. Sorry @robbiemu, but this is just too far from representing the actual imat
> At the beginning, I did transclude in sections from another document I have on LeanQuants specifically because in our conversation where I felt you were the one to equate the imatrix to the hessian approach. And they have a very natural way of expressing the relationship to quantization decisions so .. I took pains to show the approximate relationship. That and, if you search/read about llama.cpp importance matrices online now, you will often see this relationship indicated. In reading your PR comment I see that you don't even explicitly mention it, so maybe inclusion was misguided. Yet, you also don't directly ground quantization decisions to using an importance matrix. In other words, the "how did we get here" that this section currently provides .. I'll need to add that still. Do you prefer another formulation rather than what I used from LeanQuant? If I were to keep it: What is glossed over as essentially a given, that you can calculate only the diagonal, and the fact that you can treat a block-diagonal matrix here as a collection of smaller matrices (so you can break up the model's quantization row-wise, as is done in llama.cpp) -- those can be simplified or removed and replaced with the derivation you spell out in your PR.
>
> What really interests me is # 7. after generating your imatrix the next step, in practice, is to use the quantization tool. So it must be in the details it is incorrect. I got this from perplexity (I've not been working very much in the llama.cpp source code, except in regards YaRN). If it is not too much to ask, could I ask you to help correct that into a high level description. I'm trying to avoid an exact correspondence here (phase 1 also does not live up to that), I just want a simple conceptual description of the execution graph.
+
+> 👤 **robbiemu** replied on **2025-02-21** at **12:28:24**
>
-> 👤 **robbiemu** replied the **2025-02-21** at **12:28:24**:
> On one other point:
>
> "for sequential processing" -- this is just a lack of clarity, it I guess should be "to then be processed sequentially" maybe. I was never describing the reasoning, just the application, not getting into the details. Maybe I could add something about matching the max_positional_embeddings though, sure. batch and ubatch currently under the lens for change, there's a draft PR to make ubatch functionally different from batch in imatrix generation (ie computing multiple chunks per batch in https://github.com/ggml-org/llama.cpp/pull/9400 ) - as the nature and intent are perhaps changing, describing the intent is something I am not interested in adding to the document.
---
-👤 **ikawrakow** replied the **2025-02-21** at **16:20:18**:
+👤 **ikawrakow** commented on **2025-02-21** at **16:20:18**
If this was a draft that had the occasional mistake here or there, I would try to help you. But the content is so far away from reality that I wouldn't know where to begin (short of completely rewriting it).
@@ -271,7 +274,8 @@ Where did you even get this equation from? It certainly is not used anywhere in
No. All model weights in a tensor use the exact same amount of bits per weight.
-> 👤 **robbiemu** replied the **2025-02-21** at **19:03:42**:
+> 👤 **robbiemu** replied on **2025-02-21** at **19:03:42**
+>
> Ok hold on, please understand I'm just trying to essentially describe this, using tools to help me avoid reading the code was probably a mistake but, in my defense, its a big project that I am trying to elaborate. :) I'll apply the changes, this will get better. Maybe I should seek help from others instead... if so my apologies. I dont want to address the entire reply you gave me there just now, but something you said really gave me doubt.
>
> >> The quantization algorithm scales compression aggressiveness inversely with importance scores...
diff --git a/github-data/discussions/223 - Recent performance testing with DeepSeek R1.md b/github-data/discussions/223 - Recent performance testing with DeepSeek R1.md
index 3ed990bb2..768bbe5e0 100644
--- a/github-data/discussions/223 - Recent performance testing with DeepSeek R1.md
+++ b/github-data/discussions/223 - Recent performance testing with DeepSeek R1.md
@@ -1,13 +1,14 @@
-### 🗣️ [#223](https://github.com/ikawrakow/ik_llama.cpp/discussions/223) - Recent performance testing with DeepSeek R1
+## 🗣️ [Discussion #223](https://github.com/ikawrakow/ik_llama.cpp/discussions/223) - Recent performance testing with DeepSeek R1
| **Author** | `bitbottrap` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-02-22 |
| **Updated** | 2025-03-14 |
---
-#### Description
+## 📄 Description
I'm open to a more rigorous set of tests using accepted benchmark files. Just point me to them. I can run this periodically if it's scripted. Available are 2x24GB GPUs and 1TB of RAM on an Epyc CPU.
@@ -52,9 +53,9 @@ standard | | X | 8192 | | 126521 | 21.5 | 4.68 |
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **saood06** replied the **2025-02-23** at **01:03:00**:
+👤 **saood06** commented on **2025-02-23** at **01:03:00**
Thank you so much for these results.
@@ -73,14 +74,15 @@ in each ubatch-sized window. Only a single token sequence is used.
>The purpose of the benchmark is to visualize how the performance changes with
the context size without averaging the metrics values over the whole context.
-> 👤 **bitbottrap** replied the **2025-02-23** at **01:18:38**:
+> 👤 **bitbottrap** replied on **2025-02-23** at **01:18:38**
+>
> 500 token prompt, 300 token output.
>
> If it's scripted and the results get written to a log that I can easily post I can do this periodically while this project is relevant. I did this by hand and it was the wrong way of doing it. And I'm not sure what parameters would be most beneficial to change especially when new features are being developed / tested.
---
-👤 **saood06** replied the **2025-02-23** at **01:36:47**:
+👤 **saood06** commented on **2025-02-23** at **01:36:47**
The fairydreaming benchmark includes a script that contains a python script that generates a graph that would display multiple configurations against each other here are two examples of it's output from fairydreaming ( [1](https://preview.redd.it/o2uxzg63x3he1.png?width=989&format=png&auto=webp&s=dc2743353f3d5a86258aa51efc7e18853e3911a0) and [2](https://www.reddit.com/r/LocalLLaMA/comments/1igpwzl/paradigm_shift/mawmoq0/) )
@@ -88,13 +90,15 @@ We could tell you what configs to run and then you just pass all the jsonl outpu
Edit: Fixed image link to show PP instead of TG graph
-> 👤 **bitbottrap** replied the **2025-02-23** at **02:49:14**:
+> 👤 **bitbottrap** replied on **2025-02-23** at **02:49:14**
+>
> I'm primarily motivated by DeepSeek R1/V3 improvements right now. Being that the model is so large and the most value would probably be pushing limits of context tests take a while. I use this system during the day so I definitely can't afford to create such detailed graphs regularly. But if there were a smaller number of runs, say up to 30ish that's reasonable to run overnight by request.
+
+> 👤 **saood06** replied on **2025-02-23** at **04:59:50**
>
-> 👤 **saood06** replied the **2025-02-23** at **04:59:50**:
> >Being that the model is so large and the most value would probably be pushing limits of context tests take a while.
>
-> I understand my system is far weaker than yours (the highest PP I've seen is 11), and I've done overnight benchmarks so I do appreciate you doing this. I just created #225 for an easy to use but thorough benchmark, that will output nice graphs.
+> I understand my system is far weaker than yours (the highest PP I've seen is 11), and I've done overnight benchmarks so I do appreciate you doing this. I just created [#225](https://github.com/ikawrakow/ik_llama.cpp/issues/225) for an easy to use but thorough benchmark, that will output nice graphs.
>
> >But if there were a smaller number of runs, say up to 30ish that's reasonable to run overnight by request.
>
@@ -102,7 +106,7 @@ Edit: Fixed image link to show PP instead of TG graph
---
-👤 **ikawrakow** replied the **2025-02-23** at **05:57:41**:
+👤 **ikawrakow** commented on **2025-02-23** at **05:57:41**
Thank you for this!
@@ -120,7 +124,7 @@ Run time repacking seems to be adding 2-3 minutes to the load time. This is bett
---
-👤 **bitbottrap** replied the **2025-02-23** at **15:30:00**:
+👤 **bitbottrap** commented on **2025-02-23** at **15:30:00**
Epyc 7773X (64 cores, 128 threads), one socket, 8x128GB RAM
@@ -147,7 +151,7 @@ threads | std | flash | mla
---
-👤 **ikawrakow** replied the **2025-02-23** at **16:08:15**:
+👤 **ikawrakow** commented on **2025-02-23** at **16:08:15**
Thanks!
@@ -165,7 +169,8 @@ Run-time-repacking (rtr) does not change the mix of quantization types. `Q4_K_M`
OK, so this is Zen3, so using vanilla AVX2 implementation. If the information I find on the Internet is correct, it should have ~200 GB/s memory bandwidth. We have 37B active parameters at about 4.8 bpw for `Q4_K_M`, so about 22 GB of model weights are active, so we should be getting in the range of 8-9 t/s for TG. I wonder where is the bottleneck. I'm able to 100% saturate the memory bandwidth on a Ryzen-7950X (Zen4 core), Ryzen-5975WX (Zen3 core) and M2-Max with the models I can run.
-> 👤 **bitbottrap** replied the **2025-02-24** at **01:12:31**:
+> 👤 **bitbottrap** replied on **2025-02-24** at **01:12:31**
+>
> Good eye and thank you for challenging my assumptions. I had benchmarked mla and found that 63 threads was just fine. No large drop like flash attention. Here are the per-thread-count results for flash attention. Yes, there's a huge drop for 63:
>
> | Thread Count | Prompt Eval Time (tokens/s) | Eval Time (tokens/s) |
@@ -205,11 +210,12 @@ OK, so this is Zen3, so using vanilla AVX2 implementation. If the information I
---
-👤 **ikawrakow** replied the **2025-02-24** at **14:35:34**:
+👤 **ikawrakow** commented on **2025-02-24** at **14:35:34**
-Really curious to see what happens with PR #232.
+Really curious to see what happens with PR [#232](https://github.com/ikawrakow/ik_llama.cpp/issues/232).
-> 👤 **bitbottrap** replied the **2025-02-26** at **01:30:24**:
+> 👤 **bitbottrap** replied on **2025-02-26** at **01:30:24**
+>
> Well I see the PR is in main. If you've got a command line that works with 1 or 2 24GB GPUs I'll start it up. I'd like to fit maximum possible context in there.
>
> I see that mla with rtr is working together. Did a hand run and it sped things up. I also generated Q4_K_R4 and Q8_0_R8 quants and they also appear to speed things up. All working together too.
@@ -228,7 +234,7 @@ Really curious to see what happens with PR #232.
---
-👤 **ikawrakow** replied the **2025-02-26** at **13:08:06**:
+👤 **ikawrakow** commented on **2025-02-26** at **13:08:06**
> If you've got a command line that works with 1 or 2 24GB GPUs I'll start it up
@@ -242,15 +248,17 @@ On KV cache size: To match KTransformers, `ik_llama.cpp` must be able to handle
Of note: MLA is ~20% slower than standard attention for less than a few hundred tokens in the cache. It becomes competitive performance wise only beyond 16k tokens. With MLA there are two matrix multiplications that are extremely slow on CUDA. I'm trying to improve that but no luck so far.
-> 👤 **ikawrakow** replied the **2025-02-26** at **17:29:07**:
-> PR #234 does speed MLA, but only with a single GPU involved.
+> 👤 **ikawrakow** replied on **2025-02-26** at **17:29:07**
+>
+> PR [#234](https://github.com/ikawrakow/ik_llama.cpp/issues/234) does speed MLA, but only with a single GPU involved.
+
+> 👤 **ikawrakow** replied on **2025-02-26** at **17:33:19**
>
-> 👤 **ikawrakow** replied the **2025-02-26** at **17:33:19**:
> Oh, and adding `-fmoe` (or `-fmoe 1` with `llama-bench`) is useful too. This fuses the MoE matrix multiplications. Speedup is not dramatic, but we do get a few percent speedup for prefill and 1-2% for TG.
---
-👤 **bitbottrap** replied the **2025-03-14** at **14:54:37**:
+👤 **bitbottrap** commented on **2025-03-14** at **14:54:37**
So I was going to try and get a bunch of benchmarks with recent code and I encountered a problem using any GPU offloading. This was a feature that was working, but poorly, last time I did some hand testing.
diff --git a/github-data/discussions/242 - Switching from llama.cpp_ktransformers_ seeking advice_guidance.md b/github-data/discussions/242 - Switching from llama.cppktransformers seeking adviceguidance.md
similarity index 96%
rename from github-data/discussions/242 - Switching from llama.cpp_ktransformers_ seeking advice_guidance.md
rename to github-data/discussions/242 - Switching from llama.cppktransformers seeking adviceguidance.md
index 54e593072..e0af2bcd2 100644
--- a/github-data/discussions/242 - Switching from llama.cpp_ktransformers_ seeking advice_guidance.md
+++ b/github-data/discussions/242 - Switching from llama.cppktransformers seeking adviceguidance.md
@@ -1,13 +1,14 @@
-### 🗣️ [#242](https://github.com/ikawrakow/ik_llama.cpp/discussions/242) - Switching from llama.cpp/ktransformers, seeking advice/guidance
+## 🗣️ [Discussion #242](https://github.com/ikawrakow/ik_llama.cpp/discussions/242) - Switching from llama.cpp/ktransformers, seeking advice/guidance
| **Author** | `ThomasBaruzier` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-03-05 |
| **Updated** | 2025-03-15 |
---
-#### Description
+## 📄 Description
Hello,
@@ -27,28 +28,29 @@ Thank you very much
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-03-06** at **06:01:05**:
+👤 **ikawrakow** commented on **2025-03-06** at **06:01:05**
Is the 72 GB VRAM from 3 x 24 GB GPUs?
You setup is somewhat unusual as you "only" have 128 GB of RAM. If you want to use a ready model your only option would be the `IQ1_S` or `IQ1_M` models from Unsloth. The next step up is already too big for the 200 GB you have available.
-If you are willing to do your custom quantization, it will require a manual setup as there isn't an out-of-the-box mix to best take advantage of your amount of RAM+VRAM. I guess, I should add a similar functionality as the tensor overrides from #232 also to `llama-quantize` so people don't need to go and change the code to get the quantization mix they want.
+If you are willing to do your custom quantization, it will require a manual setup as there isn't an out-of-the-box mix to best take advantage of your amount of RAM+VRAM. I guess, I should add a similar functionality as the tensor overrides from [#232](https://github.com/ikawrakow/ik_llama.cpp/issues/232) also to `llama-quantize` so people don't need to go and change the code to get the quantization mix they want.
Once you have a model that you want to use, I think the best way to distribute the model weights between CPU RAM and GPU VRAM will be to use several `-ot` command line arguments. But to determine the regular expressions required one needs to know the quantization types (and hence sizes) of all tensors.
What is the CPU in this system?
-> 👤 **ThomasBaruzier** replied the **2025-03-06** at **14:02:48**:
+> 👤 **ThomasBaruzier** replied on **2025-03-06** at **14:02:48**
+>
> Yes, I have 3xRTX 3090 and a Ryzen 9 5950x.
>
> > If you want to use a ready model
>
> I don't mind making quants; that's why I wanted to try the 1bit R4 quants that are supposedly superior to unsloth's versions. Surprisingly, I got IQ2_XXS dynamic working with 4k context without mmap at around 3tok/s with llama.cpp thanks to efficient splitting and no GPU compute buffers by setting `-b 31` and `-ub 31`. This way, each GPU uses the exact same amount of VRAM, making use of 98-99% of the 24GB. So in theory, there is a bit of headroom to play with if I do custom quants.
>
-> > I guess, I should add a similar functionality as the tensor overrides from #232 also to llama-quantize so people don't need to go and change the code to get the quantization mix they want.
+> > I guess, I should add a similar functionality as the tensor overrides from [#232](https://github.com/ikawrakow/ik_llama.cpp/issues/232) also to llama-quantize so people don't need to go and change the code to get the quantization mix they want.
>
> This would be very useful. There was a PR on llama.cpp that accomplished this purpose but never got merged: https://github.com/ggml-org/llama.cpp/pull/6844#issuecomment-2423363813
>
@@ -58,20 +60,21 @@ What is the CPU in this system?
---
-👤 **ikawrakow** replied the **2025-03-07** at **12:00:58**:
+👤 **ikawrakow** commented on **2025-03-07** at **12:00:58**
-PR #244 has been merged, so hopefully this will help you with making your custom DeepSeekR1 quantization.
+PR [#244](https://github.com/ikawrakow/ik_llama.cpp/issues/244) has been merged, so hopefully this will help you with making your custom DeepSeekR1 quantization.
The `-b 31 -ub 31` option is a clever hack, but I expect prompt processing performance to be unacceptably low. So will be TG with any significant context (more than a few hundred tokens). Or not?
-> 👤 **ThomasBaruzier** replied the **2025-03-07** at **16:03:24**:
+> 👤 **ThomasBaruzier** replied on **2025-03-07** at **16:03:24**
+>
> This is very cool, thank you for this.
>
> I did not properly measure the performance impact of `-b 31 -ub 31`, it was a quick test. The logic was that the compute will be slower, but the model read access will be faster. Will report back.
---
-👤 **ikawrakow** replied the **2025-03-07** at **15:16:11**:
+👤 **ikawrakow** commented on **2025-03-07** at **15:16:11**
Could the following work in your 3x24 GiB VRAM + 128 GiB RAM:
@@ -83,7 +86,8 @@ Could the following work in your 3x24 GiB VRAM + 128 GiB RAM:
Oh, forgot. The tensors that go on the CPU should be quantized to the corresponding `_R4` variant. You can decide to not quantize to `*_R4` and then use run time repacking (`-rtr`) to repack to `_R4`, but this adds quite a bit of extra loading time (2-3 minutes on a 32-core EPYC).
-> 👤 **ThomasBaruzier** replied the **2025-03-07** at **17:26:56**:
+> 👤 **ThomasBaruzier** replied on **2025-03-07** at **17:26:56**
+>
> I couldn't be more grateful. I will try this custom quant as soon as the imatrix is done.
>
> Speaking of imatrix, I have some weird log outputs, am I doing something wrong?
@@ -662,7 +666,7 @@ Oh, forgot. The tensors that go on the CPU should be quantized to the correspond
---
-👤 **ikawrakow** replied the **2025-03-07** at **17:57:23**:
+👤 **ikawrakow** commented on **2025-03-07** at **17:57:23**
The NaNs are concerning. If we got NaN probabilities (logits) out of the forward pass, the imatrix will be useless (will likely have NaNs). Another way to get a NaN in the perplexity is if the predicted probability for the observed token is zero. You maybe better of getting an imatrix from somewhere else. Have you tried running the same calculation with mainline `llama.cpp`? Btw, if you want to create imatrix data yourself and have enough disk space, you can quantize to `Q8_0` (no imatrix required for that), and then use the quantized model for the imatrix calculation. You will fit 2X more layers on the GPUs, so it may be somewhat faster.
@@ -670,7 +674,8 @@ The messages about partial data are to be expected. Only 8 out of 256 experts ge
Concerning offloading specific experts: I haven't gathered statistics myself, so I don't know how useful that could be. I have seen claims around the Internet that one can gain that way (by offloading often used experts). On the other hand, this is such an obvious thing to do but has not become widely used, so my guess is that this may not be really true. The term "expert" is kind of misleading in the sense that it kind of implies that a given set of experts will be active when dealing with a given kind of context. But this is absolutely not true. If you process a paragraph of, say, 500 tokens on some specific topic, you will observe that basically all "experts" were active at least once.
-> 👤 **saood06** replied the **2025-03-09** at **03:39:15**:
+> 👤 **saood06** replied on **2025-03-09** at **03:39:15**
+>
> Slightly offtopic but, how does the imatrix command here handle the 3 attention tensors? Since there will always be one set of tensors not activated depending on how you set the mla argument and I'm not sure how the imatrix program would handle that without resorting to generating an imatrix with data for only one type of attention.
>
> > Concerning offloading specific experts: I haven't gathered statistics myself, so I don't know how useful that could be. I have seen claims around the Internet that one can gain that way (by offloading often used experts). On the other hand, this is such an obvious thing to do but has not become widely used, so my guess is that this may not be really true.
@@ -688,8 +693,9 @@ Concerning offloading specific experts: I haven't gathered statistics myself, so
> >The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.
> >[...]
> >[...] compared with the purely auxiliary-loss-based method, the auxiliary-loss-free strategy consistently achieves better model performance on most of the evaluation benchmarks
+
+> 👤 **ThomasBaruzier** replied on **2025-03-09** at **14:28:25**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-09** at **14:28:25**:
> > You maybe better of getting an imatrix from somewhere else.
>
> I tried using one from [Bartowski's repo](https://huggingface.co/bartowski/DeepSeek-R1-GGUF/blob/main/DeepSeek-R1.imatrix) and [mradermacher's repo](https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF/blob/main/imatrix.dat)
@@ -818,33 +824,40 @@ Concerning offloading specific experts: I haven't gathered statistics myself, so
> ---
>
> Finally, thanks for all the other precious explanations. I just started making the imatrix for R1 using mainline llama.cpp, brb.
+
+> 👤 **ikawrakow** replied on **2025-03-09** at **14:32:32**
>
-> 👤 **ikawrakow** replied the **2025-03-09** at **14:32:32**:
> Try adding `--ignore-imatrix-rules` to your `quantize` command.
+
+> 👤 **ThomasBaruzier** replied on **2025-03-09** at **14:46:11**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-09** at **14:46:11**:
> So far so good, but the errors `did not find weights for blk.0.attn_k_b.weight` and `did not find weights for blk.0.attn_v_b.weight` are persisting across every layer quantized so far (0 though 7 for now). I don't know enough to tell, but wouldn't that mean that this is going to be equal to a non-imatrix quant?
+
+> 👤 **ikawrakow** replied on **2025-03-09** at **14:47:20**
>
-> 👤 **ikawrakow** replied the **2025-03-09** at **14:47:20**:
> Explanation: the imatrix you use has been computed with standard attention. For MLA one adds two additional tensors (` attn_v_b` and `attn_k_b`). As these were not present during the imatrix calculation, they never got data. In mainline you cannot quantize a low-bit model with such imatrix. Here you can do it by adding `--ignore-imatrix-rules` to the command.
+
+> 👤 **ikawrakow** replied on **2025-03-09** at **14:49:44**
>
-> 👤 **ikawrakow** replied the **2025-03-09** at **14:49:44**:
> > but wouldn't that mean that this is going to be equal to a non-imatrix quant
>
> Only these two tensors (in each layer) will be quantized without imatrix. I see in the log they are quantized with `Q5_0`. This is not ideal (`Q5_K` would have been better), but at 5 bits the gain from having an imatrix is quite modest.
+
+> 👤 **ikawrakow** replied on **2025-03-09** at **14:52:42**
>
-> 👤 **ikawrakow** replied the **2025-03-09** at **14:52:42**:
> If you are using the latest `ik_llama.cpp`, you can overwrite the `Q5_0` choice for these tensors by using
> ```
> --custom-q "\.attn_k_b\.weight=Q5_K,\.attn_v_b\.weight=Q5_K"
> ```
+
+> 👤 **ThomasBaruzier** replied on **2025-03-09** at **14:53:50**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-09** at **14:53:50**:
> Wouldn't that mean I should be better off trying again making the imatrix myself with this repo for a higher quality result? Or, maybe, do these tensors not having any imatrix data have a negligible impact on the conversion?
>
> Edit: I guess negligible looking at your latest answers
+
+> 👤 **ThomasBaruzier** replied on **2025-03-09** at **15:27:39**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-09** at **15:27:39**:
> There is an issue when adding the `custom-q` argument:
>
> `'./ik_llama.cpp/llama-quantize' --imatrix 'imatrix.dat' --token-embedding-type q8_0 --custom-q '\.attn_k_b\.weight=Q5_K,\.attn_v_b\.weight=Q5_K' --ignore-imatrix-rules 'DeepSeek-R1-F16.gguf' 'DeepSeek-R1-IQ1_S_R4.gguf' 'IQ1_S_R4' '32'`
@@ -854,16 +867,18 @@ Concerning offloading specific experts: I haven't gathered statistics myself, so
> ```
>
> Simplifying to commands like `--custom-q "\.attn_v_b\.weight=17"` or `--custom-q "test=Q4_0"` does not help. The error is thrown in .04s, before the model had a chance to be read.
+
+> 👤 **ikawrakow** replied on **2025-03-09** at **16:15:56**
>
-> 👤 **ikawrakow** replied the **2025-03-09** at **16:15:56**:
> Sorry, it is `q5_K`, to `Q5_K`. It needs to match the quantization name in `ggml.c`.
+
+> 👤 **ThomasBaruzier** replied on **2025-03-09** at **16:37:29**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-09** at **16:37:29**:
> Seems to work, thanks!
---
-👤 **ikawrakow** replied the **2025-03-09** at **08:05:31**:
+👤 **ikawrakow** commented on **2025-03-09** at **08:05:31**
> Slightly offtopic but, how does the imatrix command here handle the 3 attention tensors?
@@ -871,14 +886,15 @@ You calculate the imatrix with MLA enabled (and no FA, because this skips one of
For imatrix data computed with standard attention, imatrix data for `wkv_b` apply to `wv_b` (see above). So, the only tensor left that does not have imatrix data is `wk_b`, which is the transposed version of the upper half of `wkv_b`. I don't think this is a big issue because one shouldn't be using low-bit quantization for `wk_b`, and once you go to `Q5_K` or above, there is barely any difference between quantization quality with and without imatrix.
-> 👤 **ikawrakow** replied the **2025-03-09** at **08:12:21**:
+> 👤 **ikawrakow** replied on **2025-03-09** at **08:12:21**
+>
> > It really depends on how the MoE is designed and then trained/[merged](https://github.com/arcee-ai/mergekit/blob/main/docs/moe.md). For Deepseek-V3/R1 the paper states:
>
> The paper can say many things when the day is long, but the only thing that is important is what happens in practice. What we observe in practice is that basically all experts participate in the processing of a batch containing tokens of the same topic. If that weren't true, we wouldn't be observing such a massive increase in PP performance as we increase batch and u-batch size.
---
-👤 **ThomasBaruzier** replied the **2025-03-10** at **18:19:24**:
+👤 **ThomasBaruzier** commented on **2025-03-10** at **18:19:24**
So here's what I came up with following your instructions:
@@ -1012,7 +1028,8 @@ Also, it seems that I can't use `-ot` with llama-perplexity (haven't tried with
Edit: Main GPU usage is at 25% and other cards are at 0% when generating. Is it because of the RAM speed limitations?
-> 👤 **ikawrakow** replied the **2025-03-11** at **06:33:54**:
+> 👤 **ikawrakow** replied on **2025-03-11** at **06:33:54**
+>
> I think these are very nice results!
>
> > Also, it seems that I can't use -ot with llama-perplexity (haven't tried with llama-bench)
@@ -1026,11 +1043,13 @@ Edit: Main GPU usage is at 25% and other cards are at 0% when generating. Is it
> > play with kv cache quants and optimizations (would you have any recommendations?)
>
> You are using `mla = 2`, so the only supported KV cache type is `fp16` when the computation is done on the GPU. I'm working on adding `Q8_0` to further reduce the KV cache size, but still having some issues with that. You can try adding `-fa` to see if this would increase your prompt processing speed (it shouldn't have major impact on token generation).
+
+> 👤 **ikawrakow** replied on **2025-03-11** at **06:43:37**
>
-> 👤 **ikawrakow** replied the **2025-03-11** at **06:43:37**:
> If you remove the `-fmoe`, does it still run everything on the main GPU?
+
+> 👤 **ThomasBaruzier** replied on **2025-03-11** at **16:30:22**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-11** at **16:30:22**:
> Great! Thank you for all the advice, once again.
>
> It seems that I forgot a backslash, `llama-bench` and `llama-perplexity` correctly uses the `-ot` argument, oops.
@@ -1093,19 +1112,22 @@ Edit: Main GPU usage is at 25% and other cards are at 0% when generating. Is it
> When removing `-fmoe`, the GPU usage is still centralized on the main GPU, with 20-25% usage at 130-140w, while the other cards stay at 0% at ~100w.
>
> Finally, using `-fa` slows down the prompt ingestion speeds to 28tok/s. Generation seems to not be affected. I've already seen this behavior on mainline when using `fa` with CPU offloading.
+
+> 👤 **ikawrakow** replied on **2025-03-11** at **16:36:21**
>
-> 👤 **ikawrakow** replied the **2025-03-11** at **16:36:21**:
> You can add `-v` to `llama-bench` to see why it fails to load the model.
+
+> 👤 **ThomasBaruzier** replied on **2025-03-11** at **16:57:45**
>
-> 👤 **ThomasBaruzier** replied the **2025-03-11** at **16:57:45**:
> I get: `llama_model_load: error loading model: failed to allocate buffer`. Is it trying to allocate the full 128k context? There is no `-c` equivalent (other than values in `-p` and `-n`), it seems.
+
+> 👤 **ikawrakow** replied on **2025-03-11** at **18:04:04**
>
-> 👤 **ikawrakow** replied the **2025-03-11** at **18:04:04**:
> No, it should use a context given by the sum of `-p` and `-n`.
---
-👤 **ThomasBaruzier** replied the **2025-03-13** at **14:22:08**:
+👤 **ThomasBaruzier** commented on **2025-03-13** at **14:22:08**
Here are some early results for wiki.test:
IQ1_S unsloth (1.67 BPW): 5.5749 +/- 0.03545
@@ -1125,7 +1147,7 @@ Edit: I don't think it could apply here: "Slim attention is somewhat similar to
---
-👤 **ikawrakow** replied the **2025-03-13** at **15:15:04**:
+👤 **ikawrakow** commented on **2025-03-13** at **15:15:04**
> In the meantime, is there any reason why you didn't recommend your new SOTA quant types like IQ2_K, or IQ4_KSS?
@@ -1135,7 +1157,8 @@ Someone else was observing issues (NaNs) with `IQ4_KSS` and `IQ4_K` and I wasn't
Yes, I know about this paper. MLA=2 does the same thing, there is only K cache and the `V` tensor gets computed from that (in different ways, depending on context). The only difference is that with MLA one does not need to compute $W_K^{-1}$ matrix, the equivalent is provided by the DeepSeek $W_{KV}$ tensor. It sounds nice in theory, but there is the theory and than there is the practice. In practice one needs to also consider compute buffers as intermediate results need to go somewhere, and the fact that counting multiply-adds is just a very rough estimate of actual performance, which also depends on memory access patterns, matrix shapes and sizes, etc. IIRC, the main factor that made me reluctant to spend the time implementing something along these lines is the fact that the benefit mostly goes away for GQA, which most models use these days.
-> 👤 **ThomasBaruzier** replied the **2025-03-13** at **16:20:03**:
+> 👤 **ThomasBaruzier** replied on **2025-03-13** at **16:20:03**
+>
> > If you feel like experimenting with these, I would be curious to learn about their performance for DeepSeekR1
>
> I'd be happy to. I spend more time setting up my LLMs than using them anyway. Thanks for all the valuable info about the quants, this will save me hours.
@@ -1144,21 +1167,23 @@ Yes, I know about this paper. MLA=2 does the same thing, there is only K cache a
> > spend the time implementing something along these lines
>
> So what's the difference between MLA=2 and "something along these lines"?
+
+> 👤 **ikawrakow** replied on **2025-03-13** at **17:17:46**
>
-> 👤 **ikawrakow** replied the **2025-03-13** at **17:17:46**:
> > So what's the difference between MLA=2 and "something along these lines"?
>
> MLA=2 is specific to the DeepSeek attention mechanism. "Something along these lines" would be a generic implementation for any MHA model.
---
-👤 **ikawrakow** replied the **2025-03-15** at **09:31:42**:
+👤 **ikawrakow** commented on **2025-03-15** at **09:31:42**
> PPL for IQ2_XXS unsloth (size equivalent with your custom quant) and IQ1_S_R4/IQ1_M_R4 are still running.
Do you have the results now? I'm curious to know.
-> 👤 **ThomasBaruzier** replied the **2025-03-15** at **11:02:21**:
+> 👤 **ThomasBaruzier** replied on **2025-03-15** at **11:02:21**
+>
> | Quant | Size (MB) | PPL |
> |------------|-----------|-----|
> | DeepSeek-R1-UD-IQ1_S | 133,736 | 5.5749 |
diff --git a/github-data/discussions/25 - CPU prompt processing speed for large contexts.md b/github-data/discussions/25 - CPU prompt processing speed for large contexts.md
index d6cd7bb25..f669ee28b 100644
--- a/github-data/discussions/25 - CPU prompt processing speed for large contexts.md
+++ b/github-data/discussions/25 - CPU prompt processing speed for large contexts.md
@@ -1,13 +1,14 @@
-### 🗣️ [#25](https://github.com/ikawrakow/ik_llama.cpp/discussions/25) - CPU prompt processing speed for large contexts
+## 🗣️ [Discussion #25](https://github.com/ikawrakow/ik_llama.cpp/discussions/25) - CPU prompt processing speed for large contexts
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2024-08-22 |
| **Updated** | 2025-01-15 |
---
-#### Description
+## 📄 Description
Back in the day when open source / open weight LLMs had a very limited context window, one of the most desired features among LLM enthusiasts was a larger context window. People came up with all sorts of modifications to the RoPE operation, used (LoRA) fine tuning, etc., to increase the context window beyond the maximum context used during model training. Today we have open source / open weight models that can handle much longer contexts. E.g., LLaMA-3.1 goes up to 128k tokens, which is probably more than what one can handle with consumer grade hardware for "Inference at the Edge" (and I find it kind of funny to see the many issues opened in the `llama.cpp` repository because users did not limit the maximum context length when running `llama.cpp`, and correspondingly the model would not load because the KV-cache required for 128k tokens does not fit into their <= 24 GB VRAM).
@@ -87,9 +88,9 @@ Based on this, here are some angles of attack for improving the CPU performance
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **jart** replied the **2024-08-22** at **15:26:07**:
+👤 **jart** commented on **2024-08-22** at **15:26:07**
> ~5% spent on thread synchronization
@@ -154,7 +155,7 @@ Is this all something that'd interest you? I can easily send a PR adding it to y
---
-👤 **ikawrakow** replied the **2024-08-22** at **16:16:08**:
+👤 **ikawrakow** commented on **2024-08-22** at **16:16:08**
Hey @jart, thanks for the comments!
@@ -170,40 +171,45 @@ Ha, you had already done that! I didn't check `llamafile` and discovered this on
I don't care about MSVC, so sure. There is the MIT vs Apache-2.0 issue, but we can sort that out.
-> 👤 **jart** replied the **2024-08-22** at **18:02:15**:
+> 👤 **jart** replied on **2024-08-22** at **18:02:15**
+>
> Apple doesn't have OpenMP. So that's where my thread synchronization changes have the most impact. Right now in llama.cpp if I build it on my Apple M2 and run with `-ngl 0` for CPU mode it gets 134 tok/sec tops. But llamafile with `-ngl 0` on MacOS M2 generates text at anywhere from 150 tok/sec to 210 tok/sec depending on how much Netflix is interfering and how much I win the XNU scheduler lottery (I imagine things are consistently 200+ if Asahi Linux is used instead of XNU). On the other hand, if I use Metal GPU then it consistently generates text at 200 tok/sec.
>
> Yes, that's correct. I'm claiming that the changes you and I both made on llamafile have made M2 Ultra CPU go faster than its GPU sometimes when generating text with TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf. However if I use a larger model like Mistral 7b where the matmuls start to dominate a lot more than the sync barriers, then I can only generate 42 tok/sec and GPU does 72 tok/sec. So this is all a bit orthogonal to the goal here of huge context windows. I just wanted you to know that we did something most people would likely assume is not possible. I certainly wouldn't have, because when I started focusing on this in January I set out with the goal of making CPU at at least only 10x slower than GPU.
+
+> 👤 **jart** replied on **2024-08-22** at **18:13:48**
>
-> 👤 **jart** replied the **2024-08-22** at **18:13:48**:
> As for MIT vs. Apache 2.0 there's a lot of leeway from Mozilla to make my work available to other local AI projects under the MIT license if that's what you're using here. I'll roll up a pull request for you sometime in the next few days, that'll work smoothly on POSIX platforms.
+
+> 👤 **ikawrakow** replied on **2024-08-22** at **19:08:09**
>
-> 👤 **ikawrakow** replied the **2024-08-22** at **19:08:09**:
> > Apple doesn't have OpenMP
>
> I thought the currently recommended approach in `llama.cpp` is to `brew install libomp`, which then by default enables OpenMP? That's what I tried anyway after observing a horrible performance with the `ggml_barrier` implementation on my M2-Max laptop, but that didn't help much either, so I did end up putting in the inline assembly that fixed performance for me.
>
> But yes, for small models such as TinyLlama thread synchronization becomes really important, so I should try your barrier version.
+
+> 👤 **jart** replied on **2024-08-22** at **22:12:59**
>
-> 👤 **jart** replied the **2024-08-22** at **22:12:59**:
> I don't even know why OpenMP is there. It's a GPL-licensed library. We might as well be using Torch if we're going to link that. Goes against the very spirit of the project which is figuring these things out for ourselves.
+
+> 👤 **jart** replied on **2024-08-22** at **22:16:45**
>
-> 👤 **jart** replied the **2024-08-22** at **22:16:45**:
> Also if by libomp you mean LLVM libomp, sadly it's kind of an newer alternative and it's got none of the alpha of GNU's OpenMP runtime. Based on my own evaluation, LLVM libomp is about as fast as llama.cpp's old synchronization code, when it's applied for GGML speedups.
---
-👤 **ikawrakow** replied the **2024-08-27** at **06:31:49**:
+👤 **ikawrakow** commented on **2024-08-27** at **06:31:49**
I did try a few things on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/kq_fused_softmax), but nothing is really working. The branch is just exploratory, absolutely not production ready, and `AVX512`-only. Given the unsatisfactory outcome, it will not get merged.
* I can get the CPU flash attention to run faster than the original (quite a bit faster for very large prompts), but it is still slower than no flash attention
* I can get a ~3% speedup for large prompts by optimizing for no-alibi and causal attention mask. But given the marginal improvement, increased complexity, and reduced generality, it does not seem worth adding.
-On the bright side, PR #27 merges "soft-capping" with soft-max. For large prompts, this leads to a significant performance boost for Gemma-2 models. At 32k tokens and Gemma-2-2b, the performance gap between GPU with flash attention and the Ryzen-7950X CPU is now "only" a factor of 45 (instead of the 53X in the above graph).
+On the bright side, PR [#27](https://github.com/ikawrakow/ik_llama.cpp/issues/27) merges "soft-capping" with soft-max. For large prompts, this leads to a significant performance boost for Gemma-2 models. At 32k tokens and Gemma-2-2b, the performance gap between GPU with flash attention and the Ryzen-7950X CPU is now "only" a factor of 45 (instead of the 53X in the above graph).
---
-👤 **ikawrakow** replied the **2024-08-30** at **15:25:30**:
+👤 **ikawrakow** commented on **2024-08-30** at **15:25:30**
OK, I have progress on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/kq_fused_softmax). Extremely hacky and `AVX512`-only (or, more precisely, Zen4-only), totally not production ready. But I'm finally able to outperform no flash attention on my Ryzen-7950X CPU - by about 20% for context of 16k, 23% for 32k, with LLaMA-3.1-8B.
@@ -215,7 +221,7 @@ My guess is that there is still a bottleneck at 32k tokens. Based on the FA to n
---
-👤 **ikawrakow** replied the **2024-08-30** at **15:37:24**:
+👤 **ikawrakow** commented on **2024-08-30** at **15:37:24**
And here is how the raltive CPU vs GPU performance graph changes with the new CPU flash attention implementation. The FA curve is basically flat now beyond 1000 tokens, except at 32k where I suspect a bottleneck that I have not found.
@@ -224,16 +230,16 @@ And here is how the raltive CPU vs GPU performance graph changes with the new CP
---
-👤 **ikawrakow** replied the **2025-01-15** at **17:50:21**:
+👤 **ikawrakow** commented on **2025-01-15** at **17:50:21**
-There has been progress since I last wrote here, with PR #172 being the latest contribution to improving CPU prompt processing speed. The following graph is for LLaMA-3.1-8B-Instruct quantized to `IQ4_XS` (which seems a fairly popular quantization type). Tested on a Ryzen-7950X CPU. The mandatory current mainline `llama.cpp` results are for `build: 1d850433 (4488)`. The results for `ik_llama.cpp` are obtained using run-time-repacking to the corresponding 4-row interleaved variant.
+There has been progress since I last wrote here, with PR [#172](https://github.com/ikawrakow/ik_llama.cpp/issues/172) being the latest contribution to improving CPU prompt processing speed. The following graph is for LLaMA-3.1-8B-Instruct quantized to `IQ4_XS` (which seems a fairly popular quantization type). Tested on a Ryzen-7950X CPU. The mandatory current mainline `llama.cpp` results are for `build: 1d850433 (4488)`. The results for `ik_llama.cpp` are obtained using run-time-repacking to the corresponding 4-row interleaved variant.

* In mainline `llama.cpp` FA continues to be underwhelming, being handsomely outperformed by not using FA
* `ik_llama.cpp` now finally exceeds 100 t/s for a prompt of 32k tokens. I get 122 t/s (`BF16` KV-cache) and 113 t/s (`Q8_0` KV-cache). The best I could do with mainline is 37 t/s (`Q8_0` K-cache, no FA).
* I'm quite pleased that `Q8_0` KV-cache is now almost on par with `BF16`
-* `ik_llama.cpp` is almost 4 times faster than mainline at 256 tokens, and still 3.3 times faster at 32k tokens. For such large contexts the computation time is heavily dominated by the `K*Q` and `V*softmax(K*Q)` matrix multiplications, with these matrices by far exceeding L3 cache size, and hence the operation becoming memory bound. In fact, part of the improvement in PR #172 is due to reducing the number of memory loads from the `V`-cache in the FA computation.
+* `ik_llama.cpp` is almost 4 times faster than mainline at 256 tokens, and still 3.3 times faster at 32k tokens. For such large contexts the computation time is heavily dominated by the `K*Q` and `V*softmax(K*Q)` matrix multiplications, with these matrices by far exceeding L3 cache size, and hence the operation becoming memory bound. In fact, part of the improvement in PR [#172](https://github.com/ikawrakow/ik_llama.cpp/issues/172) is due to reducing the number of memory loads from the `V`-cache in the FA computation.
* If processing very long context is a significant use case, utilizing `Q8_K_R8` brings additional gains. We get 373 t/s for 512 tokens, 312 t/s at 4k, 268 t/s at 8k, 203 t/s at 16k, and 136 t/s at 32k tokens.
It is also interesting to look at the performance relative to a GPU. I'm using an RTX-4080 GPU with the same model and FA enabled. Compared to earlier plots in this thread, I have changed the plot to show the ratio of GPU to CPU prompt processing speed and have restricted the prompt length to $\ge 100$ tokens to reduce the range of the y-axis. The Ryzen-7950X now saturates at about 27.5X lower performance compared to the RTX-4080, which is not bad at all.
diff --git a/github-data/discussions/256 - Diverging from llama.cpp.md b/github-data/discussions/256 - Diverging from llama.cpp.md
index 1a7e0e011..49514f3c2 100644
--- a/github-data/discussions/256 - Diverging from llama.cpp.md
+++ b/github-data/discussions/256 - Diverging from llama.cpp.md
@@ -1,13 +1,14 @@
-### 🗣️ [#256](https://github.com/ikawrakow/ik_llama.cpp/discussions/256) - Diverging from llama.cpp
+## 🗣️ [Discussion #256](https://github.com/ikawrakow/ik_llama.cpp/discussions/256) - Diverging from llama.cpp
| **Author** | `arnfaldur` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-03-14 |
-| **Updated** | 2025-03-14 |
+| **Updated** | 2025-07-23 |
---
-#### Description
+## 📄 Description
I just discovered this fork yesterday and would like to understand the situation better. This message is addressed to @ikawrakow
@@ -37,9 +38,9 @@ I'm sorry if this is a bit much, but I think it's very important and I was hones
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-03-14** at **06:06:08**:
+👤 **ikawrakow** commented on **2025-03-14** at **06:06:08**
Hello @arnfaldur,
@@ -57,8 +58,58 @@ I'm hacking here to keep my brain utilized and to have some fun. Definitely not
---
-👤 **bitbottrap** replied the **2025-03-14** at **14:40:37**:
+👤 **bitbottrap** commented on **2025-03-14** at **14:40:37**
I completely agree that some of this stuff needs to get into llama.cpp. And I completely understand why ikawrakow does not want to be personally responsible for it.
-I'm not sure what the focus is over there in llama.cpp land but it's very active. I just don't see a lot of the core stuff being improved on like it is here.
\ No newline at end of file
+I'm not sure what the focus is over there in llama.cpp land but it's very active. I just don't see a lot of the core stuff being improved on like it is here.
+
+---
+
+👤 **Kreijstal** commented on **2025-07-23** at **05:47:51**
+
+There are 470 open pull requests on llama.cpp most which probably take months or will never be merged
+
+
+I understand upstreaming is a lot of effort, no guarantee that upstream will even like your coding standard or fix, in that case anyone is free to snoop around the code and upstream it to llama.cpp which will take months to review probably
+
+> 👤 **jeffzhou2000** replied on **2025-07-23** at **13:38:08**
+>
+> > There are 470 open pull requests on llama.cpp most which probably take months or will never be merged
+> >
+> > I understand upstreaming is a lot of effort, no guarantee that upstream will even like your coding standard or fix, in that case anyone is free to snoop around the code and upstream it to llama.cpp which will take months to review probably
+>
+> I think there might-be some problems in the upstream llama.cpp project. e.g.: lack of some necessary codes of conducts. Because the upstream llama.cpp project is a 82k+ Github starers project with developers and experts from all over the world and different backgrounds.
+
+> 👤 **ikawrakow** replied on **2025-07-23** at **13:43:46**
+>
+> @jeffzhou2000 This is not the right place to discuss your issues with the `llama.cpp` maintainers. Please stop.
+
+> 👤 **jeffzhou2000** replied on **2025-07-23** at **14:08:58**
+>
+> Thanks for your reminder. I see.
+
+---
+
+👤 **jeffzhou2000** commented on **2025-07-23** at **13:22:48**
+
+> I was very excited to discover that you were still innovating on quantizations but I'm confused as to why it's happening on a fork with little desire ([#133](https://github.com/ikawrakow/ik_llama.cpp/issues/133)) to upstream the developments. I researched the history of this fork and many of the discussions that lead to it's creation (like the curiosity about Justine's tinyBLAS doubts), but have still not found a satisfactory answer.
+
+This is a real good question and I also asked [similar question](https://github.com/ikawrakow/ik_llama.cpp/discussions/8#discussioncomment-13537984) to the author of ik_llama.cpp on June 21 2025.
+
+
+
+>
+> I would be surprised if that was the case however. Why share the work on this fork if not for others to use?
+
+I also suggested to the author several times to return to the upstream community because the author of ik_llama.cpp is a real AI expert(I even think he is the father of quantization tech in the llama.cpp project).
+
+
+>
+> As is likely evident, I think it is a big loss to the commons that these new quants and optimizations aren't available upstream.
+
+Can't agree more.
+
+> I still want to emphasize that I believe that there is a valid reason for the fork's creation and I would be very interested in hearing that reason.
+
+I also want to know the clear reason for the fork's creation.
\ No newline at end of file
diff --git a/github-data/discussions/258 - Quick-start Guide coming over from llama.cpp and ktransformers_.md b/github-data/discussions/258 - Quick-start Guide coming over from llama.cpp and ktransformers.md
similarity index 96%
rename from github-data/discussions/258 - Quick-start Guide coming over from llama.cpp and ktransformers_.md
rename to github-data/discussions/258 - Quick-start Guide coming over from llama.cpp and ktransformers.md
index 6450c8a88..ff38288a2 100644
--- a/github-data/discussions/258 - Quick-start Guide coming over from llama.cpp and ktransformers_.md
+++ b/github-data/discussions/258 - Quick-start Guide coming over from llama.cpp and ktransformers.md
@@ -1,13 +1,14 @@
-### 🗣️ [#258](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) - Quick-start Guide coming over from llama.cpp and ktransformers!
+## 🗣️ [Discussion #258](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) - Quick-start Guide coming over from llama.cpp and ktransformers!
| **Author** | `ubergarm` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-03-14 |
-| **Updated** | 2025-07-13 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
`ik_llama.cpp`
===
@@ -987,9 +988,9 @@ CRASH
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ubergarm** replied the **2025-03-14** at **20:34:10**:
+👤 **ubergarm** commented on **2025-03-14** at **20:34:10**
@saood06
@@ -1001,7 +1002,8 @@ My initial impression is with the right settings it can get faster prompt proces
Looking forward to trying it with an MLA supported quant.
-> 👤 **saood06** replied the **2025-03-15** at **04:08:06**:
+> 👤 **saood06** replied on **2025-03-15** at **04:08:06**
+>
> > I trolled through some of the PRs you linked to me and pulled together this rough guide as my notes for getting started with `ik_llama.cpp`. Thanks for pointing me in the right direction.
>
> Glad I can be of help. I've seen a lot of people show interest in using ik_llama.cpp but the amount of options and the spread out documentation was a deterrent. This guide (even in it's current state) is a much better resource to give people than my explanations and links to PR's, so thank you for putting it together.
@@ -1022,16 +1024,19 @@ Looking forward to trying it with an MLA supported quant.
> I think ktransformers will outperform ik_llama.cpp without MLA for TG at higher context lengths as it uses MLA. The higher PP is nice, I wonder if the lead is still held with MLA.
>
> Also you may find https://github.com/ikawrakow/ik_llama.cpp/pull/225 useful for benchmarking.
+
+> 👤 **magikRUKKOLA** replied on **2025-07-13** at **22:39:43**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-13** at **22:39:43**:
> @saood06 please keep in mind that there is no such thing as comparing the performance of ik_llama.cpp with ktransformers. Simply because the ktransformers is using old fork of flashinfer (see 0.2.3). If simply put, you will get either crash in the sampler or the garbage output (or lost context). Yeah, I initially thought ik_llama.cpp suck because the decode speed is slower (esp. on a long context because they dont't use matrix absorption etrc.) .. but ... there is simply no way to run ktransformers with large context. ktransformers doesn't even have the --seed parameter implemented lol so each time the llm answers you you can't tell if its a right answer or its a garbage lol. ktransformers was written by script-kiddies (I looked at the code -- its awful). So please be serious.
+
+> 👤 **saood06** replied on **2025-07-13** at **22:52:02**
>
-> 👤 **saood06** replied the **2025-07-13** at **22:52:02**:
> > @saood06 please keep in mind that there is no such thing as comparing the performance of ik_llama.cpp with ktransformers. [...] So please be serious.
>
> Not sure why you are replying to old comments. I said in a later [comment](https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomment-12786183) in this same discussion page, "Even then and still now I still see ktransformers as more of a performance demo because of how limited it is in what it supports both in hardware and the server/API they expose."
+
+> 👤 **magikRUKKOLA** replied on **2025-07-13** at **23:30:53**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-13** at **23:30:53**:
> > > @saood06 please keep in mind that there is no such thing as comparing the performance of ik_llama.cpp with ktransformers. [...] So please be serious.
> >
> > Not sure why you are replying to old comments. I said in a later [comment](https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomment-12786183) in this same discussion page, "Even then and still now I still see ktransformers as more of a performance demo because of how limited it is in what it supports both in hardware and the server/API they expose."
@@ -1042,17 +1047,17 @@ Looking forward to trying it with an MLA supported quant.
---
-👤 **ikawrakow** replied the **2025-03-15** at **09:16:27**:
+👤 **ikawrakow** commented on **2025-03-15** at **09:16:27**
Thank you for these results.
> The biggest hurdle so far is needing a custom quant for MLA support
-#259 should remove this hurdle. With this PR models prepared with mainline `llama.cpp` can be used also with MLA enabled.
+[#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) should remove this hurdle. With this PR models prepared with mainline `llama.cpp` can be used also with MLA enabled.
---
-👤 **saood06** replied the **2025-03-16** at **03:37:18**:
+👤 **saood06** commented on **2025-03-16** at **03:37:18**
@ikawrakow
@@ -1061,7 +1066,8 @@ Thank you for these results.
Just thought you'd want to know this, manually notifying you as edit's don't trigger notifications.
-> 👤 **ubergarm** replied the **2025-03-16** at **03:58:21**:
+> 👤 **ubergarm** replied on **2025-03-16** at **03:58:21**
+>
> Yeah I managed to cobble together a quantize script and create my first quant `IQ2_K_R4` weighing in at `179G` and slightly higher perplexity that `UD-Q2_K_XL` at `212G` comparing across the first 10 perplexity data points. I saw a note about `nan` [over here too on this huggingface unsloth R1-GGUF discussion](https://huggingface.co/unsloth/DeepSeek-R1-GGUF/discussions/37#67bb416987172149b9baa34e) (can't compare against those charts as they use a custom txt file and not `wiki.test.raw`). The new quant at 32k context took an 8k prompt at ~63 tok/sec pp and gave ~11.3 tok/sec tg.
>
> Now that I see how it works better I'm rolling another one with more `q8_0`s for the less frequent layers and targeting under 256GB RAM system. At least I have enough perplexity data points to compare across these specific quants.
@@ -1069,8 +1075,9 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> The other thing I need to dig into more is what combination of `-ctk` and `-ctv` work with what mla/amb/fmoe/fa settings. I noticed `-ctk q8_0 -ctv q8_0` works with `-mla 2 -fa -amb 2048 -fmoe` and allows 32k context to fit in 24GB VRAM comfortably. However, trying `q8_KV` and `iq4_nl` types segfaulted (didn't grab a backtrace, might be a known invalid combination).
>
> Made a lot of progress today! Hope to move on to making a CPU only optimized quant for the Intel 6980P to try (e.g. exps around `q6_k_r4` or whatever repacked quant types might be good combo of high quality and reasonably fast assuming plenty of RAM.
+
+> 👤 **saood06** replied on **2025-03-16** at **04:43:23**
>
-> 👤 **saood06** replied the **2025-03-16** at **04:43:23**:
> > Yeah I managed to cobble together a quantize script and create my first quant `IQ2_K_R4` weighing in at `179G` and slightly higher perplexity that `UD-Q2_K_XL` at `212G` comparing across the first 10 perplexity data points.
>
> I saw that and was about to write a separate comment to you, but wanted to alert ikawrakow about the NaNs first, so I'll just reply to you in this comment.
@@ -1102,8 +1109,9 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> >is there a "dry-run" to calculate/show sizes of everything before actually doing it?
>
> There is not.
+
+> 👤 **ubergarm** replied on **2025-03-16** at **15:45:41**
>
-> 👤 **ubergarm** replied the **2025-03-16** at **15:45:41**:
> > I don't think there is any mention of this in llama.cpp issues/PR's
> Yeah, doing research in 2025 is a mind bending exercise in digging through subreddit comments, hugging face discussions, github PRs, and right I didn't even realize github "discussions" was a thing until a couple weeks ago lol.
>
@@ -1126,8 +1134,9 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> Thanks for the info, the most recent quant I rolled last night took about 3.2 hours, so I guess it depends on the exact configuration. I don't know if the Dual Intel 6980P has enough disk space lol...
>
> I appreciate all your guidance and quality feedback!
+
+> 👤 **saood06** replied on **2025-03-17** at **03:26:23**
>
-> 👤 **saood06** replied the **2025-03-17** at **03:26:23**:
> > Yeah, doing research in 2025 is a mind bending exercise in digging through subreddit comments, hugging face discussions, github PRs, and right I didn't even realize github "discussions" was a thing until a couple weeks ago lol.
>
> I'm curious how you found out about ik_llama.cpp then. I wouldn't have mentioned it to you on the llama.cpp discussion if you hadn't (but probably still would have in your r1-ktransformers-guide as you mentioned other inference engines), but I agree the state of research (there apparently is stuff on twitter/x but I've never really touched that platform besides people referencing it on other platforms). There are also forums and other places as well that I used to check out but not really so much anymore.
@@ -1168,8 +1177,9 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> > I appreciate all your guidance and quality feedback!
>
> I'm happy to do it, since I appreciate your guide and benchmarking.
+
+> 👤 **ubergarm** replied on **2025-03-17** at **20:36:43**
>
-> 👤 **ubergarm** replied the **2025-03-17** at **20:36:43**:
> @saood06
>
> > I'm curious how you found out about ik_llama.cpp then.
@@ -1186,15 +1196,16 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
>
> > A bit sad to see the full perplexity numbers gone from your guide, I think (not at all sure though) the values printed by the perplexity command are already a some sort of running average as I noticed the last value is always the same as the final estimate.
>
-> Oh sorry, I have that stuff in local logs but switched to the visual chart .png image to try to keep this "guide" less spammy haha... Yeah, its unclear to me if that total value it prints out (if no nans occur) is simply an average for each chunk or some other calculation, I didn't look that closely. I realize I was using `-mla 2` and `-ctk/ctv q8_0` for these calculations which is not a valid combination yet I just learned today. So take it with a grain of salt. --- I added another detail drop down with some full perplexity run logs if that is useful to you. Also just saw #261 to help with `nan` psure.
+> Oh sorry, I have that stuff in local logs but switched to the visual chart .png image to try to keep this "guide" less spammy haha... Yeah, its unclear to me if that total value it prints out (if no nans occur) is simply an average for each chunk or some other calculation, I didn't look that closely. I realize I was using `-mla 2` and `-ctk/ctv q8_0` for these calculations which is not a valid combination yet I just learned today. So take it with a grain of salt. --- I added another detail drop down with some full perplexity run logs if that is useful to you. Also just saw [#261](https://github.com/ikawrakow/ik_llama.cpp/issues/261) to help with `nan` psure.
>
>
>
> One other thing, I'm fussing a bit to see if it is possible to still use `mmap()` when using `-ot exps=CPU`? Just realized using tensor overrides disables `mmap()`. So I can't actually try my sweet new quant locally on the 9950X 96GB RAM. Somehow ktransformers `--optimize_config_path optimize_rules/DeepSeek-V3-Chat.yaml` regex seems to still allow `mmap()` for the non-GPU tensors.
>
> Finally, I'm still scratching my head a bit about the whole [CUDA graphs stuff](https://github.com/ikawrakow/ik_llama.cpp/pull/260#issuecomment-2730435639). I probably have to dig more into ktransformers code to see exactly what they are talking about there as using `ktransformers --no-use_cuda_graph` definitely slows it down about 50%...
+
+> 👤 **saood06** replied on **2025-03-17** at **22:13:36**
>
-> 👤 **saood06** replied the **2025-03-17** at **22:13:36**:
> > I'm checking to see if the three unsloth quants I have on the intel6980P CPU only rig throw `nan` with vanilla llama.cpp. If I can repo it there, then I'll check and possibly report. Though I'm out most of this tues/weds.
>
> Thanks, sorry for not wanting to make the issue myself even though I want the issue made.
@@ -1235,8 +1246,9 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> >Also just saw https://github.com/ikawrakow/ik_llama.cpp/pull/261 to help with nan psure.
>
> That is for his custom quant types IQ_K quants (https://github.com/ikawrakow/ik_llama.cpp/discussions/8), the nans in unsloth's quant won't be helped by that.
+
+> 👤 **ubergarm** replied on **2025-03-19** at **22:59:19**
>
-> 👤 **ubergarm** replied the **2025-03-19** at **22:59:19**:
> > I'm curious about full PPL runs.
>
> Yeah, looking more I see the full run is more useful for easy comparisons than just the first N chunks.
@@ -1552,11 +1564,13 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> > That was mentioned in the PR that implemented tensor override here
>
> Another recent PR allows for `mmap()` now so I got my quant running locally around 3 tok/sec. Get almost 4.5 when playing aroun with `-ser 5,1` - hope to do some perplexity testing with other `-ser` settings for comparison. More fun stuff!
+
+> 👤 **vaulter** replied on **2025-03-20** at **01:24:37**
>
-> 👤 **vaulter** replied the **2025-03-20** at **01:24:37**:
> Hi Guys, I've been struggling on my dual Xeon 8558 (48cores) with 768Gb RAM and Quad 3090 with Q8 (that is on lamma.cpp mainline, Q4_K_S gives me 6-7 tk/s in real world prompting) - gives me nan's, can you recommend and help to create custom quants for my situation? I would like to get best performance and ik_llama.cpp seems on the edge, I've been following this thread but might get lost in details calculating and applying custom quants logic...
+
+> 👤 **ubergarm** replied on **2025-03-20** at **03:06:51**
>
-> 👤 **ubergarm** replied the **2025-03-20** at **03:06:51**:
> @vaulter
>
> > I've been struggling on my dual Xeon 8558 (48cores) with 768Gb RAM and Quad 3090 with Q8
@@ -1602,18 +1616,20 @@ Just thought you'd want to know this, manually notifying you as edit's don't tri
> This is with `Q8_0` and vanilla `llama.cpp@main?`? When doing `llama-perplexity` or when do you see nan's?
>
> Okay, holler if you get stuck and looking forward to hearing your results! Also feel free to chat about how to make quants, I put some rough notes in this guide where I'm stumbling through the process myself haha...
+
+> 👤 **vaulter** replied on **2025-03-20** at **04:47:15**
>
-> 👤 **vaulter** replied the **2025-03-20** at **04:47:15**:
> Well, assuming nan is a token with a single D (basically the output is DDDDD...) - I'm using vannilla llama.cpp@main the same way as with Q4_K_S, it loads, and start outputting D's with out any errors, after I close session it give me tok/s stats, prompt eval is also low vs Q4_K_S at around 0,57 tok/s
> As for the ik_llama.cpp I'll try and report the results
> And I was following your other threads with Granit rapids testing - that was really helpful - so thanks for that work! @ubergarm
+
+> 👤 **vaulter** replied on **2025-03-23** at **14:04:43**
>
-> 👤 **vaulter** replied the **2025-03-23** at **14:04:43**:
> Ok here is a bit of testing - I was getting around 6-6.7 tok/s on vanilla llama and achieved 10.8 tok/s on ik_llama on 8192 context. That is Q4_K_S. I was getting assert errors so had to checkout the given branch. Currently I've done exact instructions besides I didnt isolate to 1 3090 but used all 4 - anyways its offloads whatever (not expert layers as these are CPU override) at around 11Gb on each GPUs - I'm looking in trying a single CPU 4677 motherboard with 2 dimms per channel - this will give me 768GB but for 1 NUMA node and I probably can try Q8 on it
---
-👤 **saood06** replied the **2025-03-20** at **01:47:18**:
+👤 **saood06** commented on **2025-03-20** at **01:47:18**
>Are you aware of other quants that throw nan on CPU backends?
@@ -1629,7 +1645,7 @@ Nice.
---
-👤 **saood06** replied the **2025-03-21** at **07:32:24**:
+👤 **saood06** commented on **2025-03-21** at **07:32:24**
>This is an experimental quant I rolled with q8_0 for all attention/shared experts/embeddings loaded on GPU. The rest of the MoE down exps are iq2_xs_r4 and gate/up exps are iq2_bn_r4. However, perplexity looks pretty bad. So I'll likely aim for larger sized model with higher quality quants and make-up speed/accuracy trade off exploring -ser instead of going very small quants.
@@ -1637,7 +1653,8 @@ I don't think it's the size that is the issue, iq2_bn_r4 is a bitnet quant. I br
If you are still experimenting with quant types, you might be able to improve on your Q2_K_R4 at around the same size by replacing the q2_k_r4, and q3_k_r4 which are k quants with similar sized i quants or iqk quants instead of using k quants, this PR https://github.com/ikawrakow/ik_llama.cpp/pull/85 has a really nice chart focusing on that quant range (caveat IQ3_KL is not a quant type, it is a quant recipe), and shows how the three different quant types (i, k and iqk) stack up.
-> 👤 **ubergarm** replied the **2025-03-21** at **15:38:10**:
+> 👤 **ubergarm** replied on **2025-03-21** at **15:38:10**
+>
> > iq2_bn_r4 is a bitnet quant
>
> I saw a few small bitnet quants and wanted to try it out. Okay so its not the size but the bitnet quants are not great *for non bit-net trained models*. Good to know!
@@ -1647,8 +1664,9 @@ If you are still experimenting with quant types, you might be able to improve on
> My first attempt was i quants, which are indeed quite small but seem be more CPU intensive on generation. I see, the `iqk` "non-linear" quants in the PR 85 are probably the best bang for the bit assuming I am patient enough to generate the quant. Yeah I'll do another iteration on my custom quant then with these!
>
> Thanks for taking the time to explain with references, really appreciate it!
+
+> 👤 **ubergarm** replied on **2025-03-21** at **16:39:43**
>
-> 👤 **ubergarm** replied the **2025-03-21** at **16:39:43**:
> Okie I'm cooking up one targeting a 256GB RAM + ~24GB VRAM system with `-ot exps=CPU`:
>
> #### CPU Optimized MoE Tensors
@@ -1664,13 +1682,15 @@ If you are still experimenting with quant types, you might be able to improve on
> I may try another one like this knocking the `gate/up` Tensors smaller to that `IQ1_M_R4` or even `IQ1_S_R4` to see how perplexity looks and speed on my local 9950X + 96GB RAM rig.
>
> Then I could compare against the bigger model with `-ser 6,1` perplexity and speed vs the smaller model. A lot of knobs to play with and optimize.
+
+> 👤 **saood06** replied on **2025-03-23** at **01:09:20**
>
-> 👤 **saood06** replied the **2025-03-23** at **01:09:20**:
> I see you made the IQ2_K_R4 quant, the ppl seems about the same, but performance is a bit confusing as the initial ETA is lower for IQ2_K_R4, but the Q2_K_R4 ETA was higher but it ended up finishing quicker than estimated making it faster.
>
> Any system load or anything that would cause that?
+
+> 👤 **ubergarm** replied on **2025-03-23** at **14:39:38**
>
-> 👤 **ubergarm** replied the **2025-03-23** at **14:39:38**:
> @saood06
>
> Wow, good eyes! I was wondering the same thing myself.
@@ -1690,11 +1710,13 @@ If you are still experimenting with quant types, you might be able to improve on
> Waiting for the Qwen to drop an MoE with MLA that an `iq4_k_r4` quant will fit into 96GB RAM + 24GB VRAM lmao... :crossed_fingers:
>
> Will keep you posted when I run some benchmarks!
+
+> 👤 **ikawrakow** replied on **2025-03-23** at **14:47:08**
>
-> 👤 **ikawrakow** replied the **2025-03-23** at **14:47:08**:
> PP performance is not really correlated with model size. The `IQX_K` quants are somewhat slower than k-quants for prompt processing (unpacking them to be ready for dot products is more involved). They are quite a bit faster than similarly sized i-quants (`IQ2_XXS`, `IQ2_XS`, `IQ3_S`, etc.) for PP and TG on the CPU. Here you are getting the same PPL as a model that is 5% larger, so that's pretty good.
+
+> 👤 **saood06** replied on **2025-03-23** at **14:51:46**
>
-> 👤 **saood06** replied the **2025-03-23** at **14:51:46**:
> > Waiting for the Qwen to drop an MoE with MLA that an `iq4_k_r4` quant will fit into 96GB RAM + 24GB VRAM lmao... 🤞
>
> Does WizardLM-2-8x22B or any other 8x22B interest you as that could fit, and someone tried it (albeit on llama.cpp) [here](https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2661302167) and got good results.
@@ -1702,8 +1724,9 @@ If you are still experimenting with quant types, you might be able to improve on
> > Will keep you posted when I run some benchmarks!
>
> Thanks, I periodically check on this page as github doesn't notify on edits.
+
+> 👤 **ubergarm** replied on **2025-03-23** at **16:00:02**
>
-> 👤 **ubergarm** replied the **2025-03-23** at **16:00:02**:
> I ran a quick comparison between the `Q2_K_R4` and the `IQ2_K_R4` which do seem like the better choices for CPU inferencing over `IQ2_XS` and family.
>
> For this specific config seems like pp is slightly slower but tg is slightly faster! With basically the same perplexity and 5% smaller, these non-linear `IQ?_K_R4` do seem like a great choice for CPU inferencing.
@@ -1723,18 +1746,21 @@ If you are still experimenting with quant types, you might be able to improve on
> | IQ2_K_R4 | 226.00 GiB | tg64@pp512 | 10.32 ± 0.00 |
> | IQ2_K_R4 | 226.00 GiB | tg64@pp8192 | 9.16 ± 0.02 |
> | IQ2_K_R4 | 226.00 GiB | tg64@pp16384 | 8.10 ± 0.02 |
+
+> 👤 **saood06** replied on **2025-03-23** at **16:14:16**
>
-> 👤 **saood06** replied the **2025-03-23** at **16:14:16**:
> >With basically the same perplexity and 5% smaller, these non-linear IQ?_K_R4 do seem like a great choice for CPU inferencing.
>
> Yes, I basically always use IQK quants, and at higher bpw levels ( where I-quants do not exist) they are often a far better quality option at their size (see: the data in https://github.com/ikawrakow/ik_llama.cpp/pull/83 and https://github.com/ikawrakow/ik_llama.cpp/pull/89) which is why for models that I use in the 4.25-7 bpw range I make an IQK quant (with an imatrix).
+
+> 👤 **ikawrakow** replied on **2025-03-23** at **17:21:45**
>
-> 👤 **ikawrakow** replied the **2025-03-23** at **17:21:45**:
> > Does WizardLM-2-8x22B or any other 8x22B interest you as that could fit, and someone tried it (albeit on llama.cpp) https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2661302167 and got good results.
>
> Quantized 8x22B is something I can run on my Ryzen-5975WX. I get `PP-512=61 t/s`, `TG-128 = 2.16 t/s` running CPU-only for the `Q4_K_M` model used in the linked post. They said that the difference between 100 t/s and 74 t/s wasn't that important, so based on that logic, I'm matching the performance of 3 GPUs for PP 😄
+
+> 👤 **ikawrakow** replied on **2025-03-23** at **18:31:20**
>
-> 👤 **ikawrakow** replied the **2025-03-23** at **18:31:20**:
> With my paltry 16 GB RTX-4080 that is in the Ryzen-7950WX box, I get `PP-512 = 80 t/s` and `TG-128 = 3.1 t/s` using
> ```
> -ot "blk\.[0-6]\.ffn=CUDA0,exps=CPU" -rtr -t 32 -ngl 100
@@ -1742,24 +1768,25 @@ If you are still experimenting with quant types, you might be able to improve on
---
-👤 **ikawrakow** replied the **2025-03-21** at **15:49:36**:
+👤 **ikawrakow** commented on **2025-03-21** at **15:49:36**
> Okay so its not the size but the bitnet quants are not currently great.
They are actually great. But they are Bitnet quants, so quants for a model that has been trained such that model weights take one of 3 possible values (-1, 0, 1). Hence, they absolutely cannot be used for normal models trained using actual floats. But that does not make them not great. The ternary quants in this repo (`IQ2_BN`, `IQ1_BN`) have, as far as I can tell, by far the fastest CPU implementation around.
-> 👤 **ubergarm** replied the **2025-03-21** at **15:51:44**:
+> 👤 **ubergarm** replied on **2025-03-21** at **15:51:44**
+>
> Okay gotchu. Yeah I picked them hoping they were fast, but given R1 was not trained as a bitnet they are not the right match for this specific case.
---
-👤 **ikawrakow** replied the **2025-03-21** at **17:26:50**:
+👤 **ikawrakow** commented on **2025-03-21** at **17:26:50**
The `iq3_k_r4/iq2_k_r4` MoE mix that you are cooking should work out to about 207 GiB for the experts (3.582 GiB per layer). It may be useful to have a few MoE layers quantized with more bits (e.g., `iq4_k_r4 for `ffn_down` and `iq3_k_r4` for `ffn_up/fate`). If you do the first 8 MoE layers like that, it will add about 11.2 GiB to the weights stored on the CPU.
---
-👤 **anikifoss** replied the **2025-04-08** at **16:39:03**:
+👤 **anikifoss** commented on **2025-04-08** at **16:39:03**
@ubergarm huge thanks for this guide! Any chance you could publish the DeepSeek-R1_Q2_K_R4 quant described here?
@@ -1791,7 +1818,7 @@ Would love to get my hands on the DeepSeek-R1_Q2_K_R4 quant!
---
-👤 **ubergarm** replied the **2025-04-08** at **17:07:44**:
+👤 **ubergarm** commented on **2025-04-08** at **17:07:44**
Heya @anikiforovopensource , I appreciate the feedback, its been great working with tools provided by the great developers to push the envelope! Glad you have found some of this useful
@@ -1818,7 +1845,7 @@ Cheers and good luck, sounds like you have a great rig to experiment!
---
-👤 **ikawrakow** replied the **2025-04-08** at **17:43:47**:
+👤 **ikawrakow** commented on **2025-04-08** at **17:43:47**
> Switching from CPU only inferencw with ollama to CPU+GPU inferece with ik_llama resulted in a 5x inference speedup.
@@ -1826,7 +1853,7 @@ Where are my 136k stars 😃
---
-👤 **fredlas** replied the **2025-04-08** at **18:50:04**:
+👤 **fredlas** commented on **2025-04-08** at **18:50:04**
Has something changed with how llama-quantize wants the `--custom-q` flag to be formatted? I'm trying to follow the example, but it won't accept most of the types there. As far as I can tell it only wants to accept "classic" types like q8_0, not q5_k.
@@ -1835,13 +1862,13 @@ Specifically, it gives me e.g.
---
-👤 **ikawrakow** replied the **2025-04-08** at **18:57:45**:
+👤 **ikawrakow** commented on **2025-04-08** at **18:57:45**
There have been no changes related to custom quants. Can you post your full command? `llama-quantize` error messages can be misleading sometimes.
---
-👤 **fredlas** replied the **2025-04-08** at **19:04:38**:
+👤 **fredlas** commented on **2025-04-08** at **19:04:38**
Sure! I arrived at:
```
@@ -1862,7 +1889,7 @@ It also doesn't like q6_k, but is ok with q4_0. I dug around a little, but `ggml
---
-👤 **ikawrakow** replied the **2025-04-08** at **19:10:47**:
+👤 **ikawrakow** commented on **2025-04-08** at **19:10:47**
Oh, this is Kawrakow-style usability at its best!
@@ -1870,12 +1897,13 @@ The "K" in k-quants need to be capitalized. So, `q5_K`, not `q5_k`.
This applies only to `q2_K, q3_K, q4_K, q5_K, q6_K`. In the other cases (`iq4_k`, etc.) it is small `k`.
-> 👤 **fredlas** replied the **2025-04-08** at **19:19:28**:
+> 👤 **fredlas** replied on **2025-04-08** at **19:19:28**
+>
> Oh man, thanks. I actually tried different capitalizations, but hadn't gone as far as mixing them!
---
-👤 **anikifoss** replied the **2025-04-08** at **22:32:55**:
+👤 **anikifoss** commented on **2025-04-08** at **22:32:55**
Ok, I run the benchmarks, results are below. System: 7975wx with FCLK=2100 , 768G RAM at 5600MHz, RTX 5090.
@@ -2240,10 +2268,12 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
-> 👤 **ikawrakow** replied the **2025-04-09** at **05:49:42**:
+> 👤 **ikawrakow** replied on **2025-04-09** at **05:49:42**
+>
> @saood06 You said somewhere that KTransformers was the fastest toolkit for DeepSeek inference. This is not faster?
+
+> 👤 **ubergarm** replied on **2025-04-09** at **17:03:08**
>
-> 👤 **ubergarm** replied the **2025-04-09** at **17:03:08**:
> @anikiforovopensource
>
> Oh great, thanks for the results! Double thanks for exact logs! That looks about right to me. Here are a few observations:
@@ -2284,13 +2314,15 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
> > @saood06 You said somewhere that KTransformers was the fastest toolkit for DeepSeek inference. This is not faster?
>
> I haven't used ktransformers in over a month since finding `ik_llama.cpp`, but my [last ktransformers benchmarks](https://github.com/ubergarm/r1-ktransformers-guide?tab=readme-ov-file#discussions) on very similar hardware suggest ik is potentially faster or at least on-par with ktransformers speed.
+
+> 👤 **ikawrakow** replied on **2025-04-09** at **17:25:51**
>
-> 👤 **ikawrakow** replied the **2025-04-09** at **17:25:51**:
> > ik is potentially faster or at least on-par with ktransformers speed.
>
> So, where are my 13k stars? One also has a longer context and better quantization options available...
+
+> 👤 **saood06** replied on **2025-04-10** at **03:54:09**
>
-> 👤 **saood06** replied the **2025-04-10** at **03:54:09**:
> > @saood06 You said somewhere that KTransformers was the fastest toolkit for DeepSeek inference. This is not faster?
>
> I said something to that tune on Feb 19, ik_llama.cpp has improved a lot since then. Even then and still now I still see ktransformers as more of a performance demo because of how limited it is in what it supports both in hardware and the server/API they expose.
@@ -2302,14 +2334,16 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
> >One also has a longer context and better quantization options available...
>
> I find this repo amazing, and it is full of options, but popularity and quality aren't linked. Your bitnet implementation is far better than the popular Microsoft one, but the Microsoft one (which also has 13k stars), is far better known.
+
+> 👤 **ikawrakow** replied on **2025-04-10** at **06:51:50**
>
-> 👤 **ikawrakow** replied the **2025-04-10** at **06:51:50**:
> > I felt like I could have posted about it and gotten strong reception but I never did because I wasn't sure if you wanted this project to be popular.
>
-> I'm not necessarily looking for popularity (as you say, the correlation between popularity and quality is not very strong), but KTransformers copying code from here without acknowledgement (see #319) does rub me the wrong way. You can for sure post about that. And I'm now thinking that if this repository was better known, perhaps they wouldn't do it so blatantly. They do acknowledge to have taken the CPU implementation from `llamafile`, but `llamafile` is not a competitor (doesn't even support DeepSeek models), while `ik_llama.cpp` definitely is.
+> I'm not necessarily looking for popularity (as you say, the correlation between popularity and quality is not very strong), but KTransformers copying code from here without acknowledgement (see [#319](https://github.com/ikawrakow/ik_llama.cpp/issues/319)) does rub me the wrong way. You can for sure post about that. And I'm now thinking that if this repository was better known, perhaps they wouldn't do it so blatantly. They do acknowledge to have taken the CPU implementation from `llamafile`, but `llamafile` is not a competitor (doesn't even support DeepSeek models), while `ik_llama.cpp` definitely is.
+
+> 👤 **saood06** replied on **2025-04-10** at **08:19:34**
>
-> 👤 **saood06** replied the **2025-04-10** at **08:19:34**:
-> > I'm not necessarily looking for popularity (as you say, the correlation between popularity and quality is not very strong), but KTransformers copying code from here without acknowledgement (see #319) does rub me the wrong way. You can for sure post about that.
+> > I'm not necessarily looking for popularity (as you say, the correlation between popularity and quality is not very strong), but KTransformers copying code from here without acknowledgement (see [#319](https://github.com/ikawrakow/ik_llama.cpp/issues/319)) does rub me the wrong way. You can for sure post about that.
>
> I saw that discussion, and I wasn't really happy with it either, but that isn't the sort of thing I would post about. My potential posts were more feature/performance highlights.
>
@@ -2320,15 +2354,16 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
> > but llamafile is not a competitor (doesn't even support DeepSeek models), while ik_llama.cpp definitely is.
>
> I really don't see the different inference engine as competitors, they just serve different niches.
+
+> 👤 **ubergarm** replied on **2025-04-10** at **21:51:50**
>
-> 👤 **ubergarm** replied the **2025-04-10** at **21:51:50**:
> @anikiforovopensource
>
-> One last quick tip, if you want to sacrifice some quality in exchange for extra speed add `-ser 6,1` to your command. Details on that feature are in [PR#239](https://github.com/ikawrakow/ik_llama.cpp/pull/239).
+> One last quick tip, if you want to sacrifice some quality in exchange for extra speed add `-ser 6,1` to your command. Details on that feature are in [PR[#239](https://github.com/ikawrakow/ik_llama.cpp/issues/239)](https://github.com/ikawrakow/ik_llama.cpp/pull/239).
---
-👤 **anikifoss** replied the **2025-04-11** at **15:36:54**:
+👤 **anikifoss** commented on **2025-04-11** at **15:36:54**
@ubergarm I incorporated some of your suggestions and re-run the benchmark.
@@ -2724,13 +2759,14 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
-> 👤 **ubergarm** replied the **2025-04-11** at **19:03:52**:
+> 👤 **ubergarm** replied on **2025-04-11** at **19:03:52**
+>
> @anikiforovopensource
>
> Hey very nice, I appreciate how thorough you are!
>
> 1. Interesting that `-ctk f16` is faster while only adding about 1GiB of VRAM @ 32k context as compared to `-ctk q8_0`. I'll keep that in mind for how I'm running, given I might prefer the extra speed over extra context in some configs.
-> 2. Aye, great job finding and offloading a few more layers into VRAM. This is exactly the right approch. I just learned some tips about which layers might be best to offload from @ikawrakow [here on Discussion #323](https://github.com/ikawrakow/ik_llama.cpp/discussions/323#discussioncomment-12802730).
+> 2. Aye, great job finding and offloading a few more layers into VRAM. This is exactly the right approch. I just learned some tips about which layers might be best to offload from @ikawrakow [here on Discussion [#323](https://github.com/ikawrakow/ik_llama.cpp/issues/323)](https://github.com/ikawrakow/ik_llama.cpp/discussions/323#discussioncomment-12802730).
> 3. You could collapse the override tensor command in your logs using regex e.g. either of these two I tested which are equivalent:
> ```
> # its okay if you have excessive stuff that doesn't match e.g. layer 61,62,63,...,69
@@ -2763,13 +2799,15 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
> If there is demand for it I might try to release a couple with slightly reduced shared experts / attention to fit longer context in 24GB VRAM. If things go well and I still have access to these remote rigs from https://level1techs.com, I def plan to hopefully release something assuming R2 is similar architecture.
>
> Thanks again!
+
+> 👤 **saood06** replied on **2025-04-12** at **04:17:40**
>
-> 👤 **saood06** replied the **2025-04-12** at **04:17:40**:
> >I prefer to run R1 instead of V3, so I currently don't have the quant to utilize more RAM.
>
> If you have the capability, I would recommend making your own quants, that way you can optimally make them exactly to your system specs.
+
+> 👤 **anikifoss** replied on **2025-04-12** at **21:09:33**
>
-> 👤 **anikifoss** replied the **2025-04-12** at **21:09:33**:
> I fixed some cooling issues with the system and re-run benmarks with `ser`. Also run perplexity.
>
> Perplexity for `unsloth/DeepSeek-R1-UD-Q2_K_XL` (not plotting, becuse `ser` failed, and the `ctk` results are indistinguishable when plotted):
@@ -2783,31 +2821,35 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
> Benchmark results (system: 7975wx with FCLK=2100 , RAM at 5600MHz, RTX 5090):
> 
> 
+
+> 👤 **anikifoss** replied on **2025-04-12** at **21:12:10**
>
-> 👤 **anikifoss** replied the **2025-04-12** at **21:12:10**:
> @saood06 thanks, I'll try making my own quant targetting 32G VRAM. I could use some tips on how to validate it :)
+
+> 👤 **anikifoss** replied on **2025-04-13** at **23:50:25**
>
-> 👤 **anikifoss** replied the **2025-04-13** at **23:50:25**:
> @ubergarm I tested `DeepSeek-R1-UD-IQ1_S` quant, and it turns out to be slower than `DeepSeek-R1-UD-Q2_K_XL`. It looks like the `IQ` quants are generally slower than the corresponding `Q` quants, and even slower than larger `Q` quants!
> 
+
+> 👤 **ikawrakow** replied on **2025-04-14** at **08:58:02**
>
-> 👤 **ikawrakow** replied the **2025-04-14** at **08:58:02**:
> i-quants tend to be slower than k-quants (the only exceptions being `IQ4_XS` and `IQ4_KS`). Their advantage is that they tend to achieve better quality for the same number of bits spent than k-quants. In the case where this leads to being able to fully fit the model on the GPU this results in a clear performance advantage. But when using partial GPU offload, then yes, k-quants will tend to give better performance.
---
-👤 **ikawrakow** replied the **2025-04-14** at **09:05:56**:
+👤 **ikawrakow** commented on **2025-04-14** at **09:05:56**
> Interesting that -ctk f16 is faster while only adding about 1GiB of VRAM @ 32k context as compared to -ctk q8_0. I'll keep that in mind for how I'm running, given I might prefer the extra speed over extra context in some configs.
This is only true when attention is computed on the GPU (on the GPU `fp16` is king). But for CPU-only inference, or for hybrid inference where for whatever reason the attention ops involving the KV cache are run on the CPU, `q8_0` KV-cache will outperform `fp16` by a significant margin.
-> 👤 **anikifoss** replied the **2025-04-14** at **15:19:35**:
+> 👤 **anikifoss** replied on **2025-04-14** at **15:19:35**
+>
> It's interesting to see how applying one optimization immediately moves the bottleneck somewhere else, running these models is pushing the hardware limits in different ways.
---
-👤 **Dampfinchen** replied the **2025-04-14** at **18:52:59**:
+👤 **Dampfinchen** commented on **2025-04-14** at **18:52:59**
Hello, I have a question. I'm using a laptop 2060 and I'm trying to speed up partial offloading for Gemma 3 12B.
@@ -2817,10 +2859,12 @@ I think its --override-tensor but I don't know the specific command. I tried f
What is the command for doing that? Thank you!
-> 👤 **ikawrakow** replied the **2025-04-14** at **19:16:16**:
+> 👤 **ikawrakow** replied on **2025-04-14** at **19:16:16**
+>
> Can you give more details? (quantization used, if any, commands used here and in mainline). It is hard to diagnose and give suggestions based on the provided information.
+
+> 👤 **Dampfinchen** replied on **2025-04-14** at **19:35:17**
>
-> 👤 **Dampfinchen** replied the **2025-04-14** at **19:35:17**:
> > Can you give more details? (quantization used, if any, commands used here and in mainline). It is hard to diagnose and give suggestions based on the provided information.
>
> Apologies, I was retesting it again and your build is indeed faster. Is this the expected speedup? I'm asking because I don't know if I'm putting the token embeddings on the GPU correctly. The commands below look MoE specific.
@@ -2843,15 +2887,17 @@ What is the command for doing that? Thank you!
> ```
>
> My hardware is Core i7 9750H, RTX 2060 6 GB, 32 GB RAM.
+
+> 👤 **Dampfinchen** replied on **2025-04-14** at **19:52:41**
>
-> 👤 **Dampfinchen** replied the **2025-04-14** at **19:52:41**:
> I've found the culprit of the slowdown of my previous test. It was Flash Attention. This is the performance with -fa (everything else is the same)
>
> `prompt eval time = 30858.00 ms / 10025 tokens ( 3.08 ms per token, 324.88 tokens per second) | print_timings] generation eval time = 100601.17 ms / 170 runs ( 591.77 ms per token, 1.69 tokens per second)`
>
> Token generation is significantly slower with -fa, PP is a bit faster.
+
+> 👤 **ikawrakow** replied on **2025-04-15** at **05:33:24**
>
-> 👤 **ikawrakow** replied the **2025-04-15** at **05:33:24**:
> There is now a Gemma3 12B MoE model? Or are you using [this one](https://huggingface.co/google/gemma-3-12b-it)? If the latter, the `--override-tensor "down_exps=CUDA0,gate_exps=CUDA0,up_exps=CUDA0"` command line option does nothing as there are no tensors in that model where their names match the regular expressions you have specified.
>
> On my computer (Ryzen-5975WX with RTX-4080) running the command you used for `llama.cpp` (i.e., 16 layers offloaded to the GPU, 6 CPU threads) gives me about 10 t/s.
@@ -2910,8 +2956,9 @@ What is the command for doing that? Thank you!
> llama_new_context_with_model: graph nodes = 1400
> llama_new_context_with_model: graph splits = 486
> ```
+
+> 👤 **Dampfinchen** replied on **2025-04-15** at **09:00:59**
>
-> 👤 **Dampfinchen** replied the **2025-04-15** at **09:00:59**:
> Hi! Thank you very much for your detailed reply and testing, I appreciate it a lot!
>
> With this command:
@@ -2928,11 +2975,12 @@ What is the command for doing that? Thank you!
> So I've tried all sorts of combinations and your commands of course, but I'm unable to get decent performance out of it. So far the best performance I've got is with koboldcpp (a llama.cpp wrapper). There with the same configuration and prompt I'm getting 3.2 token/s text gen and 350 token/s pp so I will be switching back to that. For some reason I can use FA there, too.
>
> My laptop is pretty old so that's the best it can do it appears. Still, thank you very much for your helpful replies.
+
+> 👤 **ikawrakow** replied on **2025-04-15** at **10:09:30**
>
-> 👤 **ikawrakow** replied the **2025-04-15** at **10:09:30**:
> If we exchange a few more messages, eventually I will know what your use case is 😃
>
-> I have pushed PR #330 to allow using `Q8_0` KV cache for Gemma models on CUDA.
+> I have pushed PR [#330](https://github.com/ikawrakow/ik_llama.cpp/issues/330) to allow using `Q8_0` KV cache for Gemma models on CUDA.
>
> If you pull that one, and then use
> ```
@@ -2969,8 +3017,9 @@ What is the command for doing that? Thank you!
> | 512 | 128 | 9728 | 0.457 | 1120.02 | 11.031 | 11.60 |
>
> `llama-sweep-bench` performs a series of prompt processing batches (size 512 in this case) followed by TG (128 tokens in this case). The KV cache is not cleared, so the `N_KV` columns tells you how many tokens were in the KV cache when the PP/TG was processed.
+
+> 👤 **ikawrakow** replied on **2025-04-15** at **10:11:39**
>
-> 👤 **ikawrakow** replied the **2025-04-15** at **10:11:39**:
> And here is what I get with more traditional `llama.cpp` style benchmarking:
>
> | model | size | params | backend | ngl | threads | type_k | type_v | fa | test | t/s |
@@ -2982,8 +3031,9 @@ What is the command for doing that? Thank you!
> | gemma3 12B Q4_0 | 7.20 GiB | 12.77 B | CUDA | 100 | 6 | q8_0 | q8_0 | 1 | pp8192 | 1320.41 ± 4.75 |
> | gemma3 12B Q4_0 | 7.20 GiB | 12.77 B | CUDA | 100 | 6 | q8_0 | q8_0 | 1 | pp10240 | 1288.77 ± 4.29 |
> | gemma3 12B Q4_0 | 7.20 GiB | 12.77 B | CUDA | 100 | 6 | q8_0 | q8_0 | 1 | pp10000+tg240 | 355.33 ± 0.02 |
+
+> 👤 **Dampfinchen** replied on **2025-04-15** at **12:48:11**
>
-> 👤 **Dampfinchen** replied the **2025-04-15** at **12:48:11**:
> Hi! Of course, I will be glad. I'm sure it's exciting for you too to work with such a low spec consumer system! :) Thank you for reacting so fast!
>
> With your new PR, I get fast prompt processing speeds again at good VRAM usage. (I had to set one more layer to the CPU to not overspill into RAM) This is the result of your benchmark:
@@ -3018,21 +3068,25 @@ What is the command for doing that? Thank you!
>
> `ik_quantkv\ik_llama.cpp\ggml\src\ggml-cuda\rope.cu:370: GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16) failed
> `
+
+> 👤 **ikawrakow** replied on **2025-04-15** at **12:59:06**
>
-> 👤 **ikawrakow** replied the **2025-04-15** at **12:59:06**:
> Well, RoPE can indeed only take `f16` or `f32` tensors. The very same assert is present in mainline as well. Are there any shenanigans being played (such as undoing RoPE for context shifting)?
+
+> 👤 **Dampfinchen** replied on **2025-04-15** at **13:09:20**
>
-> 👤 **Dampfinchen** replied the **2025-04-15** at **13:09:20**:
> With mainline I'm not getting this error, but yes I'm pretty sure llama.cpp is using context shifting as a default. In ST there's also token padding.
>
> So llama.cpp is probably using ctx shift while your build uses RoPE, could that be it?
+
+> 👤 **ikawrakow** replied on **2025-04-15** at **13:47:32**
>
-> 👤 **ikawrakow** replied the **2025-04-15** at **13:47:32**:
> > With mainline I'm not getting this error
>
> But are you using quantized KV cache with mainline? It is very slow, no?
+
+> 👤 **Dampfinchen** replied on **2025-04-15** at **14:55:51**
>
-> 👤 **Dampfinchen** replied the **2025-04-15** at **14:55:51**:
> > > With mainline I'm not getting this error
> >
> > But are you using quantized KV cache with mainline? It is very slow, no?
@@ -3045,15 +3099,16 @@ What is the command for doing that? Thank you!
> srv update_slots: all slots are idle`
>
> As you can see, Quant KV Cache + FA and Gemma 3 is completely unsuable with mainline llama.cpp. However, it doesn't throw the error that I've mentioned above.
+
+> 👤 **ikawrakow** replied on **2025-04-15** at **15:00:53**
>
-> 👤 **ikawrakow** replied the **2025-04-15** at **15:00:53**:
> > However, it doesn't throw the error that I've mentioned above.
>
> This is interesting. I'll need to investigate. It is not that I couldn't implement RoPE for `Q8_0` quantized tensors, but something else has changed and I need to understand what (which is not easy as the two code bases have not much left in common).
---
-👤 **cmoncure** replied the **2025-05-13** at **01:48:14**:
+👤 **cmoncure** commented on **2025-05-13** at **01:48:14**
Alright. I want to put down some baseline numbers. I've built a system with EPYC 9175F and 768 GB @5600, with 2x RTX 6000 Ada Generation for 96 GB VRAM. Due to my dumb ass and inexperience with this kind of hardware, I'm running without GPUs and RAM is configured at 3600 for the time being.
@@ -3073,21 +3128,24 @@ With 8000 context PP drops to ~30t/s.
I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winking_eye:; my use case requires trawling through a lot of context. I'll check back in when I get GPU working and RAM at expected speed.
-> 👤 **saood06** replied the **2025-05-13** at **03:44:10**:
+> 👤 **saood06** replied on **2025-05-13** at **03:44:10**
+>
> >RTR seems to have a huge impact.
>
-> Yes this is because the quant you pulled is optimized for hybrid inference, see #272/#274 for ways to convert it to be CPU optimized (if you plan to keep using it CPU only), if you want to be able to avoid the load times of `-rtr`, but if you plan on using it with your GPU than the quant is already made for that and you just need to use the correct `--override-tensor` for it.
+> Yes this is because the quant you pulled is optimized for hybrid inference, see [#272](https://github.com/ikawrakow/ik_llama.cpp/issues/272)/[#274](https://github.com/ikawrakow/ik_llama.cpp/issues/274) for ways to convert it to be CPU optimized (if you plan to keep using it CPU only), if you want to be able to avoid the load times of `-rtr`, but if you plan on using it with your GPU than the quant is already made for that and you just need to use the correct `--override-tensor` for it.
>
> > my use case requires trawling through a lot of context.
>
> Just a reminder that parallel inference exists and can help get more overall throughput if your use case can allow for it.
+
+> 👤 **cmoncure** replied on **2025-05-13** at **12:17:53**
>
-> 👤 **cmoncure** replied the **2025-05-13** at **12:17:53**:
> Yes, I'm generating actionable intelligence by analyzing hundreds of documents in a batch. I have a huge demand for input tokens (i.e. PP) and not very high output tokens, probably a ratio of 1000 to 1. A typical run would look like 100,000 input tokens, 100 output tokens.
> I've never done parallel inference before. How would it work in this hardware/software configuration?
> In practice I think I'll end up pre-processing with a smaller model like Gemma to extract only the relevant tokens from the documents, but...
+
+> 👤 **ubergarm** replied on **2025-05-13** at **14:40:10**
>
-> 👤 **ubergarm** replied the **2025-05-13** at **14:40:10**:
> Thanks for the report and glad you're getting some better results already before optimizing your system.
>
> For this specific [ubergarm/DeepSeek-V3-0324-IQ4_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) quant all of the attention tensors are Q8_0 (not repacked) and the rest is already repacked. Keep in mind if you put a repacked quant offloaded to your GPU is is just "expensive RAM" as the GPU is not processing it.
@@ -3106,8 +3164,9 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> Cheers and keep us posted!
>
> (NOTE: the `--parallel` works very well for smaller models so keep in mind if you're using one to pre-process faster to speed that up as well. You could probably some decent size models in full GPU offload with 96GB VRAM for max speed)
+
+> 👤 **cmoncure** replied on **2025-05-13** at **16:00:06**
>
-> 👤 **cmoncure** replied the **2025-05-13** at **16:00:06**:
> Now I just have more questions about how --parallel works.
>
> 1. The models AFAIK have a maximum context window size. Suppose a model has a window of 8192 tokens. Can I load it with --ctx-size 81920 and --parallel 10 and get ten slots of 8192, keeping each slot under the maximum window size, and everything would be fine?
@@ -3116,8 +3175,9 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> 4. I doubt I'm covering new ground with this question, but do we know anything about the utilization of the individual experts in e.g. DeepSeek V3? Are they routed equally or are some preferred over others, in which case we'd presumably want to offload the preferred experts to GPU? I suppose the stochastic training process would result in uniform routing but who knows??
>
> Thank you all very much for your attention!
+
+> 👤 **ubergarm** replied on **2025-05-13** at **16:49:14**
>
-> 👤 **ubergarm** replied the **2025-05-13** at **16:49:14**:
> > 1. ... Can I load it with --ctx-size 81920 and --parallel 10 and get ten slots of 8192, keeping each slot under the maximum window size, and everything would be fine?
>
> That is my understanding.
@@ -3142,26 +3202,31 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> I don't think it is possible to simply say "oh I'm doing coding, so I know those experts live on layer 23 so I'll offload that to GPU) no. It is not that simple. When I don't have enough RAM and am using mmap() I just let the linux kernel page cache handle keeping the "most hot" data in RAM, despite this it is constantly paging almost 6GB/s off my NVMe drive even for "all coding" example.
>
> Enjoy the ride! You have a sweet rig, have fun getting it dialed in for your use case!
+
+> 👤 **cmoncure** replied on **2025-05-14** at **00:25:32**
>
-> 👤 **cmoncure** replied the **2025-05-14** at **00:25:32**:
> Okay. Got my RAM configured at 4800 MT/s but this does not result in any improvement. PP still small.
> TG went from 7 t/s to 8.5 t/s in the same scenario.
> I'll have my GPUs online in the next couple of days.
+
+> 👤 **saood06** replied on **2025-05-14** at **01:49:43**
>
-> 👤 **saood06** replied the **2025-05-14** at **01:49:43**:
> > Okay. Got my RAM configured at 4800 MT/s but this does not result in any improvement. PP still small.
>
> PP is compute bound, TG is bandwidth bound (at a batch size of 1).
+
+> 👤 **cmoncure** replied on **2025-05-14** at **14:32:33**
>
-> 👤 **cmoncure** replied the **2025-05-14** at **14:32:33**:
> An expensive lesson to learn
+
+> 👤 **ubergarm** replied on **2025-05-14** at **14:52:46**
>
-> 👤 **ubergarm** replied the **2025-05-14** at **14:52:46**:
> @cmoncure
>
> Get those GPUs online, more of the iqX_k quants just got faster on CUDA: https://github.com/ikawrakow/ik_llama.cpp/pull/417 !!
+
+> 👤 **cmoncure** replied on **2025-05-14** at **20:23:32**
>
-> 👤 **cmoncure** replied the **2025-05-14** at **20:23:32**:
> OK so I've hit a roadblock. I got GPU 1 online.
> I'm running now with the following options:
>
@@ -3237,8 +3302,9 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> ```
> INFO [ print_timings] prompt eval time = 381183.86 ms / 25820 tokens ( 14.76 ms per token, 67.74 tokens per second) | tid="133195110408192" timestamp=1747256663 id_slot=0 id_task=478 t_prompt_processing=381183.863 n_prompt_tokens_processed=25820 t_token=14.76312405112316 n_tokens_second=67.73634066455746
> ```
+
+> 👤 **ubergarm** replied on **2025-05-14** at **21:33:55**
>
-> 👤 **ubergarm** replied the **2025-05-14** at **21:33:55**:
> Good job getting the next step going. Each GPU has 48GB VRAM right (i'm using the same two cards on a remote rig I have access to for now).
>
> ## tl;dr;
@@ -3267,8 +3333,9 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> -m ~/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf
> ```
> I knocked it down to 8192 just so you can get a quick result to see how it works. Increase as desired given however much time you want to wait benchmarking.
+
+> 👤 **cmoncure** replied on **2025-05-15** at **00:39:16**
>
-> 👤 **cmoncure** replied the **2025-05-15** at **00:39:16**:
> Here's the result with many rows removed. Looks like this TG performance is competitive, matching the scores on the Q2 quant above even though it's running Q4 here.
>
> | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
@@ -3294,13 +3361,15 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> | 512 | 128 | 36864 | 10.533 | 48.61 | 10.374 | 12.34 |
> | 512 | 128 | 40960 | 11.020 | 46.46 | 10.505 | 12.19 |
> | 512 | 128 | 47616 | 11.734 | 43.63 | 10.709 | 11.95 |
+
+> 👤 **saood06** replied on **2025-05-15** at **01:39:57**
>
-> 👤 **saood06** replied the **2025-05-15** at **01:39:57**:
> > Here's the result with many rows removed.
>
> You can use the bundled python script for visualizations if you want. Also [llama-batched-bench](https://github.com/ikawrakow/ik_llama.cpp/tree/main/examples/batched-bench) exists (with many knobs) if you want to see how batched performance differs.
+
+> 👤 **ikawrakow** replied on **2025-05-15** at **05:13:33**
>
-> 👤 **ikawrakow** replied the **2025-05-15** at **05:13:33**:
> > In my 700 tokens scenario, I now reach 74 t/s PP and 14 t/s TG. However... during PP the GPU utilization is nearly zero as reported by nvidia-smi. During TG it's around 33%. It seems like something is misconfigured or the GPU is starved for work?
>
> The experts are computed on the CPU, hence the GPU sits idle while the CPU is computing. For PP this leads to nearly zero GPU utilization.
@@ -3322,9 +3391,10 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
>
> If your PCI-E is 30 GB/s, then u-batch=4096 PP will become `4096/(10.2 + 12) = 184.5 t/s`.
>
-> If your use case is such that you cannot use large batches, then as you can see from the above table it is better to not offload the experts computation to the GPU. This is accomplished either by using `*_R4` quants, or by adding `-op 26,0,27,0,29,0` to the command line (see #405, which adds the ability to explicitly control which operations are offloaded to the GPU).
+> If your use case is such that you cannot use large batches, then as you can see from the above table it is better to not offload the experts computation to the GPU. This is accomplished either by using `*_R4` quants, or by adding `-op 26,0,27,0,29,0` to the command line (see [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405), which adds the ability to explicitly control which operations are offloaded to the GPU).
+
+> 👤 **cmoncure** replied on **2025-05-15** at **14:15:02**
>
-> 👤 **cmoncure** replied the **2025-05-15** at **14:15:02**:
> Thanks for writing, and engaging with my very shaky mental model of how all this works.
>
> Continuing the napkin math, then with 2 GPUs, I have twice the PCI-E TX bandwidth. Can't we interleave experts- upload experts 0, 2... to GPU 0 and experts 1, 3... to GPU 1, cutting `time offload` in half? Overall, at small batch sizes, where `time offload` dominates, this should result in a <2x PP speedup, approaching 1.5x at 4096.
@@ -3332,13 +3402,15 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> I don't know how PP actually works, though. Do experts have to be consulted sequentially, or randomly so that the next expert is not known until the current computation is finished? Is there state that gets acted on by each consecutive expert, or can all computation results be concatenated at the end? I should look at the code.
>
> I'm getting more storage installed so I can ~play with~ experiment on different quants and make my own.
+
+> 👤 **ikawrakow** replied on **2025-05-15** at **15:08:02**
>
-> 👤 **ikawrakow** replied the **2025-05-15** at **15:08:02**:
> Which experts are needed and have to be uploaded is not known until the experts are needed (the very last op before they are needed determines which experts are active). But in a batch, each token in the batch activates different experts. So, at the end, basically all experts are needed and one needs to upload them all.
>
> There is also a heuristic to not offload experts to the GPU if the batch size is less than 32 - this is important for TG (where batch size is 1). So, when generating tokens one-by-one, the experts are running on the CPU.
+
+> 👤 **cmoncure** replied on **2025-05-15** at **18:21:55**
>
-> 👤 **cmoncure** replied the **2025-05-15** at **18:21:55**:
> Okay so how TF do the big boys do it? Last I checked they don't have GPUs with 600 GB of VRAM either. Does it all just come down to PCI-E vs. SXM bandwidth? They can just shove the experts in and out of the GPUs faster than we can and that's it??
>
> I don't understand how batching works. Can you validate my axioms here?
@@ -3355,8 +3427,9 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> 8. Expert Es[j+1] cannot be met by a token before expert Es[j] is met by that token.
>
> How does batching work, then? When you say "batching" in regards to prompt processing, are you referring to behavior that is controlled in the code by the `n_batch` and `n_ubatch` parameters?
+
+> 👤 **cmoncure** replied on **2025-05-16** at **19:03:57**
>
-> 👤 **cmoncure** replied the **2025-05-16** at **19:03:57**:
> I'm going to assume that token and expert processing during PP is fully parallelizable, i.e. tokens do not have to be processed in order and tokens do not have to meet experts in any order.
>
> Is a quant where row-interleaved layers are duplicated with non-row-interleaved layers possible? Does row-interleaving change the calculations?
@@ -3377,27 +3450,31 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> The idea would be to equalize the time spent in CPU compute with GPU upload + GPU compute so each finishes in the same time. The 3/1/1 split is just a guess. Per-unit GPU utilization can be increased by, ironically, adding more GPUs since I/O and VRAM are the limiting factor.
>
> Or should I just buy big boy hardware :vomiting_face:
+
+> 👤 **cmoncure** replied on **2025-05-17** at **01:28:04**
>
-> 👤 **cmoncure** replied the **2025-05-17** at **01:28:04**:
> Please bear with me as I learn LLMs 101 in public. Grok informs me that the results of expert calculations are combined as a weighted sum which as we all know is commutative, validating that the tokens can meet the experts in any order. Hopefully Grok is not misinformed on this point.
>
> It occurs to me that if we have enough VRAM per GPU to store TWO sets of the necessary buffers for expert calculation, then we can _pipeline_ and eliminate the GPU compute time term from the processing time estimate. Since TX and RX are symmetric on PCI-E, uploading experts and downloading results won't interfere with one another, and with two buffers we can compute an expert simultaneously with uploading the next one.
>
> We ought to be able to achieve an optimization somewhere between 3x CPU-only performance, and 2x I/O-limited GPU-only performance. Right???
+
+> 👤 **cmoncure** replied on **2025-05-17** at **02:34:50**
>
-> 👤 **cmoncure** replied the **2025-05-17** at **02:34:50**:
> In fact. Forget about CPU for PP. PCI-E 4.0 x16 is supposed to be 32 GB/s symmetric. So let's say 30 GB/s following the above scenario. It would therefore require 6 seconds per GPU to offload each half of the experts, and 5.1 seconds to do each half of the compute. Doesn't that mean with two such GPUs and pipelining offload and compute we can consume the entire model's worth of layers in 6 seconds per batch of 4096 tokens?
> Surely that has to be a more ideal way to run a huge model like DeepSeek on (kinda-)commodity hardware.
> I'd gladly take 6 seconds as a lower bound on prompt processing if it meant prefilling 30,000 tokens in 48 seconds instead of 480.
>
> I guess the only question is whether a hybrid model could then permit us to do TG at the current rate on CPU.
+
+> 👤 **ikawrakow** replied on **2025-05-17** at **04:50:17**
>
-> 👤 **ikawrakow** replied the **2025-05-17** at **04:50:17**:
> > It would therefore require 6 seconds per GPU to offload each half of the experts, and 5.1 seconds to do each half of the compute. Doesn't that mean with two such GPUs and pipelining offload and compute we can consume the entire model's worth of layers in 6 seconds per batch of 4096 tokens?
>
> This is known as tensor parallelism (TP) or, in the `llama.cpp` world, as split mode (SM) "row" (as opposed to SM "layer"). Unfortunately SM "row" does not work for MoE models. Not here and also not in mainline `llama.cpp`. There are LLM inference frameworks that support TP (e.g., [vLLM](https://github.com/vllm-project/vllm), [sglang](https://github.com/sgl-project/sglang)), but I'm not sure if/how well they support your use case with partial GPU offload. Somebody compared `ik_llama.cpp` to vLLM on a 16 x 3090 system with a model that fully fits in VRAM, and vLLM was only about 20% faster than `ik_llama.cpp` despite using 8-way TP.
+
+> 👤 **cmoncure** replied on **2025-05-17** at **19:59:53**
>
-> 👤 **cmoncure** replied the **2025-05-17** at **19:59:53**:
> Thank you very much for your comment. I must be confused about something. There is an inherent difficulty to speak accurately about these things when there are really three competing vocabularies- the mathematical vocabulary of the model architecture, that of the code implementation of llama.cpp and GGUF, and the flawed, simplistic abstractions in my mind that I approach the topic with. (I think "blk" is roughly equivalent to "layer"?)
>
> I will try to describe some real and some hypothetical execution models for prompt processing, incrementally increasing the level of parallelism, and will you please note at which case execution becomes impossible and why?
@@ -3487,18 +3564,21 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> 7. Goto 2 until PP done.
>
> Question: A single batch can be processed in parallel between devices, with layers/blk/experts split between devices? This must be possible, if "layers" are "experts", and if "tokens can meet experts in any order". If it is not possible, there must be some constraint or entanglement that is beyond my shallow understanding of the model architecture or its implementation, or there is slippage in the vocabulary I'm using to describe the entities in the domain.
+
+> 👤 **cmoncure** replied on **2025-05-20** at **01:12:34**
>
-> 👤 **cmoncure** replied the **2025-05-20** at **01:12:34**:
> I brought GPU0 and GPU1 online and tried to split layers among them and it was dog slow. Forget.
> Adding `--numa isolate` to the commandline gave about a 10% performance boost (my CPU has 1 core per CCD).
> Now 82 PP/13.5 TG.
>
> Just answer me this- if I shell out for the 48 core version of my (16 core) CPU, will PP scale to roughly 3x?
+
+> 👤 **ikawrakow** replied on **2025-05-20** at **04:24:24**
>
-> 👤 **ikawrakow** replied the **2025-05-20** at **04:24:24**:
> Can you share your command line that resulted in dog slow performance with 2 GPUs? With that I can give you a more informed answer to your question about expected performance increase with a 48-core CPU.
+
+> 👤 **ubergarm** replied on **2025-05-20** at **14:44:57**
>
-> 👤 **ubergarm** replied the **2025-05-20** at **14:44:57**:
> @cmoncure
>
> Sorry I didn't comprehend all the "Case A, B, C...F" stuff above as it was too dense to read.
@@ -3510,14 +3590,16 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> > the 16-core EPYC 9175F uses 16 CPU dies, each with one core per die active. This results in 32 MB L3 cache per core.
>
> If I didn't already mention it, can you configure your BIOS to `NPS1` to present a single NUMA node for all 768GB RAM? Having 16 NUMA nodes (one for each CCD / CORE) would probably be bad for performance. In general if I *must* run across multiple NUMA nodes I generally use `numactl --interleave=all llama-server --numa distribute ...`
+
+> 👤 **cmoncure** replied on **2025-05-22** at **20:22:55**
>
-> 👤 **cmoncure** replied the **2025-05-22** at **20:22:55**:
> [Hybrid LLM execution models.pdf](https://github.com/user-attachments/files/20400023/Hybrid.LLM.execution.models.pdf)
>
> Okay, I illustrated it. Hope it makes things more clear.
> And yes I did NPS1. Thanks!
+
+> 👤 **ubergarm** replied on **2025-05-23** at **15:02:22**
>
-> 👤 **ubergarm** replied the **2025-05-23** at **15:02:22**:
> @cmoncure
>
> > (I think "blk" is roughly equivalent to "layer"?)
@@ -3536,8 +3618,9 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
> Sorry, I appreciate the image but I don't understand what you're asking? Are you asking "what is the best way to run a particular LLM on my specific hardware with ik_llama.cpp right now?" ?
>
> In general just try some things out and A/B test with llama-sweep-bench to see what is faster and keep iterating. See what commands other folks are using and what they say is faster/better. Sorry I don't have more motivation for this big question.
+
+> 👤 **cmoncure** replied on **2025-05-23** at **17:06:40**
>
-> 👤 **cmoncure** replied the **2025-05-23** at **17:06:40**:
> > what you're asking?
>
> I'll restate the thread of discussion from the beginning.
@@ -3564,14 +3647,15 @@ I'm actually okay with this TG, but I gotta get my PP up :stuck_out_tongue_winki
---
-👤 **VinnyG9** replied the **2025-05-13** at **19:02:29**:
+👤 **VinnyG9** commented on **2025-05-13** at **19:02:29**
can you please add to the guide: llama-sweep-bench
where it came from?
where does it live?
what does it feed on?
-> 👤 **ubergarm** replied the **2025-05-13** at **19:26:14**:
+> 👤 **ubergarm** replied on **2025-05-13** at **19:26:14**
+>
> The guide is missing a lot of things as this fork has been moving pretty quickly. Your best bet in general is to search closed PRs for more details.
>
> Regarding llama-sweep-bench:
@@ -3596,7 +3680,7 @@ what does it feed on?
---
-👤 **bart2** replied the **2025-05-20** at **06:11:45**:
+👤 **bart2** commented on **2025-05-20** at **06:11:45**
Thanks for putting this guide together! I have to say ik_llama.cpp has been a great experience so far for me:
- much faster than llama.cpp on a hybrid CPU+GPU setup
@@ -3640,10 +3724,12 @@ Is there any way to squeeze a larger context size out of this hardware, while ma
Thanks for any help and for working on this!
-> 👤 **ikawrakow** replied the **2025-05-20** at **06:16:57**:
+> 👤 **ikawrakow** replied on **2025-05-20** at **06:16:57**
+>
> Can you post the part of the log where it tells you what the CUDA buffer sizes are?
+
+> 👤 **bart2** replied on **2025-05-20** at **06:23:01**
>
-> 👤 **bart2** replied the **2025-05-20** at **06:23:01**:
> I saw two sections of the log mentioning CUDA buffer sizes (with different values):
> ```
> llm_load_tensors: offloading 61 repeating layers to GPU
@@ -3665,11 +3751,13 @@ Thanks for any help and for working on this!
> llama_new_context_with_model: CUDA1 compute buffer size = 16985.55 MiB
> llama_new_context_with_model: CUDA_Host compute buffer size = 3829.80 MiB
> ```
+
+> 👤 **ikawrakow** replied on **2025-05-20** at **07:11:05**
>
-> 👤 **ikawrakow** replied the **2025-05-20** at **07:11:05**:
> The CUDA compute buffers are unexpectedly large for this command line. Can you replace `-mla 3` with `-mla 1` and post the compute buffer sizes with that? The TG speed should be about the same. The PP performance will decrease (with the performance degradation increasing with number of tokens in the KV cache), but just to see what happens.
+
+> 👤 **bart2** replied on **2025-05-20** at **07:17:54**
>
-> 👤 **bart2** replied the **2025-05-20** at **07:17:54**:
> CUDA buffer sizes with `-mla 1`:
> ```
> llm_load_tensors: CPU buffer size = 205716.00 MiB
@@ -3688,8 +3776,9 @@ Thanks for any help and for working on this!
> llama_new_context_with_model: CUDA1 compute buffer size = 13635.05 MiB
> llama_new_context_with_model: CUDA_Host compute buffer size = 3829.80 MiB
> ```
+
+> 👤 **bart2** replied on **2025-05-20** at **07:30:07**
>
-> 👤 **bart2** replied the **2025-05-20** at **07:30:07**:
> PP, TG timings with `-mla 1`:
> ```
> INFO [ print_timings] prompt eval time = 22153.91 ms / 1800 tokens ( 12.31 ms per token, 81.25 tokens per second) | tid="135099310661632" timestamp=1747725975 id_slot=0 id_task=0 t_prompt_processing=22153.908 n_prompt_tokens_processed=1800 t_token=12.307726666666666 n_tokens_second=81.24977317771655
@@ -3698,25 +3787,31 @@ Thanks for any help and for working on this!
> ```
>
> Prompt processing speed degradation is not too bad. I'll try to find the new maximum context size now.
+
+> 👤 **bart2** replied on **2025-05-20** at **07:57:03**
>
-> 👤 **bart2** replied the **2025-05-20** at **07:57:03**:
> `DeepSeek-R1-UD-Q2_K_XL` now seems to load fine with `--ctx-size 131072` :) I wonder if RoPE scaling can work here as well... :)
+
+> 👤 **saood06** replied on **2025-05-20** at **08:00:51**
>
-> 👤 **saood06** replied the **2025-05-20** at **08:00:51**:
> Try adding `DGGML_SCHED_MAX_COPIES=1` to your build process.
+
+> 👤 **bart2** replied on **2025-05-20** at **08:03:37**
>
-> 👤 **bart2** replied the **2025-05-20** at **08:03:37**:
> @saood06, what kind of improvement can I expect to see after building with that option?
+
+> 👤 **saood06** replied on **2025-05-20** at **08:10:27**
>
-> 👤 **saood06** replied the **2025-05-20** at **08:10:27**:
> See https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2645971721 but it may lower memory.
>
> I can see `pipeline parallelism enabled (n_copies=4)` in your output.
+
+> 👤 **ikawrakow** replied on **2025-05-20** at **08:12:57**
>
-> 👤 **ikawrakow** replied the **2025-05-20** at **08:12:57**:
> I don't understand the massive CUDA compute buffer size. Can someone running a similar setup chime in?
+
+> 👤 **bart2** replied on **2025-05-20** at **08:17:28**
>
-> 👤 **bart2** replied the **2025-05-20** at **08:17:28**:
> wow, building with `-DGGML_SCHED_MAX_COPIES=1` really reduced VRAM usage:
> ```
> llama_kv_cache_init: CUDA0 KV buffer size = 2448.02 MiB
@@ -3731,13 +3826,15 @@ Thanks for any help and for working on this!
> That's with `--ctx-size 131072`.
>
> Testing the model performance now.
+
+> 👤 **saood06** replied on **2025-05-20** at **08:22:50**
>
-> 👤 **saood06** replied the **2025-05-20** at **08:22:50**:
> >wow, building with -DGGML_SCHED_MAX_COPIES=1 really reduced VRAM usage:
>
> Glad to hear it helped you.
+
+> 👤 **bart2** replied on **2025-05-20** at **08:27:48**
>
-> 👤 **bart2** replied the **2025-05-20** at **08:27:48**:
> > > wow, building with -DGGML_SCHED_MAX_COPIES=1 really reduced VRAM usage:
> >
> > Glad to hear it helped you.
@@ -3745,8 +3842,9 @@ Thanks for any help and for working on this!
> Thanks for pointing it out :) `-mla 1` from @ikawrakow also helped a lot!
>
> Now with all this available VRAM, is there any way to go beyond 128k context size with Deepseek R1?
+
+> 👤 **saood06** replied on **2025-05-20** at **08:30:20**
>
-> 👤 **saood06** replied the **2025-05-20** at **08:30:20**:
> > > > wow, building with -DGGML_SCHED_MAX_COPIES=1 really reduced VRAM usage:
> > >
> > >
@@ -3755,11 +3853,13 @@ Thanks for any help and for working on this!
> > Thanks for pointing it out :) `-mla 1` from @ikawrakow also helped a lot!
>
> You might be able to go back to `-mla 3` now and get back the PP performance?
+
+> 👤 **ikawrakow** replied on **2025-05-20** at **08:30:44**
>
-> 👤 **ikawrakow** replied the **2025-05-20** at **08:30:44**:
> You can now go back to `-mla 3` and see the compute buffer sizes. Then you know how much VRAM you have left. Most likely you can go to the claimed max. context size of 163k tokens. There may be even some space left for offloading some of the experts to the GPUs.
+
+> 👤 **ubergarm** replied on **2025-05-20** at **14:51:54**
>
-> 👤 **ubergarm** replied the **2025-05-20** at **14:51:54**:
> In addition to above recommendations, if you have configured BIOS to set each socket as a single NUMA node e.g. `SNC=Disable` (on recent intel systems), you could also try adding numactl and using more threads for PP than TG like so:
>
> ```
@@ -3767,8 +3867,9 @@ Thanks for any help and for working on this!
> ```
>
> On intel Xeon in my limited experience the optimal number of threads for PP is larger than for TG.
+
+> 👤 **bart2** replied on **2025-05-21** at **02:26:30**
>
-> 👤 **bart2** replied the **2025-05-21** at **02:26:30**:
> @ubergarm thanks. I did disable NUMA in BIOS. With the options you suggested I'm getting ~10% faster PP:
> ```
> INFO [ print_timings] prompt eval time = 18652.78 ms / 1800 tokens ( 10.36 ms per token, 96.50 tokens per second) | tid="135194909810688" timestamp=1747793997 id_slot=0 id_task=0 t_prompt_processing=18652.778 n_prompt_tokens_processed=1800 t_token=10.362654444444443 n_tokens_second=96.50037115114972
@@ -3777,8 +3878,9 @@ Thanks for any help and for working on this!
> ```
>
> That's with `--ctx-size 163840`.
+
+> 👤 **ubergarm** replied on **2025-05-21** at **14:34:29**
>
-> 👤 **ubergarm** replied the **2025-05-21** at **14:34:29**:
> @bart2
>
> > That's with `--ctx-size 163840`.
@@ -3788,8 +3890,9 @@ Thanks for any help and for working on this!
> I'm not sure on sapphire rapids intel xeon, but your BIOS may also have some kind of `Opportunistic Snoop Broadcast (OSB)` mode which reportedly can give better performance for CPU/RAM inferencing: https://github.com/ikawrakow/ik_llama.cpp/discussions/201#discussioncomment-13214852
>
> Finally, while `-ser 5,1` improves speed, have you found any noticible loss in generation quality? Just curious.
+
+> 👤 **bart2** replied on **2025-05-22** at **05:30:42**
>
-> 👤 **bart2** replied the **2025-05-22** at **05:30:42**:
> @ubergarm, thanks for those pointers!
>
> As for `-ser 5,1`, I did see some quality degradation, while the speed improvement wasn't very substantial, so I decided to stop using it.
@@ -3827,8 +3930,9 @@ Thanks for any help and for working on this!
> ```
>
> Does my `-ot` regex look reasonable? Is there anything else I could try to speed up token generation?
+
+> 👤 **ubergarm** replied on **2025-05-22** at **13:49:02**
>
-> 👤 **ubergarm** replied the **2025-05-22** at **13:49:02**:
> @bart2
>
> I'm not 100% on the best `-ot` options for DeepSeek, but you will want to put those lines with CUDAx *before* the one with CPU as the regex are applied in order. So maybe something like:
@@ -3848,7 +3952,7 @@ Thanks for any help and for working on this!
---
-👤 **cfelicio** replied the **2025-05-25** at **02:35:57**:
+👤 **cfelicio** commented on **2025-05-25** at **02:35:57**
Hi Everyone,
@@ -3874,7 +3978,8 @@ G:\ik_llama>llama-bench.exe --model "G:\Qwen3-235B-A22B-128K-Q8_0-00001-of-00006
Any suggestions are appreciated! :-)
-> 👤 **ubergarm** replied the **2025-05-25** at **15:50:57**:
+> 👤 **ubergarm** replied on **2025-05-25** at **15:50:57**
+>
> Hey glad you got it going on your system. Thanks a lot for the detailed explanation of the BIOS settings as I don't have access to intel xeon BIOS. I had never heard of "node interleaving" option and just assumed that dual socket intel had no equivalent of AMD `NPS0` to present a single numa node for *both* sockets.
>
> Right, I watched a good deep dive on AMD Epyc server BIOS on level1techs youtube recently and the AMD engineers basically said "don't use NPS0 unless your workload is not optimized at all" and that is basically the case for all CPU inferencing engines so even though aggregate RAM bandwidth goes down it will likely be the fastest for now.
@@ -3891,8 +3996,9 @@ Any suggestions are appreciated! :-)
> 7. Finally, you might consider going with a Q4 model or rolling your own iq4_ks model as having smaller weights will likely speed up TG with similar PP (or slightly slower depending on exact quant). I know you have enough RAM to hold the big models, but it might be worth it for you to get a little more speed given you have no GPU at all.
>
> Have fun tweaking!
+
+> 👤 **cfelicio** replied on **2025-05-28** at **17:55:56**
>
-> 👤 **cfelicio** replied the **2025-05-28** at **17:55:56**:
> Thanks for providing such a detailed reply, this has been super helpful! I ended up spending some more time on this, and wanted to share my results:
>
> 1 - Windows turned out to be a big limitation, as it is not possible to control NUMA behavior the same way as you can in Linux. I also tried Proxmox, but could not figure out how to reach the maximum bandwidth in a Linux VM. I ended up installing Debian on bare metal, and easily got close to 200gb / s doing the Intel MLC test, with 2 numa nodes
@@ -3918,11 +4024,12 @@ Any suggestions are appreciated! :-)
---
-👤 **cmoncure** replied the **2025-06-01** at **19:34:56**:
+👤 **cmoncure** commented on **2025-06-01** at **19:34:56**
What's the easiest method to produce a file that simply applies the --runtime-repack transformation to an existing GGUF? I can run DeepSeek at Q_8 but the startup time is a killer.
-> 👤 **ubergarm** replied the **2025-06-01** at **19:47:13**:
+> 👤 **ubergarm** replied on **2025-06-01** at **19:47:13**
+>
> > What's the easiest method to produce a file that simply applies the --runtime-repack transformation to an existing GGUF?
>
> I ran it once a few months ago but lost my logs and my rigs are tied up at the moment. Someone was asking me on reddit too: https://www.reddit.com/r/LocalLLaMA/comments/1kb97ys/comment/mvg837s/
@@ -3930,11 +4037,13 @@ What's the easiest method to produce a file that simply applies the --runtime-re
> If you want to repack *everything* for CPU inferencing, it is basically `./build/bin/llama-quantize --repack inputmodel outputmodel` but I haven't tested so let me know once u figure it out and I'll try to update the guide/model card with a reference and let that guy on reddit know.
>
> There is an option for regex matching if you only want to repack some tensors, check out `./build/bin/llama-quantize --help` or the code for more deets.
+
+> 👤 **saood06** replied on **2025-06-02** at **00:49:12**
>
-> 👤 **saood06** replied the **2025-06-02** at **00:49:12**:
-> #274 and #272 are where you can find more details about this.
+> [#274](https://github.com/ikawrakow/ik_llama.cpp/issues/274) and [#272](https://github.com/ikawrakow/ik_llama.cpp/issues/272) are where you can find more details about this.
+
+> 👤 **ubergarm** replied on **2025-06-02** at **14:33:27**
>
-> 👤 **ubergarm** replied the **2025-06-02** at **14:33:27**:
> Thanks @saood06 I couldn't find my old logs for this but apparently I'd buried a command in a detail fold over two months ago. So @cmoncure probably something like this would work if you want to repack all the attn/shexp layers to optimize for running *without any GPU*:
>
> ```
@@ -3946,13 +4055,14 @@ What's the easiest method to produce a file that simply applies the --runtime-re
> ```
>
> Then you should be able to start up with mmap() and no longer need to wait for `-rtr`. Let me know if that works for you!
+
+> 👤 **ciprianveg** replied on **2025-06-02** at **14:53:10**
>
-> 👤 **ciprianveg** replied the **2025-06-02** at **14:53:10**:
> Thank you, I will try it this evening and let you know. Much appreciated.
---
-👤 **sousekd** replied the **2025-06-24** at **13:48:04**:
+👤 **sousekd** commented on **2025-06-24** at **13:48:04**
Hi everyone,
@@ -5438,10 +5548,12 @@ I have NPS0 set in BIOS, and "LLC as NUMA domain (ACPI SRAT L3 Cache as NUMA dom
Anyway, just wanted to say "thanks" and share my excitement 💯.
Any tips, insights or discussion would be welcome.
-> 👤 **cmoncure** replied the **2025-06-25** at **22:33:44**:
+> 👤 **cmoncure** replied on **2025-06-25** at **22:33:44**
+>
> Great post. Your perf results track with my similar system (EPYC 9175F), with your PP about 1.3x bigger than mine at low context, I guess due to having 32 cores to my 16. All your remarks about command line flags impact on performance track with my observations. I don't know how to make it run faster so I will just recommend that applying a permanent repack to the quant is fairly easy and straightforward so consider it when you're bored of waiting for -rtr.
+
+> 👤 **sousekd** replied on **2025-06-27** at **23:10:41**
>
-> 👤 **sousekd** replied the **2025-06-27** at **23:10:41**:
> Okay, so after spending several hours benchmarking and trying various stuff to little effect, I managed to squeeze out slightly better results. Here's what I did:
>
> 1. Disabled **"LLC as NUMA domain (ACPI SRAT L3 Cache as NUMA domain)"** in the BIOS.
@@ -7176,26 +7288,30 @@ Any tips, insights or discussion would be welcome.
> The numbers look great! That said, looking around, I feel like I should be able to get slightly better results with 12 channels of DDR5-6400 😄. OCCT reports RAM bandwidth at **598 GB/s read**, **427 GB/s write**, and **136.82 ns latency**.
>
> I’d love to hear what more experienced people here think - @ubergarm?
+
+> 👤 **saood06** replied on **2025-06-28** at **00:34:39**
>
-> 👤 **saood06** replied the **2025-06-28** at **00:34:39**:
> >Switched the build to clang-cl
> >-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON
>
> Do you mind telling me how much these two changes matter?
+
+> 👤 **sousekd** replied on **2025-06-28** at **00:51:40**
>
-> 👤 **sousekd** replied the **2025-06-28** at **00:51:40**:
> > Do you mind telling me how much these two changes matter?
>
> Close to *not-at-all*, at least in my testing. AI insisited to use LVVM/clang instead of MSVC and as a good citizen, I obliged. The same applies to `DCMAKE_INTERPROCEDURAL_OPTIMIZATION`. I think most of the improvements were caused simply by playing with `-b` and `-ub`. I did not manage to get @ubergarm's models play well with `-ub` higher then default (without OOM on my system), but even change in `-b` made some difference.
+
+> 👤 **saood06** replied on **2025-06-28** at **01:19:59**
>
-> 👤 **saood06** replied the **2025-06-28** at **01:19:59**:
> > > Do you mind telling me how much these two changes matter?
> >
> > Close to _not-at-all_, at least in my testing. AI insisited to use LVVM/clang instead of MSVC and as a good citizen, I obliged.
>
> Thanks for confirming, that aligns with my previous testing. I also experimented with GGML_LTO on Windows (MSVC) and found that it caused issues, hadn't tried it with the other compilers (clang, gcc).
+
+> 👤 **ubergarm** replied on **2025-06-28** at **16:20:27**
>
-> 👤 **ubergarm** replied the **2025-06-28** at **16:20:27**:
> @sousekd
>
> Thanks for the detailed report and many iterations to search out the best performance for your rig. Yes, my models can be a bit slower than mainline quants given I tend to use bigger tensors for the GPU offload portion which leads to a little better perplexity and KLD scores for a given GiB size class.
@@ -7229,8 +7345,9 @@ Any tips, insights or discussion would be welcome.
> If you can get this running, check amount of VRAM used with `nvidia-smi` etc and then you could possibly increase `-amb 256` or add a little more context back to max it out.
>
> Good luck!
+
+> 👤 **sousekd** replied on **2025-06-28** at **17:53:29**
>
-> 👤 **sousekd** replied the **2025-06-28** at **17:53:29**:
> Thank you @ubergarm for the great tips to try, and for helping people here and around the web :). I’ll give it a try once I’m back from my holiday.
>
> Do you find my pp/tg numbers as expected, or do you think the machine should be able to do better? I think I saw your Threadripper PRO 7965WX numbers somewhere and thought the higher memory bandwidth of EPYC should help achieve even better results.
@@ -7238,13 +7355,15 @@ Any tips, insights or discussion would be welcome.
> I’m perfectly happy with these numbers and grateful to @ikawrakow and other contributors to ik_llama, but improving pp speed would unlock even more use cases.
>
> I have another 4090 and a 5090 in my other PC, and one of them will be moved to this server to get more VRAM. I’m also considering buying an RTX 6000, but I’m not at all sure how much it would actually help with these huge models not fitting in VRAM anyway. Could you elaborate based on your knowledge and experience, please? Thank you very much!
+
+> 👤 **saood06** replied on **2025-06-28** at **23:43:05**
>
-> 👤 **saood06** replied the **2025-06-28** at **23:43:05**:
> >If you can get this running, check amount of VRAM used with nvidia-smi etc
>
> From my experience for watching usage (split by CUDA, 3d, video decode etc.) and memory usage (shared and dedicated) task manager is pretty good on windows.
+
+> 👤 **sousekd** replied on **2025-07-09** at **07:12:18**
>
-> 👤 **sousekd** replied the **2025-07-09** at **07:12:18**:
> Back from holiday, I added another GPU to the server, expecting the extra VRAM would only help. Turns out I was totally wrong - using both GPUs actually *hurt* performance. Clearly, I've got a lot more to learn 🙂. PCIe bandwidth and latency seem to matter a lot, and I need to experiment more with batch sizes and which parts of the model to offload, as it can have a significant impact.
>
> Anyway, sticking to a single RTX 5090 for now, playing with batch sizes and offloading one, two, or no experts, I managed to improve speeds a bit:
@@ -9725,8 +9844,9 @@ Any tips, insights or discussion would be welcome.
>
>
> @ubergarm's IQ2_K_R4 PP speed doubled with `-ub 4096`. I would love to discover a similar miracle switch for the larger models 🙂.
+
+> 👤 **ubergarm** replied on **2025-07-09** at **22:23:05**
>
-> 👤 **ubergarm** replied the **2025-07-09** at **22:23:05**:
> @sousekd
>
> Thanks for the update, and huh I would have thought adding another GPU would give a slight increase to TG. I'd have to see the full command you were using for multi-GPU setup. I was just talking with @Panchovix about it over on my latest model https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/discussions/2#686eea805532fabe4bf9bce5
@@ -9734,20 +9854,23 @@ Any tips, insights or discussion would be welcome.
> and trying to figure out if it is possible to put all the attn/shexp/first 3 dense ffn layers onto a single GPU and offload only routed experts onto the other GPUs and CPU. Not sure if there is a switch or method to put kv-cache on a single GPU as well, or if that would even help e.g. keep it with the attn tensors with the theory being to avoid PCIe bus between GPUs.
>
> Try out the new TNG Chimera model as it is *not* `_R4` type so might benefit more from `-ub 4096 -b 4096` now.
+
+> 👤 **sousekd** replied on **2025-07-10** at **09:17:29**
>
-> 👤 **sousekd** replied the **2025-07-10** at **09:17:29**:
> Thank you @ubergarm. I'll read and experiment more with the multi-GPU setup. Naturally, I would also think the second GPU should help, but at the same time I can understand that PCIE bandwidth has its limits - and it might become a bottleneck if data travels over it frequently, effectively negating any gains of faster memory and/or processing. Is there even anybody with multiple GPUs achieving significantly better speeds using ik_llama? Any thoughts on the topic @ikawrakow?
>
> I originally planned to buy two CPUs and spread memory across two sockets (to get 24 channels to RAM), but then reading about NUMA issues I realized it might not help much - quite the opposite. Even cross-CCDs memory access has a negative effect, so I can see why PCIE transfers should be avoided as much as possible.
+
+> 👤 **ikawrakow** replied on **2025-07-10** at **09:33:30**
>
-> 👤 **ikawrakow** replied the **2025-07-10** at **09:33:30**:
> @sousekd Your `sweep-bench` results look pretty good. IIRC, someone got up to 350 t/s prompt processing speed using `-b 16384 -ub 16384` with 96 GB VRAM (all routed experts left on the CPU), but you need to go and pock around in the issues/discussions to find the setup and the model used (I'm not very well organized in keeping track of all the discussions). Also, I think it is better to remind us of your hardware (CPU, GPUs) instead of us having to go and search where they were posted.
>
> While I can see that competition for PCI-E bandwidth/latency may hinder PP improvements, I'm not sure I understand why one cannot get TG speed improvement by having additional routed experts offloaded to the second GPU. No tensor data is copied from RAM to VRAM when generating tokens, so PCI-E shouldn't be a bottleneck, so I expect to see at least some TG speed improvement.
>
> I'm quite interested in improving the speed further if possible, so I think it would be useful for you to post what you have tried and the results. You may want to start a new discussion for that as this one is getting difficult to follow all comments.
+
+> 👤 **sousekd** replied on **2025-07-10** at **11:01:17**
>
-> 👤 **sousekd** replied the **2025-07-10** at **11:01:17**:
> Thank you, @ikawrakow for your thoughts.
>
> The system is an EPYC 9355 (32 cores) with 12x DDR5-6400, and the latest results above are from a single RTX 5090 on PCIe 5.0 x16. Previous results were from a single RTX 4090 on PCIe 4.0 x16. Combined - without much tuning of the parameters - both PP t/s and TG t/s were significantly lower than on a single GPU. Oh, and it's currently running on Windows Server - only temporarily.
@@ -9759,8 +9882,9 @@ Any tips, insights or discussion would be welcome.
> Yes, I will play with params and benchmark more and once I have some results, I will open a new discussion. The reason I post these results (and params) are meant to help other people. When I was deciding on what hardware to buy for running these huge models the lack of available information and real results on larger contexts was putting me off. All I was able to find is "MacBook Pro can run DeepSeek", but no information about how the performance is degrading with growing context... and k-transformers for AMX.
>
> Anyway, it is quite possible I am doing something wrong, or Windows. Thank you very much - the numbers are great as they are, but obviously one can always try to improve, and the fact the second GPU did not help surprised me.
+
+> 👤 **ubergarm** replied on **2025-07-10** at **15:19:09**
>
-> 👤 **ubergarm** replied the **2025-07-10** at **15:19:09**:
> @sousekd
>
> Just helped some of the multi-gpu crew tune up their commands. Feel free to take a look on how they are achieving over 300 tok/sec PP and almost 20 tok/sec TG on my newest quants (using very fast IQ2_KS and the new IQ3_KS): https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/discussions/2
@@ -9772,13 +9896,15 @@ Any tips, insights or discussion would be welcome.
> Yeah give your BIOS configuration as well e.g. if you have dual socket are you running `NPS0` (normally not a good idea, but for this workload probably best if you can't fit the model in a single socket's worth of RAM in NPS1) etc...
>
> I believe if you use dual GPU and are offloading efficiently TG should definitely be like ~1 tok/sec faster or so probably as a 4090 with ~1TB/sec VRAM bandwidth bets almost any CPU RAM speeds.
+
+> 👤 **ikawrakow** replied on **2025-07-10** at **15:33:00**
>
-> 👤 **ikawrakow** replied the **2025-07-10** at **15:33:00**:
> @ubergarm
>
> Btw, the other day I randomly came across a discussion in the KTransformers repository where 2 guys were thinking that `ik_llama.cpp` requires a "different format" (and they didn't like that). Apparently they came to that conclusion because of your `ik_llama.cpp` specific quants on HF. See [this comment](https://github.com/kvcache-ai/ktransformers/issues/1417#issuecomment-3045026282) (and you may want to read the response to my comment). So, perhaps it would be a good idea to actually add a clarification to your HF repos that `ik_llama.cpp` also works with "standard" GGUFs, so people don't need to download these giant models just to try `ik_llama.cpp`.
+
+> 👤 **ubergarm** replied on **2025-07-10** at **16:54:58**
>
-> 👤 **ubergarm** replied the **2025-07-10** at **16:54:58**:
> I attempted to address it there also: https://github.com/kvcache-ai/ktransformers/issues/1417#issuecomment-3058222619
>
> I'll spend some time updating my huggingface model cards so hopefully people don't make this mistake and accidentally spread more misinformation.
@@ -9796,7 +9922,7 @@ Any tips, insights or discussion would be welcome.
---
-👤 **ikawrakow** replied the **2025-06-24** at **14:16:26**:
+👤 **ikawrakow** commented on **2025-06-24** at **14:16:26**
@sousekd
@@ -9808,14 +9934,15 @@ Please post the compilation errors you get with `AVX512_BF16`. It is supposed to
There are places where I have added GEMM/GEMV implementations optimized for `AVX512` extensions that I have available on my Ryzen-7950X CPU (Zen4 core). To be effective, one needs to enable `AVX512, AVX512_VNNI, AVX512VL, AVX512BW` and `AVX512DQ`. I don't think these are all available via `GGML_something` cmake definitions. When building on Linux they all get enabled with `GGML_NATIVE`, but on Windows you most likely need to work with `-DGGML_ARCH_FLAGS=add_necessary_compiler_flags`. TG performance is memory bound, so there will not be much impact there, but for PP you may get some additional performance increases if your CPU supports all of these.
-> 👤 **sousekd** replied the **2025-06-24** at **15:26:15**:
+> 👤 **sousekd** replied on **2025-06-24** at **15:26:15**
+>
> > Please post the compilation errors you get with `AVX512_BF16`. It is supposed to work, ...
>
> Oh, you are 100% correct and I am an idiot. **ik_llama.cpp** builds perfectly fine with `-DGGML_AVX512_BF16=ON` using MSVC - it was (and is) **llama.cpp** which does not build. I was experimenting with both and got confused :). Thank you!
---
-👤 **createthis** replied the **2025-07-10** at **16:13:24**:
+👤 **createthis** commented on **2025-07-10** at **16:13:24**
I have a dual EPYC 9355 system which normally has 768gb of RAM across 24 channels and scores roughly 720gb/s memory bandwidth on the stream triad test. At the moment, I had a RDIMM failure, so I'm down a stick and I only have 23 channels and 736gb of system RAM. I also have a blackwell 6000 pro on this system.
@@ -9887,7 +10014,8 @@ I'm just curious: Why is generation tok/s so much lower in `ik_llama.cpp` vs `ll
Thanks!
-> 👤 **ubergarm** replied the **2025-07-10** at **17:33:21**:
+> 👤 **ubergarm** replied on **2025-07-10** at **17:33:21**
+>
> Hey thanks for taking some time to try this out. I too started using ktransformers but have since moved over to ik's for given he is the author on pretty much all the quants after the original `q8_0` types.
>
> > I run with NPS4 set in the system BIOS, so I have 8 numa domains.
@@ -9937,8 +10065,9 @@ Thanks!
> Once you've dialed in the command you can then just switch out the executable back to `llama-server` and add back in alias/host/port and remove `--warmup-batch`.
>
> Okay, let me know if u have any questions, you have a very nice rig!
+
+> 👤 **sousekd** replied on **2025-07-10** at **18:29:53**
>
-> 👤 **sousekd** replied the **2025-07-10** at **18:29:53**:
> Hi @createthis, I was able to achieve the following on (single) Epyc 9355 and RTX 5090:
>
> | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
@@ -9953,7 +10082,7 @@ Thanks!
---
-👤 **ikawrakow** replied the **2025-07-10** at **16:46:17**:
+👤 **ikawrakow** commented on **2025-07-10** at **16:46:17**
@createthis
@@ -9973,12 +10102,13 @@ I think in `llama.cpp` they have added the `--depth` argument to `llama-bench` t
Another comment related to the NUMA situation: I don't have access to a NUMA system myself, but people report that, sadly, on dual socket systems they get the best performance by disabling NUMA in the BIOS and running on a single CPU. @ubergarm has done quite a few experiments in that regard. I haven't followed what is happening in `llama.cpp` land on that front, so maybe they have improved in the meantime (but hadn't only 2-3 months ago).
-> 👤 **ikawrakow** replied the **2025-07-10** at **16:48:34**:
+> 👤 **ikawrakow** replied on **2025-07-10** at **16:48:34**
+>
> But apart from everything else, worth pointing out that `ik_llama.cpp` needs only half the total time for PP+TG compared to `llama.cpp`.
---
-👤 **Panchovix** replied the **2025-07-10** at **20:39:17**:
+👤 **Panchovix** commented on **2025-07-10** at **20:39:17**
Just to let you know guys, did some benchmarks on iklcpp on my setup (192GB RAM + 208GB VRAM) on DeepSeek V3/R1/Chimera of Q2_K_XL, IQ3_XXS, IQ3_KS, Q3_K_XL and IQ4_XS on reddit, if you want to take a look!
@@ -9988,7 +10118,7 @@ Performance of ikllamacpp for these kind of setups, is really impressive!
---
-👤 **createthis** replied the **2025-07-10** at **21:35:54**:
+👤 **createthis** commented on **2025-07-10** at **21:35:54**
@ikawrakow here it is with NPS0:
@@ -10099,7 +10229,8 @@ PP speed does continue to rise past 32 threads though, which is suprising:
| 512 | 128 | 4096 | 3.413 | 150.02 | 13.577 | 9.43 |
```
-> 👤 **ubergarm** replied the **2025-07-10** at **23:13:21**:
+> 👤 **ubergarm** replied on **2025-07-10** at **23:13:21**
+>
> @createthis
>
> > ./build/bin/llama-batched-bench
@@ -10135,8 +10266,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> 3. try with and without `-rtr` as benefits can vary with batch size
>
> If it OOMs on VRAM already, just back off how many offload layers e.g. `-ot "blk\.(3|4|5)\.ffn_.*=CUDA0" \`
+
+> 👤 **saood06** replied on **2025-07-10** at **23:23:43**
>
-> 👤 **saood06** replied the **2025-07-10** at **23:23:43**:
> > > ./build/bin/llama-batched-bench
> >
> > I've never used `llama-batched-bench` but @saood06 has mentioned it before. Is that why you're seeing more TG tok/sec there? It might be comparing something different than `llama-sweep-bench` ? I know using `llama-server --parallel 4` for example gives higher aggregate throughput at a cost to individual request speeds.
@@ -10144,8 +10276,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> He is using it with a batch size of 1, so no aggregating performance, and it is at 0 depth so it should be comparable to the first line of a `sweep-bench` or even standard `bench` result.
>
> `llama-batched-bench` is a really nice tool for evaluating performance, but I tend to use it to performance for specific scenarios by providing specific parameters, unlike `llama-sweep-bench` where I mostly just choose how long/deep I want to test.
+
+> 👤 **createthis** replied on **2025-07-11** at **02:37:27**
>
-> 👤 **createthis** replied the **2025-07-11** at **02:37:27**:
> @ubergarm
> > ```shell
> > ./build/bin/llama-sweep-bench \
@@ -10224,16 +10357,19 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> ```
>
>
+
+> 👤 **sousekd** replied on **2025-07-11** at **06:06:26**
>
-> 👤 **sousekd** replied the **2025-07-11** at **06:06:26**:
> Great numbers @createthis! Would the model fit to only half of your RAM? I would be very interested to see the numbers when using only one socket, to avoid slower 4x16 xGMI3 link between CPUs.
>
> I have very similar system to yours (Epyc 9355 on MZ73-LM2), but with only one CPU populated (and still waiting for RTX 6000 to arrive).
+
+> 👤 **createthis** replied on **2025-07-11** at **13:06:10**
>
-> 👤 **createthis** replied the **2025-07-11** at **13:06:10**:
> @sousekd It's using about 300gb of system ram and nearly the entire 96gb of VRAM. I'm not sure if that's sustainable at full context length as my current work project doesn't require agentic loads at the moment, but I'll stress test it as soon as I get a chance. I suspect single socket performance will be lower, but I'm not sure. Please report back and let us know.
+
+> 👤 **createthis** replied on **2025-07-11** at **14:00:50**
>
-> 👤 **createthis** replied the **2025-07-11** at **14:00:50**:
> Here are the `llama.cpp` numbers with the same settings (and NPS0):
>
> ```bash
@@ -10288,8 +10424,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> ```
>
>
+
+> 👤 **ubergarm** replied on **2025-07-11** at **15:24:50**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **15:24:50**:
> @createthis
>
> Great job tuning and reporting your findings, much appreciated! Hope your rig is holding up under the stress and heat haha...
@@ -10320,8 +10457,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> ```
>
> Appreciate you sharing all your results!
+
+> 👤 **createthis** replied on **2025-07-11** at **16:47:25**
>
-> 👤 **createthis** replied the **2025-07-11** at **16:47:25**:
> Another sort of interesting result: This is NPS4 with `llama.cpp`:
>
> ```bash
@@ -10382,8 +10520,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> @ubergarm I'm interested in trying out your llama.cpp sweep benchmark. I need to get some work done on a paid project at the moment, but I'll try to take a look later this weekend and report my findings. I'll also report higher context real world results as they come in. I don't have an agentic workload at the moment, so I'm not sure when that will be, but maybe I can fabricate one this weekend if nothing pops up today.
>
> Thanks for all the feedback and help thus far!
+
+> 👤 **createthis** replied on **2025-07-11** at **20:56:50**
>
-> 👤 **createthis** replied the **2025-07-11** at **20:56:50**:
> This is still NPS4 with `llama.cpp`, just because I've been too lazy to reboot into NPS0.
>
> I'm never 100% sure I'm reading these correctly, but I think this is performance at `47k` context:
@@ -10416,8 +10555,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
> Not too shabby performance.
>
> EDIT: updated to be the same prompt as the below 47k context "real world" examples for an apples to apples comparison
+
+> 👤 **createthis** replied on **2025-07-11** at **22:06:31**
>
-> 👤 **createthis** replied the **2025-07-11** at **22:06:31**:
> "real world" NPS0 with `llama.cpp` and 47k context (same prompt as last one, I just hit regenerate):
>
> ```bash
@@ -10445,8 +10585,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
>
>
> This is in-line with my original findings. `llama.cpp` seems to prefer NPS4 for some reason.
+
+> 👤 **createthis** replied on **2025-07-11** at **22:25:43**
>
-> 👤 **createthis** replied the **2025-07-11** at **22:25:43**:
> "real world" NPS0 `ik_llama.cpp` 47k context. I just replayed the last prompt.
>
> ```bash
@@ -10476,8 +10617,9 @@ PP speed does continue to rise past 32 threads though, which is suprising:
>
>
> This performance is quite good. PP is slightly better than NPS4 `llama.cpp`. Gen is a fair bit lower though. Based on these numbers alone, I would probably opt for `llama.cpp` with NPS4, but I'm not convinced the verdict is out yet. I plan to run them both agentically for a while and see which one I like better.
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **22:50:35**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **22:50:35**:
> @createthis as related to the comparison of ik_llama.cpp and llama.cpp. The following likely unrelated to your case, but I will mention it just in case someone else would have the issue. Today I was installing the ik_llama.cpp and was unable to [do] it. It was falling out with:
>
> ```
@@ -10488,7 +10630,7 @@ PP speed does continue to rise past 32 threads though, which is suprising:
---
-👤 **magikRUKKOLA** replied the **2025-07-10** at **23:24:47**:
+👤 **magikRUKKOLA** commented on **2025-07-10** at **23:24:47**
transferring from https://github.com/kvcache-ai/ktransformers/issues/1417
@@ -10600,12 +10742,13 @@ https://github.com/turboderp-org/exllamav3
The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flash **infer** is not available in flashattn hence the for the full context in ik_llama.cpp its required to have at least 48 GB VRAM which is not ideal.
```
-```
-> 👤 **ubergarm** replied the **2025-07-10** at **23:42:50**:
+> 👤 **ubergarm** replied on **2025-07-10** at **23:42:50**
+>
> Sorry not sure which of these is the real one, I replied over here: https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13726306
+
+> 👤 **ubergarm** replied on **2025-07-10** at **23:51:29**
>
-> 👤 **ubergarm** replied the **2025-07-10** at **23:51:29**:
> @magikRUKKOLA
>
> So let's assume you have a thread ripper configured in NPS1 so all your RAM is in a single NUMA node and 3x CUDA devices, give this a try:
@@ -10646,8 +10789,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> -ot "blk\.(9|10|11)\.ffn_.*=CUDA0" \
> -ot exps=CPU \
> ```
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **13:11:37**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **13:11:37**:
> @ubergarm
>
> I decided to install the additional fans for the ECC ram so I haven't tried yet the config with three GPU. But I decided to try it out with two GPUs on my test rig with Threadripper PRO 3[9]45wx (only 12 cores) with 96k context.
@@ -10736,8 +10880,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> ```
>
> I am downloading various quants to try out with various configs.
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **15:21:56**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **15:21:56**:
> Tried the IQ2_K_R4 quant:
>
> ```
@@ -10802,8 +10947,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> ```
>
> Whoa! 120 tps prefill! Intriguing!
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **16:19:40**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **16:19:40**:
> Uh oh! Apparently the -ot etc. doesn't really do much.
>
> 96k context:
@@ -10818,8 +10964,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> ```
>
> So the whole VRAM just goes to the KV-cache computation, right? So not a single layer can be put onto the GPU. But the KV-cache is distributed okay.
+
+> 👤 **ubergarm** replied on **2025-07-11** at **16:24:23**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **16:24:23**:
> @magikRUKKOLA
>
> > Possibly you may mean it like:
@@ -10906,8 +11053,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
>
>
> `nvitop` shown for dual RTX A6000s but made sure they are not loaded past 24GB VRAM each.
+
+> 👤 **ubergarm** replied on **2025-07-11** at **16:26:32**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **16:26:32**:
> I was replying at the same time hah
>
> > Uh oh! Apparently the -ot etc. doesn't really do much.
@@ -10917,8 +11065,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> > So the whole VRAM just goes to the KV-cache computation, right? So not a single layer can be put onto the GPU. But the KV-cache is distributed okay.
>
> Not quite, it is still offloading all the attn/shexp/first 3 dense layers onto GPU. Since you want full 160k context on only 48GB VRAM you cannot offload any additional routed exps though.
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **16:31:20**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **16:31:20**:
> > The KV-cache is split almost equally across both GPUs, so not sure what is going on here unless you didn't compile with `-DGGML_SCHED_MAX_COPIES=1` which causes bloated VRAM usage.
>
> Well, let me see...
@@ -10937,8 +11086,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> version: 3795 (c53cb652)
> built with cc (Debian 14.2.0-19) 14.2.0 for x86_64-linux-gnu
> ```
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **16:33:53**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **16:33:53**:
> > Also I have some new non-`_R4` quants like this [ubergarm/DeepSeek-R1-0528-GGUF/IQ3_KS](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_KS) (and also the TNG-R1T2-Chimera version) using my latest recipes and the newest IQ3_KS type that might benefit more from `-ub 4096 -b 4096` than the `_R4` quants.
>
> Yeah, I know. I am downloading it.
@@ -10946,8 +11096,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> > Keep in mind the order of -ot matters so put the -ot exps=CPU last after the -ot ...=CUDAX stuff so I'm not sure you were actually offloading more routed exps layers like intended in your command above. I'll use my convention of ngl then ot CUDAs then ot exps=CPU:
>
> Uh oh.. May bad. :)
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **17:06:06**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **17:06:06**:
> > benchmark
>
> test-rig (12 core CPU) setup benchmark:
@@ -11459,13 +11610,15 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> | 1024 | 256 | 16384 | 24.325 | 42.10 | 50.327 | 5.09 |
> | 1024 | 256 | 17408 | 24.821 | 41.26 | 50.353 | 5.08 |
> ```
+
+> 👤 **ubergarm** replied on **2025-07-11** at **17:58:44**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **17:58:44**:
> Great, now you have a baseline command you can adjust to dial in for any given quant. You can see how it is distributing the kv-cache across both GPUs fairly equally. You can tinker adding or removing the `-ot ...=CUDA0` routed expert layer offloads or increasing batch sizes or trying with a different quant. You can also modify the command a bit to use on mainline llama.cpp for the most apples-apples comparison of which I know. (just remove `-mla 3 -amb 512 -fmoe --warmup-batch` first as those don't exist on mainline.).
>
> Have fun and keep us posted!
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **21:40:57**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **21:40:57**:
> Tried with three GPU and 2933 MT/sec 8-channel RAM 256GB and 64C CPU.
>
> 150k context with -b 4096 -ub 4096 is achieved!
@@ -11634,13 +11787,15 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> | 4096 | 1024 | 65536 | 62.658 | 65.37 | 156.614 | 6.54 |
> | 4096 | 1024 | 69632 | 63.486 | 64.52 | 159.997 | 6.40 |
> ```
+
+> 👤 **magikRUKKOLA** replied on **2025-07-11** at **21:56:58**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-11** at **21:56:58**:
> Ha! The current results are pretty promising. The prefill of 200 tps on a small context is great! And the ability to go as much as 150k tokens is great too! Amazing that nothing is crashing and the --seed and the powerful benchmarking is implemented too!
>
> What a great job, guys! Congrads!
+
+> 👤 **ubergarm** replied on **2025-07-12** at **05:06:25**
>
-> 👤 **ubergarm** replied the **2025-07-12** at **05:06:25**:
> > 150k context with -b 4096 -ub 4096 is achieved!
>
> Sweeet! You got it going and have a variety of models to choose trading off speed and accuracy as desired. Really interesting to see the benchmarks, and cool to see the `IQ4_KS_R4` speed quite comparable with the more traditional quant types used in `UD-Q4_K_XL`!
@@ -11648,8 +11803,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> > the nvme drives under the gpus are getting hot
>
> These are some interesting workloads to run for sure! :fire: Once again great job getting your hardware together, figuring out how to adjust all the command arguments, and doing the tuning to share these great results!
+
+> 👤 **magikRUKKOLA** replied on **2025-07-12** at **06:28:45**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-12** at **06:28:45**:
> I could not find the perplexity for the UD-Q4_K_XL at the graphs so I am posting it here:
>
> ```
@@ -11665,8 +11821,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> UD_Q2_K_XL:
> Final estimate: PPL = 3.5278 +/- 0.01920
> ```
+
+> 👤 **Panchovix** replied on **2025-07-12** at **06:30:57**
>
-> 👤 **Panchovix** replied the **2025-07-12** at **06:30:57**:
> > I could not find the perplexity for the UD-Q4_K_XL at the graphs so I am posting it here:
> >
> > ```
@@ -11677,8 +11834,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> > So the IQ4_KS_R4 is better in terms of perplexity.
>
> Hello there! Wondering, what was your command to test PPL? I want to try with some models I have but I get just "nan" for some reason, so maybe it's an issue on my end (highly factible). And these models work perfectly on normal usage.
+
+> 👤 **magikRUKKOLA** replied on **2025-07-12** at **06:50:36**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-12** at **06:50:36**:
> You just get what?
>
> The docs on Perplexity is in this current thread (see above). quote:
@@ -11711,11 +11869,13 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> > --override-tensor exps=CPU \
> > --threads 24
> > ```
+
+> 👤 **ikawrakow** replied on **2025-07-12** at **10:02:21**
>
-> 👤 **ikawrakow** replied the **2025-07-12** at **10:02:21**:
> The quoted comments about NaNs and `-mla 2` are hopelessly outdated.
+
+> 👤 **ubergarm** replied on **2025-07-12** at **15:57:40**
>
-> 👤 **ubergarm** replied the **2025-07-12** at **15:57:40**:
> Thanks for the result on that perplexity score @magikRUKKOLA it lines up with my own estimates of the smaller quants. That guide is indeed hopelessly outdated already haha.. Using q8_0 quantized cache will drop the score just a tiny bit, and mla 3 is pretty much always the way to go now.
>
> Here is an example of what I've been using lately for smaller models and two CUDA GPUs:
@@ -11734,8 +11894,9 @@ The same thing goes for ik_llama.cpp etc. -- the matrix absorption trick in flas
> -ot exps=CPU \
> --threads 24
> ```
+
+> 👤 **Panchovix** replied on **2025-07-12** at **18:53:08**
>
-> 👤 **Panchovix** replied the **2025-07-12** at **18:53:08**:
> Many thanks to all! I did re test and finally worked, after months haha.
>
> Finally could test R1 0525 IQ4_XS, from unsloth.
diff --git a/github-data/discussions/266 - Benchmarking DeepSeek R1 - 16x3090.md b/github-data/discussions/266 - Benchmarking DeepSeek R1 - 16x3090.md
index 652ab635b..cc17ad67a 100644
--- a/github-data/discussions/266 - Benchmarking DeepSeek R1 - 16x3090.md
+++ b/github-data/discussions/266 - Benchmarking DeepSeek R1 - 16x3090.md
@@ -1,13 +1,14 @@
-### 🗣️ [#266](https://github.com/ikawrakow/ik_llama.cpp/discussions/266) - Benchmarking DeepSeek R1 - 16x3090
+## 🗣️ [Discussion #266](https://github.com/ikawrakow/ik_llama.cpp/discussions/266) - Benchmarking DeepSeek R1 - 16x3090
| **Author** | `davidsyoung` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-03-18 |
| **Updated** | 2025-03-21 |
---
-#### Description
+## 📄 Description
Wanted to create a resource for anyone looking to optimise `-b -ub -amb` with `-mla 2 -fa -fmoe` with offloading DeepSeek R1 fully on CUDA with ik_llama.cpp @ https://github.com/ikawrakow/ik_llama.cpp/commit/dcdfad29f7d2b831f1c84751f00bda14cc359a84.
@@ -387,9 +388,9 @@ _TG shows no notable difference._
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **davidsyoung** replied the **2025-03-18** at **09:37:29**:
+👤 **davidsyoung** commented on **2025-03-18** at **09:37:29**
### Mixed quant of `Q8` for attn, `Q5 down / IQ4_XS up|gate` for layers 3-8, and `IQ4_XS down / IQ3_S up|gate`.
@@ -421,47 +422,51 @@ perplexity: 11.69 seconds per pass - ETA 27.32 minutes
Final estimate: PPL = 3.4178 +/- 0.01891
```
-> 👤 **fredlas** replied the **2025-03-19** at **15:49:40**:
+> 👤 **fredlas** replied on **2025-03-19** at **15:49:40**
+>
> Were you thinking of uploading this to huggingface, by any chance? I can reproduce and upload it myself if necessary, but I haven't downloaded the full R1 weights yet, and would be happy to continue avoiding that if possible!
+
+> 👤 **ubergarm** replied on **2025-03-19** at **22:37:04**
>
-> 👤 **ubergarm** replied the **2025-03-19** at **22:37:04**:
> @fredlas do you have any specific hardware configuration in mind? e.g. how much system RAM, and GPUs / VRAM? I put together rough notes on making your own custom quant in [this quick-start guide discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/258). I believe @davidsyoung has tailored the quant specific to his 16x3090 = 384 GB VRAM setup.
>
> I've made a couple quants now and have one okay one for 256GB RAM + 24GB VRAM single GPU configuration with better perplexity than unsloth `UD-Q2_K_XL` but just a little bit slower. I'm still experimenting to see how the various types effect generation speed vs perplexity while fitting inside the envelope of my current hardware.
>
> You can get started with `ik_llama.cpp` including `-mla 2` and repacked quants now with an existing unsloth quant or whatever you have probably. (sorry if you already know this, I'm still new here!) Cheers!
+
+> 👤 **davidsyoung** replied on **2025-03-19** at **23:18:56**
>
-> 👤 **davidsyoung** replied the **2025-03-19** at **23:18:56**:
> I might be able to upload if you give me enough time, however, I actually recommend getting used to quanting as there’s _a lot_ tweaking you may want to do.
>
> For example, I don’t actually think this quant suits my setup best yet, and I’m actually underutilising one GPU. I just haven’t found a way to split the layers that well yet.
+
+> 👤 **fredlas** replied on **2025-03-21** at **02:37:16**
>
-> 👤 **fredlas** replied the **2025-03-21** at **02:37:16**:
> @ubergarm 307GiB happens to be right around the size I'm thinking of. 72GiB VRAM + 256GiB RAM, for queuing up jobs to run overnight with 16k context - should just fit in there, I think. Funny coincidence for an extremely different configuration! Thanks for that guide - I made my own quants of Wizard2 8x22B a while back, but long enough that I was probably going to have to basically relearn it.
>
> @davidsyoung I'd say don't upload them just for my sake if you weren't already planning to - I just thought I'd check in case I could stay lazy. Plus this size range is probably pretty niche anyways; might not really be worth it in terms of helping people.
---
-👤 **ikawrakow** replied the **2025-03-18** at **09:44:15**:
+👤 **ikawrakow** commented on **2025-03-18** at **09:44:15**
Thank you for this. I think it can be really useful for people.
---
-👤 **saood06** replied the **2025-03-18** at **20:14:25**:
+👤 **saood06** commented on **2025-03-18** at **20:14:25**
@ikawrakow Can I convert this to a discussion?
---
-👤 **davidsyoung** replied the **2025-03-18** at **20:19:37**:
+👤 **davidsyoung** commented on **2025-03-18** at **20:19:37**
All good with me @saood06
---
-👤 **ikawrakow** replied the **2025-03-18** at **20:29:32**:
+👤 **ikawrakow** commented on **2025-03-18** at **20:29:32**
> @ikawrakow Can I convert this to a discussion?
diff --git a/github-data/discussions/286 - Testing _deepseek-ai_DeepSeek-V3-0324_ model support..md b/github-data/discussions/286 - Testing deepseek-aiDeepSeek-V3-0324 model support.md
similarity index 97%
rename from github-data/discussions/286 - Testing _deepseek-ai_DeepSeek-V3-0324_ model support..md
rename to github-data/discussions/286 - Testing deepseek-aiDeepSeek-V3-0324 model support.md
index 505716b15..6e492cbca 100644
--- a/github-data/discussions/286 - Testing _deepseek-ai_DeepSeek-V3-0324_ model support..md
+++ b/github-data/discussions/286 - Testing deepseek-aiDeepSeek-V3-0324 model support.md
@@ -1,13 +1,14 @@
-### 🗣️ [#286](https://github.com/ikawrakow/ik_llama.cpp/discussions/286) - Testing `deepseek-ai/DeepSeek-V3-0324` model support.
+## 🗣️ [Discussion #286](https://github.com/ikawrakow/ik_llama.cpp/discussions/286) - Testing `deepseek-ai/DeepSeek-V3-0324` model support.
| **Author** | `ubergarm` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-03-24 |
| **Updated** | 2025-04-02 |
---
-#### Description
+## 📄 Description
I saw today a new model [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) that may run on this fork?
@@ -35,9 +36,9 @@ Curious if anyone else has any luck and if this new model is "better" at coding
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **saood06** replied the **2025-03-24** at **22:03:22**:
+👤 **saood06** commented on **2025-03-24** at **22:03:22**
> I saw today a new model [deepseek-ai/DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) that may run on this fork?
>[...]
@@ -64,7 +65,8 @@ The second point, those weights were present in the other releases such as V3, V
I'm curious, and will have to make room for it on my server. I know this is slightly off topic but I'd be curious to hear your experience with this (and any of the other Deepseek models you've tried).
-> 👤 **ubergarm** replied the **2025-03-25** at **00:00:24**:
+> 👤 **ubergarm** replied on **2025-03-25** at **00:00:24**
+>
> > This is just another finetune.
>
> Great, might have a chance at getting it to work!
@@ -82,8 +84,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> Yeah will keep you posted with new V3. I'm only now experimenting with using longer context ~30-40k by copy pasting in code, man pages, documentation, etc. Using R1 at `Q4` today I was trying to understand how to potentially have `llm_load_tensors()` allocate N copies of ctx_buffs (one on each N NUMA nodes). It helped me understand a bit more the relationship between `src/llama.cpp` and `ggml/src/ggml-backend.c`, but didn't give magic working code haha... It did help me update `CMakeLists.txt` to get it building linking with libnuma library. I've also had some luck with it refactoring python code especially creating uniform style comments and adding static typing. Even QwQ-32B could write a decent 1-shot flappy bird when given a detailed prompt to follow haha...
>
> One supposed success story is about [airbnb refactoring javascript test code](https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b) to use a different library. Hard to say how much "tech debt" was incurred if any, but I too am curious to hear of any successful uses of ai for actually useful coding.
+
+> 👤 **saood06** replied on **2025-03-25** at **02:39:47**
>
-> 👤 **saood06** replied the **2025-03-25** at **02:39:47**:
> > > This is just another finetune.
> >
> > Great, might have a chance at getting it to work!
@@ -127,8 +130,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> > One supposed success story is about [airbnb refactoring javascript test code](https://medium.com/airbnb-engineering/accelerating-large-scale-test-migration-with-llms-9565c208023b) to use a different library. Hard to say how much "tech debt" was incurred if any, but I too am curious to hear of any successful uses of ai for actually useful coding.
>
> Thank you for the linked article, was a good read, another success story I know of is here: https://github.com/ggml-org/llama.cpp/pull/11453, "Surprisingly, 99% of the code in this PR is written by DeekSeek-R1."
+
+> 👤 **ubergarm** replied on **2025-03-25** at **05:02:40**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **05:02:40**:
> I'm half asleep and didn't see this reply it pretty late. I appreciate the encouragement and pointers to good existing discussions!
>
> I got the new `V3-0324` bf16 cranked out pretty quickly, but it didn't sink in that `bin/llama-imatrix` would have to run the full ~1.34TB model lmao... Of course the 256GB + 96GB VRAM system OOMd almost immedeately. So I copied everything to the 1.5TB RAM dual xeon 6980P and am giving that a go while I sleep.
@@ -191,8 +195,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> Huh it still seems to be reading mmap off of this slower disk into cache and is barely hitting 20% total CPU utilization so hopefully it speeds up a bit more haha...
>
> Okie, gotta sleep, exciting times!
+
+> 👤 **saood06** replied on **2025-03-25** at **05:35:18**
>
-> 👤 **saood06** replied the **2025-03-25** at **05:35:18**:
> > I'm half asleep and didn't see this reply it pretty late. I appreciate the encouragement and pointers to good existing discussions!
> >
> > I got the new `V3-0324` bf16 cranked out pretty quickly, but it didn't sink in that `bin/llama-imatrix` would have to run the full ~1.34TB model lmao...
@@ -232,8 +237,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> I agree, I have a lot of theories about what they will do with Deepseek-R2. I really like the model, but reading their papers they have done an amazing job at optimizations when it comes to get the most out of the hardware and on the choices of model architecture (MLA, MoE with a good amount of experts [I can't say it's a lot when [this](https://arxiv.org/abs/2407.04153) exists], a shared expert [qwen 3 seems to be dropping this for their MoE which is interesting], etc.) , but the actual RL tuning seems like there are a LOT of low hanging fruit and obvious and large improvements that can be done.
>
> Edit: Corrected mistake about imatrix
+
+> 👤 **ubergarm** replied on **2025-03-25** at **14:51:52**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **14:51:52**:
> > You can just quantize to Q8_0 statically, and then use that for imatrix
>
> Ahh, that is good news, running across both CPU sockets NUMA nodes is not performant to fit the whole bf16 haha... You asked in another thread about how it went. I had to quickly restart it due to forgetting to set directory permissions to write the imatrix.dat file, and that second time it estimated 11 hours. I killed it before finishing though after reading more of these notes.
@@ -303,8 +309,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> Hah, yeah, I'm too am wondering which of the non MoE layers I can shrink down from `q8_0` a bit to free up enough space to fit 64k context in under 24GB VRAM along with them all using `-ot exps=CPU`. Yes, if I can get a valid imatrix.dat I'm happy to upload it onto huggingface along with all details to re-create including what fork/git sha/data file used etc.
>
> Will see how much I can get through today, and I am out of office next couple days. Could leave imatrix running probably if there is a special llama fork to use as you referenced or if the input file is not enough chunks to give the ~1GiB dat file (tbh I'm just learning how it even works so just winging it lol).
+
+> 👤 **saood06** replied on **2025-03-25** at **15:04:57**
>
-> 👤 **saood06** replied the **2025-03-25** at **15:04:57**:
> > > You can just quantize to Q8_0 statically, and then use that for imatrix
> >
> > ETA 11 hours 53.50 minutes
@@ -324,8 +331,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> > Could leave imatrix running probably if there is a special llama fork to use as you referenced.
>
> I recommend you to stick to this repo, the team mradermacher have very specialized needs and thus need to track llama.cpp's bleeding edge religiously, they took a fix ikawrakow wrote to fix an issue they were seeing, and just ported that over to llama.cpp alongside an extra example that allows you to calculate exact footprint required so they can do automated job scheduler that is resource aware.
+
+> 👤 **bartowski1182** replied on **2025-04-01** at **23:17:46**
>
-> 👤 **bartowski1182** replied the **2025-04-01** at **23:17:46**:
> @ubergarm I wouldn't give too much thought to the imatrix dataset, there have been a lot of people recently who have tried iterating and experimenting on the one that I use, in particular related to different languages, and found shockingly minimal (if any) impact on the results of a target language by including that language in the dataset.
>
> it seems clear that, as Kalomaze suggested way way back, the randomness/diversity of the data is much more important than the quality, because if ANYTHING was going to be altered by using a different imatrix set, surely it would be completely different languages.
@@ -333,8 +341,9 @@ I'm curious, and will have to make room for it on my server. I know this is slig
> for models the size of DeepSeek you can probably even go all the way down to Q4_K_M, I know mradermacher mentions going down to Q4_K_S, IQ3_XS or even Q2_K, and that was there before these monster models existed
>
> that said, all this discussion about people with their massive xeon clusters and multiple servers RPCed together really tells me i need to find a sponsor.. 😂
+
+> 👤 **saood06** replied on **2025-04-02** at **00:23:04**
>
-> 👤 **saood06** replied the **2025-04-02** at **00:23:04**:
> > @ubergarm I wouldn't give too much thought to the imatrix dataset, there have been a lot of people recently who have tried iterating and experimenting on the one that I use, in particular related to different languages, and found shockingly minimal (if any) impact on the results of a target language by including that language in the dataset.
>
> This paper also confirms that https://arxiv.org/abs/2503.03592
@@ -347,7 +356,7 @@ I'm curious, and will have to make room for it on my server. I know this is slig
---
-👤 **ikawrakow** replied the **2025-03-25** at **06:32:51**:
+👤 **ikawrakow** commented on **2025-03-25** at **06:32:51**
> [!IMPORTANT]
> To calculate the imatrix, please do not use any of the `mla, fa, fmoe` or `amb` options. With these, some of the tensors will not get imatrix data collected.
@@ -357,12 +366,14 @@ As @saood06 pointed out, `Q8_0` is good enough to collect imatrix data.
> Also this https://github.com/ikawrakow/ik_llama.cpp/pull/250 if you haven't seen it is obviously relevant to you,
-This has been superseded by #259. The additional 2 tensors needed for MLA (`attn_k_b` and `attn_v_b`) are computed on the fly from `attn_kv_b` when loading the model (if missing). So, the best strategy is to use standard attention for imatrix calculations, which will give imatrix data to `attn_kv_b`, so this tensor will get a better quantization. `attn_k_b` is a transposed version of half of `attn_kv_b`. It gets computed by converting `attn_kv_b` to `fp32`, transposing that, and then quantizing to `Q8_0`, so (nearly) lossless. `attn_v_b` is just a view of the other half of `attn_kv_b`, so it uses the `attn_kv_b` data directly.
+This has been superseded by [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259). The additional 2 tensors needed for MLA (`attn_k_b` and `attn_v_b`) are computed on the fly from `attn_kv_b` when loading the model (if missing). So, the best strategy is to use standard attention for imatrix calculations, which will give imatrix data to `attn_kv_b`, so this tensor will get a better quantization. `attn_k_b` is a transposed version of half of `attn_kv_b`. It gets computed by converting `attn_kv_b` to `fp32`, transposing that, and then quantizing to `Q8_0`, so (nearly) lossless. `attn_v_b` is just a view of the other half of `attn_kv_b`, so it uses the `attn_kv_b` data directly.
-> 👤 **saood06** replied the **2025-03-25** at **07:01:42**:
+> 👤 **saood06** replied on **2025-03-25** at **07:01:42**
+>
> Sorry I forgot about the implications of that PR, updated my comment to reflect it.
+
+> 👤 **ubergarm** replied on **2025-03-25** at **14:58:40**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **14:58:40**:
> Great, thanks for the help and pro-tips!
>
> Copying over the V3-0324 `q8_0_r8` to the xeon 6980P now and will leave this running and hope to get an imatrix.dat for further smaller quants. I've removed `mla, fa, fmoe, amb` options and am unsure on `-ctk q8_0` so will just remove it too.
@@ -380,8 +391,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> --numa numactl \
> --threads 128
> ```
+
+> 👤 **ubergarm** replied on **2025-03-25** at **17:00:15**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **17:00:15**:
> Oof, starting getting NaN's computing imatrix on the `q8_0_r8`... Gonna pause rushing on this and go back and look at [Issue 285](https://github.com/ikawrakow/ik_llama.cpp/issues/285#issuecomment-2750335421) which I assume may be related.
>
>
@@ -550,11 +562,13 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> ```
>
>
+
+> 👤 **ikawrakow** replied on **2025-03-25** at **17:12:23**
>
-> 👤 **ikawrakow** replied the **2025-03-25** at **17:12:23**:
> So, this is unfortunate, but also helpful as it excludes the `fmoe` optimization as a cause. Oops, not actually helpful as now I'm completely at a loss what could be causing the NaNs.
+
+> 👤 **ubergarm** replied on **2025-03-25** at **17:31:08**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **17:31:08**:
> @ikawrakow
>
> Thanks for looking. I'm running the perplexity again as per 285 currently. Will update that one as soon as data starts coming in.
@@ -600,13 +614,15 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> ```
>
>
+
+> 👤 **saood06** replied on **2025-03-25** at **17:41:06**
>
-> 👤 **saood06** replied the **2025-03-25** at **17:41:06**:
> > Oof, starting getting NaN's computing imatrix on the `q8_0_r8`... Gonna pause rushing on this and go back and look at [Issue 285](https://github.com/ikawrakow/ik_llama.cpp/issues/285#issuecomment-2750335421) which I assume may be related.
>
> Are you going to go back to the BF16, or use llama.cpp with the Q8_0 to generate an imatrix?
+
+> 👤 **ubergarm** replied on **2025-03-25** at **17:45:20**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **17:45:20**:
> @saood06
>
> Well, mainline llama.cpp will *not* work with my mixed `q8_0`/`q8_0_r8` quant. So either:
@@ -615,26 +631,31 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> * Option B: whip out another more simple `q8_0` everything and copy it over and use mainline llama.cpp...
>
> I've started down Option B for now and with some luck I can get the imatrix.dat uploaded by tomorrow morning before I head out.
+
+> 👤 **ikawrakow** replied on **2025-03-25** at **18:11:31**
>
-> 👤 **ikawrakow** replied the **2025-03-25** at **18:11:31**:
> If you do option B and make simple `Q8_0`, then it would be useful to 1st try `ik_llama.cpp` with that. That will help narrow down the problem. If you don't get NaNs, it is somehow related to `Q8_0_R8`, and you can keep going with `ik_llama.cpp`. If you do get NaNs, you can stop it and use mainline.
>
> Btw, on a CPU with native `bf16` support, running `imatrix` with a `bf16` model should be only marginally slower than `Q8_0`.
+
+> 👤 **saood06** replied on **2025-03-25** at **18:24:50**
>
-> 👤 **saood06** replied the **2025-03-25** at **18:24:50**:
> > Btw, on a CPU with native `bf16` support, running `imatrix` with a `bf16` model should be only marginally slower than `Q8_0`.
>
> Under normal conditions yes, but going to bf16 forces him onto both numa sockets, I'm interested to know what speed llama.cpp would give though compared to this since he's going down that path now.
+
+> 👤 **ikawrakow** replied on **2025-03-25** at **18:30:46**
>
-> 👤 **ikawrakow** replied the **2025-03-25** at **18:30:46**:
> > Under normal conditions yes, but going to bf16 forces him onto both numa sockets
>
> And why would 2 sockets be bad for performance? It is PP, not TG, memory bandwidth and latency should be mostly irrelevant. With batches of 512, each piece of data that gets fetched from memory gets used 512 times for computations.
+
+> 👤 **ikawrakow** replied on **2025-03-25** at **18:34:44**
>
-> 👤 **ikawrakow** replied the **2025-03-25** at **18:34:44**:
> Ah, it is a MoE model with 256 experts. Batches of 512 result in many experts doing multiplication with just a handful of rows. So, I guess, there will be a larger penalty due to memory access patterns. Still, I don't expect it to be slower than 1 socket. Or?
+
+> 👤 **ubergarm** replied on **2025-03-25** at **20:24:25**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **20:24:25**:
> > simple Q8_0, then it would be useful to 1st try ik_llama.cpp with that
>
> Ooh I almost thought we had it... Was about to update and just got first `nan`:
@@ -726,9 +747,10 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
>
>
>
-> So gonna stop and try mainline for now. Can keep tracking this over in #285 as it may be related.
+> So gonna stop and try mainline for now. Can keep tracking this over in [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285) as it may be related.
+
+> 👤 **ubergarm** replied on **2025-03-25** at **20:52:22**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **20:52:22**:
> Double oof mainline is complaining despite it all being `q8_0`...
>
> ```
@@ -752,34 +774,39 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> I went ahead and used mainline to make the `q8_0` without any *custom* stuff in it and am copying that over. Gotta get that sweet sweet imatrix.dat lol...
>
> *EDIT* Huh [bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF/tree/main) has had an imatrix.dat there since yesterday... lol... okay... well... i'll still give this a go for fun and report back...
+
+> 👤 **saood06** replied on **2025-03-25** at **20:56:26**
>
-> 👤 **saood06** replied the **2025-03-25** at **20:56:26**:
> > Double oof mainline is complaining despite it all being `q8_0`...
> >
>
> That is expected.
>
-> #259 says:
+> [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) says:
> >In principle we could remove the preparation of wk_v and wk_b from convert_hf_to_gguf.py, but I decided have some more thorough testing in the wild before doing so.
>
> Those extra tensors support the MLA branch of llama.cpp (since we derived MLA support from that originally), so maybe try that one.
+
+> 👤 **saood06** replied on **2025-03-25** at **21:03:52**
>
-> 👤 **saood06** replied the **2025-03-25** at **21:03:52**:
> Mentioned in the original port PR.
>
> https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2621112020
>
> It is still a choice of what our converter outputs, should we be compliant with the MLA PR as that allows you to compare feature performance across both, or to support the main branch of llama.cpp even though they have a PR with that feature.
+
+> 👤 **ubergarm** replied on **2025-03-25** at **21:49:13**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **21:49:13**:
> Oooh right right the fairydreaming fork PR! I never tried that as I found this fork before learning how to roll my own MLA quant... Thanks, I'll try that quick while also copying over another mainline `q8_0` for insurance haha... Also finally rolling my new usual `q8_0` on GPU and MoEs on CPU with `iq3_k_r4/iq2_k_r4` quant with bartowski's imatrix just to compare perpelxity if i get the itch haha...
+
+> 👤 **saood06** replied on **2025-03-25** at **22:29:21**
>
-> 👤 **saood06** replied the **2025-03-25** at **22:29:21**:
> >Also finally rolling my new usual q8_0 on GPU and MoEs on CPU with iq3_k_r4/iq2_k_r4 quant with bartowski's imatrix just to compare perpelxity if i get the itch haha...
>
> I will let my IQ4_K_R4 quantize overnight, I grabbed everything I need for V3 0324 ( bartowski's imatrix and a Q8_0).
+
+> 👤 **ubergarm** replied on **2025-03-25** at **22:52:07**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **22:52:07**:
> > I will let my IQ4_K_R4 quantize overnight, I grabbed everything I need for V3 0324 ( bartowski's imatrix and a Q8_0).
>
> Nice, happy cooking!
@@ -3142,8 +3169,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> llama_perf_context_print: total time = 7322634.52 ms / 109057 tokens
> ```
>
+
+> 👤 **saood06** replied on **2025-03-25** at **23:00:04**
>
-> 👤 **saood06** replied the **2025-03-25** at **23:00:04**:
> >The output is different and it seems to be skipping 10-15% of the tensors due to partial 99.9% data...
>
> This is to be expected as long as this `storing only 605 out of 659 entries` number keeps trending up to 659 you should be good.
@@ -3155,8 +3183,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> The real concern though is the perplexity numbers, they seem way too high, even though I've never made an imatrix that still looks concerning.
>
> Edit: Actually maybe this will prove a clue to what is wrong as this implementation also seems unstable.
+
+> 👤 **ubergarm** replied on **2025-03-25** at **23:46:34**
>
-> 👤 **ubergarm** replied the **2025-03-25** at **23:46:34**:
> > The real concern though is the perplexity numbers, they seem way too high
>
> Yeah I was wondering why they are so much higher than on this fork. They did seem to get trend smaller quickly at first though hah... Still running, I'll paste the rest of the logs when it is done
@@ -3213,8 +3242,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> Running at 64k context it is using `26732MiB`... I wonder what the least damaging `q8_0`s to knock down in the GPU layers to fit this in 24GB VRAM. Would need to shave off just over 2GiB of tensors out of a total of ~17.33 so like maybe dense layers to q6 might do it... probably need a spreadsheet lol...
>
> Looks anecdotally like around 95 tok/sec pp on a <~4k prompt and 11 tok/sec generation. Generation seems a bit slower while copying markdown table logs haha... Initial impression is I don't miss `` as it gets right to the point haha... I'll test to see if it can make any graphs of my log data! Oh right and set `temperature=0.3`.
+
+> 👤 **saood06** replied on **2025-03-26** at **01:18:42**
>
-> 👤 **saood06** replied the **2025-03-26** at **01:18:42**:
> > Right, if it isn't working either, something is odd. I wonder how bartowski made his?
>
> Using main llama.cpp, it seems that the MLA attention is causing problems.
@@ -3248,8 +3278,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> >Oh right and set `temperature=0.3`.
>
> What have you been running at, and did 0.3 feel appropriate? (also anything else in the sampler chain, like top p/k, min p, mirostat etc.)
+
+> 👤 **ubergarm** replied on **2025-03-26** at **02:23:57**
>
-> 👤 **ubergarm** replied the **2025-03-26** at **02:23:57**:
> > it seems that the MLA attention is causing problems
>
> Yeah, good point, using mainline without MLA is probably fine. I got the files copied over, but didn't try running it as I just went with bartowski's without MLA for now then. Makes sense after you explain it.
@@ -3267,8 +3298,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> I use a small custom python chat client that uses `litellm` to hit the OpenAI API chat endpoint. The first time i forgot and left it at R1 default of `0.6` which possibly had some funky code generation or my terminal got borked. I set it to `0.3` and re-ran while not resizing my terminal and things looks good. The only things I ever specify are `top_p=0.95` and `temperature` as mentioned above. I generally keep it simple for coding generations.
>
> In the past I have played with samplers more, especially when trying to reduce slop and increase creativity in writing. I would increase temperature, adjust `top_p`, `min_p`, `top_k`, and even played around a bit with the more specialized samplers like [xtc](https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md#xtc-sampling). Anymore I haven't fussed with it much, and spend more time adding variance into the prompt like example clips etc.
+
+> 👤 **ubergarm** replied on **2025-03-26** at **02:28:19**
>
-> 👤 **ubergarm** replied the **2025-03-26** at **02:28:19**:
> @saood06
>
> I got a perplexity run for the `DeepSeek-V3-0324-IQ2_K_R4-bartowski-imat.gguf`.
@@ -3693,8 +3725,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> ```
>
>
+
+> 👤 **saood06** replied on **2025-03-26** at **02:45:43**
>
-> 👤 **saood06** replied the **2025-03-26** at **02:45:43**:
> > > The code might be a bit spread out, but it is very easy to understand, and I'm sure it will help you find the 2GiB you need to cut.
> >
> > Ahh okay, I had seen that unsloth fork before, but now having quantized the model enough times here, I can understand what is happening now. And right looks like `q6_k` for `ffn_down.weight` in the first 3 dense layers and `ffn_down_shexp.weight` shared experts is a good place to start trimming a bit.
@@ -3716,8 +3749,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> > In the past I have played with samplers more, especially when trying to reduce slop and increase creativity in writing. I would increase temperature, adjust `top_p`, `min_p`, `top_k`, and even played around a bit with the more specialized samplers like [xtc](https://github.com/ggml-org/llama.cpp/blob/master/examples/main/README.md#xtc-sampling). Anymore I haven't fussed with it much, and spend more time adding variance into the prompt like example clips etc.
>
> I never played around with samplers much, as I never really liked what increasing temperature did, and too low wasn't nearly as bad but made the model too stiff, and so I would have to put more effort into steering it.
+
+> 👤 **saood06** replied on **2025-03-26** at **04:18:34**
>
-> 👤 **saood06** replied the **2025-03-26** at **04:18:34**:
> > Initial impression is I don't miss `` as it gets right to the point
>
> Ya it does take time to do, also did you also follow the recommendation of removing them after the round like this:
@@ -3725,8 +3759,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> 
>
> Removing the thinking as recommended for multi round causes a lot of prompt reprocessing which takes time on my machine. All the more reason I'm looking forward to DeepSeek-V3-0324
+
+> 👤 **ubergarm** replied on **2025-03-26** at **14:48:48**
>
-> 👤 **ubergarm** replied the **2025-03-26** at **14:48:48**:
> > Interesting, I use [mikupad](https://github.com/lmg-anon/mikupad) which is really nice, but ...
>
> Oh nice, a single html sounds cool. I want to re-write my little `dchat.py` app to remove litellm dependency and simply use async http directly as it is such a thin layer and I would prefer to have more transparency. It uses a simple status bar `enlighten` and `deepseek-tokenizer` to dynamically update tok/sec estimate on the client using async streaming response. I'd like to add [primp](https://github.com/deedy5/primp) directly to it, which I use for my "agentic" stuff like web search and scraping - it delivers fairly clean markdown ready to feed to LLMs.
@@ -3755,8 +3790,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> ```
>
> Gotta head out for a night or two, hope to leave a test running and possibly check in via laptop to track updates. Cheers and curious to hear how your iq4 works out!
+
+> 👤 **saood06** replied on **2025-03-27** at **04:00:05**
>
-> 👤 **saood06** replied the **2025-03-27** at **04:00:05**:
> > > Interesting, I use [mikupad](https://github.com/lmg-anon/mikupad) which is really nice, but ...
> >
> > Oh nice, a single html sounds cool.
@@ -3802,7 +3838,7 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> Thanks, I'll let you know my experience with it.
>
>
-> Edit: Performance is lower for this mix vs my first (and fastest) R1 mix, I do think it is almost certainly because I did make this mix a bit bigger, but looking into if the runtime computed tensors in #259 may be loaded in a way that is not ideal for my system, I could maybe try loading them into my mmap buffer type from #290.
+> Edit: Performance is lower for this mix vs my first (and fastest) R1 mix, I do think it is almost certainly because I did make this mix a bit bigger, but looking into if the runtime computed tensors in [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) may be loaded in a way that is not ideal for my system, I could maybe try loading them into my mmap buffer type from [#290](https://github.com/ikawrakow/ik_llama.cpp/issues/290).
>
> First mix of V3_0324:
> (
@@ -3856,8 +3892,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> Edit 3: Made a pure IQ4_K_R4 mix using the team mradermacher imatrix. It is not functional (but it was fast).
>
> Overall first impressions though, I do think R1 is better, but the performance benefits of not having thinking tokens, and not having to reprocess the prompt so often due to removing the thinking tokens, means I actually think the new V3 is useful to me. The same can't be said about the old V3 even though it also has those performance benefits.
+
+> 👤 **ubergarm** replied on **2025-03-30** at **04:12:33**
>
-> 👤 **ubergarm** replied the **2025-03-30** at **04:12:33**:
> > You may want to look at how mikupad leverages the llama-server's tokenizer and detokinizer endpoints
>
> Oh that is a nice feature, I didn't realize that endpoint existed! Good to know there may be some differences in the API endpoint as well. I'm happy to share the `dchat.py` after I get it to a place I'm happy enough to release it.
@@ -3877,8 +3914,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> > the new V3 is useful to me
>
> Yeah, agreed it is nice to just get the answer without all that thinking latency hah.. :crossed_fingers: Fingers crossed that R2 is magically better with the same architecture if they drop that soon hah...
+
+> 👤 **saood06** replied on **2025-03-30** at **05:10:16**
>
-> 👤 **saood06** replied the **2025-03-30** at **05:10:16**:
> >I'm happy to share the `dchat.py` after I get it to a place I'm happy enough to release it.
>
> Thank you, let me know whenever that is.
@@ -3916,8 +3954,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> > Yeah, agreed it is nice to just get the answer without all that thinking latency hah.. 🤞 Fingers crossed that R2 is magically better with the same architecture if they drop that soon hah...
>
> It is, but if R2 is good enough I know I'll go back to dealing with the latency.
+
+> 👤 **ubergarm** replied on **2025-03-30** at **16:49:03**
>
-> 👤 **ubergarm** replied the **2025-03-30** at **16:49:03**:
> @saood06
>
> > First mix of V3_0324:
@@ -3944,8 +3983,9 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> Also I noticed that `python gguf-py/scripts/gguf_dump.py --markdown /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf` doesn't have support for the new quant types so it barfs. I'll keep that in the back of my head for a rainy day to possibly try to update it. More of a convenience than anything else.
>
> Thanks for sharing all your quant cooking experience and tips!
+
+> 👤 **saood06** replied on **2025-03-30** at **19:34:21**
>
-> 👤 **saood06** replied the **2025-03-30** at **19:34:21**:
> > Hrmm, I see you used `llama-sweep-bench` on your "first mix", but did you ever check perplexity or try to inference with it?
>
> Assuming you mean the V3_0324, I have not checked perplexity (and I haven't for any other V3_0324 mix), but I do use it for inference as it is my only quant of V3_0324 that functions for inference.
@@ -3957,7 +3997,7 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
> >
> > Reason I'm asking is that I made a quant overnight using `iq5_k_r4` and checking perplexity this morning it is very high (not NaN but possibly numerical instability) and also it doesn't inference correctly and just replies with `AlrightAlrightAlrightAlright` hah...
> >
-> > I've opened an issue about it to track relevant information easier, feel free to chime in if you have any thoughts. #296
+> > I've opened an issue about it to track relevant information easier, feel free to chime in if you have any thoughts. [#296](https://github.com/ikawrakow/ik_llama.cpp/issues/296)
>
> I will reply over there.
>
@@ -3982,30 +4022,33 @@ This has been superseded by #259. The additional 2 tensors needed for MLA (`attn
---
-👤 **saood06** replied the **2025-03-25** at **15:51:30**:
+👤 **saood06** commented on **2025-03-25** at **15:51:30**
@ubergarm
Just saw this "In our web and application environments, the temperature parameter $T_{model}$ is set to 0.3. " and they even go as far to encourage users to use that by "Thus, if you call V3 via API, temperature 1.0 equals to the model temperature 0.3.", so I think you might want to experiment with that temperature.
-> 👤 **ubergarm** replied the **2025-03-25** at **16:03:34**:
+> 👤 **ubergarm** replied on **2025-03-25** at **16:03:34**
+>
> Ahh, interesting, yeah R1 suggested default was 0.6 or somthing iirc.
>
> Does specifying temperature matter for making the imatrix? Guessing it does not, so will continue trying to make imatrix with default command above.
>
> But when I go to actually test a final quant, thanks for this important detail to set `temp=0.3`!
+
+> 👤 **saood06** replied on **2025-03-25** at **16:54:05**
>
-> 👤 **saood06** replied the **2025-03-25** at **16:54:05**:
> > But when I go to actually test a final quant, thanks for this important detail to set `temp=0.3`!
>
> Ya I'm in the middle of downloading. This model seems interesting to try out.
+
+> 👤 **saood06** replied on **2025-03-25** at **20:34:02**
>
-> 👤 **saood06** replied the **2025-03-25** at **20:34:02**:
> On this topic what are your preferred samplers? I use just temp, and min_p but this https://github.com/ggml-org/llama.cpp/pull/11223 has caught my eye a bit (seems like it might be a slight improvement over min_p)
---
-👤 **saood06** replied the **2025-03-25** at **19:07:02**:
+👤 **saood06** commented on **2025-03-25** at **19:07:02**
> 14B of the Multi-Token Prediction (MTP) Module weights
@@ -4015,7 +4058,8 @@ Is this something you have looked into? I think even a basic implementation shou
There is also jukofyork who is making draft model's (see [here](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-GGUF)) that can be used with llama.cpp's already existing generic drafting implementation, I'm watching that to see how much performance uplift people end up reporting on that.
-> 👤 **ikawrakow** replied the **2025-03-26** at **05:05:55**:
+> 👤 **ikawrakow** replied on **2025-03-26** at **05:05:55**
+>
> > > 14B of the Multi-Token Prediction (MTP) Module weights
> >
> > @ikawrakow
@@ -4025,8 +4069,9 @@ There is also jukofyork who is making draft model's (see [here](https://huggingf
> > There is also jukofyork who is making draft model's (see [here](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-GGUF)) that can be used with llama.cpp's already existing generic drafting implementation, I'm watching that to see how much performance uplift people end up reporting on that.
>
> No, I haven't looked into how it works. I'm surprised MPT has not been implemented in mainline.
+
+> 👤 **jukofyork** replied on **2025-03-31** at **22:05:13**
>
-> 👤 **jukofyork** replied the **2025-03-31** at **22:05:13**:
> > There is also jukofyork who is making draft model's (see [here](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.5B-GGUF)) that can be used with llama.cpp's already existing generic drafting implementation, I'm watching that to see how much performance uplift people end up reporting on that.
>
> @saood06 I haven't released anything yet as wasn't really happy with the results, but somebody linked me this paper:
@@ -4038,54 +4083,61 @@ There is also jukofyork who is making draft model's (see [here](https://huggingf
> 
>
> With 30% raw code data in the mix now.
+
+> 👤 **saood06** replied on **2025-04-01** at **00:10:00**
>
-> 👤 **saood06** replied the **2025-04-01** at **00:10:00**:
> @jukofyork
>
> Thanks for the update.
---
-👤 **ikawrakow** replied the **2025-03-26** at **05:03:12**:
+👤 **ikawrakow** commented on **2025-03-26** at **05:03:12**
> [210]6447980.5077,[211]6475482.7036,[212]6484583.7694,[213]6476309.6415,
The imatrix computation that gave these final perplexity values is useless. It means mainline is not working with `Q8_0` either for DeepSeek-V3 (the difference between a NaN PPL and a PPL of 6 million is marginal, if any).
-> 👤 **saood06** replied the **2025-03-26** at **05:08:32**:
+> 👤 **saood06** replied on **2025-03-26** at **05:08:32**
+>
> > It means mainline is not working with `Q8_0` either for DeepSeek-V3 (the difference between a NaN PPL and a PPL of 6 million is marginal, if any).
>
> That's the MLA PR on llama.cpp that is not working, llama.cpp main works as it has been used a lot to do imatrix for the large Deepseek V3/R1 models.
+
+> 👤 **ikawrakow** replied on **2025-03-26** at **06:01:14**
>
-> 👤 **ikawrakow** replied the **2025-03-26** at **06:01:14**:
> It looked like this is @ubergarm's imatrix run? It ran to completion with 213 chunks.
+
+> 👤 **saood06** replied on **2025-03-26** at **06:19:31**
>
-> 👤 **saood06** replied the **2025-03-26** at **06:19:31**:
> > It looked like this is @ubergarm's imatrix run? It ran to completion with 213 chunks.
>
> Yes and that run was on the dairy dreaming PR see below:
>
> > So I managed to build that [fairydreaming/deepseek2-mla-exp@76543311](https://github.com/fairydreaming/llama.cpp/tree/deepseek2-mla-exp) and have `llama-perplexity` running on the plain `q8_0` I made with `ik_llama.cpp`.
+
+> 👤 **ubergarm** replied on **2025-03-26** at **21:23:34**
>
-> 👤 **ubergarm** replied the **2025-03-26** at **21:23:34**:
-> Okay, using PR#291 I was able to compute an importance matrix on a `V3-0324` static `q8_0` quant. I made the `bf16` GGUF using [evshiron/llama.cpp](https://github.com/evshiron/llama.cpp) as outlined in my notes from the original deepseek-ai `fp8`.
+> Okay, using PR[#291](https://github.com/ikawrakow/ik_llama.cpp/issues/291) I was able to compute an importance matrix on a `V3-0324` static `q8_0` quant. I made the `bf16` GGUF using [evshiron/llama.cpp](https://github.com/evshiron/llama.cpp) as outlined in my notes from the original deepseek-ai `fp8`.
>
> I'm not clear if this computes imatrix for the MLA tensors as well? If so, then would this be better to use than the bartowski imatrix computed on mainline?
>
> Anyway, @saood06 if you are interested, I haven't had time to test it yet, but just uploaded it to [ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) hf repo. I hope to eventually upload a quant or two that I like for this fork to that repo.
>
-> Perplexty value and partial logs from computing imatrix on [PR#291 here](https://github.com/ikawrakow/ik_llama.cpp/pull/291#issuecomment-2755540202)
+> Perplexty value and partial logs from computing imatrix on [PR[#291](https://github.com/ikawrakow/ik_llama.cpp/issues/291) here](https://github.com/ikawrakow/ik_llama.cpp/pull/291#issuecomment-2755540202)
>
> Cheers!
+
+> 👤 **saood06** replied on **2025-03-27** at **03:32:08**
>
-> 👤 **saood06** replied the **2025-03-27** at **03:32:08**:
> > Anyway, @saood06 if you are interested, I haven't had time to test it yet, but just uploaded it to [ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) hf repo. I hope to eventually upload a quant or two that I like for this fork to that repo.
>
> Thanks, I would have used your imatrix over bartowski as I think your dataset is better, but I just finished up the quant and don't feel like making another. Once team mradermacher uploads one I may end up making additional quants using both theirs and yours.
>
> Also the forum link on your huggingface readme from L1T caught my eye, I used to hang around there a good amount, haven't in a while, I should go back.
+
+> 👤 **ubergarm** replied on **2025-03-29** at **18:43:50**
>
-> 👤 **ubergarm** replied the **2025-03-29** at **18:43:50**:
> > Thanks, I would have used your imatrix over bartowski as I think your dataset is better, but I just finished up the quant and don't feel like making another. Once team mradermacher uploads one I may end up making additional quants using both theirs and yours.
>
> So I did manage to do a comparison against both imatrix datasets by making two otherwise identical quants and comparing perplexity against `wiki.text.raw`: [here](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c?permalink_comment_id=5519433#gistcomment-5519433)
@@ -4095,8 +4147,9 @@ The imatrix computation that gave these final perplexity values is useless. It m
> Also, I finished and uploaded my `V3-0324` quant and did a comparison across top quant cookers recipes over in [this discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/288#discussioncomment-12663525)
>
> The other tip I saw was by [unsloth in r/LocalLLama post](https://www.reddit.com/r/LocalLLaMA/comments/1jk0qjs/178bit_deepseekv30324_230gb_unsloth_dynamic_gguf/) suggesting turn down temp to 0 and min-p to 0.01 when generating code or math. I've seen folks anecdotally suggesting `V3-0324` hallucinates more but might just be the default temps are too high, not sure.
+
+> 👤 **saood06** replied on **2025-03-30** at **01:22:27**
>
-> 👤 **saood06** replied the **2025-03-30** at **01:22:27**:
> > So I did manage to do a comparison against both imatrix datasets by making two otherwise identical quants and comparing perplexity against `wiki.text.raw`: [here](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c?permalink_comment_id=5519433#gistcomment-5519433)
>
> Nice, thanks for the additional data point on imatrix dataset quality.
diff --git a/github-data/discussions/288 - On _compilade_s PR 12557 and _jukofyork_s quantization ideas.md b/github-data/discussions/288 - On compilades PR 12557 and jukofyorks quantization ideas.md
similarity index 92%
rename from github-data/discussions/288 - On _compilade_s PR 12557 and _jukofyork_s quantization ideas.md
rename to github-data/discussions/288 - On compilades PR 12557 and jukofyorks quantization ideas.md
index 9f6d20eb4..f6bf678d1 100644
--- a/github-data/discussions/288 - On _compilade_s PR 12557 and _jukofyork_s quantization ideas.md
+++ b/github-data/discussions/288 - On compilades PR 12557 and jukofyorks quantization ideas.md
@@ -1,13 +1,14 @@
-### 🗣️ [#288](https://github.com/ikawrakow/ik_llama.cpp/discussions/288) - On @compilade's PR 12557 and @jukofyork's quantization ideas
+## 🗣️ [Discussion #288](https://github.com/ikawrakow/ik_llama.cpp/discussions/288) - On @compilade's PR 12557 and @jukofyork's quantization ideas
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-03-25 |
| **Updated** | 2025-04-11 |
---
-#### Description
+## 📄 Description
@compilade has submitted an [interesting PR](https://github.com/ggml-org/llama.cpp/pull/12557) in the mainline `llama.cpp` repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in the `llama.cpp` project, I'll address the pings here.
@@ -44,9 +45,9 @@ ___
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **jukofyork** replied the **2025-03-25** at **12:48:44**:
+👤 **jukofyork** commented on **2025-03-25** at **12:48:44**
> @compilade has submitted an [interesting PR](https://github.com/ggml-org/llama.cpp/pull/12557) in the mainline `llama.cpp` repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in the `llama.cpp` project, I'll address the pings here.
@@ -62,7 +63,7 @@ I'm sorry if I've come across badly as this isn't my intention - I've nothing to
---
-👤 **ikawrakow** replied the **2025-03-25** at **14:28:09**:
+👤 **ikawrakow** commented on **2025-03-25** at **14:28:09**
@jukofyork Sorry if I have come across a bit harsh. But it is interesting stuff indeed, so we all can get passionate about it.
@@ -86,7 +87,7 @@ to generate the data in the graph (a negative sample size will cause the program
---
-👤 **ikawrakow** replied the **2025-03-25** at **15:01:41**:
+👤 **ikawrakow** commented on **2025-03-25** at **15:01:41**
Here is another very simple C++ program:
* Pick $N$ random values
@@ -102,7 +103,7 @@ With this, we get this graph. It looks very similar to what one gets by doing an
---
-👤 **compilade** replied the **2025-03-25** at **16:25:49**:
+👤 **compilade** commented on **2025-03-25** at **16:25:49**
@ikawrakow
@@ -159,7 +160,7 @@ Yes, totally agree! And technically I already got what I wanted out of these alg
---
-👤 **ikawrakow** replied the **2025-03-25** at **16:53:43**:
+👤 **ikawrakow** commented on **2025-03-25** at **16:53:43**
> Aside: is there a generally better solution for the default importance weights (without imatrix)? (It seems the heuristics between quant types disagree: some use x[i] * x[i], others fabsf(x[i]), and others sqrtf(sum_x2/N) + fabsf(x[I])
@@ -173,16 +174,19 @@ Go back to the basics. Start with LLaMA-v1-7B. I know, nobody uses that today. B
Oh, I used `ik_llama.cpp` to compare. It is possible that has become much faster than mainline (I haven't used mainline for quite some time). I started testing with DeepSeek-Lite, and almost gave up (your `IQ4_NL` quantization took 302.5 seconds with imatrix). `ik_llama.cpp` does it in 54.5 seconds.
-> 👤 **bartowski1182** replied the **2025-03-26** at **17:42:29**:
+> 👤 **bartowski1182** replied on **2025-03-26** at **17:42:29**
+>
> Re: quantization speed
>
> Do you have any loose thoughts on where your crazy speedup may be coming from? Not asking you to do a thorough investigation, but curious if you have an initial place to point me
+
+> 👤 **ikawrakow** replied on **2025-03-26** at **18:16:32**
>
-> 👤 **ikawrakow** replied the **2025-03-26** at **18:16:32**:
> IIRC:
> At some point I was annoyed by the slow quantization speed of quantization types with non-linear grids (`IQ4_XS, IQ4_NL` in mainline, here also `IQ2_KS, IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K`). The major bottleneck turned out to be finding the bin in which a value falls after scaling. E.g., [this function](https://github.com/ggml-org/llama.cpp/blob/2447ad8a981253a2b8e9f4b31cc8e7fdff83423e/ggml/src/ggml-quants.c#L4562) in mainline, which does a binary search to find the bin. So, I replaced that with functions such as [this one](https://github.com/ikawrakow/ik_llama.cpp/blob/a22250df93fd833a6cb7f310b159ad1b54e4d582/ggml/src/ggml-quants.c#L14528). I think that was the major part. I don't remember if I did additional optimizations and what they were, if any. I would have to go through the old PRs to find out.
+
+> 👤 **compilade** replied on **2025-03-26** at **18:24:02**
>
-> 👤 **compilade** replied the **2025-03-26** at **18:24:02**:
> @bartowski1182
>
> (EDIT: sorry, I did not see ikawrakow's answer before commenting)
@@ -197,11 +201,13 @@ Oh, I used `ik_llama.cpp` to compare. It is possible that has become much faster
>
> I will check if (and how) `best_index_iq4nl` affects the equirectangular projection of `IQ4_NL`, since that seems relevant.
> (EDIT: it doesn't seem to change anything at a cursory glance. So it is pretty much equivalent.)
+
+> 👤 **ikawrakow** replied on **2025-03-26** at **18:40:39**
>
-> 👤 **ikawrakow** replied the **2025-03-26** at **18:40:39**:
> Here some napkin math: @compilade said that their approach is only 2X slower than the master branch in mainline. If I use the DeepSeek-Lint values, it means mainline will quantize it in 150 seconds instead of 300 seconds. If you add this optimization, it will become 50 seconds (using round values to make it easier to follow). You then add 150 seconds for the heap search, and it becomes 200 seconds. So, 4X slower than `ik_llama.cpp`, but only ~30% slower than the current state of mainline.
+
+> 👤 **compilade** replied on **2025-03-26** at **19:26:28**
>
-> 👤 **compilade** replied the **2025-03-26** at **19:26:28**:
> @ikawrakow My implementation (with the cumulative search) unfortunately cannot use this optimization, because it doesn't use `best_index_int8` anyway. The reason my implementation is slow is because it's too exhaustive. It calculates `sumqx` and `sumq2` for *all* scales which would result in a distinct quantization, and it tests both signs. That is `(32*(7+8))+1 = 481` distinct scales compared per block of 32, compared to the `(2*7+1)+1 = 16` scales compared by the implementations which use either `best_index_int8` or `best_index_iq4nl`.
>
> It's nice that it's not `481/16 = 30` times slower, though 6× does seem too slow, I agree.
@@ -209,23 +215,24 @@ Oh, I used `ik_llama.cpp` to compare. It is possible that has become much faster
> The only ways to make the cumulative search faster is to reduce how many scales it searches (which for linear quants is easier because more of them are equivalent and can be skipped), or to make the cumulative step faster.
>
> (It might be possible to mix both approaches to search for more than 16 scales at 1× speed (or faster))
+
+> 👤 **bartowski1182** replied on **2025-03-26** at **19:35:38**
>
-> 👤 **bartowski1182** replied the **2025-03-26** at **19:35:38**:
> Appreciate the insights, thanks!
---
-👤 **ikawrakow** replied the **2025-03-28** at **09:36:09**:
+👤 **ikawrakow** commented on **2025-03-28** at **09:36:09**
@compilade @bartowski1182
-You may be interested in PR #295
+You may be interested in PR [#295](https://github.com/ikawrakow/ik_llama.cpp/issues/295)
---
-👤 **ubergarm** replied the **2025-03-29** at **17:57:59**:
+👤 **ubergarm** commented on **2025-03-29** at **17:57:59**
-While not directly related to the quants specific to #295 , I did just release what may be one of the best quants (for generation quality) in its size class for `V3-0324` on huggingface [ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) cooking with `ik_llama.cpp`. It also still fits 32k context in under 24GB VRAM and can hit over 4 tok/sec tg mmap'ing on my 9950x 96GB + 3090TI 24GB VRAM rig using `-ser 6,1` sacrificing minimal perplexity.
+While not directly related to the quants specific to [#295](https://github.com/ikawrakow/ik_llama.cpp/issues/295) , I did just release what may be one of the best quants (for generation quality) in its size class for `V3-0324` on huggingface [ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) cooking with `ik_llama.cpp`. It also still fits 32k context in under 24GB VRAM and can hit over 4 tok/sec tg mmap'ing on my 9950x 96GB + 3090TI 24GB VRAM rig using `-ser 6,1` sacrificing minimal perplexity.
It only works with `ik_llama.cpp` as even with experimental mainline PRs [fairydreaming:deepseek2-mla-exp](https://github.com/ggml-org/llama.cpp/pull/11446) and [sl/custom-tensor-offload](https://github.com/ggml-org/llama.cpp/pull/11397) you still need support for `IQ3_K_R4`/`IQ2_K_R4` which is only available here.
@@ -290,10 +297,12 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
:point_up:
-> 👤 **ikawrakow** replied the **2025-03-29** at **18:18:55**:
+> 👤 **ikawrakow** replied on **2025-03-29** at **18:18:55**
+>
> I would be really curious to see the PPL values of the other quant cookers.
+
+> 👤 **bartowski1182** replied on **2025-03-29** at **18:42:51**
>
-> 👤 **bartowski1182** replied the **2025-03-29** at **18:42:51**:
> How many chunks of wiki test raw are you using for PPL? If you give your exact command I can get you the PPL for my own quant
>
> It's very intriguing. I know that most likely the unsloth one will be better than my own since he went out of his way to optimize the tensor types for that model which is just not something I have the throughput to handle 😅
@@ -301,8 +310,9 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> Also don't really want to make the same ones as him and release them since it would just be ripping off his work 🤷♂️
>
> Interesting stuff overall though
+
+> 👤 **ubergarm** replied on **2025-03-29** at **19:06:34**
>
-> 👤 **ubergarm** replied the **2025-03-29** at **19:06:34**:
> Yeah I'm curious too! Bartowski you do use imatrix though, which I don't think unsloth does. So so not sure how that would make up for the smaller tensor types.
>
> I just ran the `Q8_0` for baseline comparison and got this result:
@@ -372,16 +382,18 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> Finally, I'm not sure what imatrix text mradermacher uses to make imatrix, but I did a [quick comparison](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c?permalink_comment_id=5519433#gistcomment-5519433) of two otherwise identical quantizations using bartowski's imatrix and a slightly updated input text. They give similar perplexity against wiki.text.raw, for whatever that is worth hah...
>
> Anyway, yeah thanks for all your effort! I dunno how y'all keep up with the torrent of near weekly big model releases lately! Cheers!
+
+> 👤 **ikawrakow** replied on **2025-03-29** at **19:06:35**
>
-> 👤 **ikawrakow** replied the **2025-03-29** at **19:06:35**:
> I think @ubergarm can do the full PPL in less than an hour with their Xeon server. I don't know what kind of hardware you have.
>
> > ... since he went out of his way to optimize the tensor types for that model
> > Also don't really want to make the same ones as him and release them since it would just be ripping off his work
>
> I'm sure you are aware that quantization mixes have been in `llama.cpp` since the release of k-quants. All of those use more bits for the first few `ffn_down` layers. Also all of them use more bits for the attention tensors in MoE models. If you look at the Unsloth's so called "dynamic" quants, it is easy to see that with a small change of the function that determines the quantization type to handle the different names of the DeepSeek tensors (and the presence of shared experts), you will get basically what they used. Did they mention that? Of course not. So now the entire industry knows that Unsloth invented "dynamic" quants.
+
+> 👤 **bartowski1182** replied on **2025-03-29** at **20:14:48**
>
-> 👤 **bartowski1182** replied the **2025-03-29** at **20:14:48**:
> Yeah I did browse through his repo to check the changes he made, I do understand the overall nature of the quantization mixes and his adjustments made, and I know I could either pull his fork or make similar changes of my own to get the same results but just out of principle don't want to rehost if I'm not actually adding anything to the process
>
> I've got myself an EPYC server so things run pretty okay on my end as well, I'm just lacking on the GPU front for some things :)
@@ -393,8 +405,9 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> > All of those use more bits for the first few ffn_down layers. Also all of them use more bits for the attention tensors in MoE models
>
> This part however I was not explicitly aware of, but still in terms of raw bits per weight, unsloth's mix seems superior (at least in the tests he has ran, PPL, KLD, and additional tests would be good to see if it's genuinely big improvements or if it's actually similar overall)
+
+> 👤 **saood06** replied on **2025-03-30** at **01:51:10**
>
-> 👤 **saood06** replied the **2025-03-30** at **01:51:10**:
> Since mradermacher doesn't use gguf split you may have to use [gguf-py/scripts/gguf_dump.py](https://github.com/ikawrakow/ik_llama.cpp/blob/main/gguf-py/scripts/gguf_dump.py) to get the metadata.
>
> > 👇
@@ -402,8 +415,9 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> > ☝️
>
> You can probably remove tensor_count doesn't matter, as it changes based on split size and kv_count also doesn't really mean much it's just the number of entries of metadata from your table.
+
+> 👤 **ikawrakow** replied on **2025-03-30** at **05:44:14**
>
-> 👤 **ikawrakow** replied the **2025-03-30** at **05:44:14**:
> > This part however I was not explicitly aware of, but still in terms of raw bits per weight, unsloth's mix seems superior
>
> Superior compared to what? To unmaintained `llama.cpp`? Where @compilade's PR 12557 is the first noteworthy thing related to quantization that has happened since I left the project more than a year ago?
@@ -417,8 +431,9 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> When the quantization mix strategies for MoE were written, experts were in separate tensors named `blk.X.ffn_up/gate/down.Y.weight` (where `X` was the layer index and `Y` the expert index). Then somebody decided to combine the experts into a single tensor named `blk.X.ffn_up/down/gate_exps.weight`, but did not change the code that decides on the quantization mix. Voila, you have the `QX_K_M` "dynamic" quants not working as intended.
>
> Take a look at the code block that follows `} else if (name.find("ffn_down") != std::string::npos) {`. Several of the quantization type modifications use more bits for the first `1/8` of the layers. Which is 7 for DeepSeek-V3/R1. In how many layers do Unsloth use more bits for `ffn_down` in their "carefully tuned dynamic" quants?
+
+> 👤 **bartowski1182** replied on **2025-03-30** at **15:33:58**
>
-> 👤 **bartowski1182** replied the **2025-03-30** at **15:33:58**:
> > Superior compared to what? To unmaintained llama.cpp? Where @compilade's PR 12557 is the first noteworthy thing related to quantization that has happened since I left the project more than a year ago?
>
> I mean yeah I did mention that I wouldn't be surprised if this branch has superior performance over even what he did 🤷♂️ I do recognize the stale state llama.cpp has been left in with regards to SOTA quantization performance
@@ -434,26 +449,31 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> I also recognize the fact that since you left quantization itself has definitely gone to the backburner, I'm very thankful to compilade for his efforts but yeah, not quite the same since
>
> I'm also surprised no one has come around and attempted to upstream some of your changes, several seem like just free performance gains, others are understandably more complex but there's certainly a few low hanging fruit that are just being ignored (and yes I recognize the irony of not doing it myself while complaining others aren't doing it)
+
+> 👤 **ikawrakow** replied on **2025-03-30** at **17:03:32**
>
-> 👤 **ikawrakow** replied the **2025-03-30** at **17:03:32**:
> The only reason I started this discussion was that you wrote above "... it would just be ripping off his work". And the point I was trying to make was that it would be perfectly fine to rip off their work as this is exactly what they did.
+
+> 👤 **bartowski1182** replied on **2025-03-30** at **17:26:34**
>
-> 👤 **bartowski1182** replied the **2025-03-30** at **17:26:34**:
> Oh I mean, fair haha. I guess I meant I don't want to strictly 1:1 copy his repo and release identical quants
>
> But you're definitely right that his work is basically just a bandage solution that happens to be the proper way to handle MoE models in general
>
> I do highly appreciate the insight though for the record, I don't mean to come off as argumentative or dismissive! I'll be looking into what you suggested for sure
+
+> 👤 **bartowski1182** replied on **2025-03-30** at **19:24:25**
>
-> 👤 **bartowski1182** replied the **2025-03-30** at **19:24:25**:
> @ikawrakow would you mind if I took inspiration from your changes to https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp for some upstream work on llama_tensor_get_type? "inspiration" in this case would likely mean just straight up copying any changes that, to my untrained eye, seem strictly better and without risk of negatives (since I wouldn't discount the possibility some may be negative without other appropriate changes throughout the system)
+
+> 👤 **ikawrakow** replied on **2025-03-31** at **06:01:25**
>
-> 👤 **ikawrakow** replied the **2025-03-31** at **06:01:25**:
> Sure, go ahead. I see I haven't actually changed all occurrences of `n_expert == 8` to `n_expert >= 8`, so you may want find/replace all when making the change.
>
> Here people now use custom rules for making quants, so you may want to explore this as well. If you stick to quants available in mainline `llama.cpp`, you can "cook" the quants you publish with `ik_llama.cpp`.
+
+> 👤 **bartowski1182** replied on **2025-04-01** at **23:20:00**
>
-> 👤 **bartowski1182** replied the **2025-04-01** at **23:20:00**:
> @ubergarm I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area
>
> llama.cpp main: 3.9012
@@ -461,8 +481,9 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> my fork: 3.6868
>
> considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!
+
+> 👤 **saood06** replied on **2025-04-02** at **00:19:01**
>
-> 👤 **saood06** replied the **2025-04-02** at **00:19:01**:
> > I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area
> >
> > llama.cpp main: 3.9012
@@ -472,8 +493,9 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> > considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!
>
> I'm not ubergarm, but thank you for this, I'm always curious to see PPL numbers and this is interesting.
+
+> 👤 **ubergarm** replied on **2025-04-02** at **19:26:29**
>
-> 👤 **ubergarm** replied the **2025-04-02** at **19:26:29**:
> > @ubergarm I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area
> >
> > llama.cpp main: 3.9012
@@ -487,22 +509,25 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
> Also interesting that [suddenly today mainline llama.cpp merged in `-ot` support!](https://github.com/ggml-org/llama.cpp/pull/11397). Curious what they will do with [MLA support](https://github.com/ggml-org/llama.cpp/pull/11446).
>
> Cheers!
+
+> 👤 **bartowski1182** replied on **2025-04-03** at **03:10:18**
>
-> 👤 **bartowski1182** replied the **2025-04-03** at **03:10:18**:
> Opened the PR here:
>
> https://github.com/ggml-org/llama.cpp/pull/12727
>
> that Q2_K_L-V2 will be replaced with a SLIIIIGHTLY better one probably tomorrow, but it's basically the same overall, just a few small bumps for another couple hundred mb
+
+> 👤 **danielhanchen** replied on **2025-04-03** at **03:41:53**
>
-> 👤 **danielhanchen** replied the **2025-04-03** at **03:41:53**:
> Oh hi! I didn't expect to be tagged - @bartowski1182 you're more than welcome to use the llama.cpp fork I have :)
>
> @ikawrakow Much apologies if people are mis-representing I "invented" dynamic quants, which is far from the truth. Appreciate the work you do, and keep it up - and ignore all the haters - your code is great!
>
> @ubergarm Great work on the quant as well! I was planning to do imatrix for all quants from now on, but I'm still trying to get the calibration dataset done specifically for instruct models - reasoning models are also a bit more complex.
+
+> 👤 **danielhanchen** replied on **2025-04-03** at **03:45:49**
>
-> 👤 **danielhanchen** replied the **2025-04-03** at **03:45:49**:
> It was actually pure coincidence on making the dynamic quants for DeepSeek R1, V3, since unfortunately as @ikawrakow mentioned, `llama.cpp` also quantizes the shared experts and dense layers the same as the rest of the model - my changes are at https://github.com/unslothai/llama.cpp/
>
> But the main motivation for "dynamic quants" was due to bitsandbytes and vLLM for finetuning, not actually llama.cpp as @bartowski1182 mentioned. For eg in Gemma 3, I did both activation and weight error analysis to see which parts to quantize / not quantize:
@@ -510,7 +535,7 @@ Big thanks to y'all doing so much inspirational work and making this stuff more
---
-👤 **saood06** replied the **2025-04-11** at **03:06:19**:
+👤 **saood06** commented on **2025-04-11** at **03:06:19**
@danielhanchen
@@ -520,4 +545,4 @@ For Maverick you reported hitting this over protectiveness issue in llama.cpp
>We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration
-That issue has been addressed here in #202 but you may need to adjust it to allow 10% missing to get the blk.1 tensors as well (but block 45 is below 50% which seems very odd).
\ No newline at end of file
+That issue has been addressed here in [#202](https://github.com/ikawrakow/ik_llama.cpp/issues/202) but you may need to adjust it to allow 10% missing to get the blk.1 tensors as well (but block 45 is below 50% which seems very odd).
\ No newline at end of file
diff --git a/github-data/discussions/316 - Mainline is now copying stuff from ik_llama.cpp.md b/github-data/discussions/316 - Mainline is now copying stuff from ik_llama.cpp.md
index ed02ac0c9..f4c009192 100644
--- a/github-data/discussions/316 - Mainline is now copying stuff from ik_llama.cpp.md
+++ b/github-data/discussions/316 - Mainline is now copying stuff from ik_llama.cpp.md
@@ -1,13 +1,14 @@
-### 🗣️ [#316](https://github.com/ikawrakow/ik_llama.cpp/discussions/316) - Mainline is now copying stuff from ik_llama.cpp
+## 🗣️ [Discussion #316](https://github.com/ikawrakow/ik_llama.cpp/discussions/316) - Mainline is now copying stuff from ik_llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ❌ **Closed** |
| **Created** | 2025-04-06 |
| **Updated** | 2025-04-29 |
---
-#### Description
+## 📄 Description
We have [this merged PR](https://github.com/ggml-org/ggml/pull/1174) and [this pending PR](https://github.com/ggml-org/ggml/pull/1179) in the [ggml repository](https://github.com/ggml-org/ggml) copying code from `ik_llama.cpp`. It is an interesting choice of venue. [ggml](https://github.com/ggml-org/ggml) is well known, but much lower profile than [llama.cpp](https://github.com/ggml-org/llama.cpp). We know that changes added to `ggml` quietly make their way into `llama.cpp` with "sync: ggml" PRs such as [this one](https://github.com/ggml-org/llama.cpp/pull/12670).
@@ -29,9 +30,9 @@ But, hey, IANAL, so it is maybe better to focus on the moral side of things. Whe
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **CISC** replied the **2025-04-06** at **13:12:04**:
+👤 **CISC** commented on **2025-04-06** at **13:12:04**
Uh, I was not aware of any wish for your work to be removed, in fact, I made the PRs solely based on your comment here: https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828
@@ -39,7 +40,7 @@ I chose to submit these to `ggml` not for some nefarious reason, but simply beca
---
-👤 **CISC** replied the **2025-04-06** at **13:33:21**:
+👤 **CISC** commented on **2025-04-06** at **13:33:21**
> Hmm. The PRs are definitely not a copy of `ik_llama.cpp`, but are they a "substantial portion" of it? How is "substantial" being measured? By LOCs? By utility? By some other measure?
@@ -51,18 +52,19 @@ Please don't blame anyone else than me, I do not represent `ggml` nor `llama.cpp
---
-👤 **ikawrakow** replied the **2025-04-06** at **13:50:50**:
+👤 **ikawrakow** commented on **2025-04-06** at **13:50:50**
@CISC
I'm sorry if this came across as a critique/attack on you. That was not the intent, and it has nothing to do with you. It is between ggerganov and me. Given the history, and there is 15 years of it even before `llama.cpp` came to be, I would have expected a different reaction from ggerganov to your PRs.
-> 👤 **JohannesGaessler** replied the **2025-04-06** at **14:06:02**:
+> 👤 **JohannesGaessler** replied on **2025-04-06** at **14:06:02**
+>
> In the end I am the one who is responsible for reviewing and merging the PR in question. I had interpreted [this post](https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828) as permission to do so without preconditions. I'm sorry for acting against your wishes.
---
-👤 **CISC** replied the **2025-04-06** at **14:08:38**:
+👤 **CISC** commented on **2025-04-06** at **14:08:38**
This puts me in a bind though, my intention was to upstream what I could (with the hardware I have available to test) as it seemed you were suggesting that this should be done (but not willing to do yourself).
@@ -70,16 +72,17 @@ You have made a great number of awesome contributions here, and I still wish for
---
-👤 **ikawrakow** replied the **2025-04-06** at **14:37:07**:
+👤 **ikawrakow** commented on **2025-04-06** at **14:37:07**
-@CISC @JohannesGaessler As you both refer to what I wrote in #256, here it is:
+@CISC @JohannesGaessler As you both refer to what I wrote in [#256](https://github.com/ikawrakow/ik_llama.cpp/issues/256), here it is:
> upstream is free to take from here whatever they find useful
Meaning there is nothing I can do to prevent that from happening as I'm publishing under a MIT license. I don't think I said that I do not expect upstream to abide by the terms of the license.
-> 👤 **CISC** replied the **2025-04-06** at **14:38:40**:
-> > @CISC @JohannesGaessler As you both refer to what I wrote in #256, here it is:
+> 👤 **CISC** replied on **2025-04-06** at **14:38:40**
+>
+> > @CISC @JohannesGaessler As you both refer to what I wrote in [#256](https://github.com/ikawrakow/ik_llama.cpp/issues/256), here it is:
> >
> > > upstream is free to take from here whatever they find useful
> >
@@ -89,7 +92,7 @@ Meaning there is nothing I can do to prevent that from happening as I'm publishi
---
-👤 **ikawrakow** replied the **2025-04-07** at **06:30:56**:
+👤 **ikawrakow** commented on **2025-04-07** at **06:30:56**
So, this is becoming interesting. Here is what @ggerganov has to say about my copyright notice being included in the file(s) where stuff was copied from my work:
@@ -103,11 +106,12 @@ The [discussion 6934](https://github.com/ggml-org/llama.cpp/discussions/6394) wa
---
-👤 **JohannesGaessler** replied the **2025-04-07** at **07:59:15**:
+👤 **JohannesGaessler** commented on **2025-04-07** at **07:59:15**
For the record: Do you find it acceptable for people to read your code and to then submit a PR to llama.cpp/ggml with the same functionality?
-> 👤 **ikawrakow** replied the **2025-04-07** at **09:10:21**:
+> 👤 **ikawrakow** replied on **2025-04-07** at **09:10:21**
+>
> > For the record: Do you find it acceptable for people to read your code and to then submit a PR to llama.cpp/ggml with the same functionality?
>
> I addressed that above. But here it is again my perhaps wrong concept of how it should be:
@@ -115,11 +119,13 @@ For the record: Do you find it acceptable for people to read your code and to th
> * If you reimplement what I have done here in your own way, you don't need to mention me or this repository. But if you were nice, you would still mention the original source/idea. Just like in many places in the ggml/llama.cpp code there are references to papers and/or other repositories.
>
> Now, also for the record, it isn't so that there aren't copyright notices in `ggml` "sprinkled around the code" as @ggerganov puts it. See for instance [this](https://github.com/ggml-org/ggml/blob/ab9ed73d40965d7e4b25a4adf2230b9a19bffbf9/src/ggml-cpu/ops.cpp#L4996) (and same notices in all other backends). I have this line in my fork as well in a completely [different place](https://github.com/ikawrakow/ik_llama.cpp/blob/a051f08b8f059fa10dd089d231b975291c122e9d/ggml/src/ggml.c#L16726), so it has been preserved over multiple code reorganizations (so, maintaining copyright notices in the source code as things are moved around is not quite as painful as claimed). You don't wonder why a Kawrakow copyright notice is so different from a Jeffrey Quesnelle and Bowen Peng copyright notice?
+
+> 👤 **JohannesGaessler** replied on **2025-04-07** at **10:41:05**
>
-> 👤 **JohannesGaessler** replied the **2025-04-07** at **10:41:05**:
> Thank you for your input. My perspective is that I don't have the ability to resolve a conflict between you and Georgi especially because I'm ignorant of your prior history. My previous policy was that I would simply not look at any of your code and that is what I will go back to.
+
+> 👤 **bartowski1182** replied on **2025-04-13** at **15:47:29**
>
-> 👤 **bartowski1182** replied the **2025-04-13** at **15:47:29**:
> As another outsider without a horse in this race (besides wanting everyone to benefit as much as possible by all the best work), I don't think a simple code comment referencing either the original PR from this repo, or lacking the ability to find one simply, a quick mention of this repo, world detract much if anything from the overall code experience
>
> In fact, recently when making changes, I've seen code with a comment referencing a PR from other repos, or from llamacpp itself, and these help immensely for tracking down motivations and any potential discussions that went on at the time
@@ -130,7 +136,7 @@ For the record: Do you find it acceptable for people to read your code and to th
---
-👤 **ikawrakow** replied the **2025-04-07** at **11:07:50**:
+👤 **ikawrakow** commented on **2025-04-07** at **11:07:50**
> My previous policy was that I would simply not look at any of your code and that is what I will go back to.
@@ -138,40 +144,48 @@ Yes, of course, as predicted.
---
-👤 **jano403** replied the **2025-04-07** at **11:16:19**:
+👤 **jano403** commented on **2025-04-07** at **11:16:19**
A based thing to do would be to license your repository under AGPL3.0, solves all problems.
-> 👤 **ikawrakow** replied the **2025-04-07** at **11:23:15**:
+> 👤 **ikawrakow** replied on **2025-04-07** at **11:23:15**
+>
> > A based thing to do would be to license your repository under AGPL3.0, solves all problems.
>
> Yes, I agree, it would have been better. But I didn't feel like juggling two different licenses, so just went with the original MIT license.
>
> On the other hand, the final outcome would not have been any different. Mainline will independently discover and implement the improvement I have made here without looking at my changes, not even once. I think this was made very clear by @JohannesGaessler's last comment.
+
+> 👤 **jano403** replied on **2025-04-07** at **11:29:07**
>
-> 👤 **jano403** replied the **2025-04-07** at **11:29:07**:
> Never too late to change it if You ever feel like it.
> Btw, appreciate all the hard work You're doing for quants and speed improvements!
+
+> 👤 **ikawrakow** replied on **2025-04-07** at **11:40:33**
>
-> 👤 **ikawrakow** replied the **2025-04-07** at **11:40:33**:
> I would need to read up on what is the correct way of mixing MIT licensed code with (A)GPL licensed code. Or can you point me to a simple to follow set of instructions?
+
+> 👤 **CISC** replied on **2025-04-07** at **12:00:19**
>
-> 👤 **CISC** replied the **2025-04-07** at **12:00:19**:
> I'm not sure what "problems" that is supposed to fix though? Was the license really the problem?
+
+> 👤 **ikawrakow** replied on **2025-04-07** at **12:06:07**
>
-> 👤 **ikawrakow** replied the **2025-04-07** at **12:06:07**:
> It would have avoided ggerganov talking about the Berne Convention and implying that no copyright notices are required, or putting contributors such as yourself into the difficult position of having to choose between doing the right thing or following his rules.
+
+> 👤 **CISC** replied on **2025-04-07** at **12:15:28**
>
-> 👤 **CISC** replied the **2025-04-07** at **12:15:28**:
> It would have avoided me even considering upstreaming, that's all, the rest is unrelated fallout.
+
+> 👤 **jano403** replied on **2025-04-07** at **12:34:09**
>
-> 👤 **jano403** replied the **2025-04-07** at **12:34:09**:
> > I would need to read up on what is the correct way of mixing MIT licensed code with (A)GPL licensed code. Or can you point me to a simple to follow set of instructions?
>
> I believe the MIT license is compatible with GPL/AGPL, take a look at https://github.com/LostRuins/koboldcpp for example. The original code would still be MIT licensed but the project as a whole, including Your modifications would be GPL/AGPL licensed.
> 
+
+> 👤 **jano403** replied on **2025-04-07** at **12:35:47**
>
-> 👤 **jano403** replied the **2025-04-07** at **12:35:47**:
> https://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses
> 
> 
@@ -184,21 +198,23 @@ A based thing to do would be to license your repository under AGPL3.0, solves al
> //
> ```
> or similar when You make new changes.
+
+> 👤 **ikawrakow** replied on **2025-04-07** at **12:48:51**
>
-> 👤 **ikawrakow** replied the **2025-04-07** at **12:48:51**:
> > It would have avoided me even considering upstreaming, that's all, the rest is unrelated fallout.
>
> Well, also that. Which have resulted in you having a much less interesting weekend 😄
---
-👤 **ikawrakow** replied the **2025-04-07** at **11:24:52**:
+👤 **ikawrakow** commented on **2025-04-07** at **11:24:52**
@CISC
I'm sorry you ended up in the middle of this. I hope this has not damaged your relation with, and your ability to contribute to, the `ggml` and `llama.cpp` projects.
-> 👤 **CISC** replied the **2025-04-07** at **11:58:00**:
+> 👤 **CISC** replied on **2025-04-07** at **11:58:00**
+>
> > I'm sorry you ended up in the middle of this. I hope this has not damaged your relation with, and your ability to contribute to, the `ggml` and `llama.cpp` projects.
>
> Let's just say this weekend was more interesting than I would have liked. :(
\ No newline at end of file
diff --git a/github-data/discussions/319 - KTransformers copying ik_llama.cpp.md b/github-data/discussions/319 - KTransformers copying ik_llama.cpp.md
index cfbcfaf7c..2d8858ca3 100644
--- a/github-data/discussions/319 - KTransformers copying ik_llama.cpp.md
+++ b/github-data/discussions/319 - KTransformers copying ik_llama.cpp.md
@@ -1,13 +1,14 @@
-### 🗣️ [#319](https://github.com/ikawrakow/ik_llama.cpp/discussions/319) - KTransformers copying ik_llama.cpp
+## 🗣️ [Discussion #319](https://github.com/ikawrakow/ik_llama.cpp/discussions/319) - KTransformers copying ik_llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ❌ **Closed** |
| **Created** | 2025-04-08 |
| **Updated** | 2025-04-13 |
---
-#### Description
+## 📄 Description
[This PR](https://github.com/kvcache-ai/ktransformers/pull/754) is a direct copy from [this file](https://github.com/ikawrakow/ik_llama.cpp/blob/main/ggml/src/iqk/iqk_mul_mat.cpp) in `ik_llama.cpp`. It never acknowledges the source of the changes, and the KTransformers maintainers did not respond to [my comment](https://github.com/kvcache-ai/ktransformers/pull/754#issuecomment-2781515478) I left in the PR.
@@ -15,15 +16,15 @@ The PR is being sold as `IQ1_S` implementation, but it copies not just the `IQ1_
For those who don't know, KTRansformers uses the quantized GEMM/GEMV implementation that I contributed to [llamafile](https://github.com/Mozilla-Ocho/llamafile). `llamafile` uses the Apache-2.0 license, so I contributed the code under that license. KTransformers have kept the [copyright notice](https://github.com/kvcache-ai/ktransformers/blob/f4ae7c85edd66d6acf3ef253eeaf0143eb3358ab/third_party/llamafile/iqk_mul_mat.inc#L3) in the file, but did not update after merging PR 754, which contains a copy of MIT licensed code.
-KTransformers PR 754 is interesting anyway. Github user @godrosev entered issue #209 on February 19 asking for `IQ1_S` support in `llamafile`. There was already implementation for the row-interleaved variant `IQ1_S_R4` in `ik_llama.cpp`, so I wasn't planning to also have support for `IQ1_S`, and suggested to them to use that instead. But after some back-and-fort, I decided to add `IQ1_S`, which I did in PR #212 on Feb 20. The KTransformers PR 754 is on March 3 and comes from Github user @moonshadow-25. There are 5 commits in the PR, and the first 2 come from @godrosev. @godrosev and @moonshadow-25 both have no Github activity other the PR (and Issue #209).
+KTransformers PR 754 is interesting anyway. Github user @godrosev entered issue [#209](https://github.com/ikawrakow/ik_llama.cpp/issues/209) on February 19 asking for `IQ1_S` support in `llamafile`. There was already implementation for the row-interleaved variant `IQ1_S_R4` in `ik_llama.cpp`, so I wasn't planning to also have support for `IQ1_S`, and suggested to them to use that instead. But after some back-and-fort, I decided to add `IQ1_S`, which I did in PR [#212](https://github.com/ikawrakow/ik_llama.cpp/issues/212) on Feb 20. The KTransformers PR 754 is on March 3 and comes from Github user @moonshadow-25. There are 5 commits in the PR, and the first 2 come from @godrosev. @godrosev and @moonshadow-25 both have no Github activity other the PR (and Issue [#209](https://github.com/ikawrakow/ik_llama.cpp/issues/209)).
So now the question is, what do I do about that. Opinions?
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **moonshadow-25** replied the **2025-04-08** at **08:50:43**:
+👤 **moonshadow-25** commented on **2025-04-08** at **08:50:43**
hi ikawrakow, I am not an official developer of KT,@godrosv he is my colleague, and I am very sorry about this matter. After he gave me the code, I started the porting work without asking the source, but I noticed that the author in the file is also the same module's author as Llamafile, which is you. Afterwards, I completed all the porting work but did not modify any author information, because from the beginning KT kept mentioning that they used llamaflile as the core optimization, and I only filled in the complete functionality.
@@ -31,7 +32,7 @@ I have always felt that the CPU optimization in Llamafile is the best part done.
---
-👤 **ikawrakow** replied the **2025-04-08** at **09:29:53**:
+👤 **ikawrakow** commented on **2025-04-08** at **09:29:53**
> and I am very sorry about this matter
@@ -39,22 +40,24 @@ Are you planning to correct it? The 1800 lines added in your PR are not a "port"
---
-👤 **moonshadow-25** replied the **2025-04-08** at **10:06:25**:
+👤 **moonshadow-25** commented on **2025-04-08** at **10:06:25**
Yes, I have always believed that both the early content and the “ported” parts of Llamafile originated from your work. And what I did more was porting and testing, so I never intended to modify (except for necessary interface adjustments) your work. I think this is your contribution!
I hope we can have more communication in the future
-> 👤 **ikawrakow** replied the **2025-04-08** at **11:19:06**:
+> 👤 **ikawrakow** replied on **2025-04-08** at **11:19:06**
+>
> Sorry, @moonshadow-25, but there are no "ported” parts of Llamafile in your PR. There are 1800 lines of code copied from here. They do not exist in Llamafile to be "ported" (i.e., copied) from there.
>
> You have created a bit of a mess with your PR. KTransformers and Llamafile are both Apache-2.0 licensed. But the code here is published under a MIT License. Now, Apache-2.0 and MIT are both very permissive licenses, so it is easy to bundle code published under these license together, as explained for instance [here](https://infra.apache.org/licensing-howto.html). You could have even asked me if I would be willing to relicense the portions you copied to Apache-2.0 so it makes things easier for KTransformers (after all, I did change the MIT License of the code I contributed to Llamafile to Apache-2.0 to make it easier for them). But as permissive as these licenses are, it does not mean you can just ignore what they ask you to do.
+
+> 👤 **moonshadow-25** replied on **2025-04-08** at **11:41:27**
>
-> 👤 **moonshadow-25** replied the **2025-04-08** at **11:41:27**:
> Indeed, I am very sorry that I only realized the difference now. They look too similar, and both authors are you. So I subjectively assumed it was the same license.
> I must make some remedies as soon as possible, and I hope to hear your advice
---
-👤 **ikawrakow** replied the **2025-04-13** at **15:56:21**:
+👤 **ikawrakow** commented on **2025-04-13** at **15:56:21**
The KTransformers devs have now merged [this PR](https://github.com/kvcache-ai/ktransformers/pull/1116), which addresses the concern raised in this discussion => closing.
\ No newline at end of file
diff --git a/github-data/discussions/323 - Is there an easy way to repack an existing GGUF so it could be used wit.md b/github-data/discussions/323 - Is there an easy way to repack an existing GGUF so it could be used without --ru.md
similarity index 91%
rename from github-data/discussions/323 - Is there an easy way to repack an existing GGUF so it could be used wit.md
rename to github-data/discussions/323 - Is there an easy way to repack an existing GGUF so it could be used without --ru.md
index 53834b34f..26d875d55 100644
--- a/github-data/discussions/323 - Is there an easy way to repack an existing GGUF so it could be used wit.md
+++ b/github-data/discussions/323 - Is there an easy way to repack an existing GGUF so it could be used without --ru.md
@@ -1,13 +1,14 @@
-### 🗣️ [#323](https://github.com/ikawrakow/ik_llama.cpp/discussions/323) - Is there an easy way to repack an existing GGUF so it could be used without --run-time-repack (thus enabling mmap)
+## 🗣️ [Discussion #323](https://github.com/ikawrakow/ik_llama.cpp/discussions/323) - Is there an easy way to repack an existing GGUF so it could be used without --run-time-repack (thus enabling mmap)
| **Author** | `Lissanro` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-04-10 |
| **Updated** | 2025-05-21 |
---
-#### Description
+## 📄 Description
DeepSeek-V3-0324-GGUF-UD-Q4_K_XL works great for me when I load it using --run-time-repack, I get more than 7 tokens/s with EPYC 7763 and 1TB of 3200MHz RAM + 4x3090 GPUs. But this unfortunately disables mmap and requires a lot of compute on each reload - and if I need to switch models often in some tasks (for example, a separate model to process input images and describe them, then continue with DeepSeek V3), it slows things down.
@@ -30,9 +31,9 @@ This command utilizes about 20GB of VRAM on each 24GB GPU. The main issue is tha
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-04-10** at **15:31:47**:
+👤 **ikawrakow** commented on **2025-04-10** at **15:31:47**
You can use
```
@@ -55,13 +56,15 @@ More generally, you can use `--repack-pattern` in the `llama-quantize` command b
```
is equivalent.
-> 👤 **ikawrakow** replied the **2025-04-10** at **15:36:25**:
+> 👤 **ikawrakow** replied on **2025-04-10** at **15:36:25**
+>
> I have never repacked (or quantized) a multi-part GGUF, so I don't know if `llama-quantize` does the right thing to load all parts. In case it does not, you may need to concatenate the parts into a single file
> ```
> cat file1 file2 ... fileN >>combined_file
> ```
+
+> 👤 **saood06** replied on **2025-04-10** at **23:00:39**
>
-> 👤 **saood06** replied the **2025-04-10** at **23:00:39**:
> >In case it does not, you may need to concatenate the parts into a single file
> >
> > ```
@@ -72,13 +75,13 @@ is equivalent.
---
-👤 **ubergarm** replied the **2025-04-10** at **22:05:30**:
+👤 **ubergarm** commented on **2025-04-10** at **22:05:30**
> I noticed that DeepSeek-V3-0324-GGUF-IQ4_K_R4 for example gives me 4-5 tokens/s at most, my guess because it quantized very differently, even though it has about the same size.
A few thoughts here:
-1. My quant was designed to be a bit heavy in the non-routed experts to give better quality output. You can trade-off some quality for extra speed by adding `-ser 6,1` as detailed in [PR#239](https://github.com/ikawrakow/ik_llama.cpp/pull/239).
+1. My quant was designed to be a bit heavy in the non-routed experts to give better quality output. You can trade-off some quality for extra speed by adding `-ser 6,1` as detailed in [PR[#239](https://github.com/ikawrakow/ik_llama.cpp/issues/239)](https://github.com/ikawrakow/ik_llama.cpp/pull/239).
2. My quant is designed to offload just over 17GiB weights to VRAM plus context cache. However, it looks like you have 96 GB VRAM (4x GPUs?). Using `-ot exps=CPU` shouldn't fill up 20GB VRAM on 4x cards (80GB)?. Designing a quant specific to multiple-gpu setups like yours is more tricky as you want to offload some of the routed `exps` layers which need to be quantized in a way suited for GPU inferencing.
So yeah, like ik mentions, you will want to use `./bin/llama-quantize --repack --repack-pattern "ffn_down_exps,ffn_up_exps,gate_exps" etc.` and figure out ahead of time the size of the tensors/layers you want to offload onto GPU (and don't repack those), and only repack the remaining routed experts `exps` layers going into RAM for CPU inferencing. In other words the repacked `q4_k_r4` is for running on CPU RAM. Don't repack the tensors/layers you're running on GPU.
@@ -89,7 +92,7 @@ Cheers!
---
-👤 **Lissanro** replied the **2025-04-11** at **10:49:26**:
+👤 **Lissanro** commented on **2025-04-11** at **10:49:26**
@ikawrakow
Thank you, I was able to convert based on the suggested command, but the issue is, performance of the converted quant is very low, so I cannot really use it yet. I would appreciate any help to figure out how to convert it in the same way like -rtr option does, but to a file permanently, so I can use mmap and load without -rtr option.
@@ -128,7 +131,7 @@ INFO [ print_timings] generation eval time = 76603.74 ms / 303 run
---
-👤 **Lissanro** replied the **2025-04-11** at **10:52:18**:
+👤 **Lissanro** commented on **2025-04-11** at **10:52:18**
@saood06
It seems my own quant converted from the Unsloth one also loses a lot of performance, so it may not be something specific to your quant. I am not sure what the issue is yet. It is worth mentioning that my EPYC 7763 64-core CPU is under full load during inference with either quant, so my guess something in the converted quants hits CPU bottleneck, which is not present when using Unsloth quant with -rtr option.
@@ -141,7 +144,7 @@ With my workflow that involves loading 72B vision model in VRAM, processing imag
---
-👤 **ikawrakow** replied the **2025-04-11** at **10:58:58**:
+👤 **ikawrakow** commented on **2025-04-11** at **10:58:58**
The offline repacking command should produce a result that is 100% equivalent to what happens with online repacking.
@@ -154,7 +157,7 @@ echo 3 | sudo tee /proc/sys/vm/drop_caches
---
-👤 **ikawrakow** replied the **2025-04-11** at **11:10:46**:
+👤 **ikawrakow** commented on **2025-04-11** at **11:10:46**
> Maybe I could put some additional ffn_down_exps, ffn_up_exps or ffn_gate_exps on each GPU, but not sure which of them is more beneficial to put in VRAM yet. I already experimented with blk.3.ffn_gate_exps=CUDA0, ... and so on, but since I cannot put too many of them due to having not that much VRAM free, I did not notice difference in performance. I did not try with non-gate ones yet.
@@ -166,7 +169,7 @@ to have all attention and shared experts tensors plus the first 20 layers of `ff
---
-👤 **Lissanro** replied the **2025-04-11** at **11:35:48**:
+👤 **Lissanro** commented on **2025-04-11** at **11:35:48**
First, I load the repacked model with -rtr option - obviously should be unnecessary, but I was curious if it makes a difference, and to my surprise, it did, I got good performance again (full log: https://pastebin.com/5d6R2GDG):
@@ -196,7 +199,8 @@ Please let me know if there are some kind of performance profiling or additional
As of putting more ffn_up_exps and ffn_gate_exps on GPU, I will try that with as much layers as I can, thank you very much for the suggestion.
-> 👤 **ubergarm** replied the **2025-04-11** at **14:20:23**:
+> 👤 **ubergarm** replied on **2025-04-11** at **14:20:23**
+>
> @Lissanro
>
> > --no-mmap option, performance was back to normal. So, it seems something about mmap that drastically reduces performance. Nothing wrong with the quant file then.
@@ -208,15 +212,16 @@ As of putting more ffn_up_exps and ffn_gate_exps on GPU, I will try that with as
> Keep us posted once you come up with a multi-gpu command line to override `ffn_up_exps` and `ffn_gate_exps` tensors onto each GPU as ik mentions above. I wanted to document that somewhere to help others as many of the questions I see are how to use more VRAM correctly when using `-ot`.
>
> Thanks!
+
+> 👤 **ubergarm** replied on **2025-04-11** at **19:08:55**
>
-> 👤 **ubergarm** replied the **2025-04-11** at **19:08:55**:
> @Lissanro
>
-> Also, using the above examples I'm slowly learning how to better use `-ot` myself. I have a few examples now on [discussion #258](https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomment-12807746) which you could use to target `CUDA0` `CUDA1` etc to craft the best command for your rig.
+> Also, using the above examples I'm slowly learning how to better use `-ot` myself. I have a few examples now on [discussion [#258](https://github.com/ikawrakow/ik_llama.cpp/issues/258)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomment-12807746) which you could use to target `CUDA0` `CUDA1` etc to craft the best command for your rig.
---
-👤 **Lissanro** replied the **2025-04-13** at **03:57:01**:
+👤 **Lissanro** commented on **2025-04-13** at **03:57:01**
I was able to achieve similar speed with mmap after resetting my BIOS, and changing only absolutely necessary settings. Before that, no matter what I did, it ran at 30%-50% reduced speed. Not sure exactly what setting was messing up results, maybe performance tuning settings for memory throughput.
@@ -264,21 +269,23 @@ Thank you so very much, @ikawrakow and @ubergarm , for helping me to figure this
---
-👤 **Ph0rk0z** replied the **2025-05-17** at **18:57:32**:
+👤 **Ph0rk0z** commented on **2025-05-17** at **18:57:32**
So to repack I do inverse of my cuda regex? Can quant type also be converted? Or does it just become same_R4? MMAP or not, the entire model gets cached on my system, at least for qwen 235b sizes.
---
-👤 **Lissanro** replied the **2025-05-21** at **05:27:22**:
+👤 **Lissanro** commented on **2025-05-21** at **05:27:22**
@Ph0rk0z
You need to craft a regex for R4 repacking happen in way that covers all tensors you plan to keep on CPU, but does not affect tensors that you plan running on GPU (GPU tensors need to be kept non-R4). You can refer to regexes in my previous message to see how repack regex differs.
-> 👤 **Ph0rk0z** replied the **2025-05-21** at **11:25:07**:
+> 👤 **Ph0rk0z** replied on **2025-05-21** at **11:25:07**
+>
> Yea I assume it's just see which layers are on GPU and then exclude them. So if you pick 1,2,3,4 make a not 1,2,3,4 regex. Funny enough we have AI for this. But I have IQ4_XS, so what does that become? IQ4_XS_R4? Or can it repack to something else?
+
+> 👤 **ikawrakow** replied on **2025-05-21** at **11:29:29**
>
-> 👤 **ikawrakow** replied the **2025-05-21** at **11:29:29**:
> > Or can it repack to something else?
>
> No. The repacking is only to the corresponding row-interleaved type. Repacking to something else would result in quality loss.
\ No newline at end of file
diff --git a/github-data/discussions/334 - _iq4_ks_ performs great on gemma-3-27b-it-qat-q4_0-unquantized.md b/github-data/discussions/334 - iq4_ks performs great on gemma-3-27b-it-qat-q4_0-unquantized.md
similarity index 95%
rename from github-data/discussions/334 - _iq4_ks_ performs great on gemma-3-27b-it-qat-q4_0-unquantized.md
rename to github-data/discussions/334 - iq4_ks performs great on gemma-3-27b-it-qat-q4_0-unquantized.md
index 7e83410ff..978eb1593 100644
--- a/github-data/discussions/334 - _iq4_ks_ performs great on gemma-3-27b-it-qat-q4_0-unquantized.md
+++ b/github-data/discussions/334 - iq4_ks performs great on gemma-3-27b-it-qat-q4_0-unquantized.md
@@ -1,13 +1,14 @@
-### 🗣️ [#334](https://github.com/ikawrakow/ik_llama.cpp/discussions/334) - `iq4_ks` performs great on gemma-3-27b-it-qat-q4_0-unquantized
+## 🗣️ [Discussion #334](https://github.com/ikawrakow/ik_llama.cpp/discussions/334) - `iq4_ks` performs great on gemma-3-27b-it-qat-q4_0-unquantized
| **Author** | `ubergarm` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-04-18 |
-| **Updated** | 2025-07-07 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
*EDIT*: Just uploaded the `ik_llama.cpp` exclusive quants for best quality in minimum VRAM to huggingface [ubergarm/gemma-3-27b-it-qat-GGUF](https://huggingface.co/ubergarm/gemma-3-27b-it-qat-GGUF).
@@ -655,9 +656,9 @@ Cheers!
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **saood06** replied the **2025-04-18** at **22:57:25**:
+👤 **saood06** commented on **2025-04-18** at **22:57:25**
> I saw google released their [google/gemma-3-27b-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-unquantized) original `.safetensors` unquantized model. It is supposedly designed for `q4_0` quantization which was released earlier in gguf format.
>
@@ -668,7 +669,8 @@ This is QAT but unlike previous QAT models I have seen this was done with an add

-> 👤 **bartowski1182** replied the **2025-04-19** at **00:55:33**:
+> 👤 **bartowski1182** replied on **2025-04-19** at **00:55:33**
+>
> > unlike previous QAT
>
> Which ones have you seen, and what did they do if not additional fine tuning? 🤔
@@ -676,8 +678,9 @@ This is QAT but unlike previous QAT models I have seen this was done with an add
> But yes I theorized it was possible that they did some fine tuning for the quant awareness with wiki text itself, maybe unlikely but certainly not impossible
>
> I think it could be valuable to use a random additional well formatted English corpus for more PPL numbers, that might start giving a more full image
+
+> 👤 **saood06** replied on **2025-04-19** at **01:32:31**
>
-> 👤 **saood06** replied the **2025-04-19** at **01:32:31**:
> > Which ones have you seen, and what did they do if not additional fine tuning? 🤔
>
> Not any that I remember being released, but just in papers/blogs/demos one example being [this](https://pytorch.org/blog/quantization-aware-training/): where for example they do "Llama3-8B fine-tuned on the C4 dataset (en subset) with and without QAT" which allows you to see the difference between QAT and just finetuning.
@@ -687,13 +690,14 @@ This is QAT but unlike previous QAT models I have seen this was done with an add
> > But yes I theorized it was possible that they did some fine tuning for the quant awareness with wiki text itself, maybe unlikely but certainly not impossible
>
> My point isn't really specific to any data they did finetune with (my guess is they just did one or a partial epoch of the last dataset used for the Instruction tuned model, as people have reported modern LLM's can get very sensitive to the diversity of their training data [reported since Llama 3 and why people may have struggled fine tuning that for a while] ), just that the QAT model was trained more.
+
+> 👤 **bartowski1182** replied on **2025-04-19** at **02:50:22**
>
-> 👤 **bartowski1182** replied the **2025-04-19** at **02:50:22**:
> Oh hmm I suppose that's possible as well, would definitely be very interesting to see the full details
---
-👤 **saood06** replied the **2025-04-19** at **05:00:05**:
+👤 **saood06** commented on **2025-04-19** at **05:00:05**
@ubergarm
@@ -701,11 +705,12 @@ Have you seen these versions, [27B](https://huggingface.co/stduhpf/google-gemma-
---
-👤 **ikawrakow** replied the **2025-04-19** at **08:38:23**:
+👤 **ikawrakow** commented on **2025-04-19** at **08:38:23**
In my quick experiments with Gemma3-12B, the `Q4_0` quantized version has a significantly lower Wiki2 perplexity than the `bf16` model, or any other quantization. Which means that whatever they have done, they have massively overfit that specific dataset, specifically with `Q4_0` quantization. Which means that one cannot use Wiki2 for evaluation (PPL, but also KLD or any other quantization quality measure). On my book (but you may differ from me in that regard), it also means that one cannot take this model seriously.
-> 👤 **saood06** replied the **2025-04-19** at **08:56:08**:
+> 👤 **saood06** replied on **2025-04-19** at **08:56:08**
+>
> > On my book (but you may differ from me in that regard), it also means that one cannot take this model seriously.
>
> I haven't touched Gemma 3 myself yet (I want to see if it beats QwQ for my GPU only use cases), but I've heard a lot of positive feedback on the QAT version of Gemma 3. I agree that it does make them hard to directly compare since they differ so much, but whatever they did people seem generally happy with it.
@@ -713,22 +718,25 @@ In my quick experiments with Gemma3-12B, the `Q4_0` quantized version has a sign
> > but also KLD or any other quantization quality measure
>
> Do you think knowing the KLD between the two BF16 versions versions would be insightful (not that I could run it in a reasonable amount of time for the 27B, the 12B might be possible though)?
+
+> 👤 **bartowski1182** replied on **2025-04-20** at **18:19:36**
>
-> 👤 **bartowski1182** replied the **2025-04-20** at **18:19:36**:
> > they have massively overfit that specific dataset
>
> that was my theory as well, that they may have used wikitext as a calibration dataset for the QAT portion
>
> I don't know if it completely invalidates the model, but rather just makes wikitext useless/misleading, similar to using wikitext for imatrix and then checking PPL against wikitext, it's totally fine to use it, but need to use something different for PPL after
+
+> 👤 **saood06** replied on **2025-04-20** at **18:29:03**
>
-> 👤 **saood06** replied the **2025-04-20** at **18:29:03**:
> > that was my theory as well, that they may have used wikitext as a calibration dataset for the QAT portion
>
> You could test the QAT against the old one with different datasets if you want to test that hypothesis.
>
> I don't think they used a low diversity dataset to train on, my theory is they may have updated the model they distilled from and that might be the extra bump on top of the bump from more training on more tokens.
+
+> 👤 **ubergarm** replied on **2025-04-20** at **22:51:23**
>
-> 👤 **ubergarm** replied the **2025-04-20** at **22:51:23**:
> I threw together a quick-n-dirty perplexity test corpus mostly english language with a little chinese and XML. Very possible the model was trained on this stuff already, given it is available online. Might be able to generate some "novel" synthetic text using output from a few various LLMs to mix it up, but at least here are a few more data points with something other than `wiki.test.raw` shown below.
>
> ## Observations
@@ -778,15 +786,16 @@ In my quick experiments with Gemma3-12B, the `Q4_0` quantized version has a sign
> ```
>
>
+
+> 👤 **ubergarm** replied on **2025-04-21** at **14:52:18**
>
-> 👤 **ubergarm** replied the **2025-04-21** at **14:52:18**:
> A redditor [mentioned their post](https://www.reddit.com/r/LocalLLaMA/comments/1jqnnfp/comment/ml8nuof/) measuring PPL and KLD with `wiki.test.raw` and a private corpus for some of the gemma-3-27b QAT models with an interesting writeup.
>
> Also amusing that [a redditor quoted ik](https://www.reddit.com/r/LocalLLaMA/comments/1k3jal4/comment/mo707ni/) on this thread hah... My impression is folks with <= 16GB VRAM are interested in the gemma-3-27b-it-qat ~4 bits as while it isn't as good as R1/V3-0324/QwQ-32B imo, it is a newer model that just barely fits with enough context to play around with decent speed.
---
-👤 **ikawrakow** replied the **2025-04-19** at **09:08:49**:
+👤 **ikawrakow** commented on **2025-04-19** at **09:08:49**
> Do you think knowing the KLD between the two BF16 versions versions would be insightful
@@ -796,7 +805,8 @@ Good question. If the `Q4_0` model outperforms the `bf16` model (at least it doe
Is it so because it really is good, or is it more because the sentiment towards Google has shifted lately (at least when it comes to "AI"). My impression is that the Internet believes that the latest Gemini models are currently the best (and so, by extension, Gemma3 must be among the best open weight). But the few things that I asked Gemma3-12B where I have good knowledge of the subject matter, the answers were complete BS.
-> 👤 **saood06** replied the **2025-04-19** at **09:33:57**:
+> 👤 **saood06** replied on **2025-04-19** at **09:33:57**
+>
> > Good question. If the `Q4_0` model outperforms the `bf16` model (at least it does for a set of quality metrics), do we now compare the `Q4_0` model against `bf16`, or do we compare the other way around?
>
> There are two different BF16's the original post shows PPL of both (pasted below for convenience).
@@ -822,13 +832,15 @@ Is it so because it really is good, or is it more because the sentiment towards
> Did both QAT and non QAT do that? I'd assume they'd both fail your test.
>
> Gemma3 may not be a good fit for you, but I am curious what models you have used and liked.
+
+> 👤 **ikawrakow** replied on **2025-04-20** at **07:02:01**
>
-> 👤 **ikawrakow** replied the **2025-04-20** at **07:02:01**:
> > Gemma3 may not be a good fit for you, but I am curious what models you have used and liked.
>
> From the models I can run locally, none really passes the smell test. I would be hard pressed to say which one I like the best.
+
+> 👤 **saood06** replied on **2025-04-20** at **09:51:32**
>
-> 👤 **saood06** replied the **2025-04-20** at **09:51:32**:
> >From the models I can run locally
>
> Have you used any non locally? You brought up gemini, those models have had a huge advantage on long context benchmarks for a long time. The gemma models are nothing similar. Deepseek-r1 is the best local model but even that pales in comparison to gemini (from benchmarks and user testimonials, I've used it a bit over lmarena but not enough to remark on it).
@@ -839,7 +851,7 @@ Is it so because it really is good, or is it more because the sentiment towards
---
-👤 **ubergarm** replied the **2025-04-27** at **03:08:43**:
+👤 **ubergarm** commented on **2025-04-27** at **03:08:43**
*EDIT* My compile script was messed up and putting me into DEBUG mode...
@@ -1026,7 +1038,7 @@ ggml_backend_cuda_graph_compute: disabling CUDA graphs due to batch size > 1 [sa
---
-👤 **ikawrakow** replied the **2025-04-27** at **07:06:13**:
+👤 **ikawrakow** commented on **2025-04-27** at **07:06:13**
~CUDA graphs get disabled for MoE models in `ik_llama.cpp`, this is why you see the warning. It was the same in mainline until very recently, their PR 12970 enables CUDA graphs for TG (and apparently hides the warning when disabling graphs for PP). Also very recently, Johannes Gaessler completely independently discovered batched processing for TG with MoE models in PR 13014. He really discovered it by himself, [without ever looking at ik_llama.cpp, not even once](https://github.com/ikawrakow/ik_llama.cpp/pull/283/files/7f6980fa5166d029ad04cef395d2993ddc8da307#r2029830357) /s~
@@ -1036,7 +1048,7 @@ I was clearly confused. This is Gemma3.
---
-👤 **ikawrakow** replied the **2025-04-27** at **07:49:45**:
+👤 **ikawrakow** commented on **2025-04-27** at **07:49:45**
@ubergarm
@@ -1047,7 +1059,8 @@ The PP performance difference between mainline and `ik_llama.cpp` did not look p
Are you sure your `sweep-bench` adaptation for mainline is working correctly? Gemma3 KV cache size relative to model size is quite high, so just a ~40% drop in PP performance at 32k tokens seen for mainline seems relatively unlikely.
-> 👤 **saood06** replied the **2025-04-27** at **08:10:16**:
+> 👤 **saood06** replied on **2025-04-27** at **08:10:16**
+>
> > @ubergarm
> >
> > Are you sure your `sweep-bench` adaptation for mainline is working correctly? Gemma3 KV cache size relative to model size is quite high, so just a ~40% drop in PP performance at 32k tokens seen for mainline seems relatively unlikely.
@@ -1055,8 +1068,9 @@ Are you sure your `sweep-bench` adaptation for mainline is working correctly? Ge
> ~~He is missing a llama_synchronize call, could that account for it?~~
>
> Edit: Nevermind
+
+> 👤 **ubergarm** replied on **2025-04-27** at **17:14:57**
>
-> 👤 **ubergarm** replied the **2025-04-27** at **17:14:57**:
> *EDIT* My compile script was messed up and putting me into DEBUG mode...
>
> Thanks for taking a look, I too am doubting my `sweep-bench` adaptation for mainline as I just quickly got it compiling without looking too closely.
@@ -1066,8 +1080,9 @@ Are you sure your `sweep-bench` adaptation for mainline is working correctly? Ge
> - [x] Look more closely at the `sweep-bench` adaptation as it could be inflating numbers (though for non FA and CPU cases with GLM-4 it looked more like I expected). Thanks @saood06 for the `llama_synchronize` call, I'll try to figure out if there is something I'm missing.
> 3. Possibly repeat with `Gemma3-12B` `Q4_0` to reproduce graphs like ik just gave above.
> - [x] Try good old `llama-bench` for a sanity test across a smaller range of values.
+
+> 👤 **ubergarm** replied on **2025-04-27** at **17:39:39**
>
-> 👤 **ubergarm** replied the **2025-04-27** at **17:39:39**:
> > He is missing a llama_synchronize call, could that account for it?
>
> Hrmm, I only see one `llama_synchronize(ctx);` call in the [ik_llama.cpp/examples/sweep-bench/sweep-bench.cpp](https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/sweep-bench/sweep-bench.cpp#L90) code which also appears in [my adaptation](https://github.com/ubergarm/llama.cpp/blob/ug/port-sweep-bench/examples/sweep-bench/sweep-bench.cpp#L86)?
@@ -1075,15 +1090,17 @@ Are you sure your `sweep-bench` adaptation for mainline is working correctly? Ge
> Its possible somehow I'm using the wrong number for `n_batch` etc as I don't really understand what batches and n_batches are. Also maybe some function arguments changed beyond just the names for stuff like `llama_model_params_from_gpt_params(params);` to `common_init_from_params(params);` etc...
>
> I'll dig around some more as I'd be quite surprised if mainline FA CUDA implementation for dense models like gemma-3 and glm-4 was suddenly this good.
+
+> 👤 **ubergarm** replied on **2025-04-27** at **18:02:40**
>
-> 👤 **ubergarm** replied the **2025-04-27** at **18:02:40**:
> *EDIT* My compile script was messed up and putting me into DEBUG mode...
>
> Using `-t 1` does seem slightly but consistently faster than `-t 16` in this one comparison. `ik_llama.cpp` for both runs using same bartowski quant:
>
> 
+
+> 👤 **saood06** replied on **2025-04-27** at **19:28:28**
>
-> 👤 **saood06** replied the **2025-04-27** at **19:28:28**:
> >
> > * [x] Look more closely at the `sweep-bench` adaptation as it could be inflating numbers (though for non FA and CPU cases with GLM-4 it looked more like I expected). Thanks @saood06 for the `llama_synchronize` call, I'll try to figure out if there is something I'm missing.
> >
@@ -1091,8 +1108,9 @@ Are you sure your `sweep-bench` adaptation for mainline is working correctly? Ge
> >Hrmm, I only see one llama_synchronize(ctx); call in the [ik_llama.cpp/examples/sweep-bench/sweep-bench.cpp](https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/sweep-bench/sweep-bench.cpp#L90) code which also appears in [my adaptation](https://github.com/ubergarm/llama.cpp/blob/ug/port-sweep-bench/examples/sweep-bench/sweep-bench.cpp#L86)?
>
> Sorry, I was looking at [this](https://github.com/ubergarm/llama.cpp/commit/e59a5f1eb92b5b99d6a6d386b4620f89f9dad5ec) and I didn't fully expand the file. Ignore what I said.
+
+> 👤 **ubergarm** replied on **2025-04-27** at **19:48:37**
>
-> 👤 **ubergarm** replied the **2025-04-27** at **19:48:37**:
> *EDIT* My compile script was messed up and putting me into DEBUG mode...
>
> I ran a plain `llama-bench` for PP only to compare and sanity check if my adaptation of `llama-sweep-bench` is accurate at least for PP. It looks like in general `llama-sweep-bench` shows lower scores than `llama-bench` assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to `N_KV`?
@@ -1169,13 +1187,14 @@ Are you sure your `sweep-bench` adaptation for mainline is working correctly? Ge
> ```
>
> I'll poke at the server a bit to see if anything changed and also maybe roll back my `ik_llama.cpp` git repo a couple weeks in case something odd changed.
+
+> 👤 **ubergarm** replied on **2025-04-27** at **21:25:01**
>
-> 👤 **ubergarm** replied the **2025-04-27** at **21:25:01**:
> Okay, yeah, the remote server compile script was in `Debug` mode... I recompiled for `Release` and performance improved for both TG and PP and is more in line with what I would expect. Sorry for the fire drill...
---
-👤 **ikawrakow** replied the **2025-04-28** at **06:08:20**:
+👤 **ikawrakow** commented on **2025-04-28** at **06:08:20**
> I ran a plain llama-bench for PP only to compare and sanity check if my adaptation of llama-sweep-bench is accurate at least for PP. It looks like in general llama-sweep-bench shows lower scores than llama-bench assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to N_KV?
@@ -1185,7 +1204,8 @@ It is related, but not really the same. With `llama-sweep-bench` you have `N_KV`
If you see such messages, you are running in debug mode.
-> 👤 **saood06** replied the **2025-04-28** at **07:34:44**:
+> 👤 **saood06** replied on **2025-04-28** at **07:34:44**
+>
> > > I ran a plain llama-bench for PP only to compare and sanity check if my adaptation of llama-sweep-bench is accurate at least for PP. It looks like in general llama-sweep-bench shows lower scores than llama-bench assuming the x-axis is the describing the "same thing" for both e.g. PP context length is similar enough to N_KV?
> >
> > It is related, but not really the same.
@@ -1196,7 +1216,7 @@ If you see such messages, you are running in debug mode.
---
-👤 **Nexesenex** replied the **2025-05-31** at **10:58:14**:
+👤 **Nexesenex** commented on **2025-05-31** at **10:58:14**
@ubergarm Thanks for this iq4_ks quant, it works super.
By the way, I tested the perplexity of a q8_0 and your qat iq4_ks in Serbian on an extract of a dataset named Sveznanje.
@@ -1206,7 +1226,7 @@ PPL Gemma 27b qat iq4_ks : Final estimate: PPL = 12.7006 +/- 0.12469
---
-👤 **Nexesenex** replied the **2025-06-04** at **15:56:03**:
+👤 **Nexesenex** commented on **2025-06-04** at **15:56:03**
More.
@@ -1252,17 +1272,20 @@ For IQ4_KS (imat) with embed/output q6_0 and attn_q / attn_k / attn_o / ffn_gate
For IQ4_KS (imat) with embed/output q6_0 and attn_q / attn_o / ffn_gate / ffn_up in new_iq2_kt (9.529 GiB (3.031 BPW)): : PPL = 9.0063 +/- 0.06917
For IQ4_KS (imat) with embed/output q6_0 and attn_q / ffn_gate / ffn_up in new_iq2_kt (9.867 GiB (3.138 BPW)): PPL = 9.0923 +/- 0.07124
-> 👤 **ubergarm** replied the **2025-06-04** at **16:54:07**:
+> 👤 **ubergarm** replied on **2025-06-04** at **16:54:07**
+>
> Yeah these QATs are wild where the 4bpw "beats" the bf16!?! And for some reason the `iq4_ks` 32 block quants seem to do very well. psure the `iq4_ks` is a strict upgrade of the `iq4_xs` as I understand it is the same bpw with better PPL.
>
> Thanks again for sharing your results! Definitely check out the new hottness `iqN_kt` which are basically [QTIP](https://arxiv.org/html/2406.11235v3) / exl3 style trellis quants. So far I'd say they are like a smaller version of `iqN_k` with similar perplexity, but I need to do more testing as the implementation isn't fully baked yet.
+
+> 👤 **Nexesenex** replied on **2025-06-05** at **02:42:48**
>
-> 👤 **Nexesenex** replied the **2025-06-05** at **02:42:48**:
> Well, it seems your iq4_ks was optimal. I'm satisfied with a q6_0 embed/ouput instead of Q8_0, but that's it.
> Generally, ofc iq4_ks is better, but on small models, I guess some tensors are so small that rules can be a bit different, as seen on the 4b.
> On my side, I had the first Trellis Cuda implementation made by IK working on Croco (6 months ago, maybe?) but I have yet to make work the second, it gives me gibberish for now. Probably missed a part of code somewhere.
+
+> 👤 **ubergarm** replied on **2025-06-08** at **22:51:42**
>
-> 👤 **ubergarm** replied the **2025-06-08** at **22:51:42**:
> > Well, it seems your iq4_ks was optimal.
>
> I went back and looked at my recipe and oddly enough I think the `attn_k_b` are at `q4_0` and not actually `iq4_ks` for some reason. Maybe a mistake on my part or confusion about tensor dimensions vs quant limitations.
@@ -1272,36 +1295,43 @@ For IQ4_KS (imat) with embed/output q6_0 and attn_q / ffn_gate / ffn_up in new_i
> > On my side, I had the first Trellis Cuda implementation made by IK working on Croco (6 months ago, maybe?) but I have yet to make work the second, it gives me gibberish for now. Probably missed a part of code somewhere.
>
> Oh wow, I see you've had them a while! I'm just catching up lol. Yeah I believe the exact implementation is still in flux, so I haven't released any quants with it just yet.
+
+> 👤 **Thireus** replied on **2025-07-04** at **22:15:18**
>
-> 👤 **Thireus** replied the **2025-07-04** at **22:15:18**:
> I'm trying to spot the difference between iq4_ks and iq4_xs. They seem to have the same bpw, the same perfs and the same PPL. Am I mistaken?
+
+> 👤 **ikawrakow** replied on **2025-07-05** at **09:25:02**
>
-> 👤 **ikawrakow** replied the **2025-07-05** at **09:25:02**:
> Maybe this is not the case for the model you are looking at, but typically `IQ4_KS` will have a slightly better PPL than `IQ4_XS`. Otherwise, yes, they are very similar. The differences are
> * `IQ4_XS` uses an `fp16` scale per super-block of 256. `IQ4_KS` has a single scale per tensor row.
> * The 16 bits per 256 weights saved that way give 2 extra bits per block of 32. One is spent on extra precision for the block scale, one is spent to select between 2 non-linear lookup tables. Being able to choose between two lookup tables results in a slightly lower difference to the model weights being quantized.
>
> As `IQ4_KS` does not need super-blocks of 256, theoretically one could remove the requirement for tensor row size being a multiple of 256. I haven't done that yet, but will if models with row sizes that are not a multiple of 256 become more common.
+
+> 👤 **Thireus** replied on **2025-07-05** at **13:48:21**
>
-> 👤 **Thireus** replied the **2025-07-05** at **13:48:21**:
> Thank you for the clarification! Indeed on DeepSeek-R1-0528 I haven't noticed a difference.
+
+> 👤 **saood06** replied on **2025-07-07** at **04:47:41**
>
-> 👤 **saood06** replied the **2025-07-07** at **04:47:41**:
> > As `IQ4_KS` does not need super-blocks of 256, theoretically one could remove the requirement for tensor row size being a multiple of 256. I haven't done that yet, but will if models with row sizes that are not a multiple of 256 become more common.
>
> Does that mean `IQ4_KS` and `IQ5_KS` could be used for the KV cache, or is there some other limitation?
+
+> 👤 **ikawrakow** replied on **2025-07-07** at **05:07:40**
>
-> 👤 **ikawrakow** replied the **2025-07-07** at **05:07:40**:
> > Does that mean IQ4_KS and IQ5_KS could be used for the KV cache, or is there some other limitation?
>
> Theoretically yes, but with caveats:
> * The quantization needs to use a simpler and less accurate algorithm, else storing data in the cache will be too slow. This is already done for `IQ4_NL`, and `IQ4_KS/IQ5_KS` will be similar
> * One runs into the same issue as with `Q8_KV` for DeepSeek, where the V cache is just a view into the K cache, so misses the row scale
+
+> 👤 **saood06** replied on **2025-07-07** at **05:40:24**
>
-> 👤 **saood06** replied the **2025-07-07** at **05:40:24**:
> >The quantization needs to use a simpler and less accurate algorithm, else storing data in the cache will be too slow. This is already done for IQ4_NL, and IQ4_KS/IQ5_KS will be similar
>
> I didn't know that, so based on that do you think they would offer quality/size benefits over the existing types?
+
+> 👤 **ikawrakow** replied on **2025-07-07** at **07:59:21**
>
-> 👤 **ikawrakow** replied the **2025-07-07** at **07:59:21**:
> `IQ4_NL` halves the PPL difference between `Q4_0` KV-cache and `fp16` KV cache, but is somewhat higher than `Q5_0` KV cache. My guess is that `IQ4_KS` will perform similar to `IQ4_NL`. Not sure if/how much better `IQ5_KS` will be compared to `Q5_0` for KV cache quantization.
\ No newline at end of file
diff --git a/github-data/discussions/350 - Maverick slow prompt with gpu.md b/github-data/discussions/350 - Maverick slow prompt with gpu.md
index 827602938..429004aad 100644
--- a/github-data/discussions/350 - Maverick slow prompt with gpu.md
+++ b/github-data/discussions/350 - Maverick slow prompt with gpu.md
@@ -1,13 +1,14 @@
-### 🗣️ [#350](https://github.com/ikawrakow/ik_llama.cpp/discussions/350) - Maverick slow prompt with gpu
+## 🗣️ [Discussion #350](https://github.com/ikawrakow/ik_llama.cpp/discussions/350) - Maverick slow prompt with gpu
| **Author** | `justinjja` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-04-27 |
| **Updated** | 2025-04-27 |
---
-#### Description
+## 📄 Description
Any idea what the deal is with prompt speeds on Maverick?
@@ -23,15 +24,15 @@ Is it possible to leave prompt processing on the CPU and still use the GPU for g
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **saood06** replied the **2025-04-27** at **04:22:52**:
+👤 **saood06** commented on **2025-04-27** at **04:22:52**
Do you mind providing the exact commands used to get those numbers (and any details about the quant used)?
---
-👤 **ikawrakow** replied the **2025-04-27** at **06:45:38**:
+👤 **ikawrakow** commented on **2025-04-27** at **06:45:38**
Please tell us your command line parameters.
@@ -57,7 +58,7 @@ The above command puts all attention tensors, shared experts, and the first 10 l
---
-👤 **justinjja** replied the **2025-04-27** at **16:51:53**:
+👤 **justinjja** commented on **2025-04-27** at **16:51:53**
Nice, thank you!
My command must have been bad.
diff --git a/github-data/discussions/354 - Not all MLAs are born equal.md b/github-data/discussions/354 - Not all MLAs are born equal.md
index 08e3200cf..da662a721 100644
--- a/github-data/discussions/354 - Not all MLAs are born equal.md
+++ b/github-data/discussions/354 - Not all MLAs are born equal.md
@@ -1,13 +1,14 @@
-### 🗣️ [#354](https://github.com/ikawrakow/ik_llama.cpp/discussions/354) - Not all MLAs are born equal
+## 🗣️ [Discussion #354](https://github.com/ikawrakow/ik_llama.cpp/discussions/354) - Not all MLAs are born equal
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-04-29 |
-| **Updated** | 2025-05-13 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
## Intro
@@ -246,25 +247,27 @@ The next graph shows PP performance as a function of `N_KV`. Here the performanc
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **JohannesGaessler** replied the **2025-04-29** at **07:29:26**:
+👤 **JohannesGaessler** commented on **2025-04-29** at **07:29:26**
Since you are tagging me: I did look at the more general implementation for mapping MoE to regular matrix multiplications in the PR where I commented but I did not look at any MoE-specific CUDA code for matrix vector multiplication, nor was I aware that this repository had such an optimization. It's just the natural way of writing a fused kernel.
-> 👤 **ikawrakow** replied the **2025-04-29** at **14:39:31**:
+> 👤 **ikawrakow** replied on **2025-04-29** at **14:39:31**
+>
> > It's just the natural way of writing a fused kernel.
>
> Sure, a kernel that did not get written for a very long time, despite the well known fact that `llama.cpp` CUDA performance for MoE models is really bad. Which indicates that the understanding how badly the fused kernel was needed was missing. It is not very often that one has a PR that [improves performance up to 4X](https://github.com/ggml-org/llama.cpp/pull/13014#issuecomment-2816637977).
>
> But if it is so as you say, then sorry.
+
+> 👤 **JohannesGaessler** replied on **2025-04-29** at **15:33:40**
>
-> 👤 **JohannesGaessler** replied the **2025-04-29** at **15:33:40**:
> Apology accepted. My top priority was and still is good performance for dense GEMM/GEMV because that is the most fundamental operation. MoE optimizations have now simply reached the front of the priority queue.
---
-👤 **cmoncure** replied the **2025-05-06** at **15:50:00**:
+👤 **cmoncure** commented on **2025-05-06** at **15:50:00**
I read this and the warning on the README.md about incompatible GGUFs is quite unfortunate. I don't mind spending the time to create my own quants for this fork in the pursuit of maximum performance. I am a total noob to creating quants, however.
@@ -274,7 +277,8 @@ Do you plan to support the incompatible mainline GGUF files? Can I assume that G
Thank you for creating this work and making it available. You are a true wizard.
-> 👤 **ikawrakow** replied the **2025-05-06** at **16:16:34**:
+> 👤 **ikawrakow** replied on **2025-05-06** at **16:16:34**
+>
> > Can I assume that GGUFs created before mid-April or so will be compatible? (Downloading these larger models represents a considerable cost.)
>
> I think so. But to make sure, if you are downloading from HF, you can check the content of the GGUF. To be compatible, it needs to have tensors ` blk.X.attn_kv_b.weight` (where `X` is the layer index, so 0,1,...). If it does, it will work with this fork. If instead it has separate tensors `blk.X.attn_k_b.weight` and `blk.X.attn_v_b.weight`, it is most likely not compatible.
@@ -282,8 +286,9 @@ Thank you for creating this work and making it available. You are a true wizard.
> > Do you plan to support the incompatible mainline GGUF files?
>
> No, not really. There are implications beyond compatibility. The change impacts quantization of the attention tensors, and I think there are now some reports from users about reduced model quality after the change was made and the quantized models compatible with that change started coming out.
+
+> 👤 **saood06** replied on **2025-05-06** at **20:24:09**
>
-> 👤 **saood06** replied the **2025-05-06** at **20:24:09**:
> > I think so. But to make sure, if you are downloading from HF, you can check the content of the GGUF. To be compatible, it needs to have tensors ` blk.X.attn_kv_b.weight` (where `X` is the layer index, so 0,1,...). If it does, it will work with this fork. If instead it has separate tensors `blk.X.attn_k_b.weight` and `blk.X.attn_v_b.weight`, it is most likely not compatible.
>
> Just to be more clear after looking at one converted with the compatible version of MLA that works [here](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ2_K_R4?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) , it has `attn_k_b.weight`, `attn_v_b.weight` and `attn_kv_b.weight`.
@@ -294,33 +299,38 @@ Thank you for creating this work and making it available. You are a true wizard.
>
> So in conclusion if the model has all three `attn_k_b.weight`, `attn_v_b.weight` and `attn_kv_b.weight` or just `attn_kv_b.weight` it will work here, but if it has `attn_k_b.weight` and `attn_v_b.weight` but no `attn_kv_b.weight` it will not work here.
>
-> Edit: The above is outdated, see #394 and #409
+> Edit: The above is outdated, see [#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394) and [#409](https://github.com/ikawrakow/ik_llama.cpp/issues/409)
+
+> 👤 **ubergarm** replied on **2025-05-12** at **15:39:39**
>
-> 👤 **ubergarm** replied the **2025-05-12** at **15:39:39**:
> Sorry for late reply @cmoncure , I have a rough outline of the process of going from fp8 to GGUF for ik's fork [buried in a fold in my quickstart guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) under the "Custom Quants" section.
>
> Its a bit dated already, but the basic procedures are described there. I'd suggest making your own imatrix and take [this new PR411 into consideration ](https://github.com/ikawrakow/ik_llama.cpp/pull/411) for that step as well.
+
+> 👤 **saood06** replied on **2025-05-13** at **00:23:49**
>
-> 👤 **saood06** replied the **2025-05-13** at **00:23:49**:
> > Sorry for late reply @cmoncure , I have a rough outline of the process of going from fp8 to GGUF for ik's fork [buried in a fold in my quickstart guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) under the "Custom Quants" section.
> >
> > Its a bit dated already, but the basic procedures are described there. I'd suggest making your own imatrix and take [this new PR411 into consideration ](https://github.com/ikawrakow/ik_llama.cpp/pull/411) for that step as well.
>
> The dequant method in your guide (that I had recommended) may need more precise instructions to work now. For more info see [this](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2865306085) and the following comments.
+
+> 👤 **ubergarm** replied on **2025-05-13** at **20:13:04**
>
-> 👤 **ubergarm** replied the **2025-05-13** at **20:13:04**:
> Thanks @saood06 , I managed to `git apply saood06.patch` copy/pasting your comment and that fixes up building `triton-cpu`. I tested with `uv venv ./venv --python 3.12 --python-preference=only-managed` for my venv and updated a couple lines of the quick start guide.
>
> Hopefully enough bread crumbs our future selves can figure it out.
+
+> 👤 **saood06** replied on **2025-05-13** at **21:09:54**
>
-> 👤 **saood06** replied the **2025-05-13** at **21:09:54**:
> > Thanks @saood06 , I managed to `git apply saood06.patch` copy/pasting your comment and that fixes up building `triton-cpu`.
>
> Mind telling me the exact version/commit hash of `triton-cpu` you built?
>
> I noticed mine is 3.2.0 and they seem to be on 3.3.0 (and thus I hoped the bug would be fixed upstream)
+
+> 👤 **ubergarm** replied on **2025-05-13** at **21:21:58**
>
-> 👤 **ubergarm** replied the **2025-05-13** at **21:21:58**:
> > > Thanks @saood06 , I managed to `git apply saood06.patch` copy/pasting your comment and that fixes up building `triton-cpu`.
> >
> > Mind telling me the exact version/commit hash of `triton-cpu` you built?
@@ -330,18 +340,21 @@ Thank you for creating this work and making it available. You are a true wizard.
> I added your patch to `main@0625715c` `Artlesbol` `[MathToVecLib] Add support for setting bit-widths for AVX512...` `Apr 26 12:24:21 2025 +0800`
>
> I originally tried to use the same git sha I used the first time, but it doesn't exist anymore, so I guess they force pushed main or something somewhere along the way between now and March 13, 2025 maybe?
+
+> 👤 **saood06** replied on **2025-05-13** at **21:45:22**
>
-> 👤 **saood06** replied the **2025-05-13** at **21:45:22**:
> > I originally tried to use the same git sha I used the first time, but it doesn't exist anymore, so I guess they force pushed main or something somewhere along the way between now and March 13, 2025 maybe?
>
> I noticed similar things when trying to look into the history of the repo. Whatever they are doing it makes tracing down the source of changes in their repo very tedious and annoying.
>
> Thanks for confirming the issue still exists in their latest commit, I don't currently plan on creating a better fix for them so I made an issue https://github.com/triton-lang/triton-cpu/issues/237 and hopefully they fix it.
+
+> 👤 **saood06** replied on **2025-05-13** at **22:33:34**
>
-> 👤 **saood06** replied the **2025-05-13** at **22:33:34**:
> @ubergarm if you still have the build errors that my patch solves do you mind sharing them in the issue I made. I don't have them, and they are requesting them in the issue I opened.
+
+> 👤 **ubergarm** replied on **2025-05-13** at **23:10:18**
>
-> 👤 **ubergarm** replied the **2025-05-13** at **23:10:18**:
> > @ubergarm if you still have the build errors that my patch solves do you mind sharing them in the issue I made. I don't have them, and they are requesting them in the issue I opened.
>
> Its a goofy browser ssh client for this specific rig, i tried to scroll my tmux back but its gone...
diff --git a/github-data/discussions/357 - Qwen3 - early performance comparisons.md b/github-data/discussions/357 - Qwen3 - early performance comparisons.md
index 035187d48..595056f54 100644
--- a/github-data/discussions/357 - Qwen3 - early performance comparisons.md
+++ b/github-data/discussions/357 - Qwen3 - early performance comparisons.md
@@ -1,15 +1,16 @@
-### 🗣️ [#357](https://github.com/ikawrakow/ik_llama.cpp/discussions/357) - Qwen3 - early performance comparisons
+## 🗣️ [Discussion #357](https://github.com/ikawrakow/ik_llama.cpp/discussions/357) - Qwen3 - early performance comparisons
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-04-29 |
| **Updated** | 2025-05-19 |
---
-#### Description
+## 📄 Description
-The Qwen3 models were [officially released](https://qwenlm.github.io/blog/qwen3/), and support was added in `ik_llama.cpp` in PR #355, so I was curious to run some performance benchmarks. As much as I would like to try the flagship model, I don't have enough horse power for that, so I experimented with [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B), the 30B total, 3B active parameter MoE model.
+The Qwen3 models were [officially released](https://qwenlm.github.io/blog/qwen3/), and support was added in `ik_llama.cpp` in PR [#355](https://github.com/ikawrakow/ik_llama.cpp/issues/355), so I was curious to run some performance benchmarks. As much as I would like to try the flagship model, I don't have enough horse power for that, so I experimented with [Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B), the 30B total, 3B active parameter MoE model.
This time I'm using a custom quantization where all experts are quantized with `IQ4_XS`, all attention tensors with `Q5_K`, and the output tensor is `Q6_K`. PPL for this model is only 1.25% above the PPL of the `bf16` model, so it is a pretty decent quality quantization. Benchmarks are run on a Ryzen-7950X system with an RTX-4080 GPU. Compared are the latest `ik_kllama.cpp` and `llama.cpp` versions as of this morning (April 29 2025).
@@ -310,13 +311,14 @@ The next graph shows PP performance as a function of `N_KV`. Also here the perfo
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-04-29** at **13:57:33**:
+👤 **ikawrakow** commented on **2025-04-29** at **13:57:33**
Anyone who has the horse power to run Qwen3-235B-A22B, please feel free to add your results to this discussion.
-> 👤 **ubergarm** replied the **2025-04-29** at **16:30:10**:
+> 👤 **ubergarm** replied on **2025-04-29** at **16:30:10**
+>
> I'm away from home but frantically trying to remote into a server I just got access too again and cook up a good Qwen3-235B-A22B mix for my home 3090TI 24GB VRAM + 96GB RAM system which is about the limit of common AM5 gaming rigs (with the faster and more supported 2x DIMM configuration).
>
> Any particular reason you chose `IQ4_XS` for the experts over `IQ4_K` (possibly GPU inference speed?).
@@ -381,20 +383,22 @@ Anyone who has the horse power to run Qwen3-235B-A22B, please feel free to add y
>
>
> Did you bother to make an imatrix for your quant, and if so, were you able to activate enough experts with your imatrix corpus text? Thanks again, exciting times with Qwen3 MoE out and wondering if R2 is around the corner haha...
+
+> 👤 **ikawrakow** replied on **2025-04-29** at **16:34:39**
>
-> 👤 **ikawrakow** replied the **2025-04-29** at **16:34:39**:
> > Any particular reason you chose IQ4_XS for the experts over IQ4_K (possibly GPU inference speed?).
>
> I wanted to have a quantized model that I can run with `ik_llama.cpp` and with `llama.cpp` so we have a fair performance comparison.
>
> I'm playing with some quantization recipes for Qwen3-30B-A3B. I'll post the results tomorrow, maybe that can be useful to you for "cooking" the Qwen3-235B-A22B quants.
+
+> 👤 **Gaolingx** replied on **2025-05-06** at **13:15:17**
>
-> 👤 **Gaolingx** replied the **2025-05-06** at **13:15:17**:
-> I run Qwen3-235B-A22B on my pc(#385 ), but the performance not better, might the memory performance of RAM is too slow...
+> I run Qwen3-235B-A22B on my pc([#385](https://github.com/ikawrakow/ik_llama.cpp/issues/385) ), but the performance not better, might the memory performance of RAM is too slow...
---
-👤 **ubergarm** replied the **2025-04-30** at **04:45:24**:
+👤 **ubergarm** commented on **2025-04-30** at **04:45:24**
Just "cooked" my first `ik_llama.cpp` exclusive experimental quant and uploaded to [huggingface ubergarm/Qwen3-235B-A22B-GGUF](https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF). Just tried a benchmark on my local gaming rig as it just finished downloading. Hybrid GPU+CPU inferencing with about 12 ffn layers on GPU and the rest repacked on CPU. *Barely* fits in VRAM+RAM (had to close my browser haha).
@@ -1210,12 +1214,14 @@ KReclaimable 633.56 633.56
Interestingly I could hear my fans spin up and down periodically every 15 seconds or so as the CPU ramped up and the GPU dropped down a bit. I noticed this more on the Q8_0 test visually with `btop` as the CPU would drop to almost 0 and the GPU would ramp up and oscillate slowly back and forth.
-> 👤 **ikawrakow** replied the **2025-04-30** at **06:07:34**:
+> 👤 **ikawrakow** replied on **2025-04-30** at **06:07:34**
+>
> > Note that for some reason ik_llama.cpp could offload one additional ffn layer than mainline llama.cpp in this test
>
> This is because the `ik_llama.cpp` CUDA compute buffer is smaller. This is most likely due to the fused `ffn_up+ffn_gate` op that you get with `-fmoe`. In any case, having 80 instead of 81 MoE experts competed on the CPU will not make a significant difference in performance.
+
+> 👤 **ubergarm** replied on **2025-04-30** at **17:46:53**
>
-> 👤 **ubergarm** replied the **2025-04-30** at **17:46:53**:
> I don't have access to enough RAM+VRAM currently to run the full `bf16`, so I'm using the `Q8_0` as the baseline for my imatrix data and PPL/KLD.
>
>
@@ -1311,13 +1317,14 @@ Interestingly I could hear my fans spin up and down periodically every 15 second
---
-👤 **ikawrakow** replied the **2025-04-30** at **05:57:12**:
+👤 **ikawrakow** commented on **2025-04-30** at **05:57:12**
@ubergarm Can you try the attached `sweep_bench.cpp` adaptation for `llama.cpp` instead of your adaptation? Thanks!
[sweep-bench.cpp.gz](https://github.com/user-attachments/files/19971777/sweep-bench.cpp.gz)
-> 👤 **ubergarm** replied the **2025-04-30** at **17:21:50**:
+> 👤 **ubergarm** replied on **2025-04-30** at **17:21:50**
+>
> I compared your `sweep-bench.cpp` adaptation to mainline llama.cpp with [my adaptation](https://github.com/ubergarm/llama.cpp/blob/ug/port-sweep-bench/examples/sweep-bench/sweep-bench.cpp) of @saood06 's code. A couple quick results suggest they are pretty similar for two benchmarks I had run:
>
> ## bartowski/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf GQA FA
@@ -1339,7 +1346,7 @@ Interestingly I could hear my fans spin up and down periodically every 15 second
---
-👤 **ikawrakow** replied the **2025-04-30** at **14:04:50**:
+👤 **ikawrakow** commented on **2025-04-30** at **14:04:50**
OK, after thinking more about this, I can see why mainline has a better large context TG performance on CUDA for Qwen3-235B-A22B (and previously noted for LLaMA-4): these models have a quite large GQA factor, and I'm still using the old CUDA FA implementation that did not take advantage of that. Improved GQA FA performance was added in [this mainline PR](https://github.com/ggml-org/llama.cpp/pull/12014).
@@ -1347,7 +1354,8 @@ OK, after thinking more about this, I can see why mainline has a better large co
* Pickup the mainline PR (but heavy adaptation will be required as things have diverged a lot, and mainline FA does not support different K and V head sizes as required for DeepSeek models)
* Finally sit down and write my own CUDA FA implementation
-> 👤 **ubergarm** replied the **2025-04-30** at **15:02:22**:
+> 👤 **ubergarm** replied on **2025-04-30** at **15:02:22**
+>
> Interesting, yes, I first noticed this with GLM-4 (which uses GQA) in the [CUDA + Flash Attention case](https://github.com/ikawrakow/ik_llama.cpp/pull/344#issuecomment-2832581799) benchmark.
>
> I still have the dream of converting an existing GQA architecture model to MLA but the additional fine-tuning required even with a fraction of the original training data seems daunting:
@@ -1360,15 +1368,16 @@ OK, after thinking more about this, I can see why mainline has a better large co
> In the mean-time I'll re-run a couple `llama-sweep-bench` comparisons with your mainline `sweep-bench.cpp` adaptation to confirm or reject my prior benchmarks!
>
> Thanks!
+
+> 👤 **ikawrakow** replied on **2025-04-30** at **16:14:03**
>
-> 👤 **ikawrakow** replied the **2025-04-30** at **16:14:03**:
> > I still have the dream of converting an existing GQA architecture model to MLA but the additional fine-tuning required even with a fraction of the original training data seems daunting:
>
-> But MLA is not all roses either. It took quite a bit of experimentation to arrive at a meaningful compromise between TG and PP performance. Mainline has a long way to go there (see #354). And then we have this much smaller KV cache, but then we need giant compute buffers to get meaningful performance, so we need to compute self attention in chunks to keep compute memory usage at a reasonable level, so suddenly the compute graph building becomes this huge pile of complications instead of being just a few tens of lines of simple code as ii is for the other models. And then seeing the massive drop in performance with large contexts in your DeepSeek-V3/R1 benchmarks, my guess is that it is still far from optimum.
+> But MLA is not all roses either. It took quite a bit of experimentation to arrive at a meaningful compromise between TG and PP performance. Mainline has a long way to go there (see [#354](https://github.com/ikawrakow/ik_llama.cpp/issues/354)). And then we have this much smaller KV cache, but then we need giant compute buffers to get meaningful performance, so we need to compute self attention in chunks to keep compute memory usage at a reasonable level, so suddenly the compute graph building becomes this huge pile of complications instead of being just a few tens of lines of simple code as ii is for the other models. And then seeing the massive drop in performance with large contexts in your DeepSeek-V3/R1 benchmarks, my guess is that it is still far from optimum.
---
-👤 **AesSedai** replied the **2025-05-03** at **00:37:46**:
+👤 **AesSedai** commented on **2025-05-03** at **00:37:46**
Hello, @artus-dev and @ubergarm asked me to run some sweeps for Qwen3-235B-A22B. My homelab has a substantial server with a VM in it that has the following allocation:
```
@@ -2635,7 +2644,8 @@ CPU performance TG comparison:
GPU performance TG comparison:

-> 👤 **AesSedai** replied the **2025-05-03** at **05:29:15**:
+> 👤 **AesSedai** replied on **2025-05-03** at **05:29:15**
+>
> One more test, I disabled pipeline parallelism (setting it to 1) and re-built ik_llama.cpp:
> ```
> cmake -DBLAS_INCLUDE_DIRS=/usr/include/openblas -B build -DGGML_CUDA=ON -DGGML_RPC=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_SCHED_MAX_COPIES=1
@@ -3510,7 +3520,7 @@ GPU performance TG comparison:
---
-👤 **ikawrakow** replied the **2025-05-03** at **05:47:58**:
+👤 **ikawrakow** commented on **2025-05-03** at **05:47:58**
Thank you for these results!
@@ -3520,15 +3530,17 @@ Prompt processing speed on CUDA will also benefit from larger u-batches (e.g., `
The CUDA TG results are somewhat surprising (sharp performance drop with context length for `ik_llama.cpp`, performance basically the same as CPU-only for long context, performance decreasing with more layers offloaded to a second GPU).
-> 👤 **AesSedai** replied the **2025-05-03** at **06:07:58**:
+> 👤 **AesSedai** replied on **2025-05-03** at **06:07:58**
+>
> I just re-ran the above with 2x GPU for llama.cpp as well and edited the comment / graph. I was already re-running ik_llama w/o BLAS, I'll have the results of that shortly.
+
+> 👤 **AesSedai** replied on **2025-05-03** at **06:19:29**
>
-> 👤 **AesSedai** replied the **2025-05-03** at **06:19:29**:
> Posted!
---
-👤 **AesSedai** replied the **2025-05-03** at **06:18:41**:
+👤 **AesSedai** commented on **2025-05-03** at **06:18:41**
Some more data, this time compiled w/ no BLAS:
```
@@ -4372,15 +4384,17 @@ ik_llama.cpp BLAS vs NO BLAS PP comparison:
ik_llama.cpp BLAS vs NO BLAS TG comparison:

-> 👤 **ikawrakow** replied the **2025-05-03** at **06:24:36**:
+> 👤 **ikawrakow** replied on **2025-05-03** at **06:24:36**
+>
> Oh, for CPU-only inference you want to build **without CUDA**. The almighty `ggml` back-end scheduler that is very difficult to work around takes all sorts of funny decisions where to run stuff when one has more than one back-end enabled.
+
+> 👤 **AesSedai** replied on **2025-05-03** at **06:25:03**
>
-> 👤 **AesSedai** replied the **2025-05-03** at **06:25:03**:
> D'oh, okay. I can redo it :)
---
-👤 **AesSedai** replied the **2025-05-03** at **07:04:12**:
+👤 **AesSedai** commented on **2025-05-03** at **07:04:12**
ik_llama.cpp, no cuda, no blas:
```
@@ -4561,7 +4575,7 @@ main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ikawrakow** replied the **2025-05-03** at **07:21:24**:
+👤 **ikawrakow** commented on **2025-05-03** at **07:21:24**
Thanks!
@@ -4569,17 +4583,20 @@ So, CPU PP is much better now and more inline with what I would have expected. L
But I see that the Epyc 9355 has 32 cores, so we are using hyper-threading?
-> 👤 **AesSedai** replied the **2025-05-03** at **07:23:30**:
+> 👤 **AesSedai** replied on **2025-05-03** at **07:23:30**
+>
> That's good news!
>
> Yes, this is with hyperthreading. Out of the 64 threads on the system, 56 are passed through to the virtual machine and I have it configured to use 48 of those during the sweep.
>
> Is there a particular `-t` count (or thread passthrough count) you would like me to try?
+
+> 👤 **ikawrakow** replied on **2025-05-03** at **07:27:14**
>
-> 👤 **ikawrakow** replied the **2025-05-03** at **07:27:14**:
> On bare metal one achieves the best performance by setting the number of threads to the physical core count. But I have no idea how a VM will behave. You can try `-t 32`, but that would be only better if you get 32 cores involved, and not e.g. 16 cores with 2 threads per core.
+
+> 👤 **AesSedai** replied on **2025-05-03** at **07:58:15**
>
-> 👤 **AesSedai** replied the **2025-05-03** at **07:58:15**:
> Yes, I think it's about a ~10% performance loss because it's in a VM. The system is a hypervisor though and used for other homelab things, so I'm fine taking that loss. I was able to run `likwid-bench` inside the VM before and achieve ~500GB/s memory bandwidth for reference, theoretical maximum is ~576GB/s.
>
> For completeness sake, I've disabled SMT on the host:
@@ -4761,16 +4778,19 @@ But I see that the Epyc 9355 has 32 cores, so we are using hyper-threading?
> 
>
> 
+
+> 👤 **ikawrakow** replied on **2025-05-03** at **08:08:26**
>
-> 👤 **ikawrakow** replied the **2025-05-03** at **08:08:26**:
> So, ~30% better for PP, but not much difference for TG. I need to understand the cause of the sharp drop in TG performance for the first ~2k tokens. I'll investigate.
>
> Thanks a lot for these benchmarks!
+
+> 👤 **AesSedai** replied on **2025-05-03** at **08:10:21**
>
-> 👤 **AesSedai** replied the **2025-05-03** at **08:10:21**:
> You're welcome, let me know if you want me to re-run any of these benchmarks at some point in the future and I can pull / rebuild / re-test. Excited to see what shakes out!
+
+> 👤 **VinnyG9** replied on **2025-05-19** at **14:08:09**
>
-> 👤 **VinnyG9** replied the **2025-05-19** at **14:08:09**:
> > That's good news!
> >
> > Yes, this is with hyperthreading. Out of the 64 threads on the system, 56 are passed through to the virtual machine and I have it configured to use 48 of those during the sweep.
diff --git a/github-data/discussions/359 - Qwen3 quantization experiments.md b/github-data/discussions/359 - Qwen3 quantization experiments.md
index d7829a1b2..49f369fbd 100644
--- a/github-data/discussions/359 - Qwen3 quantization experiments.md
+++ b/github-data/discussions/359 - Qwen3 quantization experiments.md
@@ -1,13 +1,14 @@
-### 🗣️ [#359](https://github.com/ikawrakow/ik_llama.cpp/discussions/359) - Qwen3 quantization experiments
+## 🗣️ [Discussion #359](https://github.com/ikawrakow/ik_llama.cpp/discussions/359) - Qwen3 quantization experiments
| **Author** | `ikawrakow` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-04-30 |
-| **Updated** | 2025-06-11 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
I did some experimentation with Qwen3 quantization. As I don't have the horsepower to run the flagship model, I experimented with the Qwen3-30B-A3B MoE model. I'm reporting the results here, hopefully this could be useful also for Qwen3-235B-A22B.
@@ -80,23 +81,25 @@ I.e., all tensors (except attention, output and embeddings) are `IQ4_KS`. The qu
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-04-30** at **09:11:48**:
+👤 **ikawrakow** commented on **2025-04-30** at **09:11:48**
Has there been any QAT going on with the Qwen3 models? I didn't see anything mentioned in the [linked blog post](https://qwenlm.github.io/blog/qwen3/), but there are indications that QAT may have been involved. Does somebody know?
-> 👤 **saood06** replied the **2025-04-30** at **09:32:22**:
+> 👤 **saood06** replied on **2025-04-30** at **09:32:22**
+>
> >but there are indications that QAT may have been involved.
>
> What indications are you referring to?
+
+> 👤 **ikawrakow** replied on **2025-04-30** at **09:34:07**
>
-> 👤 **ikawrakow** replied the **2025-04-30** at **09:34:07**:
> I'm putting together the results and will post in a bit. In the meantime I was curious if somebody knew if QAT was used for Qwen3.
---
-👤 **ikawrakow** replied the **2025-04-30** at **12:25:59**:
+👤 **ikawrakow** commented on **2025-04-30** at **12:25:59**
# QAT used in Qwen3 training?
@@ -148,7 +151,8 @@ Looking at this graph, it seems plausible that if `fp4` QAT was used with blocks
Just in case, I also checked PPL for `Q4_0`. Without imatrix we get `PPL = 9.3017`, so no, unlike Google with Gemma3, the Qwen3 creators have not been overfitting to `Q4_0`.
-> 👤 **saood06** replied the **2025-04-30** at **23:41:35**:
+> 👤 **saood06** replied on **2025-04-30** at **23:41:35**
+>
> Very interesting will definitely take this into account when making my own mixes of Qwen-3.
>
> > OK, it must be my imatrix. People have filled whole libraries writing about how the imatrix calibration data needs to be random, diverse, whatnot. OK, let's grab the [Unsloth imatrix](https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF/blob/main/imatrix_unsloth.dat). Quantize, run `llama-perplexity` for recipe IQK-6 . Result: `PPL = 8.8787`.
@@ -166,16 +170,19 @@ Just in case, I also checked PPL for `Q4_0`. Without imatrix we get `PPL = 9.301
> >Though we have focused on accuracy so far, our observation that the difference between two models’ output token values cancel out leaving the average metric result unchanged, is applicable to perplexity as well. In particular, since perplexity may be interpreted as the inverse of the geometric mean of token probabilities, lower probabilities for some tokens in the test dataset may be cancelled by higher probabilities of other tokens. This indicates that perplexity alone is also inadequate in evaluating model compression schemes. Therefore, we argue that along with perplexity, KL-Divergence between the distributions generated by the baseline and optimized models should also be reported.
> >
> >Figure 9 in Appendix plots the log-likelihood difference between the 16-bit and quantized model for each of the tokens in the wiki-2 dataset Merity et al. (2016) for four different quantization schemes. From the figure, it appears that the log-likelihoods of the quantized model is just the log-likelihood of baseline model with some symmetric noise added. Now, since perplexity is e−avg(logprobabilities), adding any amount of symmetric noise leaves it unchanged. For example, addition of Gaussian noise to the log-probability outputs of the model should maintain the perplexity, while the quality of generation will degrade as the standard deviation of the noise increases (see Table 19). This analysis demonstrates one key weakness with the perplexity metric when used for evaluating compression techniques. While it is not clear if adding Gaussian noise to the log-likelihoods is an accurate representation of the behavior of compression schemes, it appears to be a good analogy. As we shall see in Section 6, as quantization increases, there is steady degradation in the quality of the text generated by the model that are visible only by examining them closely.
+
+> 👤 **ikawrakow** replied on **2025-05-01** at **06:15:01**
>
-> 👤 **ikawrakow** replied the **2025-05-01** at **06:15:01**:
> They critique PPL. Do you want me to critique the paper for you?
+
+> 👤 **saood06** replied on **2025-05-01** at **06:57:34**
>
-> 👤 **saood06** replied the **2025-05-01** at **06:57:34**:
> > They critique PPL. Do you want me to critique the paper for you?
>
> I'm not asking for a critique and I don't really care for the paper as they heavily imply there is an objective measure of performance of an LLM, but in my view there isn't one and it is all dependent on one's use case and use of the LLM (prompting, sampling, etc.), it's just they state, "While it is not clear if adding Gaussian noise to the log-likelihoods is an accurate representation of the behavior of compression schemes, it appears to be a good analogy. ", and I don't have any intuition of whether or not that statement is correct, but thought you might have a take on that if you don't mind sharing.
+
+> 👤 **ikawrakow** replied on **2025-05-01** at **18:02:40**
>
-> 👤 **ikawrakow** replied the **2025-05-01** at **18:02:40**:
> > it's just they state, "While it is not clear if adding Gaussian noise to the log-likelihoods is an accurate representation of the behavior of compression schemes, it appears to be a good analogy. ", and I don't have any intuition of whether or not that statement is correct, but thought you might have a take on that if you don't mind sharing.
>
> This is indeed one critique of their argument, and they deal with it in a very hand wavy way. Apart from the overall quality of the paper, if I just focus on their Table 19, which is the crux of their argument against PPL, here are several other points:
@@ -190,8 +197,9 @@ Just in case, I also checked PPL for `Q4_0`. Without imatrix we get `PPL = 9.301
> [Here](https://huggingface.co/blog/bartowski/llama4-scout-off#67f7beac7500c1c63d048419) are graphs that shows KLD vs PPL and correct top token probability vs PPL for the models studied in the blog post. The correlation coefficient for the straight line fits are 99% and 98%, respectively. I'm a physicist, and as part of my physics education I studied statistics. Physics experiments require a lot of effort, so they thought us that it is important to understand that it does not make sense to measure quantities that are highly correlated. When correlation is as high as 98-99%, measuring one lets you predict the other. This is how it is with PPL and KLD, and with PPL and correct top token.
>
> But if you still have doubts, open a discussion and let's discuss it there. This discussion is about Qwen3 quantization.
+
+> 👤 **bartowski1182** replied on **2025-05-02** at **02:05:54**
>
-> 👤 **bartowski1182** replied the **2025-05-02** at **02:05:54**:
> so strange to see decreasing PPL when quantizing 🤔
>
> I suppose one theory could be that by quantizing reduces some of the noise that's correlated with thinking or other stranger text, and so it's more likely to produce wiki text style generation? that wouldn't be absurd
@@ -208,11 +216,13 @@ Just in case, I also checked PPL for `Q4_0`. Without imatrix we get `PPL = 9.301
> Would a QAT for int4 also help Q4_K with its scaling factors? and nf4 with its different format? or would it need to be specifically the same target quant format?
>
> just thinking out loud
+
+> 👤 **ubergarm** replied on **2025-05-02** at **04:23:54**
>
-> 👤 **ubergarm** replied the **2025-05-02** at **04:23:54**:
> I don't find any references to QAT for this Qwen3 release either, but the paper itself is not yet linked. I did find some official recommendations on quantizing by the Qwen team including GGUF format, some of the documentation maybe is recycled from previous Qwen2.5 release: https://github.com/QwenLM/Qwen3/tree/main/docs/source/quantization
+
+> 👤 **ikawrakow** replied on **2025-05-02** at **06:18:07**
>
-> 👤 **ikawrakow** replied the **2025-05-02** at **06:18:07**:
> @saood06
>
> > Are you sure about the other two imatrix overfitting? Do you have any data showing they perform worse when testing things other than wiki.test.raw?
@@ -220,8 +230,9 @@ Just in case, I also checked PPL for `Q4_0`. Without imatrix we get `PPL = 9.301
> It is hard to prove one model is working better than another with just subjective feelings about the quality of the responses. But if we assume that QAT was not involved in the training, and we observe that the quantized model arrives at a lower PPL for a given test corpus than the `bf16` model, than this must be due to overfitting to the specific type of test data. The only way the overfitting can happen is via the imatrix. Hence, one imatrix resulting in a lower PPL than another imatrix can only mean that the first imatrix has been computed with calibration data that is more similar to the test corpus than the calibration data of the second imatrix.
>
> You see it differently?
+
+> 👤 **bartowski1182** replied on **2025-05-02** at **14:37:59**
>
-> 👤 **bartowski1182** replied the **2025-05-02** at **14:37:59**:
> Also I should note, specifically for the 30B (because I was having issues with experts not being activated) I generated ~100k more tokens of noise from the model which seemed to positively affect the results, there was a bunch of English and Chinese as well as a few other languages I noticed fly by, and a ton of emojis
>
> But yeah with my usual dataset I couldn't make iq2_xs and smaller from lack of data, after augmenting it I had no issues
@@ -230,7 +241,7 @@ Just in case, I also checked PPL for `Q4_0`. Without imatrix we get `PPL = 9.301
---
-👤 **ubergarm** replied the **2025-05-02** at **02:09:33**:
+👤 **ubergarm** commented on **2025-05-02** at **02:09:33**
Oh man I just released [ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf](https://www.reddit.com/r/LocalLLaMA/comments/1kcp34g/ubergarmqwen330ba3bgguf_1600_toksec_pp_105_toksec/) just *before* finding and reading this discussion!!! ooops!
@@ -256,7 +267,8 @@ So sounds like if Qwen was using QAT targeted at fp4, there it may be possible t
If I'm following here, sounds like the goal to get as low as possible without going *below* the bf16 PPL? So using `No imatrix, w = 1` would be better than over fitting with a poor imatrix corpus?
-> 👤 **saood06** replied the **2025-05-02** at **03:54:37**:
+> 👤 **saood06** replied on **2025-05-02** at **03:54:37**
+>
> Thank you for this, it is interesting to see the differences, do you mind doing more tests with the same quant mix but with different imatrix datasets.
>
> I was experimenting with some things and happened to run a PPL run
@@ -273,15 +285,17 @@ If I'm following here, sounds like the goal to get as low as possible without go
> [1]5.9247,[2]8.1743,[3]7.8086,[4]7.4665,[5]7.5827,[6]7.9146,[7]7.9256,[8]8.3800,[9]8.7338,[10]9.2358,[11]9.2796,[12]9.4751,[13]9.9094,[14]9.5276,[15]9.4315,[16]9.6651,[17]9.1359,[18]9.2973,[19]9.2420,[20]9.2027,[21]8.8980,[22]8.8728,[23]8.5166,[24]8.0805,[25]7.8595,[26]7.6511,[27]7.4495,[28]7.3339,[29]7.3661,[30]7.3024,[31]7.2536,[32]7.2892,[33]7.1866,[34]7.2554,[35]7.3744,[36]7.4953,[37]7.6588,[38]7.7084,[39]7.7045,[40]7.7682,[41]7.7699,[42]7.7425,[43]7.8004,[44]7.8073,[45]7.8050,[46]7.8210,[47]7.9946,[48]8.0825,[49]8.0659,[50]8.1399,[51]8.1844,[52]8.2167,[53]8.2830,[54]8.3550,[55]8.3549,[56]8.3934,[57]8.3696,[58]8.4045,[59]8.4742,[60]8.5250,[61]8.5489,[62]8.5950,[63]8.6515,[64]8.7124,[65]8.7947,[66]8.8633,[67]8.9378,[68]8.9229,[69]8.9382,[70]8.9302,[71]8.9595,[72]9.0276,[73]9.0641,[74]9.0806,[75]9.0334,[76]9.0309,[77]9.0764,[78]9.1284,[79]9.0602,[80]9.0377,[81]8.9987,[82]9.0382,[83]9.0019,[84]8.9840,[85]9.0000,[86]9.0918,[87]9.1256,[88]9.1180,[89]9.1208,[90]9.0922,[91]9.1438,[92]9.1152,[93]9.1593,[94]9.1677,[95]9.1424,[96]9.1272,[97]9.1001,[98]9.1115,[99]9.0908,[100]9.1484,[101]9.1762,[102]9.1625,[103]9.1782,[104]9.1421,[105]9.1415,[106]9.1300,[107]9.1597,[108]9.1980,[109]9.2245,[110]9.2765,[111]9.3936,[112]9.3875,[113]9.3438,[114]9.4089,[115]9.4185,[116]9.3804,[117]9.3632,[118]9.3410,[119]9.2961,[120]9.3076,[121]9.2886,[122]9.2788,[123]9.2384,[124]9.1841,[125]9.1533,[126]9.1287,[127]9.0717,[128]9.0405,[129]9.0034,[130]8.9708,[131]8.9196,[132]8.8810,[133]8.8661,[134]8.8615,[135]8.8584,[136]8.8559,[137]8.8246,[138]8.7906,[139]8.7961,[140]8.7798,[141]8.7714,[142]8.7767,[143]8.7843,[144]8.8132,[145]8.7829,[146]8.7426,[147]8.7048,[148]8.6669,[149]8.6472,[150]8.6109,[151]8.5820,[152]8.5675,[153]8.5704,[154]8.5259,[155]8.5278,[156]8.4865,[157]8.4598,[158]8.4230,[159]8.3891,[160]8.3530,[161]8.3328,[162]8.3172,[163]8.2974,[164]8.2960,[165]8.2776,[166]8.2697,[167]8.2623,[168]8.2869,[169]8.2879,[170]8.3189,[171]8.3428,[172]8.3957,[173]8.4403,[174]8.4601,[175]8.5162,[176]8.5446,[177]8.5919,[178]8.6328,[179]8.6446,[180]8.6443,[181]8.6652,[182]8.6882,[183]8.6726,[184]8.6863,[185]8.6957,[186]8.6995,[187]8.7099,[188]8.7095,[189]8.7234,[190]8.7530,[191]8.7610,[192]8.7754,[193]8.7698,[194]8.7940,[195]8.8140,[196]8.8248,[197]8.8314,[198]8.8092,[199]8.8022,[200]8.7916,[201]8.7998,[202]8.8119,[203]8.8319,[204]8.8476,[205]8.8637,[206]8.8524,[207]8.8779,[208]8.8596,[209]8.8616,[210]8.8582,[211]8.8598,[212]8.8597,[213]8.8587,[214]8.8400,[215]8.8232,[216]8.8181,[217]8.8283,[218]8.8254,[219]8.7966,[220]8.7662,[221]8.7545,[222]8.7441,[223]8.7394,[224]8.7550,[225]8.7343,[226]8.7351,[227]8.7257,[228]8.7006,[229]8.6707,[230]8.6484,[231]8.6321,[232]8.6174,[233]8.6163,[234]8.6245,[235]8.6190,[236]8.6054,[237]8.5927,[238]8.5727,[239]8.5641,[240]8.5666,[241]8.5683,[242]8.5787,[243]8.5781,[244]8.5955,[245]8.5972,[246]8.6165,[247]8.6211,[248]8.6247,[249]8.6325,[250]8.6405,[251]8.6578,[252]8.6728,[253]8.7001,[254]8.7186,[255]8.7225,[256]8.7368,[257]8.7484,[258]8.7335,[259]8.7119,[260]8.6908,[261]8.6662,[262]8.6495,[263]8.6428,[264]8.6444,[265]8.6552,[266]8.6581,[267]8.6575,[268]8.6485,[269]8.6505,[270]8.6479,[271]8.6384,[272]8.6374,[273]8.6343,[274]8.6304,[275]8.6300,[276]8.6186,[277]8.6173,[278]8.6225,[279]8.6199,[280]8.6151,[281]8.6080,[282]8.6152,[283]8.5876,[284]8.5549,[285]8.5602,[286]8.5455,[287]8.5271,[288]8.5236,[289]8.5263,[290]8.5485,[291]8.5523,[292]8.5515,[293]8.5514,[294]8.5683,[295]8.5875,[296]8.6017,[297]8.6254,[298]8.6212,[299]8.6067,[300]8.6080,[301]8.6052,[302]8.6009,[303]8.5933,[304]8.6112,[305]8.6107,[306]8.6070,[307]8.6090,[308]8.6063,[309]8.6045,[310]8.6092,[311]8.6138,[312]8.6037,[313]8.5963,[314]8.6001,[315]8.5870,[316]8.5918,[317]8.6133,[318]8.6186,[319]8.6134,[320]8.6160,[321]8.6021,[322]8.6136,[323]8.6285,[324]8.6449,[325]8.6652,[326]8.6675,[327]8.6592,[328]8.6594,[329]8.6444,[330]8.6352,[331]8.6262,[332]8.6289,[333]8.6290,[334]8.6211,[335]8.6103,[336]8.6040,[337]8.6102,[338]8.6196,[339]8.6126,[340]8.6043,[341]8.5928,[342]8.5905,[343]8.5851,[344]8.5939,[345]8.5976,[346]8.5932,[347]8.5814,[348]8.5820,[349]8.5750,[350]8.5689,[351]8.5691,[352]8.5742,[353]8.5740,[354]8.5611,[355]8.5775,[356]8.5859,[357]8.5915,[358]8.5794,[359]8.5805,[360]8.5783,[361]8.5832,[362]8.5786,[363]8.5745,[364]8.5847,[365]8.6017,[366]8.6275,[367]8.6443,[368]8.6732,[369]8.6930,[370]8.7077,[371]8.7285,[372]8.7517,[373]8.7593,[374]8.7694,[375]8.7897,[376]8.8037,[377]8.8148,[378]8.8284,[379]8.8390,[380]8.8577,[381]8.8731,[382]8.8873,[383]8.8997,[384]8.9121,[385]8.9398,[386]8.9572,[387]8.9592,[388]8.9606,[389]8.9685,[390]8.9916,[391]9.0118,[392]9.0092,[393]9.0068,[394]8.9998,[395]9.0002,[396]9.0093,[397]9.0136,[398]9.0151,[399]9.0200,[400]9.0358,[401]9.0392,[402]9.0389,[403]9.0288,[404]9.0197,[405]9.0086,[406]9.0037,[407]9.0078,[408]9.0125,[409]9.0091,[410]9.0078,[411]9.0194,[412]9.0216,[413]9.0181,[414]9.0109,[415]9.0000,[416]8.9862,[417]8.9881,[418]8.9888,[419]8.9870,[420]8.9827,[421]8.9848,[422]8.9703,[423]8.9702,[424]8.9664,[425]8.9640,[426]8.9660,[427]8.9734,[428]8.9854,[429]8.9910,[430]8.9852,[431]8.9782,[432]8.9832,[433]8.9821,[434]8.9803,[435]8.9902,[436]8.9772,[437]8.9787,[438]8.9782,[439]8.9701,[440]8.9795,[441]8.9758,[442]8.9677,[443]8.9611,[444]8.9618,[445]8.9508,[446]8.9544,[447]8.9521,[448]8.9426,[449]8.9333,[450]8.9348,[451]8.9310,[452]8.9178,[453]8.9090,[454]8.9044,[455]8.9046,[456]8.9018,[457]8.9067,[458]8.9228,[459]8.9191,[460]8.9195,[461]8.9148,[462]8.9139,[463]8.9248,[464]8.9240,[465]8.9264,[466]8.9284,[467]8.9341,[468]8.9408,[469]8.9442,[470]8.9504,[471]8.9408,[472]8.9479,[473]8.9368,[474]8.9379,[475]8.9455,[476]8.9491,[477]8.9436,[478]8.9316,[479]8.9338,[480]8.9431,[481]8.9505,[482]8.9400,[483]8.9486,[484]8.9555,[485]8.9582,[486]8.9558,[487]8.9607,[488]8.9523,[489]8.9391,[490]8.9381,[491]8.9294,[492]8.9257,[493]8.9140,[494]8.9104,[495]8.9025,[496]8.9010,[497]8.9099,[498]8.9157,[499]8.9101,[500]8.9104,[501]8.9127,[502]8.9097,[503]8.9239,[504]8.9318,[505]8.9347,[506]8.9338,[507]8.9276,[508]8.9323,[509]8.9259,[510]8.9275,[511]8.9330,[512]8.9285,[513]8.9306,[514]8.9337,[515]8.9353,[516]8.9378,[517]8.9438,[518]8.9422,[519]8.9414,[520]8.9405,[521]8.9424,[522]8.9330,[523]8.9372,[524]8.9372,[525]8.9398,[526]8.9454,[527]8.9458,[528]8.9462,[529]8.9425,[530]8.9378,[531]8.9411,[532]8.9373,[533]8.9376,[534]8.9375,[535]8.9406,[536]8.9342,[537]8.9426,[538]8.9535,[539]8.9526,[540]8.9681,[541]8.9682,[542]8.9586,[543]8.9606,[544]8.9674,[545]8.9644,[546]8.9625,[547]8.9558,[548]8.9420,[549]8.9398,[550]8.9259,[551]8.9152,[552]8.9047,[553]8.8751,[554]8.8725,[555]8.8753,[556]8.8766,[557]8.8767,[558]8.8752,[559]8.8801,[560]8.8846,[561]8.8912,[562]8.9031,[563]8.9110,[564]8.9083,[565]8.9173,[566]8.9215,[567]8.9116,[568]8.9038,[569]8.8974,[570]8.8955,[571]8.8936,[572]8.9040,[573]8.9069,[574]8.9099,[575]8.9105,[576]8.9181,[577]8.9149,[578]8.9194,[579]8.9272,[580]8.9403,[581]8.9417,[582]8.9526,[583]8.9375,[584]8.9344,
> Final estimate: PPL = 8.9344 +/- 0.06857
> ```
+
+> 👤 **ikawrakow** replied on **2025-05-02** at **06:28:45**
>
-> 👤 **ikawrakow** replied the **2025-05-02** at **06:28:45**:
> So, this is a `0.0557` difference to the PPL I computed with Unsloth's imatrix, so about 0.6% higher. This is way too much to be explained by it being computed on different hardware (typically differences due to floating point operations non-associativity are in the 0.001 range for Wiki2 PPL). This would indicate
> * There is some level of numerical instability resulting in larger than usual differences between results computed on different hardware
> * And/Or `Q8_0` quantization of the KV cache is not accurate enough for this model (I used `fp16` KV cache).
>
> If you have the ability and time to run with `fp16` KV cache, it would be interesting to have that result as well.
+
+> 👤 **saood06** replied on **2025-05-02** at **07:15:15**
>
-> 👤 **saood06** replied the **2025-05-02** at **07:15:15**:
> > If you have the ability and time to run with `fp16` KV cache, it would be interesting to have that result as well.
>
> Here you go:
@@ -292,29 +306,34 @@ If I'm following here, sounds like the goal to get as low as possible without go
> [1]5.9736,[2]8.2473,[3]7.8248,[4]7.5090,[5]7.6181,[6]7.9293,[7]7.9364,[8]8.3848,[9]8.7403,[10]9.2418,[11]9.2909,[12]9.5013,[13]9.9446,[14]9.5539,[15]9.4583,[16]9.7112,[17]9.1785,[18]9.3274,[19]9.2756,[20]9.2376,[21]8.9346,[22]8.9130,[23]8.5527,[24]8.1115,[25]7.8918,[26]7.6819,[27]7.4769,[28]7.3594,[29]7.3887,[30]7.3210,[31]7.2760,[32]7.3085,[33]7.1979,[34]7.2693,[35]7.3798,[36]7.4989,[37]7.6636,[38]7.7104,[39]7.7082,[40]7.7709,[41]7.7718,[42]7.7430,[43]7.7996,[44]7.8064,[45]7.8028,[46]7.8195,[47]7.9951,[48]8.0858,[49]8.0708,[50]8.1428,[51]8.1885,[52]8.2210,[53]8.2908,[54]8.3622,[55]8.3631,[56]8.4006,[57]8.3778,[58]8.4120,[59]8.4784,[60]8.5297,[61]8.5523,[62]8.5996,[63]8.6570,[64]8.7165,[65]8.8000,[66]8.8702,[67]8.9450,[68]8.9290,[69]8.9443,[70]8.9367,[71]8.9676,[72]9.0349,[73]9.0725,[74]9.0898,[75]9.0438,[76]9.0420,[77]9.0893,[78]9.1437,[79]9.0755,[80]9.0532,[81]9.0161,[82]9.0547,[83]9.0174,[84]8.9989,[85]9.0142,[86]9.1062,[87]9.1386,[88]9.1295,[89]9.1316,[90]9.1027,[91]9.1535,[92]9.1270,[93]9.1707,[94]9.1783,[95]9.1520,[96]9.1371,[97]9.1107,[98]9.1225,[99]9.1011,[100]9.1586,[101]9.1854,[102]9.1710,[103]9.1861,[104]9.1507,[105]9.1506,[106]9.1385,[107]9.1680,[108]9.2054,[109]9.2309,[110]9.2824,[111]9.4013,[112]9.3940,[113]9.3502,[114]9.4151,[115]9.4236,[116]9.3861,[117]9.3689,[118]9.3468,[119]9.3020,[120]9.3153,[121]9.2962,[122]9.2866,[123]9.2463,[124]9.1913,[125]9.1611,[126]9.1364,[127]9.0790,[128]9.0479,[129]9.0107,[130]8.9785,[131]8.9274,[132]8.8878,[133]8.8715,[134]8.8671,[135]8.8644,[136]8.8624,[137]8.8311,[138]8.7971,[139]8.8032,[140]8.7863,[141]8.7775,[142]8.7831,[143]8.7891,[144]8.8182,[145]8.7880,[146]8.7468,[147]8.7090,[148]8.6709,[149]8.6509,[150]8.6152,[151]8.5858,[152]8.5716,[153]8.5746,[154]8.5300,[155]8.5322,[156]8.4904,[157]8.4639,[158]8.4276,[159]8.3930,[160]8.3574,[161]8.3378,[162]8.3230,[163]8.3033,[164]8.3020,[165]8.2830,[166]8.2750,[167]8.2675,[168]8.2916,[169]8.2929,[170]8.3244,[171]8.3492,[172]8.4016,[173]8.4460,[174]8.4650,[175]8.5220,[176]8.5503,[177]8.5972,[178]8.6381,[179]8.6490,[180]8.6490,[181]8.6707,[182]8.6941,[183]8.6781,[184]8.6918,[185]8.7008,[186]8.7041,[187]8.7146,[188]8.7149,[189]8.7290,[190]8.7586,[191]8.7662,[192]8.7813,[193]8.7756,[194]8.8000,[195]8.8202,[196]8.8308,[197]8.8377,[198]8.8159,[199]8.8094,[200]8.7984,[201]8.8067,[202]8.8194,[203]8.8394,[204]8.8558,[205]8.8719,[206]8.8608,[207]8.8861,[208]8.8674,[209]8.8694,[210]8.8649,[211]8.8664,[212]8.8664,[213]8.8654,[214]8.8467,[215]8.8299,[216]8.8243,[217]8.8339,[218]8.8314,[219]8.8026,[220]8.7723,[221]8.7606,[222]8.7501,[223]8.7457,[224]8.7614,[225]8.7408,[226]8.7417,[227]8.7322,[228]8.7067,[229]8.6769,[230]8.6544,[231]8.6385,[232]8.6238,[233]8.6229,[234]8.6317,[235]8.6266,[236]8.6119,[237]8.5987,[238]8.5786,[239]8.5695,[240]8.5717,[241]8.5735,[242]8.5840,[243]8.5833,[244]8.6012,[245]8.6028,[246]8.6218,[247]8.6262,[248]8.6298,[249]8.6374,[250]8.6450,[251]8.6627,[252]8.6781,[253]8.7056,[254]8.7244,[255]8.7280,[256]8.7426,[257]8.7539,[258]8.7389,[259]8.7176,[260]8.6965,[261]8.6721,[262]8.6554,[263]8.6484,[264]8.6496,[265]8.6603,[266]8.6635,[267]8.6630,[268]8.6536,[269]8.6558,[270]8.6532,[271]8.6436,[272]8.6434,[273]8.6404,[274]8.6366,[275]8.6370,[276]8.6256,[277]8.6240,[278]8.6297,[279]8.6269,[280]8.6228,[281]8.6157,[282]8.6225,[283]8.5953,[284]8.5628,[285]8.5682,[286]8.5529,[287]8.5352,[288]8.5319,[289]8.5352,[290]8.5574,[291]8.5611,[292]8.5600,[293]8.5597,[294]8.5771,[295]8.5966,[296]8.6104,[297]8.6343,[298]8.6301,[299]8.6159,[300]8.6174,[301]8.6142,[302]8.6097,[303]8.6022,[304]8.6197,[305]8.6192,[306]8.6158,[307]8.6179,[308]8.6149,[309]8.6137,[310]8.6182,[311]8.6222,[312]8.6118,[313]8.6043,[314]8.6079,[315]8.5949,[316]8.5993,[317]8.6204,[318]8.6258,[319]8.6203,[320]8.6228,[321]8.6086,[322]8.6199,[323]8.6346,[324]8.6507,[325]8.6710,[326]8.6732,[327]8.6655,[328]8.6653,[329]8.6499,[330]8.6404,[331]8.6312,[332]8.6335,[333]8.6336,[334]8.6258,[335]8.6146,[336]8.6087,[337]8.6148,[338]8.6240,[339]8.6169,[340]8.6086,[341]8.5971,[342]8.5949,[343]8.5896,[344]8.5983,[345]8.6018,[346]8.5975,[347]8.5856,[348]8.5863,[349]8.5795,[350]8.5734,[351]8.5733,[352]8.5784,[353]8.5782,[354]8.5653,[355]8.5821,[356]8.5903,[357]8.5962,[358]8.5844,[359]8.5856,[360]8.5831,[361]8.5881,[362]8.5832,[363]8.5793,[364]8.5895,[365]8.6065,[366]8.6323,[367]8.6493,[368]8.6782,[369]8.6979,[370]8.7129,[371]8.7341,[372]8.7573,[373]8.7651,[374]8.7751,[375]8.7954,[376]8.8094,[377]8.8205,[378]8.8340,[379]8.8444,[380]8.8630,[381]8.8783,[382]8.8923,[383]8.9046,[384]8.9177,[385]8.9451,[386]8.9627,[387]8.9649,[388]8.9664,[389]8.9747,[390]8.9977,[391]9.0179,[392]9.0151,[393]9.0123,[394]9.0053,[395]9.0057,[396]9.0149,[397]9.0193,[398]9.0209,[399]9.0254,[400]9.0412,[401]9.0448,[402]9.0445,[403]9.0342,[404]9.0250,[405]9.0143,[406]9.0092,[407]9.0131,[408]9.0179,[409]9.0147,[410]9.0133,[411]9.0250,[412]9.0270,[413]9.0237,[414]9.0164,[415]9.0056,[416]8.9918,[417]8.9939,[418]8.9943,[419]8.9925,[420]8.9881,[421]8.9901,[422]8.9757,[423]8.9752,[424]8.9716,[425]8.9691,[426]8.9713,[427]8.9781,[428]8.9900,[429]8.9958,[430]8.9898,[431]8.9829,[432]8.9878,[433]8.9866,[434]8.9847,[435]8.9947,[436]8.9817,[437]8.9833,[438]8.9826,[439]8.9745,[440]8.9837,[441]8.9798,[442]8.9722,[443]8.9657,[444]8.9664,[445]8.9557,[446]8.9592,[447]8.9568,[448]8.9472,[449]8.9380,[450]8.9395,[451]8.9356,[452]8.9220,[453]8.9135,[454]8.9089,[455]8.9092,[456]8.9065,[457]8.9113,[458]8.9274,[459]8.9238,[460]8.9241,[461]8.9196,[462]8.9185,[463]8.9295,[464]8.9291,[465]8.9318,[466]8.9338,[467]8.9392,[468]8.9456,[469]8.9488,[470]8.9550,[471]8.9455,[472]8.9530,[473]8.9420,[474]8.9434,[475]8.9509,[476]8.9546,[477]8.9489,[478]8.9368,[479]8.9392,[480]8.9484,[481]8.9561,[482]8.9454,[483]8.9540,[484]8.9609,[485]8.9638,[486]8.9614,[487]8.9661,[488]8.9577,[489]8.9444,[490]8.9436,[491]8.9348,[492]8.9310,[493]8.9193,[494]8.9158,[495]8.9076,[496]8.9063,[497]8.9151,[498]8.9211,[499]8.9155,[500]8.9159,[501]8.9183,[502]8.9154,[503]8.9297,[504]8.9373,[505]8.9398,[506]8.9389,[507]8.9328,[508]8.9376,[509]8.9313,[510]8.9331,[511]8.9384,[512]8.9338,[513]8.9362,[514]8.9392,[515]8.9409,[516]8.9433,[517]8.9492,[518]8.9474,[519]8.9465,[520]8.9458,[521]8.9477,[522]8.9383,[523]8.9423,[524]8.9424,[525]8.9450,[526]8.9508,[527]8.9511,[528]8.9515,[529]8.9478,[530]8.9430,[531]8.9463,[532]8.9421,[533]8.9426,[534]8.9426,[535]8.9459,[536]8.9394,[537]8.9478,[538]8.9587,[539]8.9576,[540]8.9731,[541]8.9730,[542]8.9633,[543]8.9653,[544]8.9722,[545]8.9691,[546]8.9674,[547]8.9609,[548]8.9473,[549]8.9452,[550]8.9316,[551]8.9211,[552]8.9108,[553]8.8812,[554]8.8786,[555]8.8814,[556]8.8827,[557]8.8827,[558]8.8813,[559]8.8863,[560]8.8909,[561]8.8975,[562]8.9095,[563]8.9175,[564]8.9143,[565]8.9233,[566]8.9277,[567]8.9180,[568]8.9102,[569]8.9038,[570]8.9022,[571]8.9006,[572]8.9107,[573]8.9135,[574]8.9165,[575]8.9171,[576]8.9246,[577]8.9213,[578]8.9259,[579]8.9338,[580]8.9469,[581]8.9482,[582]8.9594,[583]8.9442,[584]8.9408,
> Final estimate: PPL = 8.9408 +/- 0.06868
> ```
+
+> 👤 **ikawrakow** replied on **2025-05-02** at **08:08:40**
>
-> 👤 **ikawrakow** replied the **2025-05-02** at **08:08:40**:
> Thanks.
>
> This discards the second option and points more towards the first, given the `0.0065` difference between `Q8_0` and `fp16` KV cache on the *same hardware*. But there is also a 3rd option that I missed above:
> * There is (also) numerical instability is in the quantization process
>
> I'm leaving for the airport shortly and will be traveling for the better part of the day. But tomorrow I'll post my `IQ4_KS` models quantized with the 3 different imatrix datasets on HF.
+
+> 👤 **danielhanchen** replied on **2025-05-02** at **08:12:08**
>
-> 👤 **danielhanchen** replied the **2025-05-02** at **08:12:08**:
> @ikawrakow I think you're using the 128K imatrix which has YaRN enabled hence the discrepancy maybe. Also @ubergarm's results on Wiki show Q4_K_XL does pretty ok on Wiki.test.raw (Ubergarm's own quants look very impressive indeed), but higher on Ub's own calibration dataset. Notice I use Qwen's chat template directly, and add thinking traces so it might be worse on generic text data.
+
+> 👤 **saood06** replied on **2025-05-02** at **08:15:52**
>
-> 👤 **saood06** replied the **2025-05-02** at **08:15:52**:
> > This discards the second option and points more towards the first, given the `0.0065` difference between `Q8_0` and `fp16` KV cache on the _same hardware_.
>
> Do you want me to run PPL on that model on the CPU in my server, at FP16 and Q8_0? The model is fast enough for me to do that without it taking forever.
+
+> 👤 **ikawrakow** replied on **2025-05-02** at **08:19:31**
>
-> 👤 **ikawrakow** replied the **2025-05-02** at **08:19:31**:
> > Do you want me to run PPL on that model on the CPU in my server, at FP16 and Q8_0?
>
> That could be a useful datapoint. @ubergarm's `bf16` value differs from mine by more than I have historically found as a difference between different systems.
+
+> 👤 **ikawrakow** replied on **2025-05-02** at **08:37:24**
>
-> 👤 **ikawrakow** replied the **2025-05-02** at **08:37:24**:
> @danielhanchen
>
> > I think you're using the 128K imatrix which has YaRN enabled hence...
@@ -324,8 +343,9 @@ If I'm following here, sounds like the goal to get as low as possible without go
> > Also @ubergarm's results on Wiki show Q4_K_XL does pretty ok on Wiki.test.raw
>
> This depends on the way you look at it. `PPL = 9.1688` is 1.1% higher than `bf16`, so pretty much a run-of-the-mill `Q4_K` quantization, especially for a MoE model (sometimes one needs `Q4_K_M` to get to the 1% range, but often `Q4_K_S` is enough). Your `IQ4_XS` quantization is actually better, arriving at essentially the same PPL (`9.1704`) with 1.3 GB smaller model size.
+
+> 👤 **danielhanchen** replied on **2025-05-02** at **08:48:36**
>
-> 👤 **danielhanchen** replied the **2025-05-02** at **08:48:36**:
> @ikawrakow Oh my calibration dataset is like 12K or longer for thinking models, so there might be some discrepancies for 128K long context imatrices. https://blog.eleuther.ai/yarn/ for eg does show YaRN enabled does increase PPL for shorter context lengths.
>
> Oh this one: https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/imatrix_unsloth.dat (normal 40960 context length)
@@ -333,8 +353,9 @@ If I'm following here, sounds like the goal to get as low as possible without go
> I used BF16 to get the imatrix. I actually tried Q8_0, and it actually failed to create some IQ1_S / IQ1_M quants, so I instead used BF16 - I think Qwen released FP8 versions, so I first thought using Q8_0 was fine for imatrix, since they might have trained with FP8, but I'm not sure anymore - the FP8 might just be post quantization.
>
> I was actually thinking of adopting ik_llama.cpp @ikawrakow :) For the next release of models, I could provide quants also compatible with ik_llama.cpp if that's interesting, especially since @ubergarm's results always wow me (Deepseek, Scout, etc) :)
+
+> 👤 **saood06** replied on **2025-05-02** at **09:00:40**
>
-> 👤 **saood06** replied the **2025-05-02** at **09:00:40**:
> >But apart from this, why would YaRN enabled or not change anything when we are running a 512 tokens context
>
> The official model card says:
@@ -364,16 +385,19 @@ If I'm following here, sounds like the goal to get as low as possible without go
> ```
>
> It gave me the same result both times (my commit and main) so I'm not sure if my change did anything at all.
+
+> 👤 **ikawrakow** replied on **2025-05-03** at **04:41:26**
>
-> 👤 **ikawrakow** replied the **2025-05-03** at **04:41:26**:
> > For the next release of models, I could provide quants also compatible with ik_llama.cpp if that's interesting,
>
> This would be great!
+
+> 👤 **ikawrakow** replied on **2025-05-03** at **06:44:59**
>
-> 👤 **ikawrakow** replied the **2025-05-03** at **06:44:59**:
> @saood06 Where did you get the RoPE factor change from?
+
+> 👤 **saood06** replied on **2025-05-03** at **06:49:45**
>
-> 👤 **saood06** replied the **2025-05-03** at **06:49:45**:
> Sorry for the delay but here is the same model (and it's repacked variant) running PPL on my CPU instead of my GPU with both F16 and Q8_0 cache.
>
> ` ./bin/llama-perplexity -m /mnt/sda/Qwen3/30BA3B/BF16/ggml-model-IQ4_KS.gguf -t 48 --numa distribute -fa -fmoe -f /mnt/sda/wikitext-2-raw/wiki.test.raw`
@@ -407,8 +431,9 @@ If I'm following here, sounds like the goal to get as low as possible without go
> [1]5.9016,[2]8.2000,[3]7.8140,[4]7.4725,[5]7.5629,[6]7.8727,[7]7.9035,[8]8.3263,[9]8.6767,[10]9.1767,[11]9.2245,[12]9.4083,[13]9.8413,[14]9.4664,[15]9.3727,[16]9.6102,[17]9.0849,[18]9.2388,[19]9.1735,[20]9.1432,[21]8.8470,[22]8.8168,[23]8.4654,[24]8.0293,[25]7.8182,[26]7.6125,[27]7.4136,[28]7.2990,[29]7.3292,[30]7.2632,[31]7.2147,[32]7.2425,[33]7.1391,[34]7.2118,[35]7.3291,[36]7.4505,[37]7.6066,[38]7.6595,[39]7.6571,[40]7.7174,[41]7.7206,[42]7.6952,[43]7.7506,[44]7.7565,[45]7.7513,[46]7.7658,[47]7.9396,[48]8.0275,[49]8.0138,[50]8.0853,[51]8.1307,[52]8.1643,[53]8.2358,[54]8.3060,[55]8.3053,[56]8.3430,[57]8.3225,[58]8.3581,[59]8.4309,[60]8.4806,[61]8.5032,[62]8.5485,[63]8.6048,[64]8.6630,[65]8.7482,[66]8.8175,[67]8.8924,[68]8.8787,[69]8.8915,[70]8.8836,[71]8.9148,[72]8.9826,[73]9.0235,[74]9.0416,[75]8.9940,[76]8.9918,[77]9.0371,[78]9.0880,[79]9.0183,[80]8.9964,[81]8.9587,[82]8.9944,[83]8.9610,[84]8.9402,[85]8.9564,[86]9.0486,[87]9.0821,[88]9.0735,[89]9.0747,[90]9.0454,[91]9.0955,[92]9.0694,[93]9.1150,[94]9.1229,[95]9.0985,[96]9.0840,[97]9.0566,[98]9.0688,[99]9.0485,[100]9.1061,[101]9.1346,[102]9.1200,[103]9.1360,[104]9.1011,[105]9.1018,[106]9.0895,[107]9.1188,[108]9.1568,[109]9.1825,[110]9.2341,[111]9.3521,[112]9.3452,[113]9.3028,[114]9.3675,[115]9.3770,[116]9.3409,[117]9.3241,[118]9.3013,[119]9.2564,[120]9.2678,[121]9.2488,[122]9.2393,[123]9.2002,[124]9.1469,[125]9.1159,[126]9.0924,[127]9.0347,[128]9.0034,[129]8.9666,[130]8.9353,[131]8.8850,[132]8.8451,[133]8.8305,[134]8.8248,[135]8.8234,[136]8.8209,[137]8.7910,[138]8.7567,[139]8.7631,[140]8.7471,[141]8.7386,[142]8.7445,[143]8.7522,[144]8.7807,[145]8.7505,[146]8.7112,[147]8.6732,[148]8.6360,[149]8.6164,[150]8.5807,[151]8.5516,[152]8.5375,[153]8.5405,[154]8.4962,[155]8.4992,[156]8.4582,[157]8.4322,[158]8.3958,[159]8.3613,[160]8.3245,[161]8.3042,[162]8.2882,[163]8.2678,[164]8.2657,[165]8.2456,[166]8.2368,[167]8.2301,[168]8.2538,[169]8.2544,[170]8.2859,[171]8.3102,[172]8.3630,[173]8.4064,[174]8.4245,[175]8.4801,[176]8.5084,[177]8.5554,[178]8.5952,[179]8.6057,[180]8.6065,[181]8.6286,[182]8.6513,[183]8.6357,[184]8.6486,[185]8.6579,[186]8.6620,[187]8.6730,[188]8.6738,[189]8.6871,[190]8.7167,[191]8.7243,[192]8.7397,[193]8.7343,[194]8.7577,[195]8.7769,[196]8.7888,[197]8.7953,[198]8.7731,[199]8.7664,[200]8.7559,[201]8.7647,[202]8.7775,[203]8.7980,[204]8.8137,[205]8.8302,[206]8.8196,[207]8.8455,[208]8.8264,[209]8.8284,[210]8.8242,[211]8.8247,[212]8.8245,[213]8.8242,[214]8.8063,[215]8.7894,[216]8.7850,[217]8.7952,[218]8.7927,[219]8.7639,[220]8.7339,[221]8.7223,[222]8.7116,[223]8.7064,[224]8.7211,[225]8.7003,[226]8.7008,[227]8.6912,[228]8.6655,[229]8.6352,[230]8.6134,[231]8.5975,[232]8.5829,[233]8.5819,[234]8.5907,[235]8.5855,[236]8.5715,[237]8.5586,[238]8.5385,[239]8.5304,[240]8.5332,[241]8.5353,[242]8.5460,[243]8.5450,[244]8.5628,[245]8.5646,[246]8.5836,[247]8.5875,[248]8.5915,[249]8.5992,[250]8.6072,[251]8.6246,[252]8.6396,[253]8.6664,[254]8.6849,[255]8.6886,[256]8.7028,[257]8.7138,[258]8.6980,[259]8.6768,[260]8.6553,[261]8.6303,[262]8.6140,[263]8.6072,[264]8.6084,[265]8.6184,[266]8.6213,[267]8.6203,[268]8.6109,[269]8.6128,[270]8.6113,[271]8.6013,[272]8.6002,[273]8.5974,[274]8.5942,[275]8.5939,[276]8.5820,[277]8.5805,[278]8.5852,[279]8.5820,[280]8.5774,[281]8.5707,[282]8.5787,[283]8.5512,[284]8.5188,[285]8.5241,[286]8.5089,[287]8.4909,[288]8.4869,[289]8.4895,[290]8.5114,[291]8.5149,[292]8.5140,[293]8.5142,[294]8.5316,[295]8.5506,[296]8.5642,[297]8.5877,[298]8.5841,[299]8.5692,[300]8.5706,[301]8.5680,[302]8.5639,[303]8.5569,[304]8.5745,[305]8.5746,[306]8.5707,[307]8.5734,[308]8.5716,[309]8.5707,[310]8.5753,[311]8.5795,[312]8.5694,[313]8.5621,[314]8.5663,[315]8.5533,[316]8.5577,[317]8.5790,[318]8.5843,[319]8.5790,[320]8.5820,[321]8.5679,[322]8.5800,[323]8.5938,[324]8.6105,[325]8.6308,[326]8.6332,[327]8.6251,[328]8.6260,[329]8.6108,[330]8.6016,[331]8.5926,[332]8.5956,[333]8.5957,[334]8.5876,[335]8.5771,[336]8.5711,[337]8.5770,[338]8.5861,[339]8.5789,[340]8.5703,[341]8.5586,[342]8.5562,[343]8.5509,[344]8.5587,[345]8.5623,[346]8.5576,[347]8.5456,[348]8.5460,[349]8.5389,[350]8.5331,[351]8.5330,[352]8.5379,[353]8.5370,[354]8.5242,[355]8.5401,[356]8.5479,[357]8.5531,[358]8.5408,[359]8.5419,[360]8.5397,[361]8.5451,[362]8.5403,[363]8.5366,[364]8.5463,[365]8.5628,[366]8.5886,[367]8.6058,[368]8.6349,[369]8.6541,[370]8.6687,[371]8.6894,[372]8.7124,[373]8.7204,[374]8.7299,[375]8.7500,[376]8.7637,[377]8.7737,[378]8.7876,[379]8.7978,[380]8.8164,[381]8.8318,[382]8.8456,[383]8.8580,[384]8.8710,[385]8.8986,[386]8.9157,[387]8.9177,[388]8.9189,[389]8.9269,[390]8.9499,[391]8.9701,[392]8.9670,[393]8.9642,[394]8.9573,[395]8.9575,[396]8.9663,[397]8.9706,[398]8.9717,[399]8.9767,[400]8.9924,[401]8.9957,[402]8.9955,[403]8.9853,[404]8.9763,[405]8.9653,[406]8.9603,[407]8.9638,[408]8.9684,[409]8.9650,[410]8.9634,[411]8.9757,[412]8.9778,[413]8.9744,[414]8.9673,[415]8.9566,[416]8.9433,[417]8.9448,[418]8.9459,[419]8.9444,[420]8.9400,[421]8.9419,[422]8.9279,[423]8.9277,[424]8.9237,[425]8.9214,[426]8.9228,[427]8.9294,[428]8.9414,[429]8.9471,[430]8.9413,[431]8.9345,[432]8.9398,[433]8.9388,[434]8.9367,[435]8.9467,[436]8.9339,[437]8.9353,[438]8.9351,[439]8.9267,[440]8.9365,[441]8.9333,[442]8.9257,[443]8.9191,[444]8.9199,[445]8.9096,[446]8.9130,[447]8.9104,[448]8.9008,[449]8.8919,[450]8.8932,[451]8.8895,[452]8.8761,[453]8.8675,[454]8.8629,[455]8.8631,[456]8.8609,[457]8.8660,[458]8.8819,[459]8.8785,[460]8.8785,[461]8.8742,[462]8.8731,[463]8.8837,[464]8.8831,[465]8.8852,[466]8.8872,[467]8.8925,[468]8.8989,[469]8.9025,[470]8.9087,[471]8.8997,[472]8.9071,[473]8.8964,[474]8.8979,[475]8.9053,[476]8.9089,[477]8.9035,[478]8.8919,[479]8.8941,[480]8.9030,[481]8.9105,[482]8.9003,[483]8.9091,[484]8.9163,[485]8.9196,[486]8.9168,[487]8.9217,[488]8.9133,[489]8.9002,[490]8.8997,[491]8.8910,[492]8.8872,[493]8.8755,[494]8.8719,[495]8.8641,[496]8.8628,[497]8.8714,[498]8.8780,[499]8.8725,[500]8.8728,[501]8.8752,[502]8.8721,[503]8.8863,[504]8.8940,[505]8.8969,[506]8.8959,[507]8.8901,[508]8.8949,[509]8.8884,[510]8.8903,[511]8.8956,[512]8.8914,[513]8.8939,[514]8.8972,[515]8.8990,[516]8.9014,[517]8.9072,[518]8.9054,[519]8.9049,[520]8.9039,[521]8.9059,[522]8.8965,[523]8.9003,[524]8.9001,[525]8.9028,[526]8.9086,[527]8.9088,[528]8.9095,[529]8.9053,[530]8.9005,[531]8.9036,[532]8.8999,[533]8.9003,[534]8.9000,[535]8.9031,[536]8.8967,[537]8.9052,[538]8.9158,[539]8.9149,[540]8.9305,[541]8.9301,[542]8.9205,[543]8.9225,[544]8.9292,[545]8.9262,[546]8.9243,[547]8.9178,[548]8.9041,[549]8.9023,[550]8.8888,[551]8.8781,[552]8.8676,[553]8.8381,[554]8.8353,[555]8.8383,[556]8.8396,[557]8.8397,[558]8.8384,[559]8.8437,[560]8.8485,[561]8.8554,[562]8.8673,[563]8.8752,[564]8.8724,[565]8.8813,[566]8.8853,[567]8.8753,[568]8.8676,[569]8.8609,[570]8.8589,[571]8.8573,[572]8.8675,[573]8.8705,[574]8.8733,[575]8.8736,[576]8.8814,[577]8.8783,[578]8.8829,[579]8.8908,[580]8.9042,[581]8.9057,[582]8.9168,[583]8.9019,[584]8.8990,
> Final estimate: PPL = 8.8990 +/- 0.06799
> ```
+
+> 👤 **ubergarm** replied on **2025-05-03** at **16:38:27**
>
-> 👤 **ubergarm** replied the **2025-05-03** at **16:38:27**:
> > > For the next release of models, I could provide quants also compatible with ik_llama.cpp if that's interesting,
> >
> > This would be great!
@@ -418,19 +443,21 @@ If I'm following here, sounds like the goal to get as low as possible without go
> @danielhanchen fwiw myself and a few others have started adding the tag `ik_llama.cpp` to the `iqN_k` quants uploaded on the huggingface README.md model cards which makes it easier to find e.g. https://huggingface.co/models?other=ik_llama.cpp
>
> Appreciate all your time and thoughtfulness lately with all the excitement haha... Cheers!
+
+> 👤 **saood06** replied on **2025-05-04** at **05:02:36**
>
-> 👤 **saood06** replied the **2025-05-04** at **05:02:36**:
> > @saood06 Where did you get the RoPE factor change from?
>
> Nowhere, I was just experimenting after I saw that statement in the model card, but I didn't get very far.
---
-👤 **ikawrakow** replied the **2025-05-03** at **06:12:51**:
+👤 **ikawrakow** commented on **2025-05-03** at **06:12:51**
I have posted 3 `IQ4_KS` models quantized with the 3 different imatrix datasets discussed above [here](https://huggingface.co/ikawrakow/Qwen3-30B-A3B)
-> 👤 **ubergarm** replied the **2025-05-04** at **04:21:54**:
+> 👤 **ubergarm** replied on **2025-05-04** at **04:21:54**
+>
> I attempted to make a "visual diff" of three imatrix files. I didn't find yours @ikawrakow on the hf repo, so used mine, unsloths non-128k version, and bartowski's.
>
> https://gist.github.com/ubergarm/2aa9327f7b98a9b16fef62b4941c7e76
@@ -440,16 +467,18 @@ I have posted 3 `IQ4_KS` models quantized with the 3 different imatrix datasets
> I'm not sure how to read the tea leaves or if this is just an amusing distraction and excuse to test vibe coding with `Qwen3-30B-A3B`. To make matters a bit more confusing, unsloth gave some details of their methodology which seems to include generating imatrix with larger context than the `-c 512` I use (and I assume is typical and default?). Its a useful comment in an otherwise odd discussion: https://huggingface.co/unsloth/Phi-4-reasoning-plus-GGUF/discussions/1#68152ae82c118dc537ae3667
>
> Haven't had a chance to grab PPL and KLD stats on your 3 quants yet, but might be able to get that on Sunday and update my table above.
+
+> 👤 **ubergarm** replied on **2025-05-08** at **18:43:48**
>
-> 👤 **ubergarm** replied the **2025-05-08** at **18:43:48**:
> Just posted a quant roundup and trying to run some benchmarks against your Qwen3 `IQ4_KS` on hf. Already posted some PPL and KLD stats here: https://www.reddit.com/r/LocalLLaMA/comments/1khwxal/the_great_quant_wars_of_2025/
+
+> 👤 **l15y** replied on **2025-05-17** at **08:47:42**
>
-> 👤 **l15y** replied the **2025-05-17** at **08:47:42**:
> Please upload the IQK-3 version, which is very useful for users with 16G VRAM.
---
-👤 **ikawrakow** replied the **2025-05-08** at **19:30:19**:
+👤 **ikawrakow** commented on **2025-05-08** at **19:30:19**
@ubergarm Great write up!
@@ -464,7 +493,8 @@ My only comment: when there is no doubt that the `bf16` model is best, then KLD
In that scenario, the larger the difference between the `bf16` model and the (better) quantized model, the higher the values of KLD, etc. So that, if we went by these metrics, we would be thinking that the quantized model is not good, while in reality it is better.
-> 👤 **saood06** replied the **2025-05-08** at **22:56:01**:
+> 👤 **saood06** replied on **2025-05-08** at **22:56:01**
+>
> > @ubergarm Great write up!
> >
> > The fact that the ikawrakow/IQ4_KS_Unsloth model gets a lower PPL than `bf16` on your private evaluation dataset is another indication that something is not quite right.
@@ -479,9 +509,10 @@ In that scenario, the larger the difference between the `bf16` model and the (be
---
-👤 **afsara-ben** replied the **2025-06-10** at **22:24:55**:
+👤 **afsara-ben** commented on **2025-06-10** at **22:24:55**
@ikawrakow i am having a hard time understanding how the iqx_k quants came from? is there an explanation somewhere other than the code
-> 👤 **saood06** replied the **2025-06-11** at **02:58:40**:
-> #8 has the info you are looking for.
\ No newline at end of file
+> 👤 **saood06** replied on **2025-06-11** at **02:58:40**
+>
+> [#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) has the info you are looking for.
\ No newline at end of file
diff --git a/github-data/discussions/372 - multy gpu.md b/github-data/discussions/372 - multy gpu.md
index 0a340aa37..115abc5b9 100644
--- a/github-data/discussions/372 - multy gpu.md
+++ b/github-data/discussions/372 - multy gpu.md
@@ -1,20 +1,21 @@
-### 🗣️ [#372](https://github.com/ikawrakow/ik_llama.cpp/discussions/372) - multy gpu
+## 🗣️ [Discussion #372](https://github.com/ikawrakow/ik_llama.cpp/discussions/372) - multy gpu
| **Author** | `airnsk` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-03 |
| **Updated** | 2025-05-06 |
---
-#### Description
+## 📄 Description
I have 2 cmp90 10 gb GPUs on a computer with 512gb ram. Is it possible to run qwen3-235B?
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-06** at **16:39:53**:
+👤 **ikawrakow** commented on **2025-05-06** at **16:39:53**
I think so. But what kind of performance you will get also depends on the CPU you have as a large part of the calculations will be done on the CPU.
\ No newline at end of file
diff --git a/github-data/discussions/384 - ik_llama.cpp issues on an old workstation.md b/github-data/discussions/384 - ik_llama.cpp issues on an old workstation.md
index 3a89b7285..9227ad1b8 100644
--- a/github-data/discussions/384 - ik_llama.cpp issues on an old workstation.md
+++ b/github-data/discussions/384 - ik_llama.cpp issues on an old workstation.md
@@ -1,13 +1,14 @@
-### 🗣️ [#384](https://github.com/ikawrakow/ik_llama.cpp/discussions/384) - ik_llama.cpp issues on an old workstation
+## 🗣️ [Discussion #384](https://github.com/ikawrakow/ik_llama.cpp/discussions/384) - ik_llama.cpp issues on an old workstation
| **Author** | `matt23654` |
| :--- | :--- |
+| **State** | ✅ **Answered** |
| **Created** | 2025-05-06 |
| **Updated** | 2025-05-06 |
---
-#### Description
+## 📄 Description
Hi! So I have managed to get ubergarm's 235B quant to work on a 6 year old workstation with 2*2080TI's, 64GB RAM and a pretty fast (new) SSD.
@@ -50,9 +51,9 @@ Am I just doing something wrong or is there some genuine bug here?
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-06** at **11:31:27**:
+👤 **ikawrakow** commented on **2025-05-06** at **11:31:27**
Split mode "row" does not work for MoE models (and I'm not sure if it works for dense models as I don't have access to a multi-GPU system, so have not tested since forking). I'm pretty sure split mode "row" does not work for MoE models in mainline `llama.cpp` either.
@@ -70,7 +71,8 @@ Note that the tensor overrides are processed in the order they were defined on t
If the GPUs are different, then it may be better to just manually define with `-ot` which tensors go where.
-> 👤 **matt23654** replied the **2025-05-06** at **13:54:09**:
+> 👤 **matt23654** replied on **2025-05-06** at **13:54:09**
+>
> Hi @ikawrakow !
>
> No matter what I do ``-sm layer`` just doesnt seem to work with 2 devices. A variation of your first command segfaults:
@@ -100,8 +102,9 @@ If the GPUs are different, then it may be better to just manually define with `-
> ```
>
> I don't know why it wants to allocate such a huge amount of memory. It doesn't do that with one device or with ``-sm row`` (as mentioned row doesn't work if I try to put any MoE expert tensors on the GPUs).
+
+> 👤 **ubergarm** replied on **2025-05-06** at **13:57:01**
>
-> 👤 **ubergarm** replied the **2025-05-06** at **13:57:01**:
> @matt23654
>
> First I'm not sure where this came from but a lot of folks keep using `-ot "^blk\.[3-9]\.ffn_.*_exps\.=CPU"` which misses some other ffn layers without the `exps` as the naming convention on Qwen3 is a bit different than DeepSeek for example.
@@ -112,21 +115,25 @@ If the GPUs are different, then it may be better to just manually define with `-
> Look here for more discussions and examples: https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF/discussions/1#681642d4a383b2fb9aa3bd8c
>
> Keep us posted how you get along, as some others have reported success with multi-gpu once they get the arguments just right for their specific systems!
+
+> 👤 **matt23654** replied on **2025-05-06** at **15:19:56**
>
-> 👤 **matt23654** replied the **2025-05-06** at **15:19:56**:
> Thanks @ubergarm ! For some reason ``-DGGML_SCHED_MAX_COPIES=1`` works and it no longer tries allocating 170GB of VRAM. I'm getting ~15 tok/s PP and ~6 tok/s generation. Not too bad really for a very old computer offloading from SSD! Specs: i9-9940X, 64GB quad channel ram, 2*2080Ti. I also offloaded all the ffn tensors as suggested.
>
> I'm guessing that I can't really expect to get a lot of PP speed with SSD offloading and an old CPU (i9-9940X)?
+
+> 👤 **ikawrakow** replied on **2025-05-06** at **16:32:43**
>
-> 👤 **ikawrakow** replied the **2025-05-06** at **16:32:43**:
> @matt23654 I'm curious what happens if you add `-rtr` to your command line. Model loading will take longer, but possibly this will improve your PP performance (PP being only 2.5 times faster than TG does not sound right).
+
+> 👤 **matt23654** replied on **2025-05-06** at **19:59:06**
>
-> 👤 **matt23654** replied the **2025-05-06** at **19:59:06**:
> @ikawrakow So there definitely looks to be something a bit wierd going on, maybe because of the SSD, but ``-rtr`` didn't really change PP speed. I've also tried compiling with OpenBLAS, but that somehow seems to have made it slower (yay!).
>
> The CPU is less active during PP than during regular inference, so I can only assume that somehow the SSD is bottlenecking it. The SSD bandwidth on its own should only be about 0.5tok/s peak, I think the reason generation is so fast is that Qwen isn't choosing experts uniformly and so the kernel caching is making it far closer to the quad-channel ram speed instead. That's my theory, anyway.
+
+> 👤 **ubergarm** replied on **2025-05-06** at **20:44:40**
>
-> 👤 **ubergarm** replied the **2025-05-06** at **20:44:40**:
> You might be able to get some more out of it, not sure your what your final command was, but give this a try:
> ```
> # do *not* use BLAS and set -DGGML_SCHED_MAX_COPIES=1
diff --git a/github-data/discussions/385 - Qwen3 235B performance on Intel Xeon Scalable processor.md b/github-data/discussions/385 - Qwen3 235B performance on Intel Xeon Scalable processor.md
index c1cf5ef80..558568b6d 100644
--- a/github-data/discussions/385 - Qwen3 235B performance on Intel Xeon Scalable processor.md
+++ b/github-data/discussions/385 - Qwen3 235B performance on Intel Xeon Scalable processor.md
@@ -1,13 +1,14 @@
-### 🗣️ [#385](https://github.com/ikawrakow/ik_llama.cpp/discussions/385) - Qwen3 235B performance on Intel Xeon Scalable processor
+## 🗣️ [Discussion #385](https://github.com/ikawrakow/ik_llama.cpp/discussions/385) - Qwen3 235B performance on Intel Xeon Scalable processor
| **Author** | `Gaolingx` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-06 |
| **Updated** | 2025-05-27 |
---
-#### Description
+## 📄 Description
## Introduction
@@ -23,7 +24,7 @@ The Qwen3 models were officially released on 29th, April, 2025. This is a mixtur
- Number of Activated Experts: 8
- Context Length: 32,768 natively and 131,072 tokens with YaRN.
-The qwen3moe had supported in in PR #355, I tried to run the biggest model [Qwen3-235B-A22B-128K-GGUF](https://hf-mirror.com/unsloth/Qwen3-235B-A22B-128K-GGUF) with ik_llama.cpp on my Workstation, I need better generation quality, an my system has sufficient memory(Total 512G RAM), so I chose the relatively higher quality quantization `Q8_0`.
+The qwen3moe had supported in in PR [#355](https://github.com/ikawrakow/ik_llama.cpp/issues/355), I tried to run the biggest model [Qwen3-235B-A22B-128K-GGUF](https://hf-mirror.com/unsloth/Qwen3-235B-A22B-128K-GGUF) with ik_llama.cpp on my Workstation, I need better generation quality, an my system has sufficient memory(Total 512G RAM), so I chose the relatively higher quality quantization `Q8_0`.
## System Info
@@ -158,9 +159,9 @@ I also use `Intel VTune Profiler 2025.0.1` capture some interesting data when ru
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-06** at **13:11:51**:
+👤 **ikawrakow** commented on **2025-05-06** at **13:11:51**
Thank you for these results. Quite amazing that it works reasonably well on an almost 8 years old CPU!
@@ -172,7 +173,8 @@ This shouldn't take very long, even for the 235B model.
Another note: at least on the CPUs that I have available, one gets better performance using `q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the command line). Not so much for short contexts, but quite noticeable for long contexts.
-> 👤 **saood06** replied the **2025-05-06** at **20:29:54**:
+> 👤 **saood06** replied on **2025-05-06** at **20:29:54**
+>
> > Another note: at least on the CPUs that I have available, one gets better performance using `q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the command line). Not so much for short contexts, but quite noticeable for long contexts.
>
> I have seen this https://www.reddit.com/r/LocalLLaMA/comments/1kewkno/qwen_30b_a3b_performance_degradation_with_kv/ where they report using `q8_0` KV cache causes the model to not able to solve a problem with a comment saying:
@@ -180,13 +182,14 @@ Another note: at least on the CPUs that I have available, one gets better perfor
> KV cache q8_0: 0/5
> KV cache f16: 2/2
> ```
+
+> 👤 **Gaolingx** replied on **2025-05-07** at **07:16:13**
>
-> 👤 **Gaolingx** replied the **2025-05-07** at **07:16:13**:
> Ok, Thanks for the info. I found that the memory bandwidth was not filled when I use vtune profiler analysis the memory access, Maybe numa system works in Linux better, I will try to use `numactl` changes the memory policy ([https://github.com/ggml-org/llama.cpp/issues/1437](https://github.com/ggml-org/llama.cpp/issues/1437)), and repack the model with `q8_0_r8`. I will see if I can do better yet however.
---
-👤 **Gaolingx** replied the **2025-05-07** at **18:42:39**:
+👤 **Gaolingx** commented on **2025-05-07** at **18:42:39**
Note: when I run llama-server with `-fa` and `-rtr` parameter, the speed is a little faster than only use `-fa`, the prefill and decode are increased, That is a good beginning!
@@ -206,7 +209,7 @@ INFO [ print_timings] total time = 35864.19 ms | tid="4682
---
-👤 **ikawrakow** replied the **2025-05-08** at **12:59:17**:
+👤 **ikawrakow** commented on **2025-05-08** at **12:59:17**
@saood06
@@ -230,7 +233,8 @@ This grabbed my attention as I have never seen any significant difference betwee
Hence, I think that the outcome is largely determined by the quality of the quantized model and by some luck. We know that in a random process (as we have here) slight differences in the computed token probabilities can make the model go on a very different path, even if the same seed was used.
-> 👤 **saood06** replied the **2025-05-08** at **22:40:13**:
+> 👤 **saood06** replied on **2025-05-08** at **22:40:13**
+>
> >So, being someone who does not take thinks for granted, I tried it myself.
>
> Thank you. Do you mind saying what sampler settings you used?
@@ -247,7 +251,7 @@ Hence, I think that the outcome is largely determined by the quality of the quan
---
-👤 **Gaolingx** replied the **2025-05-13** at **00:52:27**:
+👤 **Gaolingx** commented on **2025-05-13** at **00:52:27**
Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality.
@@ -260,15 +264,17 @@ INFO [ print_timings] generation eval time = 15317.10 ms / 50 run
INFO [ print_timings] total time = 25677.19 ms | tid="71476" timestamp=1747096864 id_slot=0 id_task=9696 t_prompt_processing=10360.092 t_token_generation=15317.103 t_total=25677.195
```
-> 👤 **saood06** replied the **2025-05-13** at **01:03:32**:
+> 👤 **saood06** replied on **2025-05-13** at **01:03:32**
+>
> > Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality.
>
> There is this feature: https://github.com/ikawrakow/ik_llama.cpp/pull/239 I personally haven't had much success using it (for Deepseek V3/R1) , but it may work for you on Qwen.
+
+> 👤 **Gaolingx** replied on **2025-05-13** at **01:45:22**
>
-> 👤 **Gaolingx** replied the **2025-05-13** at **01:45:22**:
> > > Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality.
> >
-> > There is this feature: #239 I personally haven't had much success using it (for Deepseek V3/R1) , but it may work for you on Qwen.
+> > There is this feature: [#239](https://github.com/ikawrakow/ik_llama.cpp/issues/239) I personally haven't had much success using it (for Deepseek V3/R1) , but it may work for you on Qwen.
>
> All right, it seems that `--smart-expert-reduction` not works well on qwen3moe, there are a lot of garbled characters appeared and continuous output appeared.
>
@@ -277,14 +283,16 @@ INFO [ print_timings] total time = 25677.19 ms | tid="7147
>
> `--flash-attn --run-time-repack --smart-expert-reduction 7,1`
> 
+
+> 👤 **ikawrakow** replied on **2025-05-13** at **12:35:23**
>
-> 👤 **ikawrakow** replied the **2025-05-13** at **12:35:23**:
-> Can you both try PR #415 and let me know if it now works? Thanks!
+> Can you both try PR [#415](https://github.com/ikawrakow/ik_llama.cpp/issues/415) and let me know if it now works? Thanks!
+
+> 👤 **Gaolingx** replied on **2025-05-14** at **01:42:24**
>
-> 👤 **Gaolingx** replied the **2025-05-14** at **01:42:24**:
-> > Can you both try PR #415 and let me know if it now works? Thanks!
+> > Can you both try PR [#415](https://github.com/ikawrakow/ik_llama.cpp/issues/415) and let me know if it now works? Thanks!
>
-> yes, I pulled PR(#415 ), The smart expert reduction works very well on cpu backend, thank you fix it.
+> yes, I pulled PR([#415](https://github.com/ikawrakow/ik_llama.cpp/issues/415) ), The smart expert reduction works very well on cpu backend, thank you fix it.
> 
>
> `--flash-attn --run-time-repack --smart-expert-reduction 6,1`
@@ -297,7 +305,7 @@ INFO [ print_timings] total time = 25677.19 ms | tid="7147
---
-👤 **VinnyG9** replied the **2025-05-19** at **15:30:30**:
+👤 **VinnyG9** commented on **2025-05-19** at **15:30:30**
you forgot to set -nkvo?
what snoop mode you're using for numa?
@@ -318,11 +326,12 @@ here's some numbers on the xeon v4 @Q2KL
---
-👤 **ikawrakow** replied the **2025-05-19** at **15:38:58**:
+👤 **ikawrakow** commented on **2025-05-19** at **15:38:58**
You cannot compare `Q2_K` to `Q8_0` for TG, there is going to be a factor in the range of 3X difference. Her PP is for a short prompt, and we don't know if it was a single prompt of 165 tokens or 10 prompts with 16 tokens each.
-> 👤 **VinnyG9** replied the **2025-05-19** at **15:48:34**:
+> 👤 **VinnyG9** replied on **2025-05-19** at **15:48:34**
+>
> > You cannot compare `Q2_K` to `Q8_0` for TG, there is going to be a factor in the range of 3X difference. Her PP is for a short prompt, and we don't know if it was a single prompt of 165 tokens or 10 prompts with 16 tokens each.
>
> or 2.5x going by model size :)
@@ -333,7 +342,7 @@ You cannot compare `Q2_K` to `Q8_0` for TG, there is going to be a factor in th
---
-👤 **Gaolingx** replied the **2025-05-27** at **13:06:54**:
+👤 **Gaolingx** commented on **2025-05-27** at **13:06:54**
Well, I use `-ser 4,1` parameter to improve token generation(TG) performance, now we can get ~4.1 token/s TG(< 4k context size), and the
quality not declined too much. all right, I admit this is just my opinion. Others can offer their own opinions on this point...We don't know what will happen in complex tasks...
diff --git a/github-data/discussions/393 - Creating quantized models.md b/github-data/discussions/393 - Creating quantized models.md
index 7fe17341d..ec750d19b 100644
--- a/github-data/discussions/393 - Creating quantized models.md
+++ b/github-data/discussions/393 - Creating quantized models.md
@@ -1,13 +1,14 @@
-### 🗣️ [#393](https://github.com/ikawrakow/ik_llama.cpp/discussions/393) - Creating quantized models
+## 🗣️ [Discussion #393](https://github.com/ikawrakow/ik_llama.cpp/discussions/393) - Creating quantized models
| **Author** | `nux` |
| :--- | :--- |
+| **State** | ❌ **Closed** |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-29 |
---
-#### Description
+## 📄 Description
Hello,
diff --git a/github-data/discussions/395 - Why does imatrix not tokenize special tokens_.md b/github-data/discussions/395 - Why does imatrix not tokenize special tokens.md
similarity index 87%
rename from github-data/discussions/395 - Why does imatrix not tokenize special tokens_.md
rename to github-data/discussions/395 - Why does imatrix not tokenize special tokens.md
index 66ebacc30..2d6c72cd9 100644
--- a/github-data/discussions/395 - Why does imatrix not tokenize special tokens_.md
+++ b/github-data/discussions/395 - Why does imatrix not tokenize special tokens.md
@@ -1,13 +1,14 @@
-### 🗣️ [#395](https://github.com/ikawrakow/ik_llama.cpp/discussions/395) - Why does imatrix not tokenize special tokens?
+## 🗣️ [Discussion #395](https://github.com/ikawrakow/ik_llama.cpp/discussions/395) - Why does imatrix not tokenize special tokens?
| **Author** | `bartowski1182` |
| :--- | :--- |
+| **State** | ❌ **Closed** |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-09 |
---
-#### Description
+## 📄 Description
Recently there's been some discussion (and I've also experimented slightly) around adding chat tokens to the imatrix dataset and tokenizing them, a change from the default behaviour, so I was curious why the original implementation avoided tokenizing them
@@ -15,9 +16,9 @@ Was it just an arbitrary decision or was there a reason at the time?
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-08** at **05:21:04**:
+👤 **ikawrakow** commented on **2025-05-08** at **05:21:04**
When the `imatrix` tool was written handling of chat, special tokens, etc., was extremely immature/non-existent in `llama.cpp` . If you look at the `llama_tokenize` function in `common` that is being used by the `imatrix` tool to tokenize the calibration data, you will see that the `parse_special` argument was added well after the `imatrix` tool was merged. It was added with a default value of `false`, so that defined the `imatrix` tool behavior with special tokens as this argument is missing in the `imatrix` call to `::lama_tokenize`. By the time `llama_tokenize` got the ability to parse special tokens I had left the `llama.cpp` project, so somebody else needed to notice, investigate, and possibly change.
@@ -27,30 +28,34 @@ In any case, it would be interesting to see if including special tokens, using n
---
-👤 **ikawrakow** replied the **2025-05-09** at **08:46:05**:
+👤 **ikawrakow** commented on **2025-05-09** at **08:46:05**
@bartowski1182 I see you submitted [this PR](https://github.com/ggml-org/llama.cpp/pull/13389) in mainline.
You are welcome.
-> 👤 **bartowski1182** replied the **2025-05-09** at **12:33:00**:
+> 👤 **bartowski1182** replied on **2025-05-09** at **12:33:00**
+>
> Ah did I not send that reply here first? Sorry, I had one typed up
>
> That makes perfect sense though! Do you think you'd want the same thing here? Was planning to open one up in each assuming it made sense, it seems like a nice idea for A/B testing anyways, but figured I'd double check with the original architect that there wasn't something glaringly obvious I was missing
>
> Thanks again for the input!
+
+> 👤 **bartowski1182** replied on **2025-05-09** at **12:42:35**
>
-> 👤 **bartowski1182** replied the **2025-05-09** at **12:42:35**:
> Truly did not mean to just grab knowledge and run, that's a terrible look, hence I meant to ask if I could contribute the same here so that it wouldn't just be a one-sided deal (not that it's a complex change from me, but just the principle of it, it's not in good taste to open a discussion, get your insight, and run to mainline without saying anything, that isn't my style but it's exactly what I did in this case)
+
+> 👤 **ikawrakow** replied on **2025-05-09** at **12:42:53**
>
-> 👤 **ikawrakow** replied the **2025-05-09** at **12:42:53**:
> > Do you think you'd want the same thing here?
>
> Most people are using mainline `llama.cpp` to compute imatrix data, so it is not critical to have this here.
>
> I'm waiting to see if the mainline developers will independently discover what's wrong with the imatrix calculation after their change to support MLA. After they have independently discovered it, or when enough time has passed, I'll make the change here, and at that point I can also put in the ability to use special tokens. Do you hear complains from users about reduced model quality after the MLA change?
+
+> 👤 **bartowski1182** replied on **2025-05-09** at **12:47:29**
>
-> 👤 **bartowski1182** replied the **2025-05-09** at **12:47:29**:
> > Do you hear complains from users about reduced model quality after the MLA change
>
> No I didn't hear anything about that yet, but MLA has its own can of worms with speed so I had personally been avoiding remaking those models that have MLA since, hoping for a resolution...
@@ -60,20 +65,23 @@ You are welcome.
> Without looking directly at your commit history I doubt anyone in mainline will figure it out, but who knows
>
> I do know that I like your algorithm for some semi incomplete experts, seems reasonable to have some wiggle room there, especially if after 200k tokens of imatrix it's still not being activated quite enough
+
+> 👤 **ikawrakow** replied on **2025-05-09** at **12:48:22**
>
-> 👤 **ikawrakow** replied the **2025-05-09** at **12:48:22**:
> > Truly did not mean to just grab knowledge and run, that's a terrible look, hence I meant to ask if I could contribute the same here so that it wouldn't just be a one-sided deal (not that it's a complex change from me, but just the principle of it, it's not in good taste to open a discussion, get your insight, and run to mainline without saying anything, that isn't my style but it's exactly what I did in this case)
>
> No worries. I know you are not free to mention my name in the mainline repository, else your PR will have the same fate as [that one](https://github.com/ggml-org/llama.cpp/pull/12727)
+
+> 👤 **bartowski1182** replied on **2025-05-09** at **12:55:14**
>
-> 👤 **bartowski1182** replied the **2025-05-09** at **12:55:14**:
> > else your PR will have the same fate as that one
>
> I'd *like* to think that's not the reason, but rather the annoying complexity level of that function in general and excitement for a new feature (though the feature does miss out on an important part, counting discrete layers ahead of time and applying variable quantization automatically..)
>
> But who knows, it's not my drama to unpack, so much as I wish we could all get along in a nice Kumbaya circle and contribute to the open world together, I know I'm naive ;)
+
+> 👤 **ikawrakow** replied on **2025-05-09** at **13:03:17**
>
-> 👤 **ikawrakow** replied the **2025-05-09** at **13:03:17**:
> It has never been the style of the `llama.cpp` project to wait for the perfect solution before merging a useful change.
>
> Your PR is immensely helpful to anyone using mainline `llama.cpp` and making their own quantized MoE models.
diff --git a/github-data/discussions/396 - Best settings for Maverick - Dual CPU Xeon 8480_ - RTX 3090.md b/github-data/discussions/396 - Best settings for Maverick - Dual CPU Xeon 8480 - RTX 3090.md
similarity index 79%
rename from github-data/discussions/396 - Best settings for Maverick - Dual CPU Xeon 8480_ - RTX 3090.md
rename to github-data/discussions/396 - Best settings for Maverick - Dual CPU Xeon 8480 - RTX 3090.md
index 319091797..dc815fbb7 100644
--- a/github-data/discussions/396 - Best settings for Maverick - Dual CPU Xeon 8480_ - RTX 3090.md
+++ b/github-data/discussions/396 - Best settings for Maverick - Dual CPU Xeon 8480 - RTX 3090.md
@@ -1,13 +1,14 @@
-### 🗣️ [#396](https://github.com/ikawrakow/ik_llama.cpp/discussions/396) - Best settings for Maverick - Dual CPU Xeon 8480+ - RTX 3090
+## 🗣️ [Discussion #396](https://github.com/ikawrakow/ik_llama.cpp/discussions/396) - Best settings for Maverick - Dual CPU Xeon 8480+ - RTX 3090
| **Author** | `justinjja` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-08 |
---
-#### Description
+## 📄 Description
With a single 8480+ and a 3090 I get excellent speeds ~40 T/s on Maverick
After installing a second cpu and another 8 sticks of ram I cant get good speeds.
@@ -24,9 +25,9 @@ llama-server -m Maverick-UD-IQ4_XS.gguf -c 32000 -fa -fmoe -amb 512 -rtr -ctk q
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **justinjja** replied the **2025-05-08** at **01:11:10**:
+👤 **justinjja** commented on **2025-05-08** at **01:11:10**
Small update,
@@ -40,7 +41,7 @@ Still no luck finding settings that actually both cpus.
---
-👤 **ikawrakow** replied the **2025-05-08** at **08:26:39**:
+👤 **ikawrakow** commented on **2025-05-08** at **08:26:39**
There have been a lot of discussions around the Internet about `llama.cpp` performance on dual-socket systems, and the conclusion appears to be that the best one can do is to just use one physical CPU.
diff --git a/github-data/discussions/397 - KV split while using _-sm row_.md b/github-data/discussions/397 - KV split while using -sm row.md
similarity index 95%
rename from github-data/discussions/397 - KV split while using _-sm row_.md
rename to github-data/discussions/397 - KV split while using -sm row.md
index c95c7d49b..79aacfc89 100644
--- a/github-data/discussions/397 - KV split while using _-sm row_.md
+++ b/github-data/discussions/397 - KV split while using -sm row.md
@@ -1,13 +1,14 @@
-### 🗣️ [#397](https://github.com/ikawrakow/ik_llama.cpp/discussions/397) - KV split while using `-sm row`
+## 🗣️ [Discussion #397](https://github.com/ikawrakow/ik_llama.cpp/discussions/397) - KV split while using `-sm row`
| **Author** | `pt13762104` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-08 |
| **Updated** | 2025-05-08 |
---
-#### Description
+## 📄 Description
I have found that ik_llama.cpp does NOT support kv-split while using `-sm row`, which is a limitation compared to llama.cpp. Is there any way to do this or it's just not implemented yet?
Example output:
@@ -147,13 +148,14 @@ INFO [ update_slots] all slots are idle | tid="137884088823808" times
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-08** at **08:08:16**:
+👤 **ikawrakow** commented on **2025-05-08** at **08:08:16**
I have never looked into splitting the KV cache when using `-sm row`, so the behavior is whatever the behavior of `llama.cpp` was when I forked last year.
Out of curiosity: does `-sm row` give you a better performance compared to `-sm layer` ?
-> 👤 **pt13762104** replied the **2025-05-08** at **08:36:42**:
+> 👤 **pt13762104** replied on **2025-05-08** at **08:36:42**
+>
> Yes. About 1.5x better
\ No newline at end of file
diff --git a/github-data/discussions/399 - Qwen 30b.A3b IK_LCPP comparisons on lowspec machine.md b/github-data/discussions/399 - Qwen 30b.A3b IKLCPP comparisons on lowspec machine.md
similarity index 95%
rename from github-data/discussions/399 - Qwen 30b.A3b IK_LCPP comparisons on lowspec machine.md
rename to github-data/discussions/399 - Qwen 30b.A3b IKLCPP comparisons on lowspec machine.md
index 072e8c075..069121a32 100644
--- a/github-data/discussions/399 - Qwen 30b.A3b IK_LCPP comparisons on lowspec machine.md
+++ b/github-data/discussions/399 - Qwen 30b.A3b IKLCPP comparisons on lowspec machine.md
@@ -1,13 +1,14 @@
-### 🗣️ [#399](https://github.com/ikawrakow/ik_llama.cpp/discussions/399) - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine
+## 🗣️ [Discussion #399](https://github.com/ikawrakow/ik_llama.cpp/discussions/399) - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine
| **Author** | `fizzAI` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-09 |
| **Updated** | 2025-05-14 |
---
-#### Description
+## 📄 Description
Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.
@@ -95,15 +96,16 @@ I still need to try dense models, CPU without offload, etc etc for this to be a
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **VinnyG9** replied the **2025-05-14** at **12:05:43**:
+👤 **VinnyG9** commented on **2025-05-14** at **12:05:43**
> * I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?
-> 👤 **ikawrakow** replied the **2025-05-14** at **12:29:26**:
+> 👤 **ikawrakow** replied on **2025-05-14** at **12:29:26**
+>
> > if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.
>
> No, it does not. This is `ik_llama.cpp` not `llama.cpp`. I wrote the matrix multiplication implementation for almost all quants in `llamafile` and for all quants here, so I know that what I have here is faster than llamafile.
\ No newline at end of file
diff --git a/github-data/discussions/401 - install bitnet _or other cpu models_ on a fresh termux aarch64.md b/github-data/discussions/401 - install bitnet or other cpu models on a fresh termux aarch64.md
similarity index 82%
rename from github-data/discussions/401 - install bitnet _or other cpu models_ on a fresh termux aarch64.md
rename to github-data/discussions/401 - install bitnet or other cpu models on a fresh termux aarch64.md
index af2cfcbe0..1779adb0f 100644
--- a/github-data/discussions/401 - install bitnet _or other cpu models_ on a fresh termux aarch64.md
+++ b/github-data/discussions/401 - install bitnet or other cpu models on a fresh termux aarch64.md
@@ -1,13 +1,14 @@
-### 🗣️ [#401](https://github.com/ikawrakow/ik_llama.cpp/discussions/401) - install bitnet (or other cpu models) on a fresh termux aarch64
+## 🗣️ [Discussion #401](https://github.com/ikawrakow/ik_llama.cpp/discussions/401) - install bitnet (or other cpu models) on a fresh termux aarch64
| **Author** | `Benjamin-Wegener` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-09 |
-| **Updated** | 2025-06-21 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
just for convenience all subsequential commands to install bitnet (or other cpu models) on a fresh termux aarch64:
```bash
@@ -33,20 +34,21 @@ reverted to old prompt template
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **VinnyG9** replied the **2025-05-14** at **12:07:00**:
+👤 **VinnyG9** commented on **2025-05-14** at **12:07:00**
what is a termux?
-> 👤 **saood06** replied the **2025-05-14** at **12:25:00**:
+> 👤 **saood06** replied on **2025-05-14** at **12:25:00**
+>
> > what is a termux?
>
> Android terminal emulator: https://termux.dev/en/
---
-👤 **Benjamin-Wegener** replied the **2025-05-15** at **14:23:33**:
+👤 **Benjamin-Wegener** commented on **2025-05-15** at **14:23:33**
using the built in llama-server standard and pasting that in prompt template field to get correct chat format
<|begin_of_text|>{{prompt}}<|eot_id|>
@@ -54,7 +56,8 @@ using the built in llama-server standard and pasting that in prompt template fie
{{history}}
{{char}}:
-> 👤 **saood06** replied the **2025-05-16** at **06:01:00**:
+> 👤 **saood06** replied on **2025-05-16** at **06:01:00**
+>
> Just to be clear the proper template is:
>
> <|begin_of_text|>System: {system_message}<|eot_id|>
@@ -64,8 +67,9 @@ using the built in llama-server standard and pasting that in prompt template fie
> Assistant: {assistant_message_2}<|eot_id|>
>
> It's been a while since I've used the server's template field but my testing using an alternative front-end following this was successful.
+
+> 👤 **saood06** replied on **2025-05-18** at **12:42:54**
>
-> 👤 **saood06** replied the **2025-05-18** at **12:42:54**:
> @Benjamin-Wegener
>
> The template above is grabbed from the paper. It isn't what is meant to actually go into the template field under the server's built in front-end.
@@ -75,8 +79,9 @@ using the built in llama-server standard and pasting that in prompt template fie
> Even when I used the bundled front-end I still basically never used the "Chat" section where those fields existed. I used the completions section where I would manually conform to a template, but I can see why on a mobile device the Chat endpoint would be far more convenient.
>
> Also I have uploaded already converted models [here](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF) which might be useful if space is limited (the actual time to convert is minor for this model so unlike other models that benefit doesn't exist for it).
+
+> 👤 **RobertAgee** replied on **2025-05-18** at **12:59:53**
>
-> 👤 **RobertAgee** replied the **2025-05-18** at **12:59:53**:
> FWIW, once i got the server running, I was able to confirm it was working with this curl request. Alternatively, you could send this like a regular JSON webhook of course:
>
> ```
@@ -93,8 +98,9 @@ using the built in llama-server standard and pasting that in prompt template fie
> Also, I was able to connect [ChatterUI's](https://github.com/Vali-98/ChatterUI) (free and oss) mobile app to my termux server with a config file and now I have a superfast, local, AI with TTS, chat interface, and convo history.
>
> Setting up the connection took me awhile to figure out, so if anyone's interested, I'll share the config file and settings. But yeah, all things said Bitnet is rough but shows promise. Would love to try out an abliterated version and Falcon 3 to see if either of those would help it have a little more conversational flow.
+
+> 👤 **Benjamin-Wegener** replied on **2025-05-18** at **13:44:35**
>
-> 👤 **Benjamin-Wegener** replied the **2025-05-18** at **13:44:35**:
> so we revert that back to what i posted earlier for the server? what do you think?
>
> ```
@@ -107,31 +113,37 @@ using the built in llama-server standard and pasting that in prompt template fie
---
-👤 **RobertAgee** replied the **2025-05-16** at **05:26:44**:
+👤 **RobertAgee** commented on **2025-05-16** at **05:26:44**
Didn't work for me in my case. Stayed hung up at compilation forever

-> 👤 **ikawrakow** replied the **2025-05-16** at **05:30:51**:
-> You have to be patient. The file is 18k LOC of heavily templated C++ code. It takes a while to compile even on a fast desktop CPU. I know it needs to get refactored into multiple files (#183), but I haven't come around to do it.
+> 👤 **ikawrakow** replied on **2025-05-16** at **05:30:51**
+>
+> You have to be patient. The file is 18k LOC of heavily templated C++ code. It takes a while to compile even on a fast desktop CPU. I know it needs to get refactored into multiple files ([#183](https://github.com/ikawrakow/ik_llama.cpp/issues/183)), but I haven't come around to do it.
+
+> 👤 **ikawrakow** replied on **2025-05-16** at **06:21:47**
>
-> 👤 **ikawrakow** replied the **2025-05-16** at **06:21:47**:
> Just measured: it takes 2 minutes on my M2-Max CPU to compile this file. Based on this, my guess is that it is in the 5-10 minutes range on a phone.
+
+> 👤 **saood06** replied on **2025-05-16** at **06:26:21**
>
-> 👤 **saood06** replied the **2025-05-16** at **06:26:21**:
> > Just measured: it takes 2 minutes on my M2-Max CPU to compile this file. Based on this, my guess is that it is in the 5-10 minutes range on a phone.
>
> I feel like it took longer when I tested it, and the person reporting the clashing .so files reported around half an hour, but yes the solution is to just be patient.
+
+> 👤 **RobertAgee** replied on **2025-05-16** at **06:27:06**
>
-> 👤 **RobertAgee** replied the **2025-05-16** at **06:27:06**:
> I waited more than 10 minutes, without competing processes open. in htop, no rw was happening so there's something causing it to hang idk
+
+> 👤 **saood06** replied on **2025-05-16** at **06:29:17**
>
-> 👤 **saood06** replied the **2025-05-16** at **06:29:17**:
> > I waited more than 10 minutes, without competing processes open. in htop, no rw was happening so there's something causing it to hang idk
>
> But was there still CPU usage? Also if you don't mind sharing what device it was on it would help estimate how long it would take. ( I may be able to time a compile on the device I use to test Android on but that may be a while as I have to borrow that device).
+
+> 👤 **RobertAgee** replied on **2025-05-17** at **14:17:34**
>
-> 👤 **RobertAgee** replied the **2025-05-17** at **14:17:34**:
> Hi @saood06 I appreciate your patience and willingness to help. I have a Samsung a71 5g
>
> ```
@@ -143,8 +155,9 @@ Didn't work for me in my case. Stayed hung up at compilation forever
> ```
>
> I did get it to compile and successfully run with the new FA kernels OFF flag at the compilation step.
+
+> 👤 **saood06** replied on **2025-05-18** at **02:49:19**
>
-> 👤 **saood06** replied the **2025-05-18** at **02:49:19**:
> >Hi @saood06 I appreciate your patience and willingness to help
> >I did get it to compile and successfully run with the new FA kernels OFF flag at the compilation step.
>
@@ -152,13 +165,14 @@ Didn't work for me in my case. Stayed hung up at compilation forever
---
-👤 **ikawrakow** replied the **2025-05-17** at **08:24:16**:
+👤 **ikawrakow** commented on **2025-05-17** at **08:24:16**
You can now disable building the templated flash attention (FA) kernels. Disabling FA should massively improve build times.
-See PR #429
+See PR [#429](https://github.com/ikawrakow/ik_llama.cpp/issues/429)
-> 👤 **RobertAgee** replied the **2025-05-17** at **10:00:36**:
+> 👤 **RobertAgee** replied on **2025-05-17** at **10:00:36**
+>
> Thanks @ikawrakow for the fast PR! I was able to successfully get it running and make a call to get a response! :)
>
> For anyone in my situation, it did have a few what looked like errors in the console during the build process, but it was successful, as I said, so no worries. Here's the list of commands with the speed up (disabling flash attention kernels):
@@ -191,42 +205,46 @@ See PR #429
> "n_predict": 128,
> "stop": ["<|im_end|>"]
> }'
-> ```
---
-👤 **ikawrakow** replied the **2025-05-20** at **09:48:56**:
+👤 **ikawrakow** commented on **2025-05-20** at **09:48:56**
-There is now PR #435 that significantly reduces build time. I cannot test on Android myself, so would appreciate if someone did and reported
+There is now PR [#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435) that significantly reduces build time. I cannot test on Android myself, so would appreciate if someone did and reported
* New vs old build time (with CPU model)
* Does it still work correctly?
* Is the inference performance affected?
-> 👤 **aezendc** replied the **2025-06-02** at **15:30:06**:
-> > There is now PR #435 that significantly reduces build time. I cannot test on Android myself, so would appreciate if someone did and reported
+> 👤 **aezendc** replied on **2025-06-02** at **15:30:06**
+>
+> > There is now PR [#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435) that significantly reduces build time. I cannot test on Android myself, so would appreciate if someone did and reported
> >
> > * New vs old build time (with CPU model)
> > * Does it still work correctly?
> > * Is the inference performance affected?
>
> HI ikawrakow do we have a step by step running microsoft/bitnet-b1.58-2B-4T-gguf in windows?
+
+> 👤 **ikawrakow** replied on **2025-06-02** at **15:36:51**
>
-> 👤 **ikawrakow** replied the **2025-06-02** at **15:36:51**:
> There are no prebuild packages, so you need to follow the [above instructions](https://github.com/ikawrakow/ik_llama.cpp/discussions/401#discussioncomment-13178115) and build yourself. They don't work (with small adjustments)?
+
+> 👤 **aezendc** replied on **2025-06-02** at **15:45:42**
>
-> 👤 **aezendc** replied the **2025-06-02** at **15:45:42**:
> > There are no prebuild packages, so you need to follow the [above instructions](https://github.com/ikawrakow/ik_llama.cpp/discussions/401#discussioncomment-13178115) and build yourself. They don't work (with small adjustments)?
>
> I made it work I use [saood06](https://github.com/saood06) converted model https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF. I will create a basic commands
+
+> 👤 **saood06** replied on **2025-06-03** at **00:51:30**
>
-> 👤 **saood06** replied the **2025-06-03** at **00:51:30**:
> > do we have a step by step running microsoft/bitnet-b1.58-2B-4T-gguf in windows?
>
> There are build instructions with a lot more details for Windows [here](https://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/build.md). Once it is built you can just grab the model either pre-converted one like [this](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF) or convert one yourself and just launch server. Which is covered in the above instructions.
>
> It seems like you have already figured it out, but just wanted to link the Windows build instructions in case anyone else finds this and wants to follow along.
+
+> 👤 **aezendc** replied on **2025-06-03** at **03:34:32**
>
-> 👤 **aezendc** replied the **2025-06-03** at **03:34:32**:
> > > do we have a step by step running microsoft/bitnet-b1.58-2B-4T-gguf in windows?
> >
> > There are build instructions with a lot more details for Windows [here](https://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/build.md). Once it is built you can just grab the model either pre-converted one like [this](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF) or convert one yourself and just launch server. Which is covered in the above instructions.
@@ -234,27 +252,31 @@ There is now PR #435 that significantly reduces build time. I cannot test on And
> > It seems like you have already figured it out, but just wanted to link the Windows build instructions in case anyone else finds this and wants to follow along.
>
> Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
+
+> 👤 **saood06** replied on **2025-06-03** at **07:11:46**
>
-> 👤 **saood06** replied the **2025-06-03** at **07:11:46**:
> > Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
>
> Just to be sure, are you making sure to access the server using the port passed in when launching (or 8080 if not set as that is the default), and are you setting the host address (if needed) since it defaults to 127.0.0.1 (AKA localhost) which is only accessible on that machine.
+
+> 👤 **aezendc** replied on **2025-06-03** at **12:28:17**
>
-> 👤 **aezendc** replied the **2025-06-03** at **12:28:17**:
> > > Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
> >
> > Just to be sure, are you making sure to access the server using the port passed in when launching (or 8080 if not set as that is the default), and are you setting the host address (if needed) since it defaults to 127.0.0.1 (AKA localhost) which is only accessible on that machine.
>
> i am using the default http://127.0.0.1:8080/ but somehow it works now. Thanks for the info
+
+> 👤 **aezendc** replied on **2025-06-04** at **14:40:21**
>
-> 👤 **aezendc** replied the **2025-06-04** at **14:40:21**:
> > > Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
> >
> > Just to be sure, are you making sure to access the server using the port passed in when launching (or 8080 if not set as that is the default), and are you setting the host address (if needed) since it defaults to 127.0.0.1 (AKA localhost) which is only accessible on that machine.
>
> How you do make the the model to respond longer?
+
+> 👤 **saood06** replied on **2025-06-21** at **16:33:44**
>
-> 👤 **saood06** replied the **2025-06-21** at **16:33:44**:
> >How you do make the the model to respond longer?
>
> I don't have much specific advice for using this model. Beyond benchmarking and minor curiosity of the ability of a model this small, I haven't used it much.
@@ -266,6 +288,7 @@ There is now PR #435 that significantly reduces build time. I cannot test on And
> * add context specific details or changes to the prompt given
> * break the task apart and only allow it to respond to a fraction at a time
> * manually steer the model to avoid skipping or missing out on details (often is easier with a thinking model as you often only have to steer during thinking tokens).
+
+> 👤 **aezendc** replied on **2025-06-21** at **16:46:12**
>
-> 👤 **aezendc** replied the **2025-06-21** at **16:46:12**:
> I fix it now. The only problem of mine is the libomp.so build and I do not have a file of it. I set it the openmp off because libggml.so needs the libomp.so an when I build llama-server using windows and transfer the binaries to my android phone and the model is hallucinating.
\ No newline at end of file
diff --git a/github-data/discussions/403 - Tool Calling and Structured Response _Json Mode_ support.md b/github-data/discussions/403 - Tool Calling and Structured Response Json Mode support.md
similarity index 80%
rename from github-data/discussions/403 - Tool Calling and Structured Response _Json Mode_ support.md
rename to github-data/discussions/403 - Tool Calling and Structured Response Json Mode support.md
index 18765c1e1..f19cf7222 100644
--- a/github-data/discussions/403 - Tool Calling and Structured Response _Json Mode_ support.md
+++ b/github-data/discussions/403 - Tool Calling and Structured Response Json Mode support.md
@@ -1,13 +1,14 @@
-### 🗣️ [#403](https://github.com/ikawrakow/ik_llama.cpp/discussions/403) - Tool Calling and Structured Response (Json Mode) support
+## 🗣️ [Discussion #403](https://github.com/ikawrakow/ik_llama.cpp/discussions/403) - Tool Calling and Structured Response (Json Mode) support
| **Author** | `mtcl` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-10 |
| **Updated** | 2025-05-30 |
---
-#### Description
+## 📄 Description
Hey Team,
@@ -21,9 +22,9 @@ Thank you in advance!
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-10** at **08:30:16**:
+👤 **ikawrakow** commented on **2025-05-10** at **08:30:16**
Hey @mtcl,
@@ -31,14 +32,16 @@ we are a very small team, so cannot do everything that `llama.cpp` does. Hence,
Please enter a feature request in the Issues. I'll label it with "help wanted" and we will see what happens.
-> 👤 **mtcl** replied the **2025-05-10** at **08:33:02**:
+> 👤 **mtcl** replied on **2025-05-10** at **08:33:02**
+>
> No worries my friend. I have a workaround here that I've written.
>
> https://github.com/Teachings/FastAgentAPI
>
> It acts as a wrapper and get me by. Thank you for your hard work!
+
+> 👤 **cmoncure** replied on **2025-05-30** at **19:58:13**
>
-> 👤 **cmoncure** replied the **2025-05-30** at **19:58:13**:
> Before I try and get this running, can you educate me on the mechanics of tool calling within the LLM response? I understand that the LLM may request a call as part of its TG phase, and then the call runner injects the result into the LLM response. Is this correct?
>
> I have some questions about this. Suppose I want to ask the LLM a question about a long document.
@@ -51,6 +54,6 @@ Please enter a feature request in the Issues. I'll label it with "help wanted" a
---
-👤 **KCS-Mack** replied the **2025-05-18** at **22:28:59**:
+👤 **KCS-Mack** commented on **2025-05-18** at **22:28:59**
This is great, will give it a try!
\ No newline at end of file
diff --git a/github-data/discussions/434 - Quant Cookers Basic Guide.md b/github-data/discussions/434 - Quant Cookers Basic Guide.md
index b2c6081ed..9fb99fcd7 100644
--- a/github-data/discussions/434 - Quant Cookers Basic Guide.md
+++ b/github-data/discussions/434 - Quant Cookers Basic Guide.md
@@ -1,13 +1,14 @@
-### 🗣️ [#434](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) - Quant Cookers Basic Guide
+## 🗣️ [Discussion #434](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) - Quant Cookers Basic Guide
| **Author** | `ubergarm` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-18 |
| **Updated** | 2025-05-21 |
---
-#### Description
+## 📄 Description
Quant Cooking Basic Guide
===
@@ -318,15 +319,16 @@ model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **VinnyG9** replied the **2025-05-19** at **14:48:32**:
+👤 **VinnyG9** commented on **2025-05-19** at **14:48:32**
thanks for this, can you point me where can i read a description of:
-DGGML_RPC=OFF
--seed 1337
-> 👤 **ubergarm** replied the **2025-05-19** at **15:07:31**:
+> 👤 **ubergarm** replied on **2025-05-19** at **15:07:31**
+>
> > -DGGML_RPC=OFF
> > --seed 1337
>
@@ -335,8 +337,9 @@ thanks for this, can you point me where can i read a description of:
> > --seed 1337
>
> I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
+
+> 👤 **VinnyG9** replied on **2025-05-21** at **03:42:57**
>
-> 👤 **VinnyG9** replied the **2025-05-21** at **03:42:57**:
> > > -DGGML_RPC=OFF
> > > --seed 1337
> >
diff --git a/github-data/discussions/451 - Context reuse _ context shift for long prompts.md b/github-data/discussions/451 - Context reuse context shift for long prompts.md
similarity index 85%
rename from github-data/discussions/451 - Context reuse _ context shift for long prompts.md
rename to github-data/discussions/451 - Context reuse context shift for long prompts.md
index 4dedbb984..7ae324549 100644
--- a/github-data/discussions/451 - Context reuse _ context shift for long prompts.md
+++ b/github-data/discussions/451 - Context reuse context shift for long prompts.md
@@ -1,19 +1,20 @@
-### 🗣️ [#451](https://github.com/ikawrakow/ik_llama.cpp/discussions/451) - Context reuse / context shift for long prompts
+## 🗣️ [Discussion #451](https://github.com/ikawrakow/ik_llama.cpp/discussions/451) - Context reuse / context shift for long prompts
| **Author** | `SamuelOliveirads` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-23 |
-| **Updated** | 2025-06-10 |
+| **Updated** | 2025-07-22 |
---
-#### Description
+## 📄 Description
Hi! — I'm coming from koboldcpp, and I've been testing this fork due to its optimizations.
One feature I found very useful in koboldcpp was the context shift functionality, which helps when working with very long context windows.
-I noticed that `llama.cpp` implemented something similar in [PR #9866](https://github.com/ggml-org/llama.cpp/pull/9866), which allows for reusing the prompt cache more efficiently instead of regenerating the entire prompt every time the context overflows.
+I noticed that `llama.cpp` implemented something similar in [PR [#9866](https://github.com/ikawrakow/ik_llama.cpp/issues/9866)](https://github.com/ggml-org/llama.cpp/pull/9866), which allows for reusing the prompt cache more efficiently instead of regenerating the entire prompt every time the context overflows.
I searched through this repo but couldn’t find an equivalent implementation.
@@ -31,46 +32,50 @@ Thanks!
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **mtcl** replied the **2025-05-30** at **16:47:09**:
+👤 **mtcl** commented on **2025-05-30** at **16:47:09**
This is a very useful usecase because of which I have been switching back and forth between ik_llama.cpp and llama.cpp. This works seamlessly with llama.cpp i have noticed. I always thought I am doing something wrong here and it is my user error, but apparantly it is not! Thank you for mentioning it here.
---
-👤 **cmoncure** replied the **2025-05-30** at **19:51:44**:
+👤 **cmoncure** commented on **2025-05-30** at **19:51:44**
This would be a massive win for me. Currently PP is the millstone around the neck (for which you have had to endure many of my ignorant comments in support of a solution).
KV Cache reuse and tool calling would open up whole new worlds.
-> 👤 **mtcl** replied the **2025-06-05** at **02:26:48**:
+> 👤 **mtcl** replied on **2025-06-05** at **02:26:48**
+>
> I agree 100% with you. Given that I built my own tool calling solution for ik_llama.cpp, at this point of time kv cache reuse would mean an instant switch for me to this!
---
-👤 **SamuelOliveirads** replied the **2025-06-03** at **21:52:10**:
+👤 **SamuelOliveirads** commented on **2025-06-03** at **21:52:10**
Glad to see that others are also interested in this feature! I was about to open an issue myself, but I noticed that @saood06 is already looking into something similar [here](https://github.com/ikawrakow/ik_llama.cpp/issues/455#issuecomment-2917718499) — so now it’s just a matter of waiting.
By the way, @saood06, if you need any help with testing, I’d be happy to assist.
-> 👤 **saood06** replied the **2025-06-06** at **09:16:14**:
+> 👤 **saood06** replied on **2025-06-06** at **09:16:14**
+>
> Since there does seem to be demand, and people waiting, I'll provide an update which explains what my plan is (and the benefits, but also the limitations), and the current status.
>
> The goal is to create a new mechanism where if enabled a [trie](https://en.wikipedia.org/wiki/Trie) of all processed tokens is kept that can be saved and restored to a file. This should allow you to keep every explored branch of a session (or multiple if you share a large initial prompt between sessions) with the least amount of space and no quality loss.
>
> This may only be viable on MLA models as they are extremely light for KV cache, and this method does not degrade quality like chunking or shifting, but for that reason this does not handle the common case of shifting the cache when you want to remove the thought tokens without having to reprocess as there is no way to do that without losing (at least some) quality.
>
-> I was stalled because of #436 but now that saving and loading works I am now unblocked, but this still seems like a large undertaking and may take some time.
+> I was stalled because of [#436](https://github.com/ikawrakow/ik_llama.cpp/issues/436) but now that saving and loading works I am now unblocked, but this still seems like a large undertaking and may take some time.
>
> I may end up porting the chunk/shift method (or @cmoncure is welcome to do it) anyway (even before I finish), since as I said they have different tradeoffs, but integrating the two fully as nice as it sounds (which would let you be able to chunk and shift from the trie) seems way too difficult.
+
+> 👤 **cmoncure** replied on **2025-06-06** at **15:16:33**
>
-> 👤 **cmoncure** replied the **2025-06-06** at **15:16:33**:
> Do you have any insight into the nature or mechanism behind the quality loss with chunking?
+
+> 👤 **ikawrakow** replied on **2025-06-06** at **15:29:13**
>
-> 👤 **ikawrakow** replied the **2025-06-06** at **15:29:13**:
> Are we talking about the `llama.cpp` feature (taken from kobold.cpp) where if I have
> ```
> aaaaccccbbbb
@@ -90,15 +95,17 @@ By the way, @saood06, if you need any help with testing, I’d be happy to assis
> The existing KV cache, despite context shifting and all that, will be heavily biased towards "brilliant", "amazing" and such.
>
> Do you see the problem? You cannot undo the impact of the skipped tokens by just changing the position encoding via RoPE.
+
+> 👤 **saood06** replied on **2025-06-06** at **15:41:47**
>
-> 👤 **saood06** replied the **2025-06-06** at **15:41:47**:
> > Are we talking about the `llama.cpp` feature (taken from kobold.cpp) where if I have
>
> Yes that is what we are talking about. Thank you for the very clear example (so much better than what I was typing out).
>
> I'm not sure this is from kobold.cpp. I know they offer a much better context shift where they effectively keep the context full at all times once you hit the limit unlike llama.cpp and here where the context shift unnecessarily removes far more tokens than is needed (I think half) and thus shifts are less frequent. Kobold.cpp on the other hand shifts every token which keeps the maximum information allowed at all times.
+
+> 👤 **cmoncure** replied on **2025-06-06** at **19:40:13**
>
-> 👤 **cmoncure** replied the **2025-06-06** at **19:40:13**:
> >You cannot undo the impact of the skipped tokens by just changing the position encoding via RoPE.
>
> So...
@@ -112,8 +119,9 @@ By the way, @saood06, if you need any help with testing, I’d be happy to assis
>
> 1. Is the effect of tokens on the KV cache _additive_ or _multiplicative_ (or something else)? If additive, can the effect of tokens removed from the prompt be recalculated and their effect subtracted?
> 2. If the presence of token PP computation in the KV cache poisons it forever, then doesn't that imply that tokens outside the context window can continue to affect generation? That would contradict my mental model of how all this is supposed to work. Edit: I suppose that's why the whole thing must be scrapped each time when the context window fills up. It makes sense.
+
+> 👤 **saood06** replied on **2025-06-07** at **06:17:39**
>
-> 👤 **saood06** replied the **2025-06-07** at **06:17:39**:
> > 4. Once a token has acted on the KV cache, its effect poisons the KV cache indelibly.
> >
> >
@@ -130,8 +138,9 @@ By the way, @saood06, if you need any help with testing, I’d be happy to assis
> If you shift "The main actor was" then you will see the influence of the removed tokens (but it will be much faster as you are not recomputing those tokens).
>
> If you do recompute the tokens "The main actor was" and do not shift then it will be slower (as you have to actually compute the tokens again) but you will not experience the lingering impact of "I absolutely enjoyed it."
+
+> 👤 **cmoncure** replied on **2025-06-10** at **02:35:21**
>
-> 👤 **cmoncure** replied the **2025-06-10** at **02:35:21**:
> >If you do recompute the tokens "The main actor was" and do not shift then it will be slower (as you have to actually compute the tokens again) but you will not experience the lingering impact of "I absolutely enjoyed it."
>
> Forgive me if I've misunderstood. Suppose we have the following prompt:
@@ -169,6 +178,6 @@ By the way, @saood06, if you need any help with testing, I’d be happy to assis
---
-👤 **cmoncure** replied the **2025-06-05** at **18:35:28**:
+👤 **cmoncure** commented on **2025-06-05** at **18:35:28**
Might have to do it myself.
\ No newline at end of file
diff --git a/github-data/discussions/459 - qwen3 metrics on ancient hardware _2x xeon Vs 2x P100_.md b/github-data/discussions/459 - qwen3 metrics on ancient hardware 2x xeon Vs 2x P100.md
similarity index 94%
rename from github-data/discussions/459 - qwen3 metrics on ancient hardware _2x xeon Vs 2x P100_.md
rename to github-data/discussions/459 - qwen3 metrics on ancient hardware 2x xeon Vs 2x P100.md
index d1572dfb7..529a42e4a 100644
--- a/github-data/discussions/459 - qwen3 metrics on ancient hardware _2x xeon Vs 2x P100_.md
+++ b/github-data/discussions/459 - qwen3 metrics on ancient hardware 2x xeon Vs 2x P100.md
@@ -1,13 +1,14 @@
-### 🗣️ [#459](https://github.com/ikawrakow/ik_llama.cpp/discussions/459) - qwen3 metrics on ancient hardware (2x xeon Vs 2x P100)
+## 🗣️ [Discussion #459](https://github.com/ikawrakow/ik_llama.cpp/discussions/459) - qwen3 metrics on ancient hardware (2x xeon Vs 2x P100)
| **Author** | `VinnyG9` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-15 |
| **Updated** | 2025-05-28 |
---
-#### Description
+## 📄 Description
so i set a snoop mode in bios which does some kind of speculative decoding called Home dir w/ OSB+, and it gives a big boost with numa enabled
all tests with HT off
@@ -106,15 +107,15 @@ WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to i
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **ikawrakow** replied the **2025-05-15** at **04:26:42**:
+👤 **ikawrakow** commented on **2025-05-15** at **04:26:42**
You regex is incorrect, so everything goes to the GPU. Try `-ot exps=CPU` instead. When that works and you see how much VRAM you have left on each GPU, you can offload some of the experts to the GPU using additional regular expressions for that that precede the `exps=CPU` expression.
---
-👤 **VinnyG9** replied the **2025-05-15** at **14:08:28**:
+👤 **VinnyG9** commented on **2025-05-15** at **14:08:28**
> You regex is incorrect, so everything goes to the GPU. Try `-ot exps=CPU` instead. When that works and you see how much VRAM you have left on each GPU, you can offload some of the experts to the GPU using additional regular expressions for that that precede the `exps=CPU` expression.
@@ -136,7 +137,7 @@ https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html
---
-👤 **ikawrakow** replied the **2025-05-15** at **14:13:55**:
+👤 **ikawrakow** commented on **2025-05-15** at **14:13:55**
The attention tensors are on the GPU, so you don't really want to use `-nkvo` (unless extremely desperate to save more VRAM).
@@ -144,7 +145,7 @@ What is the quantization type you are using? Full log, including command line ar
---
-👤 **VinnyG9** replied the **2025-05-15** at **17:31:23**:
+👤 **VinnyG9** commented on **2025-05-15** at **17:31:23**
when i do "exps\.=CPU" only 6GB total are offloaded to the GPUs is that normal?
in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
@@ -168,13 +169,13 @@ log> https://pastebin.com/1VEd7tuD
---
-👤 **VinnyG9** replied the **2025-05-15** at **18:31:10**:
+👤 **VinnyG9** commented on **2025-05-15** at **18:31:10**
this tensor override thing makes no sense, i'm testing the Q2K quant it's using 40% of vram and if i set only one more tensor-layer the cuda malloc explodes
---
-👤 **Ph0rk0z** replied the **2025-05-15** at **21:23:16**:
+👤 **Ph0rk0z** commented on **2025-05-15** at **21:23:16**
>in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
@@ -184,7 +185,7 @@ I had best luck with numa distribute. Maybe you should do a benchmark of your ra
---
-👤 **ubergarm** replied the **2025-05-16** at **21:30:59**:
+👤 **ubergarm** commented on **2025-05-16** at **21:30:59**
@Fuckingnameless
@@ -202,7 +203,7 @@ have fun!
---
-👤 **VinnyG9** replied the **2025-05-17** at **01:18:44**:
+👤 **VinnyG9** commented on **2025-05-17** at **01:18:44**
> > in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
>
@@ -222,7 +223,7 @@ numa is not working right for me i need to fiddle with snoop modes is my guess
---
-👤 **VinnyG9** replied the **2025-05-17** at **01:25:58**:
+👤 **VinnyG9** commented on **2025-05-17** at **01:25:58**
> [@Fuckingnameless](https://github.com/Fuckingnameless)
>
@@ -249,7 +250,7 @@ i thought that was default, also read somewhere that doing 2 copies aka data par
---
-👤 **ubergarm** replied the **2025-05-17** at **14:41:33**:
+👤 **ubergarm** commented on **2025-05-17** at **14:41:33**
@Fuckingnameless
@@ -269,7 +270,7 @@ I have a [whole discussion on the NUMA stuff here](https://github.com/ggml-org/l
---
-👤 **Ph0rk0z** replied the **2025-05-17** at **15:03:48**:
+👤 **Ph0rk0z** commented on **2025-05-17** at **15:03:48**
>Also as @Ph0rk0z you might want to try compiling with -DGGML_SCHED_MAX_COPIES=1
@@ -287,7 +288,7 @@ If you do it sequentially and just fill as many layers before OOM, you'll have a
---
-👤 **VinnyG9** replied the **2025-05-18** at **02:01:19**:
+👤 **VinnyG9** commented on **2025-05-18** at **02:01:19**
> > Also as [@Ph0rk0z](https://github.com/Ph0rk0z) you might want to try compiling with -DGGML_SCHED_MAX_COPIES=1
>
@@ -309,13 +310,13 @@ I updated the OP with benchmarks
---
-👤 **Ph0rk0z** replied the **2025-05-18** at **11:33:22**:
+👤 **Ph0rk0z** commented on **2025-05-18** at **11:33:22**
Try some different regex for CPU. In the benchmark command line above its missing the wildcard.
---
-👤 **VinnyG9** replied the **2025-05-20** at **14:49:53**:
+👤 **VinnyG9** commented on **2025-05-20** at **14:49:53**
$ CUDA_VISIBLE_DEVICES=0,1 bin/llama-bench -t 31 -p 64,128,256 -n 32,64,128 -m moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "blk.([0-9]|[1][0-3]).ffn_.*=CUDA1","output.=CUDA1","blk.([0-3][0-9]|4[0-6]).ffn_norm.=CUDA1" -ot "blk.(4[7-9]|[5-9][0-9]).ffn_norm.=CUDA0" -ot "blk.([3][1-9]|[4-9][0-9]).ffn_.*=CPU" -fa 1 -fmoe 1 -rtr 1 --numa distribute
@@ -361,7 +362,7 @@ norm layers split 1/1, output layers on last gpu
---
-👤 **saood06** replied the **2025-05-25** at **05:08:13**:
+👤 **saood06** commented on **2025-05-25** at **05:08:13**
> ̶E̶d̶i̶t̶;̶ ̶f̶i̶x̶e̶d̶ ̶b̶y̶ ̶d̶i̶s̶a̶b̶l̶i̶n̶g̶ ̶c̶u̶b̶l̶a̶s̶
@@ -369,12 +370,13 @@ norm layers split 1/1, output layers on last gpu
Edit: a discussion makes a lot more sense. Thanks @ikawrakow
-> 👤 **ikawrakow** replied the **2025-05-25** at **07:36:49**:
+> 👤 **ikawrakow** replied on **2025-05-25** at **07:36:49**
+>
> Yes, I thought this could be useful info for some people.
---
-👤 **VinnyG9** replied the **2025-05-25** at **12:51:11**:
+👤 **VinnyG9** commented on **2025-05-25** at **12:51:11**
trying to figure out why I was seeing a performance drop with numa-cpu inference on debian, tried xanmod 6.12/6.14 kernel, upgraded to debian-testing, tried cuda 12-8/12-9, one change at a time, best i could get was 32t/s on qwen3 30B
also memory mapping doesn't work
@@ -385,20 +387,24 @@ booted back on linux mint vanilla
I'm now a distrohopper
-> 👤 **Ph0rk0z** replied the **2025-05-25** at **18:14:19**:
+> 👤 **Ph0rk0z** replied on **2025-05-25** at **18:14:19**
+>
> I've been using xanmod-v3 with mint. Since my CPUs identify as skylake-x, I might try the V4 version and see if there is some difference.
+
+> 👤 **VinnyG9** replied on **2025-05-26** at **15:27:17**
>
-> 👤 **VinnyG9** replied the **2025-05-26** at **15:27:17**:
> > I've been using xanmod-v3 with mint. Since my CPUs identify as skylake-x, I might try the V4 version and see if there is some difference.
>
> on mint i had no luck with xanmodv3 either it was like 15% slower
+
+> 👤 **Ph0rk0z** replied on **2025-05-27** at **14:35:27**
>
-> 👤 **Ph0rk0z** replied the **2025-05-27** at **14:35:27**:
> going to have to try and compare a regular kernel of the same version. V4 xanmod seems behind for ubuntu 22.04 based distros, there was no 6.12 even. V3 has been serving me well for more than a year so I'm curious if I get higher memory b/w or other difference that would change t/s.
>
> I'm having a crazy time with GGML_SCHED_MAX_COPIES. I'm not sure what's being offloaded when you set it to 1 and do all model layers. CUDA host compute buffer is smaller but whatever ends up on my other cards forces me to remove 3 gate layers. In theory TG is better but not PP. Maybe I can make up for it. Also means I have to test qwen again because this is deepseek. I'm going to keep juicing the turnip just like you.
+
+> 👤 **VinnyG9** replied on **2025-05-28** at **20:13:36**
>
-> 👤 **VinnyG9** replied the **2025-05-28** at **20:13:36**:
> > going to have to try and compare a regular kernel of the same version. V4 xanmod seems behind for ubuntu 22.04 based distros, there was no 6.12 even. V3 has been serving me well for more than a year so I'm curious if I get higher memory b/w or other difference that would change t/s.
> >
> > I'm having a crazy time with GGML_SCHED_MAX_COPIES. I'm not sure what's being offloaded when you set it to 1 and do all model layers. CUDA host compute buffer is smaller but whatever ends up on my other cards forces me to remove 3 gate layers. In theory TG is better but not PP. Maybe I can make up for it. Also means I have to test qwen again because this is deepseek. I'm going to keep juicing the turnip just like you.
@@ -407,7 +413,7 @@ I'm now a distrohopper
---
-👤 **VinnyG9** replied the **2025-05-25** at **13:22:48**:
+👤 **VinnyG9** commented on **2025-05-25** at **13:22:48**
235B Q2 not so bad?
diff --git a/github-data/discussions/466 - A curiosity..md b/github-data/discussions/466 - A curiosity.md
similarity index 76%
rename from github-data/discussions/466 - A curiosity..md
rename to github-data/discussions/466 - A curiosity.md
index 26eb977eb..d1b752168 100644
--- a/github-data/discussions/466 - A curiosity..md
+++ b/github-data/discussions/466 - A curiosity.md
@@ -1,13 +1,14 @@
-### 🗣️ [#466](https://github.com/ikawrakow/ik_llama.cpp/discussions/466) - A curiosity.
+## 🗣️ [Discussion #466](https://github.com/ikawrakow/ik_llama.cpp/discussions/466) - A curiosity.
| **Author** | `Nexesenex` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-28 |
| **Updated** | 2025-06-08 |
---
-#### Description
+## 📄 Description
I made a little fork of Llama.cpp mainline, integrating some commits of IK_Llama, and able to quantize (for now) in q6_0, IQ3_K, IQ4_K, IQ5_K and IQ6_K.
It's based on b5474 for now, and now I can use the wonderful q6_0 and IQ6_K for any model supported by mainline.
@@ -19,12 +20,13 @@ Edit 2 : https://github.com/Nexesenex/croco.cpp/releases/tag/v1.93040_b5600_RMv1
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **VinnyG9** replied the **2025-05-28** at **20:14:51**:
+👤 **VinnyG9** commented on **2025-05-28** at **20:14:51**
any performance numberos?
-> 👤 **Nexesenex** replied the **2025-05-29** at **07:05:33**:
+> 👤 **Nexesenex** replied on **2025-05-29** at **07:05:33**
+>
> None, it barely works for a part of its purpose, which is to quantize models with some IQ quants within the mainline framework.
> PPL test work also, as well as Cuda inference for Gemma 3 in 0.04. And that's it for now. ^^
\ No newline at end of file
diff --git a/github-data/discussions/477 - DeepSeek-R1-0528 ik quants_.md b/github-data/discussions/477 - DeepSeek-R1-0528 ik quants.md
similarity index 72%
rename from github-data/discussions/477 - DeepSeek-R1-0528 ik quants_.md
rename to github-data/discussions/477 - DeepSeek-R1-0528 ik quants.md
index 4a9943ff8..c082b517b 100644
--- a/github-data/discussions/477 - DeepSeek-R1-0528 ik quants_.md
+++ b/github-data/discussions/477 - DeepSeek-R1-0528 ik quants.md
@@ -1,13 +1,14 @@
-### 🗣️ [#477](https://github.com/ikawrakow/ik_llama.cpp/discussions/477) - DeepSeek-R1-0528 ik quants!
+## 🗣️ [Discussion #477](https://github.com/ikawrakow/ik_llama.cpp/discussions/477) - DeepSeek-R1-0528 ik quants!
| **Author** | `ubergarm` |
| :--- | :--- |
+| **State** | ✅ **Open** |
| **Created** | 2025-05-30 |
-| **Updated** | 2025-07-19 |
+| **Updated** | 2025-07-26 |
---
-#### Description
+## 📄 Description
## What
Starting this "show and tell" discussion about the updated DeepSeek-R1-0528 model and various quants beginning to emerge.
@@ -51,20 +52,22 @@ Thanks and let me know if you try these out or have questions or comments. Feel
---
-#### 🗣️ Discussion
+## 💬 Discussion
-👤 **randoentity** replied the **2025-05-31** at **05:56:18**:
+👤 **randoentity** commented on **2025-05-31** at **05:56:18**
Thanks for these quants and the rest of your work you publish. Could you do one that fits in 128GB RAM and 72GB VRAM with 32K context? I tried the unsloth IQ1_S and got about 2.7 t/s generation on mainline and 2.15 t/s on ik. It was coherent and delivered surprisingly good responses to real world coding tasks. Oh but the R4 variants don't support Q1 yet, right?
-> 👤 **ubergarm** replied the **2025-06-01** at **17:54:28**:
+> 👤 **ubergarm** replied on **2025-06-01** at **17:54:28**
+>
> Yeah getting that small becomes tricky. I've been noodling on it and want to try out some experiments.. the iq2_kt quants might be interesting but will take a long time to quantize. they will get us down to 2.125 BPW but likely not performant given a lot of CPU inferencing.
>
> I could look into the IQ1 stuff but haven't ever messed with those really... but yes there are no `_r4` repacked versions of the smaller sub ~4bpw guys yet.
>
> If you have a good PCIe Gen5 NVMe e.g. the T705 or similar you might actually get faster going with my `IQ2_KS` which is 220GiB and using the default mmap() to let some of it "hang off" into page cache. Hoping to try that soon and expect 3-5 tok/sec on my gaming rig (96GB RAM +24GB VRAM) but it does heat up the SSD (though no write level wear as it is read only).
+
+> 👤 **ubergarm** replied on **2025-06-02** at **04:43:27**
>
-> 👤 **ubergarm** replied the **2025-06-02** at **04:43:27**:
> @randoentity
>
> So I'm about to upload a `IQ1_S_R4` 1.664 BPW (131GiB) that might actually fit in 128GB RAM + 24GB VRAM and has lower perplexity than Qwen3-235B-A22B-Q8_0 haha... Not sure if it is "better" though, but kind of surprising.
@@ -72,29 +75,33 @@ Thanks for these quants and the rest of your work you publish. Could you do one
> If you have enough RAM+VRAM to fully fit a larger model I'd recommend that over this tiny one, and you probably won't be able to run the these repacked quants on CUDA yet to take advantage of offloading extra layers. Though you can up your `-b 4096 -ub 4096` or possibly higher and use the full 160k context with all your extra VRAM.
>
> It should be finished uploading by monday morning NYC Eastern Time.
+
+> 👤 **randoentity** replied on **2025-06-02** at **17:18:21**
>
-> 👤 **randoentity** replied the **2025-06-02** at **17:18:21**:
> I'm only getting 0.05 TG, probably because it isn't running on CUDA. Higher batch did improve TG on mainline.
+
+> 👤 **ubergarm** replied on **2025-06-02** at **19:45:52**
>
-> 👤 **ubergarm** replied the **2025-06-02** at **19:45:52**:
> @randoentity
>
> > I'm only getting 0.05 TG, probably because it isn't running on CUDA.
>
> What are you trying to do? Test out the IQ1_S_R4 quant? Provide your full command here and we can workshop it as 0.05 tok/sec TG (assuming that is what you mean?) sounds low for a 128GB RAM + 72GB VRAM system. Also provide what mix of GPUs you have e.g. a 2x 3090s and whatever.
+
+> 👤 **ThomasBaruzier** replied on **2025-06-02** at **22:20:58**
>
-> 👤 **ThomasBaruzier** replied the **2025-06-02** at **22:20:58**:
> https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-12452986
> @randoentity I have the same setup as you and managed 7tok/s TG and 40 tok/s PP
>
> Edit: the setup described in the link probably needs updating with all the new PRs, like mla3, but I haven't tested yet
+
+> 👤 **randoentity** replied on **2025-06-03** at **19:00:26**
>
-> 👤 **randoentity** replied the **2025-06-03** at **19:00:26**:
> @ThomasBaruzier thanks! Unfortunately your example didn't help me. I had already tried that and other combinations.
---
-👤 **Ph0rk0z** replied the **2025-06-01** at **13:19:59**:
+👤 **Ph0rk0z** commented on **2025-06-01** at **13:19:59**
Will -rtr fix the R4 quants so they don't have to use the BF16 path?
@@ -103,20 +110,21 @@ different in that regard. Granted, I can use full 32k context now and maintain s
Smaller AMB than 512 often lets you fit a couple more pieces due to the reduced buffer. Every little bit on GPU helps when CPU/Memory isn't that strong.
-> 👤 **ubergarm** replied the **2025-06-01** at **17:57:01**:
+> 👤 **ubergarm** replied on **2025-06-01** at **17:57:01**
+>
> > Will -rtr fix the R4 quants so they don't have to use the BF16 path?
>
> `-rtr` will try to make non `_r4` quants into `_r4` quants so I believe the answer is no. Though some folks are reporting `-DGGML_CUDA_IQK_FORCE_BF16=1` is giving them a slight speed *boost* probably depending on what model GPU you have.
---
-👤 **ubergarm** replied the **2025-06-01** at **15:20:15**:
+👤 **ubergarm** commented on **2025-06-01** at **15:20:15**
I had an [interesting report from huggingface.co/ciprianv](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/discussions/2#683b68b8df33990a5ac0a1f7) that compiling with `-DGGML_CUDA_IQK_FORCE_BF16=1` was giving a speed *boost* on these quants which is not what I expected.
I tried it out myself and confirmed with `llama-sweep` bench. This also is showing some small speed-ups by offloading additional layers onto GPU. I didn't have the patience to finish running one of them but you get the jist.
-Interestingly it does suggest that for some hardware configurations it may be beneficial to PP to compile with `-DGGML_CUDA_IQK_FORCE_BF16=1` which surprised me given discussion in [PR#461](https://github.com/ikawrakow/ik_llama.cpp/pull/461#issue-3091345746)
+Interestingly it does suggest that for some hardware configurations it may be beneficial to PP to compile with `-DGGML_CUDA_IQK_FORCE_BF16=1` which surprised me given discussion in [PR[#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461)](https://github.com/ikawrakow/ik_llama.cpp/pull/461#issue-3091345746)

@@ -251,26 +259,32 @@ llama_model_loader: - type iq3_k_r4: 58 tensors
-> 👤 **ikawrakow** replied the **2025-06-01** at **15:30:25**:
+> 👤 **ikawrakow** replied on **2025-06-01** at **15:30:25**
+>
> Ha, this is interesting. On my RTX-4080 `bf16` is ~10-20% slower than `fp16`.
+
+> 👤 **ikawrakow** replied on **2025-06-01** at **15:40:55**
>
-> 👤 **ikawrakow** replied the **2025-06-01** at **15:40:55**:
> Btw, if you have space VRAM, try `-b 4096 -ub 4096`. This should give you a very significant boost in PP performance.
+
+> 👤 **ubergarm** replied on **2025-06-01** at **16:27:29**
>
-> 👤 **ubergarm** replied the **2025-06-01** at **16:27:29**:
> Holy Ravioli, Batman!
>
> 
+
+> 👤 **ciprianveg** replied on **2025-06-01** at **17:06:40**
>
-> 👤 **ciprianveg** replied the **2025-06-01** at **17:06:40**:
> Exactly, you can go to 6144, if vram permits, an even further bump in pp speed..
+
+> 👤 **Ph0rk0z** replied on **2025-06-01** at **17:53:51**
>
-> 👤 **Ph0rk0z** replied the **2025-06-01** at **17:53:51**:
> >-b 4096 -ub 4096
>
> This gives me a bump from 90 to 127 but the buffer sizes mean I have to offload less layers. Offloading the wrong things can cause PCIE related gpu bottleneck too.
+
+> 👤 **RodriMora** replied on **2025-06-02** at **09:15:30**
>
-> 👤 **RodriMora** replied the **2025-06-02** at **09:15:30**:
> results with and without -b 4096 -ub 4096
>
> 
@@ -346,32 +360,36 @@ llama_model_loader: - type iq3_k_r4: 58 tensors
> cmake -B build -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
> cmake --build build --config Release -j$(nproc)
> ```
+
+> 👤 **cmoncure** replied on **2025-06-02** at **14:05:02**
>
-> 👤 **cmoncure** replied the **2025-06-02** at **14:05:02**:
> > Offloading the wrong things can cause PCIE related gpu bottleneck too.
>
> Tell me more. Isn't -ot just a static offload of tensors, and if you put too many, the process blows up when it runs out of VRAM? Where does PCI-E come into play?
+
+> 👤 **Ph0rk0z** replied on **2025-06-02** at **15:23:34**
>
-> 👤 **Ph0rk0z** replied the **2025-06-02** at **15:23:34**:
> If you split a layer across cards you can have a situation where GPU usage is high and they transfer a lot of data back and forth. Like place a gate on one and down on another. The CPU usage then craters to half or less and your overall speed is cooked. Especially evident for RTR. Remember a forward pass goes through these weights and I think passes states along.
+
+> 👤 **ubergarm** replied on **2025-06-03** at **20:20:29**
>
-> 👤 **ubergarm** replied the **2025-06-03** at **20:20:29**:
> @RodriMora
>
> Thanks for the graphs. I thought I recognized that combination of GPUs from reddit lmao... Cheers at stitching together a sweet vibe coding rig haha
---
-👤 **anikifoss** replied the **2025-06-01** at **16:12:44**:
+👤 **anikifoss** commented on **2025-06-01** at **16:12:44**
I uploaded the custom quant I use for coding [here](https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4) with some of the infromation how I arrived there and relevant benchmarks. I added some teasers on command line arguments to experiment with, as this branch is moving quickly and small performance improvements can add up over time.
-> 👤 **ubergarm** replied the **2025-06-04** at **21:08:29**:
+> 👤 **ubergarm** replied on **2025-06-04** at **21:08:29**
+>
> Thanks again for your quant, pretty sure it is the biggest boi of them all so a great choice for anyone with a big rig that wants the more BPW than my quants!
---
-👤 **ubergarm** replied the **2025-06-01** at **18:28:15**:
+👤 **ubergarm** commented on **2025-06-01** at **18:28:15**
Qantization Effects of `attn`/`shexp` on Perplexity
===
@@ -379,7 +397,7 @@ Qantization Effects of `attn`/`shexp` on Perplexity
## Motivation
> I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts. @ikawrakow
-This research grew out of [PR#411 discussions](https://github.com/ikawrakow/ik_llama.cpp/pull/411#issuecomment-2922464774). I've expanded on ik's example bash script to create 10 test quants each about \~355GiB in size. All the quants hold constant `q4_0` for `ffn.*` and `token_embd` while varying `attn.*` and `shexp` using all quants between 4~6bpw.
+This research grew out of [PR[#411](https://github.com/ikawrakow/ik_llama.cpp/issues/411) discussions](https://github.com/ikawrakow/ik_llama.cpp/pull/411#issuecomment-2922464774). I've expanded on ik's example bash script to create 10 test quants each about \~355GiB in size. All the quants hold constant `q4_0` for `ffn.*` and `token_embd` while varying `attn.*` and `shexp` using all quants between 4~6bpw.
If anyone wants to publish this, just hit me up and just cite myself and the project here appropriately.
@@ -750,19 +768,22 @@ My personal observations and thoughts are:
3. The 32 block size [_ks](https://github.com/ikawrakow/ik_llama.cpp/pull/83#issue-2575352790) quants are looking really strong here especially given recent CUDA speed-ups. I'm eyeing that `iq5_ks` for future recipes and glad I already used them my released `IQ2_K_R4`
4. The error bars crack me up.
-> 👤 **ubergarm** replied the **2025-06-02** at **04:46:51**:
+> 👤 **ubergarm** replied on **2025-06-02** at **04:46:51**
+>
> 
>
> Just ran some perplexity numbers for all of the quants I've released to huggingface. Running a few KLD on a very short "novel" test corpus also mainly to compare against quants from other cookers using different imatrix test corpus and methodologies and confirm if the PPL compares between us all okay or what.
>
> Interestingly the small `IQ1_S_R4` has a perplexity lower than `Qwen3-235B-A22B-Q8_0`=`Final estimate: PPL = 5.3141 +/- 0.03321` 232.769 GiB though that doesn't necessarily mean it is "better" but possibly more trained against wiki.test.raw?
+
+> 👤 **ikawrakow** replied on **2025-06-02** at **05:36:13**
>
-> 👤 **ikawrakow** replied the **2025-06-02** at **05:36:13**:
> So, `iq5_ks` looks like the winning option for attention tensors.
>
> Concerning `IQ1_S` lower PPL: these are two different models, so PPL cannot be used to compare. PPL is useful for measuring quality degradation with different quantization types applied to the **same model**. My guess is that the PPL difference between `f16` (or `Q8_0`) Qwen3-235B-A22B and DeepSeek-R1 is quite large.
+
+> 👤 **ubergarm** replied on **2025-06-02** at **14:14:22**
>
-> 👤 **ubergarm** replied the **2025-06-02** at **14:14:22**:
> > So, iq5_ks looks like the winning option for attention tensors.
>
> Yes, just for fun I ran a very short kld test corpus against them as well. The graph is kind of gnarly but is attempting to show `RMS Δp`, `99.0% Δp`, and `Maximum Δp` percentage for each of the experimental attn/shexp quants. Seems to still point towards `iq5_ks` as it has a surprisingly tight Δp relative the to pure q8_0 everything ~666GiB baseline.
@@ -896,33 +917,36 @@ My personal observations and thoughts are:
> > PPL is useful for measuring quality degradation with different quantization types applied to the same model.
>
> Thanks, that makes sense. I'm wondering if it is okay to use PPL to measure relative quality of the same model quantized with different imatrix corpus / methodologies? I don't know how much stock to put into my PPL comparisons of R1-0528 quants done by myself, unsloth, bartowski, given somewhat varying imatrix methodologies.
+
+> 👤 **saood06** replied on **2025-06-04** at **04:32:52**
>
-> 👤 **saood06** replied the **2025-06-04** at **04:32:52**:
> > Yes, just for fun I ran a very short kld test corpus against them as well. The graph is kind of gnarly but is attempting to show `RMS Δp`, `99.0% Δp`, and `Maximum Δp` percentage for each of the experimental attn/shexp quants. Seems to still point towards `iq5_ks` as it has a surprisingly tight Δp relative the to pure q8_0 everything ~666GiB baseline.
>
> If you find it fun/interesting can you see what quants you have pass the maze test. As mentioned here https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2882600098, I found it quite interesting the difference in pass rate between IQ4_K_R4 and IQ4_KS_R4.
>
> If you don't find it fun/interesting then don't bother.
+
+> 👤 **randoentity** replied on **2025-06-04** at **20:24:31**
>
-> 👤 **randoentity** replied the **2025-06-04** at **20:24:31**:
> I tried one pass and the IQ1_S succeeded, but it took 19 minutes of thinking (at 4.7 t/s).
>
> Edit: 3/3 so far, quasi-random maze (I skipped ones that required fewer than 3 steps).
---
-👤 **Ph0rk0z** replied the **2025-06-02** at **11:12:38**:
+👤 **Ph0rk0z** commented on **2025-06-02** at **11:12:38**
So here is a new surprise, since I'm eying that IQ1 quant you're publishing. On a lark I turned off the -rtr switch and in unsloth's quant, it was cutting my prompt processing by half. It did buff textgen to over 11t/s though. The mind wobbles. Will try reloading the larger quant of V3 to check results. On Qwens it sped things up 100%
On another note, I tried to test mainline llama and that sweep bench segfaults with deepseek and does not recognize the -FA parameter. I was able to load on llama-server and get a blazing fast 6t/s PP, 6t/s TG. So much for that.
-> 👤 **ubergarm** replied the **2025-06-04** at **21:11:14**:
+> 👤 **ubergarm** replied on **2025-06-04** at **21:11:14**
+>
> Check out this [PR492](https://github.com/ikawrakow/ik_llama.cpp/pull/492), given one cannot simply repack IQ1_S to IQ1_S_R4 is possibly related to the mind wobbles. haha..
---
-👤 **cmoncure** replied the **2025-06-03** at **00:58:43**:
+👤 **cmoncure** commented on **2025-06-03** at **00:58:43**
Still struggling to understand some things.
@@ -942,13 +966,16 @@ With Q4_K_M `-ngl 8 -sm layer -b 4096` it's 180-200 PP but less ideal 6-8 TG. C
Either way I have a whole GPU worth of compute just sitting idle. There has to be a way to utilize it. Can I not have the `-ngl 8 -sm layer` approach during PP on CUDA0, and then the `-rtr -sm none` approach during TG on CUDA1? Can I produce a quant that gets me the best of both worlds?
-> 👤 **Ph0rk0z** replied the **2025-06-03** at **02:39:39**:
+> 👤 **Ph0rk0z** replied on **2025-06-03** at **02:39:39**
+>
> Trial and error :( Helps to print the sizes on mainline and then see what you can fit. Generally on deepseek, only EXP layers help. All the little small ones don't do much.
+
+> 👤 **cmoncure** replied on **2025-06-03** at **15:24:54**
>
-> 👤 **cmoncure** replied the **2025-06-03** at **15:24:54**:
> Why can I seemingly split any combination of tensors between CPU and GPU0, but as soon as I try putting one tensor on to GPU1 this is suddenly impossible?
+
+> 👤 **ubergarm** replied on **2025-06-03** at **16:19:17**
>
-> 👤 **ubergarm** replied the **2025-06-03** at **16:19:17**:
> Its hard for me to understand what you're doing without a full command. A few quick thoughts:
> 1. Order matters, always put `-ot exps=CPU` *last* and any kind of offload to CUDA0 *before* it.
> 2. What is `GPU0`? Does that work? I've only used `CUDA0` but maybe you have non nvidia? i dunno...
@@ -965,8 +992,9 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> ```
>
> Have fun!
+
+> 👤 **Ph0rk0z** replied on **2025-06-03** at **17:22:42**
>
-> 👤 **Ph0rk0z** replied the **2025-06-03** at **17:22:42**:
> SM row and layer is the pseudo tensor parallel switch, mainly for GPU inference only. If we had real TP I bet our t/s go up by a third. Does TS even do anything here when you curate what layers to offload?
>
> I could put NGL 3 (maybe not that low, it segfaults) and just OT the layers I want to GPU. NGL only seems to stuff some unmentioned piddly layers on there and determine if pipeline parallel enables or not which affects the buffer size.
@@ -976,16 +1004,19 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> Super easy to get started, you just rack ffn or ffn_exp onto each GPU until it reaches a point where it doesn't OOM after the buffer is added. Can lower the buffer with AMB or smaller batch/ubatch. Ideally you have 4096, 2048, 1024 batches for context and then lower that to gain more t/s. It really is a balance of what you want.
>
> Likely with Q4KM the layers are large too. Going to have to pick and choose. Sincerely hope that only 2 layers aren't fitting because that's nothing.
+
+> 👤 **randoentity** replied on **2025-06-03** at **18:58:13**
>
-> 👤 **randoentity** replied the **2025-06-03** at **18:58:13**:
> I've tried the example commands and a ton of combinations, but I can't get the IQ1_ik generate faster than the unsloth IQ1_S. The fastest I can get is about 2.8 t/s and that's with **only** `--override-tensor exps=CPU,attn_kv_b=CPU`. As soon as I add more ffn layers (as per example) to CUDA (4@16x) it slows down. I've played with batch sizes, fa+ctv, bf16 enabled or not (it is a bit faster with it on!), and also the unsloth -ot examples. I (again) must have missed something obvious, like ik_llama.cpp requiring AVX512 or more than 6 cores.
+
+> 👤 **Thireus** replied on **2025-06-03** at **19:05:09**
>
-> 👤 **Thireus** replied the **2025-06-03** at **19:05:09**:
> > I've tried the example commands and a ton of combinations, but I can't get the IQ1_ik generate faster than the unsloth IQ1_S. The fastest I can get is about 2.8 t/s and that's with **only** `--override-tensor exps=CPU,attn_kv_b=CPU`. As soon as I add more ffn layers (as per example) to CUDA (4@16x) it slows down. I've played with batch sizes, fa+ctv, bf16 enabled or not (it is a bit faster with it on!), and also the unsloth -ot examples. I (again) must have missed something obvious, like ik_llama.cpp requiring AVX512 or more than 6 cores.
>
> I'm observing the same behaviour and I'm suspecting it has to do with memory/pcie bandwidth being saturated. Which CPU are you using?
+
+> 👤 **ubergarm** replied on **2025-06-03** at **20:04:14**
>
-> 👤 **ubergarm** replied the **2025-06-03** at **20:04:14**:
> Heya all, I have another thread going to help people specifically related to my smol boi 131GiB `IQ1_S_R4` ik_llama.cpp quant with some more example commands and discussion here: https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/discussions/6#683e4c6ede3f6dd9c43ad4ad
>
> If you want some help always give your CPU, RAM size, and list GPUs with VRAM each/total as well as the *full* current command you're trying. That will help me diagnose and optimize your command.
@@ -1006,8 +1037,9 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> Oh hello from reddit! I'm voidalchemy haha... Hopefully we can get a good command ironed out as folks start learning ik_llama.cpp which can be a little different than mainline llama.cpp.
>
> So yeah post your current command and system specs and hopefully can get you a few more tok/sec.
+
+> 👤 **randoentity** replied on **2025-06-03** at **20:27:15**
>
-> 👤 **randoentity** replied the **2025-06-03** at **20:27:15**:
> ```sh
> ./build_bf16/bin/llama-server \
> --model /mnt/x/models/ubergarm/DeepSeek-R1-0528- IQ1_S_R4-00001-of-00003.gguf \
@@ -1029,8 +1061,9 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> The kv_b example I got from https://huggingface.co/anikifoss/DeepSeek-R1-0528-DQ4_K_R4 (see above). I just added it to show that I've tried a ton of things.
>
> I do use a headless system and I don't have any swap allocated. The 172GiB one fits just fine and I can run it with --no-mmap.
+
+> 👤 **Thireus** replied on **2025-06-03** at **20:33:12**
>
-> 👤 **Thireus** replied the **2025-06-03** at **20:33:12**:
> 👋 @ubergarm - thank you for all your posts, I've been digging them all and tried various combinations with ik_llama.cpp on Windows.
>
> I kept note of my progress (but not of everything I've tried) here: https://thireus.com/GITHUB/ik_llama_Thireus_bench_01.txt (Firefox: View -> Repair Text Encoding), please let me know if you have suggestions that might help.
@@ -1050,18 +1083,21 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
>
> Using CUDA 12.8 (and Blackwell compatible) + -DGGML_AVX512=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
> See https://github.com/Thireus/ik_llama.cpp/blob/main/.github/workflows/release.yml#L448-L450
+
+> 👤 **Thireus** replied on **2025-06-03** at **21:06:54**
>
-> 👤 **Thireus** replied the **2025-06-03** at **21:06:54**:
> @cmoncure
>
> > Why can I seemingly split any combination of tensors between CPU and GPU0, but as soon as I try putting one tensor on to GPU1 this is suddenly impossible?
>
> That happened to me when not using `--flash-attn` or `-mla 3`.
+
+> 👤 **anikifoss** replied on **2025-06-03** at **22:51:59**
>
-> 👤 **anikifoss** replied the **2025-06-03** at **22:51:59**:
> The `attn_kv_b=CPU` flag can save up to 1GB VRAM without losing any speed, which is huge when you're trying to squeeze more context out of a 24GB card!
+
+> 👤 **ubergarm** replied on **2025-06-03** at **22:53:33**
>
-> 👤 **ubergarm** replied the **2025-06-03** at **22:53:33**:
> @randoentity
>
> > Isn't the problem just that IQ1_R4 isn't implemented (https://github.com/ikawrakow/ik_llama.cpp/pull/461)? Because the more I offload to CUDA the slower it gets. I.e. -ot exps=CPU alone is faster than adding more ffn blocks to CUDA (also tested single or multiple devices; same result).
@@ -1091,8 +1127,9 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> I'll look at your commands and come up with an example one to run the larger IQ2_K_R4 and reply later.
>
> Seems like I should roll an unpacked version as 128GB RAM does not seem like enough without using GPU offload and GPU offload doesn't speed anything up so not great. Got it!
+
+> 👤 **ubergarm** replied on **2025-06-03** at **23:13:54**
>
-> 👤 **ubergarm** replied the **2025-06-03** at **23:13:54**:
> @Thireus
>
> #### Assumptions
@@ -1148,18 +1185,21 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> > wish I had the same results with some full layers loaded to the GPU
>
> Okay, hope that helps! Thanks for helping me figure out why folks are having issue with the IQ1_S_R4 which cannot run any additional layers on GPU!
+
+> 👤 **ubergarm** replied on **2025-06-04** at **04:19:07**
>
-> 👤 **ubergarm** replied the **2025-06-04** at **04:19:07**:
> Okay, uploading the `IQ1_S` now that supports offloading more layers onto GPU. Ideally you would run it with `-rtr` too which takes a little time but should now fit in 128GiB RAM + 24GB VRAM rigs in my testing. Updating model card with two working examples.
+
+> 👤 **Thireus** replied on **2025-06-04** at **07:16:06**
>
-> 👤 **Thireus** replied the **2025-06-04** at **07:16:06**:
> @ubergarm, thank you for the tips, I'm downloading IQ2_K_R4 and IQ1_S. Will report back.
>
> I believe `-f` meant `-fa` from your commands, and `--ot` should be `-ot`.
>
> On Intel, matching the number of threads to the number of CPU threads gives it a 25% boost. Unfortunately I'm still capped at PP 21t/s no matter the -b -ub combination... See results: https://thireus.com/GITHUB/ik_llama_Thireus_bench_02.txt (Firefox: View -> Repair Text Encoding)
+
+> 👤 **Thireus** replied on **2025-06-04** at **08:31:00**
>
-> 👤 **Thireus** replied the **2025-06-04** at **08:31:00**:
> @ubergarm, I need to do more testing but happy days! `IQ1_S` gives me 246t/s PP 🏎️💨
> The trick was indeed NOT TO USE `IQ1_S_R4` for now until support is added for CUDA - https://github.com/ikawrakow/ik_llama.cpp/pull/461
>
@@ -1213,23 +1253,28 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> ```
>
> Loading more layers onto GPU VRAMs finally gets me higher speeds with `IQ1_S`!
+
+> 👤 **randoentity** replied on **2025-06-04** at **10:48:48**
>
-> 👤 **randoentity** replied the **2025-06-04** at **10:48:48**:
> Happy day! It works and I get above TG 4 t/s.
> @Thireus what is CUDA_DEVICE_ORDER=PCI_BUS_ID for? More consistency when rearranging devices with CUDA_VISIBLE_DEVICES as you don't rely on the heuristics which could change between CUDA versions and potentially hardware conditions?
+
+> 👤 **Thireus** replied on **2025-06-04** at **10:51:46**
>
-> 👤 **Thireus** replied the **2025-06-04** at **10:51:46**:
> @randoentity yep exactly this, it ensures to directly rely on the PCIe order, so I know exactly which card is which.
+
+> 👤 **randoentity** replied on **2025-06-04** at **10:59:44**
>
-> 👤 **randoentity** replied the **2025-06-04** at **10:59:44**:
> Ohh and does anyone know if the --main-gpu setting uses the cuda ordering? So if I do CUDA_VISIBLE_DEVICES=2,0,1 will doing -mg=0 select the first device in aforementioned list (I.e. the one that appears as device 2 in nvtop/nvidia-smi)? I've tried playing with this but empiricism ran away from me at some point.
+
+> 👤 **RodriMora** replied on **2025-06-04** at **11:04:07**
>
-> 👤 **RodriMora** replied the **2025-06-04** at **11:04:07**:
> > Ohh and does anyone know if the --main-gpu setting uses the cuda ordering? So if I do CUDA_VISIBLE_DEVICES=2,0,1 will doing -mg=0 select the first device in aforementioned list (I.e. the one that appears as device 2 in nvtop/nvidia-smi)? I've tried playing with this but empiricism ran away from me at some point.
>
> I believe when you do CUDA_VISIBLE_DEVICES=2,0,1, for ik_llama.cpp now cuda0 is the real cuda2
+
+> 👤 **randoentity** replied on **2025-06-04** at **12:00:53**
>
-> 👤 **randoentity** replied the **2025-06-04** at **12:00:53**:
> Same command as Thireus but with 7 layers in CUDA0 and only 6 cores, which seems to massively cripple PP, but it could be something else. I'll run some more tests, but that this runs and is not outputting gibberish is absolutely astonishing!
>
> ```
@@ -1244,8 +1289,9 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> ```
>
> **Edit:** S_PP t/s in the 160 range with `--threads-batch = 12`!
+
+> 👤 **Thireus** replied on **2025-06-04** at **12:38:47**
>
-> 👤 **Thireus** replied the **2025-06-04** at **12:38:47**:
> Nice! I haven't played with --threads-batch yet, but will do.
>
> I've cranked the b and ub values to `-b 16384 -ub 8192`, which give much higher PP speeds now. But doesn't leave much room for context size.
@@ -1273,27 +1319,32 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> | 8192 | 2048 | 8192 | 31.843 | 257.26 | 404.438 | 5.06 |
> ---
> ```
+
+> 👤 **Ph0rk0z** replied on **2025-06-04** at **16:14:17**
>
-> 👤 **Ph0rk0z** replied the **2025-06-04** at **16:14:17**:
> Heh.. from the tests I run yesterday/today.. it seems pointless to download other people's R4 quants unless you have the exact same configuration as they do else you get massive speed hits. https://github.com/ikawrakow/ik_llama.cpp/discussions/491
>
> If I didn't do something wrong, it's more ideal to just use RTR if you want higher tg at the expense of prompt processing. There is a sweet spot for the tradeoff, imo. My CPU is xeon scalable without vnni.. perhaps another codepath or single CPU doesn't have the problem.
+
+> 👤 **ubergarm** replied on **2025-06-04** at **21:12:39**
>
-> 👤 **ubergarm** replied the **2025-06-04** at **21:12:39**:
> @Thireus @randoentity and all,
>
> More good news, ik took a crack at getting `IQ1_S_R4` CUDA implementation going with [PR492](https://github.com/ikawrakow/ik_llama.cpp/pull/492). Feel free to build that branch and compare speeds as it will likely increase your TG numbers.
+
+> 👤 **randoentity** replied on **2025-06-05** at **04:27:36**
>
-> 👤 **randoentity** replied the **2025-06-05** at **04:27:36**:
> Thanks @ubergarm . It looks like a 10% speedup in TG, but slower PP as a tradeoff. However, more space for context might be nice, especially for those with only 24GB VRAM. I'll do some more of those maze tests if you decide to release a pure IQ1_S_R4 (as you mention in the PR, the IQ1_S_R4 you uploaded on HF doesn't work). It might be worth it to make another post on LocalLlama for that.
+
+> 👤 **ubergarm** replied on **2025-06-05** at **15:04:07**
>
-> 👤 **ubergarm** replied the **2025-06-05** at **15:04:07**:
> Yeah I did make and test that `IQ1_S_R4-smol` i call it with iq5_ks for all attn/shexp/token_embd then IQ1_S_R4 for all ffn_up/down/gate_exps but as ik mentioned it is indeed a little bit more dumb despite being just a little bit smaller.
> `Final estimate: PPL = 5.0048 +/- 0.02978`
>
> I decided to not be so brash and just wait a little bit as sounds like ik is interested in also adding `IQ1_M_R4` cuda support in which case that first model I released would be good to go. Oh yes I'll go test [PR494](https://github.com/ikawrakow/ik_llama.cpp/pull/494) now!
+
+> 👤 **randoentity** replied on **2025-06-05** at **21:12:37**
>
-> 👤 **randoentity** replied the **2025-06-05** at **21:12:37**:
> About 20% faster TG and PP didn't take a hit! I think I could even squeeze in another layer or two. Now let's see if this smolly can solve mazes.
> ```sh
> CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=2,0,1 ./
@@ -1331,43 +1382,50 @@ Either way I have a whole GPU worth of compute just sitting idle. There has to
> | 4096 | 1024 | 8192 | 31.199 | 131.29 | 187.610 | 5.46 |
> | 4096 | 1024 | 12288 | 35.090 | 116.73 | 190.219 | 5.38 |
> ```
+
+> 👤 **Thireus** replied on **2025-06-05** at **21:34:55**
>
-> 👤 **Thireus** replied the **2025-06-05** at **21:34:55**:
> Sorry if this is a silly question, but aren't unsloth's quants supported on ik_llama? I can see they load but fatal error occurs on inference.
+
+> 👤 **randoentity** replied on **2025-06-05** at **22:01:02**
>
-> 👤 **randoentity** replied the **2025-06-05** at **22:01:02**:
> @Thireus ah yeah, try disabling fmoe.
+
+> 👤 **ikawrakow** replied on **2025-06-06** at **06:09:51**
>
-> 👤 **ikawrakow** replied the **2025-06-06** at **06:09:51**:
-> Does #495 solve the `-fmoe` issue with Unsloth's model?
+> Does [#495](https://github.com/ikawrakow/ik_llama.cpp/issues/495) solve the `-fmoe` issue with Unsloth's model?
+
+> 👤 **randoentity** replied on **2025-06-06** at **12:53:56**
>
-> 👤 **randoentity** replied the **2025-06-06** at **12:53:56**:
> For those with multi-GPU having uneven bandwidth (i.e. different number of lanes or PCIe generation): try playing with `--tensor-split`. I got from 175 PP 5.6 TG to 200 PP 6.0 TG by setting it to 1,0,0. Having fewer full layers on the fastest GPU, but more tensors overall seems to give a modest boost.
>
> I also found that `-amb` doesn't do much for speeds, so setting it to 64 frees up some memory (lower doesn't work).
>
> Finally, the bf16 compilation option prevents use of ctk q8_0, and I have to double check this still, but the speed boost doesn't seem significant on the R4 quant.
+
+> 👤 **ikawrakow** replied on **2025-06-06** at **13:30:09**
>
-> 👤 **ikawrakow** replied the **2025-06-06** at **13:30:09**:
> > Finally, the bf16 compilation option prevents use of ctk q8_0
>
> This would be news to me.
>
> > I also found that -amb doesn't do much for speeds, so setting it to 64 frees up some memory (lower doesn't work).
>
-> For your specific system, with the specific model you are using. The `-amb` option was added in PR #260, which has an explanation what it does. Please don't recommend `-amb 64` as a general truth to others.
+> For your specific system, with the specific model you are using. The `-amb` option was added in PR [#260](https://github.com/ikawrakow/ik_llama.cpp/issues/260), which has an explanation what it does. Please don't recommend `-amb 64` as a general truth to others.
+
+> 👤 **randoentity** replied on **2025-06-06** at **14:51:22**
>
-> 👤 **randoentity** replied the **2025-06-06** at **14:51:22**:
-> I've created #499 for the error.
+> I've created [#499](https://github.com/ikawrakow/ik_llama.cpp/issues/499) for the error.
>
> Thanks for the link to the explanation for `-amb`! I didn't mean to spread misinformation, sorry. It was meant in the context of multi-GPU, this model, and this quant.
+
+> 👤 **Ph0rk0z** replied on **2025-06-06** at **15:32:34**
>
-> 👤 **Ph0rk0z** replied the **2025-06-06** at **15:32:34**:
> I have set BF16 and almost always use Q8 cache with different AMB, including 64. It shrinks the compute buffer so you can fit another piece of a layer or layer itself. For me it also didn't do much for speeds on it's own. Best to benchmark. Has worked both with deepseek and qwen including the IQ1 unsloth.
---
-👤 **cmoncure** replied the **2025-06-03** at **21:25:22**:
+👤 **cmoncure** commented on **2025-06-03** at **21:25:22**
Can anyone explain to me in simple terms. When considering tensor offload configurations, what exactly is the nature of the stickiness or entanglement between tensors? What tensors MUST go together as an indivisible unit?
@@ -1387,10 +1445,12 @@ But, you want to do something like this?
Are these **impossible** for REASONS or just "not supported" i.e. go learn the domain and write the code myself?
-> 👤 **Thireus** replied the **2025-06-03** at **21:32:54**:
+> 👤 **Thireus** replied on **2025-06-03** at **21:32:54**
+>
> I'm reading this answer - https://chatgpt.com/share/683f69cc-bff8-800f-8610-55aa4de145ed
+
+> 👤 **ubergarm** replied on **2025-06-03** at **23:25:38**
>
-> 👤 **ubergarm** replied the **2025-06-03** at **23:25:38**:
> @cmoncure
>
> Zero offense intended, and just being a mirror, but for some reason I have a hard time understanding your writing for some reason. Perhaps you're just asking broad questions beyond my level of understanding as my brain is usually in the weeds ignoring the forest to mix my metaphores haha... Are you maybe copy pasting ai generated stuff as I never type unicode checks and x's. Anyway, just working on my communication, thanks.
@@ -1406,11 +1466,13 @@ Are these **impossible** for REASONS or just "not supported" i.e. go learn the d
> > Are these impossible for REASONS or just "not supported" i.e. go learn the domain and write the code myself?
>
> mu
+
+> 👤 **cmoncure** replied on **2025-06-04** at **00:12:27**
>
-> 👤 **cmoncure** replied the **2025-06-04** at **00:12:27**:
> "go learn the domain and write the code yourself" then, got it.
+
+> 👤 **cmoncure** replied on **2025-06-04** at **00:23:01**
>
-> 👤 **cmoncure** replied the **2025-06-04** at **00:23:01**:
> > attn=CUDA0, blk.3=CUDA1, exps=CPU
>
> > If “blk.3” means “all of layer 3 (attention + feed‑forward)” goes to CUDA:1, but you also try to put “attention” itself (the subcomponent of layer 3) on CUDA:0, you’ve overlapped. The “attention” sub‐block lives partly on CUDA:0 (its matmuls → exps) and partly on CUDA:1 (the rest of the layer 3). As soon as you compute context = softmax(scores) @ V, you need Q/K/V and the output projection to be together. If some of those weights/activations are on CUDA:1 and some on CUDA:0, you’d have to copy intermediates back and forth in the middle of that attention forward. In practice, no mainstream codebase will (a) know how to break attention in exactly two devices at the same time, or (b) optimize all of those back‑and‑forth copies.
@@ -1418,15 +1480,16 @@ Are these **impossible** for REASONS or just "not supported" i.e. go learn the d
> Well, let's look at this helpful and reasonable explanation from ChatGPT. All is well and good here! No codebase can handle this scenario where the whole of layer 3 (attention + feed forward) goes to CUDA1, but attention remains on CUDA0, because the activations get split between CUDA0 and CUDA1. Totally makes sense.
>
> Okay well, how then does this work when I do `-ot attn=CUDA0 exps=CPU`? Now attention is on CUDA0 and feed forward is on CPU... they are split! IMPOSSIBLE! ... impossible, right ChatGPT? :face_exhaling:
+
+> 👤 **Ph0rk0z** replied on **2025-06-04** at **11:33:47**
>
-> 👤 **Ph0rk0z** replied the **2025-06-04** at **11:33:47**:
> >ffn_(up|gate) computation was optimized in such a way that I'd recommend not putting those on different devices for a given layer.
>
> So that explains why that causes being GPU bound. It seems I can put individual ups or gates on GPU vs CPU but I can't put one up or gate from the same layer on different GPUs. Both up/gate on the same GPU speeds things up though.
---
-👤 **cmoncure** replied the **2025-06-06** at **15:04:26**:
+👤 **cmoncure** commented on **2025-06-06** at **15:04:26**
Day 4 of chasing performance with bespoke repacking and the delicate and mercurial (i.e. broken) configuration args. I'm ready to give up. I tried so many blends of tensor offload parameters and statically repacking my head is spinning. Nothing I tried can reach the high water marks of:
16 TG t/s with `--rtr -ot attn=CUDA0` _(but bad PP)_
@@ -1436,19 +1499,22 @@ I made a repacked quant that converts only the exps tensors running on CPU to _r
The domain may seem like black magic but at the end of the day all we're doing here is matrix multiplication. My instinct is screaming at me that there's huge amounts of performance left on the table. The wild and frankly shocking comment that "high gpu utilization is actually a bad thing" notwithstanding, the goal is to get the most math done per unit time as possible. It's very telling that seemingly no one can give an explanation that holds water of what operations must be tied to one another on a compute device, or why the tensors can be split in one way between CPU and CUDA0 but as soon as you extend the split to involve CUDA1 the performance bombs. We want to run big models on commodity hardware and that means finding the way of distributing the computation among multiple relatively-low-capacity compute units that maximizes the contribution of all the units.
-> 👤 **Thireus** replied the **2025-06-06** at **15:08:15**:
+> 👤 **Thireus** replied on **2025-06-06** at **15:08:15**
+>
> Don't give up so soon! I'm in the same boat and I need motivation. 😂
>
> Which model/quant and ik_llama build are you using?
+
+> 👤 **cmoncure** replied on **2025-06-06** at **15:48:32**
>
-> 👤 **cmoncure** replied the **2025-06-06** at **15:48:32**:
> version: 3722 (7a8abe29)
>
> bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF and my various repackings of it.
> `./ik_llama.cpp/build/bin/llama-quantize --repack --repack-pattern "blk.(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60).ffn_gate_exps","blk.(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60).ffn_down_exps","blk.(11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57|58|59|60).ffn_up_exps" ~/AIModels/textgen/deepseek-ai_DeepSeek-V3-0324-Q4_K_M-V2-00001-of-00011.gguf ~/AIModels/textgen/repacked5.gguf COPY
> `
+
+> 👤 **VinnyG9** replied on **2025-06-11** at **03:31:14**
>
-> 👤 **VinnyG9** replied the **2025-06-11** at **03:31:14**:
> > Day 4 of chasing performance with bespoke repacking and the delicate and mercurial (i.e. broken) configuration args. I'm ready to give up. I tried so many blends of tensor offload parameters and statically repacking my head is spinning. Nothing I tried can reach the high water marks of: 16 TG t/s with `--rtr -ot attn=CUDA0` _(but bad PP)_ 200 PP t/s with no repacking and `-sm layer -ngl 8` _(but bad TG)_
> >
> > I made a repacked quant that converts only the exps tensors running on CPU to _r4 (exps 11...60) and run everything else on CUDA0 and CUDA1 with --sm layer. It should be the best of both worlds, but it's the worst of both worlds: PP 71 and TG 9.
@@ -1456,25 +1522,28 @@ The domain may seem like black magic but at the end of the day all we're doing h
> > The domain may seem like black magic but at the end of the day all we're doing here is matrix multiplication. My instinct is screaming at me that there's huge amounts of performance left on the table. The wild and frankly shocking comment that "high gpu utilization is actually a bad thing" notwithstanding, the goal is to get the most math done per unit time as possible. It's very telling that seemingly no one can give an explanation that holds water of what operations must be tied to one another on a compute device, or why the tensors can be split in one way between CPU and CUDA0 but as soon as you extend the split to involve CUDA1 the performance bombs. We want to run big models on commodity hardware and that means finding the way of distributing the computation among multiple relatively-low-capacity compute units that maximizes the contribution of all the units.
>
> here fellow OCD, see if [this](https://www.reddit.com/r/LocalLLaMA/comments/1kpe33n/comment/msxzv0s/) helps
+
+> 👤 **cmoncure** replied on **2025-06-11** at **19:21:38**
>
-> 👤 **cmoncure** replied the **2025-06-11** at **19:21:38**:
> I can't use this approach at all because as soon as I try to involve CUDA1 with `-sm none` and `-mg` the code attempts to allocate 1.5 trillion bytes of memory on the GPU (four times the size of the entire model tensors)
+
+> 👤 **saood06** replied on **2025-06-12** at **01:38:20**
>
-> 👤 **saood06** replied the **2025-06-12** at **01:38:20**:
> @cmoncure
>
> Are you building with `-DGGML_SCHED_MAX_COPIES=1`?
>
> That may be needed for now to avoid that issue, see https://github.com/ikawrakow/ik_llama.cpp/issues/437#issuecomment-2954768207
+
+> 👤 **VinnyG9** replied on **2025-06-13** at **18:06:09**
>
-> 👤 **VinnyG9** replied the **2025-06-13** at **18:06:09**:
> > I can't use this approach at all because as soon as I try to involve CUDA1 with `-sm none` and `-mg` the code attempts to allocate 1.5 trillion bytes of memory on the GPU (four times the size of the entire model tensors)
>
> set ngl to all minus 1 layer
---
-👤 **Gaolingx** replied the **2025-06-06** at **18:13:10**:
+👤 **Gaolingx** commented on **2025-06-06** at **18:13:10**
I am running on 1x epyc 9334qs + 12x ddr5 6400mhz(works on 4800mhz) 48g + 3070 16g, **~10.3t/s TG, ~78t/s PP**, it works well, but the VRAM has used about 12GB, I am not sure how large a context window(`--ctx-size`) I can open.
@@ -1503,17 +1572,18 @@ parameter:
---
-👤 **ciprianveg** replied the **2025-06-06** at **18:18:44**:
+👤 **ciprianveg** commented on **2025-06-06** at **18:18:44**
Add -b 4096 -ub 4096 and you will have 3x your pp speed
-> 👤 **zts9989** replied the **2025-06-26** at **01:36:14**:
+> 👤 **zts9989** replied on **2025-06-26** at **01:36:14**
+>
> https://github.com/ggml-org/llama.cpp/issues/14325
> Thanks.
---
-👤 **saood06** replied the **2025-06-11** at **15:05:50**:
+👤 **saood06** commented on **2025-06-11** at **15:05:50**
So I finally cooked a quant after sitting on the BF16 for so long.
@@ -1526,7 +1596,7 @@ Running sweep right now but early impressions are good enough that I may end up
---
-👤 **zts9989** replied the **2025-06-26** at **01:44:18**:
+👤 **zts9989** commented on **2025-06-26** at **01:44:18**
Thank you for the discussion. Sharing my experimental results for your reference.
@@ -1535,17 +1605,20 @@ Thank you for the discussion. Sharing my experimental results for your reference
https://github.com/ggml-org/llama.cpp/issues/14325
-> 👤 **saood06** replied the **2025-06-26** at **01:58:27**:
+> 👤 **saood06** replied on **2025-06-26** at **01:58:27**
+>
> You said in the linked post:
>
> >I tested ik llamacpp and found some performance improvements, but the stability was insufficient (there also seem to be other issues with usability and stability)
>
> Can you make issues for the usability and stability problems you mentioned.
+
+> 👤 **zts9989** replied on **2025-06-26** at **02:03:56**
>
-> 👤 **zts9989** replied the **2025-06-26** at **02:03:56**:
> Absolutely. I can provide that shortly. Please excuse the informal nature of my issue description—it's based more on observational feel than quantitative metrics or official specifications. Much of the feedback I provide within the llama.cpp community tends to reflect practical usage experiences rather than technical documentation standards.
+
+> 👤 **saood06** replied on **2025-06-26** at **02:09:12**
>
-> 👤 **saood06** replied the **2025-06-26** at **02:09:12**:
> > Absolutely. I can provide that shortly.
>
> Thanks.
@@ -1553,22 +1626,25 @@ https://github.com/ggml-org/llama.cpp/issues/14325
> >Please excuse the informal nature of my issue description—it's based more on observational feel than quantitative metrics or official specifications. Much of the feedback I provide within the llama.cpp community tends to reflect practical usage experiences rather than technical documentation standards.
>
> No worries, I've seen your feedback to llama.cpp (especially your NUMA stuff) and in my view it is very useful.
+
+> 👤 **zts9989** replied on **2025-06-26** at **03:38:50**
>
-> 👤 **zts9989** replied the **2025-06-26** at **03:38:50**:
> My sincere apologies, I retract what I said (Please forgive me for trying to use ik llama.cpp the same way I use the standard llama.cpp, which led to unexpected results. For example, with llama-cli, I didn't add the -cnv switch, so the model went off the rails and generated output I didn't expect).
>
> ik llama.cpp does offer a performance improvement over standard llama.cpp. Speed increased from 17.4 t/s (llama.cpp) to 18.xx t/s (ik).
>
> **Apologies again. (I'm really sorry.)**
+
+> 👤 **ikawrakow** replied on **2025-06-26** at **06:48:24**
>
-> 👤 **ikawrakow** replied the **2025-06-26** at **06:48:24**:
> The recommended batch/u-batch size for `ik_llama.cpp` **with MoE models** is 4096 tokens (if you have enough RAM/VRAM; derfault u-batch is perfectly fine for dense models). Performance gains beyond 4096 are quite minor and do not justify the massive increase of compute buffer sizes. Some users go up to 6144. A batch/u-batch size of 16384 is really pushing it.
>
> You are reporting a few percent performance benefit for TG with `ik_llama.cpp` vs `llama.cpp`. The difference in PP should be quite a bit larger, no? Interesting you are not looking at that, considering that the whole thread is about batch/u-batch size, which only matters for PP.
>
> Having to add `-cnv` in `ik_llama.cpp` is my personal preference. This is how `llama.cpp` used to behave as well, and I'm annoyed each time I want to use `llama-cli` in `llama.cpp` for a quick performance/coherence check when it starts in conversation mode rather than completing my prompt. And because I don't use mainline very often, each time I need to go and check if it was `--no-conv` or `-no-conv` to disable the conversation mode. Extremely annoying.
+
+> 👤 **zts9989** replied on **2025-06-26** at **08:17:43**
>
-> 👤 **zts9989** replied the **2025-06-26** at **08:17:43**:
> PP (Prompt Processing) speed in ik_llama.cpp is significantly faster than in standard llama.cpp.
> At a batch size of 8192, llama.cpp achieves 170 tokens/s while ik_llama.cpp reaches 200 tokens/s (I will provide screenshot evidence later).
> At a batch size of 16384, llama.cpp achieves 270 tokens/s, but ik_llama.cpp enters an infinite loop and generates irrelevant outputs. This prevented further performance testing (my screenshot evidence here is insufficient since terminating the process via Ctrl+C doesn’t log PP/TG metrics).
@@ -1587,7 +1663,7 @@ https://github.com/ggml-org/llama.cpp/issues/14325
---
-👤 **zts9989** replied the **2025-06-26** at **08:21:32**:
+👤 **zts9989** commented on **2025-06-26** at **08:21:32**
PP (Prompt Processing) speed in ik_llama.cpp is significantly faster than in standard llama.cpp.
At a batch size of 8192, llama.cpp achieves 170 tokens/s while ik_llama.cpp reaches 200 tokens/s (I will provide screenshot evidence later).
@@ -1624,13 +1700,16 @@ Screenshot evidence will be attached as noted.


-> 👤 **ikawrakow** replied the **2025-06-26** at **09:04:13**:
+> 👤 **ikawrakow** replied on **2025-06-26** at **09:04:13**
+>
> I suggest you try `-mla 3 -fmoe`. If you run out of VRAM, add `-amb 512`. For the 36k tokens you are processing you should get a very significant performance boost in PP performance.
+
+> 👤 **Thireus** replied on **2025-06-26** at **09:14:12**
>
-> 👤 **Thireus** replied the **2025-06-26** at **09:14:12**:
> @zts9989 - Yep, similar observations here https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13367713 ;)
+
+> 👤 **zts9989** replied on **2025-06-26** at **09:17:36**
>
-> 👤 **zts9989** replied the **2025-06-26** at **09:17:36**:
> > I suggest you try `-mla 3 -fmoe`. If you run out of VRAM, add `-amb 512`. For the 36k tokens you are processing you should get a very significant performance boost in PP performance.
>
> 
@@ -1643,7 +1722,7 @@ Screenshot evidence will be attached as noted.
---
-👤 **zts9989** replied the **2025-06-26** at **09:56:07**:
+👤 **zts9989** commented on **2025-06-26** at **09:56:07**
Turns out I was using ik llama.cpp incorrectly all along.
Coming full circle, I'm back to square one:
@@ -1653,7 +1732,7 @@ Thanks!
---
-👤 **ikawrakow** replied the **2025-06-26** at **10:38:51**:
+👤 **ikawrakow** commented on **2025-06-26** at **10:38:51**
> Please optimize the ggml_cuda_cpy function to support copying tensors larger than 2GB.
@@ -1661,17 +1740,20 @@ I can see what I can do, but I don't feel particularly motivated to engage in hu
But based on your performance numbers, I estimate you have a 30 GB/s PCI-E, so it takes about 13 seconds to upload all experts stored in RAM to the GPU(s). For u-batch size of 16k tokens you are getting 347 t/s, so the u-batch takes about 47 seconds, so computation is about 34 seconds (and it is easy to verify that this napkin math works for u-batches of 8k and 4k). If you would go to u-batch size of 32k tokens, computation for the batch will at least double, offload time will stay the same, so it will be taking about 81 seconds, so performance will be in the range of 390 t/s. In reality when batch sizes become very large, computing performance goes down due to limited caches, etc, so I'm guessing you will saturate around 350-360 t/s. If I look at the 8k u-batch size, I estimate you have in the range of 30 GB of unused VRAM. Hence, you could have uploaded 5 or 6 layers of experts to the GPU. That would slightly increase your PP performance, and will also boost your TG performance by about 10%.
-> 👤 **zts9989** replied the **2025-06-26** at **13:02:20**:
+> 👤 **zts9989** replied on **2025-06-26** at **13:02:20**
+>
> I just gave it a try.
> My GPU is connected via PCIe 4.0 x16, so the bandwidth is around 30 GB/s. 347 t/s really seems to be the current limit for my setup. I experimented with a batch size of 32,768 tokens, but performance actually decreased. I also tried pre-loading experts into the available GPU VRAM – the gain was minimal (just from 17.3 to 17.5 t/s).
>
> Thanks for the suggestions though. I've now secured a runtime environment with higher-performance PP.
+
+> 👤 **ikawrakow** replied on **2025-06-26** at **17:37:09**
>
-> 👤 **ikawrakow** replied the **2025-06-26** at **17:37:09**:
-> Does PR #560 let you compute the context that fails on the main branch with batch/u-batch of 16k tokens?
+> Does PR [#560](https://github.com/ikawrakow/ik_llama.cpp/issues/560) let you compute the context that fails on the main branch with batch/u-batch of 16k tokens?
+
+> 👤 **zts9989** replied on **2025-06-27** at **02:46:28**
>
-> 👤 **zts9989** replied the **2025-06-27** at **02:46:28**:
-> > Does PR #560 let you compute the context that fails on the main branch with batch/u-batch of 16k tokens?
+> > Does PR [#560](https://github.com/ikawrakow/ik_llama.cpp/issues/560) let you compute the context that fails on the main branch with batch/u-batch of 16k tokens?
>
> I tried this version, and it still crashed after 131,072. This time it wasn't an error in the cuda cpy, but in the cuda compute. It might really be exceeding the limit.
>
@@ -1680,7 +1762,7 @@ But based on your performance numbers, I estimate you have a 30 GB/s PCI-E, so i
---
-👤 **eous** replied the **2025-07-10** at **21:20:45**:
+👤 **eous** commented on **2025-07-10** at **21:20:45**
Just a couple benchmark dumps.
@@ -1792,10 +1874,12 @@ Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
```
-> 👤 **ikawrakow** replied the **2025-07-11** at **04:57:13**:
+> 👤 **ikawrakow** replied on **2025-07-11** at **04:57:13**
+>
> What is the model in these benchmarks?
+
+> 👤 **ubergarm** replied on **2025-07-11** at **06:22:06**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **06:22:06**:
> @ikawrakow
>
> I believe it is [ubergarm/DeepSeek-TNG-R1T2-Chimera/IQ1_S at 132.915 GiB (1.699 BPW) quant](https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF#-iq1_s-132915-gib-1699-bpw)
@@ -1824,15 +1908,17 @@ Build cuda_12.9.r12.9/compiler.36037853_0
> It is a blend of two of the smallest yet faster CUDA inferencing quants, IQ2_KS and slightly smaller IQ2_XXS for the routed exps. The perplexity is better too at around ~4.0 so should be a little "smarter" than the smaller IQ1_S.
>
> Uploading now, should be live within a couple hours!
+
+> 👤 **ikawrakow** replied on **2025-07-11** at **07:15:12**
>
-> 👤 **ikawrakow** replied the **2025-07-11** at **07:15:12**:
> Oh, I see. That's why it is fully loaded in VRAM. Very impressive.
>
> Can one get 800 t/s PP and 40+ t/s TG with any of llama.cpp, KTransformers, vLLM, sglang, ... with this setup?
>
> @ubergarm If you are targeting a fully offloaded setup, isn't `IQ2_KT` the best option? It beets `IQ2_XXS` and `IQ2_KS` in terms of PPL and GPU performance.
+
+> 👤 **ubergarm** replied on **2025-07-11** at **14:55:34**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **14:55:34**:
> @ikawrakow
>
> > Can one get 800 t/s PP and 40+ t/s TG with any of llama.cpp, KTransformers, vLLM, sglang, ... with this setup?
@@ -1848,8 +1934,9 @@ Build cuda_12.9.r12.9/compiler.36037853_0
> It will take a bit longer to cook and calculate perplexity as I can't offload it all, but I'm too curious now not to try! Thanks!
>
> PS. I'm still not sure the best way to handle that odd shaped `attn_k_b.*=q4_0`... It could go to `iq4_nl` but I'm honestly not even sure if it is actually used or if the corresponding versions of that tensor are used.
+
+> 👤 **ikawrakow** replied on **2025-07-11** at **15:21:02**
>
-> 👤 **ikawrakow** replied the **2025-07-11** at **15:21:02**:
> `IQ2_KT` TG performance on CUDA is pretty good, at least on my RTX-4080. It is in the same ballpark as `IQ2_XXS/IQ2_KS`.
>
> The `attn_k_b` and `attn_v_b` tensors get used for TG. The `attn_kv_b` tensors that `ik_llama.cpp` creates on-the-fly are used for PP (when MLA = 2, 3). To avoid potential accuracy loss due to re-quantization, the `attn_kv_b` tensors get created as `Q8_0`.
@@ -1857,8 +1944,9 @@ Build cuda_12.9.r12.9/compiler.36037853_0
> Surprised to see `llama.cpp` pulling ahead for TG. I guess one needs to see the exact compositions of these models as theirs may be larger on disk, but use fewer bits during inference.
>
> What about KTransformers? They for sure can do `IQ1_S` after copy/pasting it from here.
+
+> 👤 **ubergarm** replied on **2025-07-11** at **21:21:31**
>
-> 👤 **ubergarm** replied the **2025-07-11** at **21:21:31**:
> @ikawrakow
>
> > The attn_k_b and attn_v_b tensors get used for TG. The attn_kv_b tensors that ik_llama.cpp creates on-the-fly are used for PP (when MLA = 2, 3). To avoid potential accuracy loss due to re-quantization, the attn_kv_b tensors get created as Q8_0.
@@ -1915,11 +2003,12 @@ Build cuda_12.9.r12.9/compiler.36037853_0
---
-👤 **magikRUKKOLA** replied the **2025-07-10** at **21:31:25**:
+👤 **magikRUKKOLA** commented on **2025-07-10** at **21:31:25**
MOVED: https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomment-13726226
-> 👤 **ubergarm** replied the **2025-07-10** at **23:41:21**:
+> 👤 **ubergarm** replied on **2025-07-10** at **23:41:21**
+>
> @magikRUKKOLA
>
> Thanks for bringing the discussion over here, explaining your goal of running as much context as possible up to 160k (model max) on the least VRAM possible, and showing your hardware setup.
@@ -1982,8 +2071,9 @@ MOVED: https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomme
>
>
> So you have 3x 3090s and how much RAM? You can easily achieve full 160k context while offloading additional layers for max PP and TG speeds.
+
+> 👤 **magikRUKKOLA** replied on **2025-07-10** at **23:48:40**
>
-> 👤 **magikRUKKOLA** replied the **2025-07-10** at **23:48:40**:
> @ubergarm
>
> > I'm not sure how you came to this conclusion?
@@ -1993,15 +2083,16 @@ MOVED: https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomme
> > So you have 3x 3090s and how much RAM?
>
> 512 GB RAM
+
+> 👤 **ubergarm** replied on **2025-07-10** at **23:52:42**
>
-> 👤 **ubergarm** replied the **2025-07-10** at **23:52:42**:
> lmao so sorry, I realized after refreshing that it was moved over *there* so replied there for the next step! xD
>
> yeah you have plenty of ram and VRAM, we can get u going 160k context no problemo
---
-👤 **magikRUKKOLA** replied the **2025-07-14** at **15:07:55**:
+👤 **magikRUKKOLA** commented on **2025-07-14** at **15:07:55**
Lets update the perplexity vs llm size graph. I suggest we use svg.
@@ -2015,6 +2106,7 @@ with qr codes [following[ to the huggingface['s short-version domain name hf.co
[INSTRUCTIONS TO GENETATE SVG]
* the colours for the figures are generated deterministically, via the name of the quant in the config.
+* the trendline goes via the pareto-optimal quants.
To generate the svg graph perplexity vs llm size keep the data in config.json:
@@ -2045,262 +2137,265 @@ To generate the svg graph perplexity vs llm size keep the data in config.json:
and use the make.sh script ( ./make.sh --logscale config.json > ppl-log.svg ) to generate the svg file:
```bash
-#!/bin/bash
-
-# Usage: ./generate_chart.sh [--logscale] config.json > output.svg
+#!/usr/bin/env bash
+set -euo pipefail
+# ------------------------------------------------------------------
+# 1. CLI
+# ------------------------------------------------------------------
logscale=0
-if [ "$1" = "--logscale" ]; then
- logscale=1
- shift
-fi
-
-if [ $# -ne 1 ]; then
- echo "Usage: $0 [--logscale] " >&2
- exit 1
-fi
-
-config_file="$1"
-
-# Verify config file exists
-[ ! -f "$config_file" ] && echo "Error: Config file not found" >&2 && exit 1
-
-# QR code directory
+[[ $# -ge 1 && $1 == "--logscale" ]] && { logscale=1; shift; }
+[[ $# -eq 1 ]] || { echo "Usage: $0 [--logscale] config.json" >&2; exit 1; }
+config=$1
+[[ -f $config ]] || { echo "Config file not found" >&2; exit 1; }
+
+# ------------------------------------------------------------------
+# 2. QR-codes (never touch stdin)
+# ------------------------------------------------------------------
qr_dir="qrcodes"
mkdir -p "$qr_dir"
-
-# Collect URLs and generate QR codes
-jq -r '.data[] | select(.url) | .url' "$config_file" | while read -r url; do
- # Shorten URL
- short_url=$(sed 's|https://huggingface.co/|hf.co/|' <<< "$url")
- # Generate hash for filename
- hash=$(echo -n "$short_url" | md5sum | awk '{print $1}')
- qr_file="$qr_dir/$hash.svg"
-
- # Only generate if doesn't exist
- if [ ! -f "$qr_file" ]; then
- tmpfile="$qr_dir/${hash}_tmp.svg"
- qrencode --inline -t svg -l L -s 1 -m 0 "$short_url" -o "$tmpfile"
- svgo --multipass -q "$tmpfile" -o "$qr_file" 2>/dev/null
- rm -f "$tmpfile"
+while IFS= read -r url; do
+ [[ -z $url ]] && continue
+ short=${url//https:\/\/huggingface.co\//hf.co/}
+ hash=$(printf '%s' "$short" | md5sum | awk '{print $1}')
+ file="$qr_dir/$hash.svg"
+ [[ -f $file ]] && continue
+ tmp="$qr_dir/${hash}_tmp.svg"
+ qrencode --inline -t svg -l L -s 1 -m 0 "$short" -o "$tmp"
+ svgo --multipass -q "$tmp" -o "$file" 2>/dev/null
+ rm -f "$tmp"
+done < <(jq -r '.data[] | select(.url) | .url' "$config")
+
+# ------------------------------------------------------------------
+# 3. Pre-compute .size and limits
+# ------------------------------------------------------------------
+mp=$(jq -r '.model_parameters' "$config")
+min_ppl=$(jq -r '.data | min_by(.ppl).ppl' "$config")
+max_ppl=$(jq -r '.data | max_by(.ppl).ppl' "$config")
+
+sizes=()
+while IFS= read -r size; do
+ sizes+=("$size")
+done < <(jq -r --arg mp "$mp" '.data[] | .bpw * ($mp|tonumber) / 8 / 1024 / 1024 / 1024 | round * 1.0' "$config")
+
+max_sz=0
+for size in "${sizes[@]}"; do
+ if (( $(echo "$size > $max_sz" | bc -l) )); then
+ max_sz=$size
fi
done
-
-# Extract model parameters and data
-title=$(jq -r '.title // "Quantization Analysis"' "$config_file")
-subtitle=$(jq -r '.subtitle // "Lower perplexity = Better performance"' "$config_file")
-model_params=$(jq -r '.model_parameters' "$config_file")
-if [ $logscale -eq 1 ]; then
- subtitle+=" (Y-axis: Log-Difference Scale)"
-fi
-
-# Calculate model sizes in GB: (bpw * parameters) / 8 / 1024^3
-data=$(jq --arg model_params "$model_params" '
- .data |= map(.size = (.bpw * ($model_params | tonumber) / 8 / (1024*1024*1024) | round * 1.0))
- | .data
-' "$config_file")
-
-num_points=$(jq -r 'length' <<< "$data")
-[ "$num_points" -eq 0 ] && echo "Error: No data points" >&2 && exit 1
-
-# Extract min/max perplexity and max size
-min_ppl=$(jq -r 'min_by(.ppl) | .ppl' <<< "$data")
-max_ppl=$(jq -r 'max_by(.ppl) | .ppl' <<< "$data")
-max_size=$(jq -r 'max_by(.size) | .size' <<< "$data")
-
-# Calculate rounded max size (next multiple of 64)
-max_size_rounded=$(awk -v max="$max_size" 'BEGIN { rounded = int((max + 63) / 64) * 64; if (rounded < 64) rounded=64; print rounded }')
-
-# Pre-calculate logarithmic values if needed
-if [ $logscale -eq 1 ]; then
- # Calculate range and epsilon (1% of range for smoothing)
- range=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN { print max - min }')
- epsilon=$(awk -v range="$range" 'BEGIN { print range / 100.0 }')
-
- # Calculate transformed min/max values
- t_min=$(awk -v epsilon="$epsilon" 'BEGIN { print log(epsilon)/log(10) }')
- t_max=$(awk -v range="$range" -v epsilon="$epsilon" 'BEGIN { print log(range + epsilon) / log(10) }')
- t_range=$(awk -v t_min="$t_min" -v t_max="$t_max" 'BEGIN { print t_max - t_min }')
+max_round=$(awk -v m="$max_sz" 'BEGIN{r=int((m+63)/64)*64; print (r<64?64:r)}')
+
+title=$(jq -r '.title // "Quantization Analysis"' "$config")
+subtitle=$(jq -r '.subtitle // "Lower perplexity = better"' "$config")
+[[ $logscale -eq 1 ]] && subtitle+=" (log-difference scale)"
+
+if [[ $logscale -eq 1 ]]; then
+ rng=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN{print max-min}')
+ eps=$(awk -v r="$rng" 'BEGIN{print r/100}')
+ t_min=$(awk -v e="$eps" 'BEGIN{print log(e)/log(10)}')
+ t_range=$(awk -v min="$min_ppl" -v max="$max_ppl" -v e="$eps" \
+ 'BEGIN{tmax=log(max-min+e)/log(10); print tmax-log(e)/log(10)}')
else
- ppl_range=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN { print max - min }')
+ ppl_rng=$(awk -v min="$min_ppl" -v max="$max_ppl" 'BEGIN{print max-min}')
fi
-# Dimensions
-top_margin=100
-chart_height=400
-gap=50
-legend_height=$((50 + num_points * 40))
-bottom_margin=5
-total_height=$((top_margin + chart_height + gap + legend_height + bottom_margin))
-
-# Color functions
-generate_color() {
- echo -n "$1" | md5sum | awk '{print "#" substr($1,1,6)}'
-}
+# ------------------------------------------------------------------
+# 4. Pareto indices
+# ------------------------------------------------------------------
+pareto_i=()
+item_count=$(jq '.data | length' "$config")
+
+for ((i=0; i
+
-This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing #285.
+This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285).
I have had two non functional mixes so far as mentioned in https://github.com/ikawrakow/ik_llama.cpp/pull/295#issuecomment-2762814972 and the comments that follow.
@@ -1922,3566 +1922,1117 @@ Command used to make this fourth quant
---
-👤 **saood06** commented the **2025-03-30** at **20:23:35**:
+👤 **ubergarm** commented on **2025-03-30** at **21:11:28**
-My working mix:
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 246 tensors
-llama_model_loader: - type iq4_k_r4: 357 tensors
-llama_model_loader: - type iq5_k_r4: 61 tensors
+> This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing https://github.com/ikawrakow/ik_llama.cpp/issues/285.
-Full quant log below:
+Okay, thanks for confirming success with those tensor types. I'll re-cooking again just changing `q8_0_r8` to `q8_0` to see if there is any effect. Plus it would allow use on GPU.
-
+> The second broken mix (where I was going to test setting output.weight to iq6_k), I ended up realizing after I tested it I messed up the custom quant rule and it actually ended up being q6_k_r4 for both blk.X.attn_output.weight and output.weight
-Log
+> ...4th mix going back to bartowski's and it is also not functional....
-./bin/llama-quantize --allow-requantize --imatrix /mnt/sda/deepseek-ai_DeepSeek-V3-0324.imatrix --token-embedding-type q8_0 --output-tensor-type q8_0 /mnt/sda/deepseek-ai_DeepSeek-V3-0324-Q8_0/deepseek-a i_DeepSeek-V3-0324-Q8_0-00001-of-00020.gguf /mnt/sda/DeepSeek-V3-0324-IQ4_K_R4.gguf IQ4_K_R4 48
-load_imatrix: imatrix dataset='/workspace/calibration_datav3.txt'
-load_imatrix: loaded 720 importance matrix entries from /mnt/sda/deepseek-ai_DeepSeek-V3-0324.imatrix computed on 124 chunks
-prepare_imatrix: have 720 importance matrix entries
-main: build = 3617 (f31aca2d)
-main: built with gcc (Clear Linux OS for Intel Architecture) 14.2.1 20241210 releases/gcc-14.2.0-551-g21a09f0507 for x86_64-generic-linux
-main: quantizing '/mnt/sda/deepseek-ai_DeepSeek-V3-0324-Q8_0/deepseek-ai_DeepSeek-V3-0324-Q8_0-00001-of-00020.gguf' to '/mnt/sda/DeepSeek-V3-0324-IQ4_K_R4.gguf' as IQ4_K_R4 using 48 threads
-llama_model_loader: additional 19 GGUFs metadata loaded.
-llama_model_loader: loaded meta data with 47 key-value pairs and 1025 tensors from /mnt/sda/deepseek-ai_DeepSeek-V3-0324-Q8_0/deepseek-ai_DeepSeek-V3-0324-Q8_0-00001-of-00020.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324 Bf16
-llama_model_loader: - kv 3: general.size_label str = 256x20B
-llama_model_loader: - kv 4: general.license str = mit
-llama_model_loader: - kv 5: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 6: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 7: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 8: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 9: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 10: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 11: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 12: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 13: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 14: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 15: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 16: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 17: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 18: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 19: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 20: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 21: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 22: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 23: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 24: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 25: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 26: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 27: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 28: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 29: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 30: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 31: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 32: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 33: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<▒...
-llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
-llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 40: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 41: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 42: general.quantization_version u32 = 2
-llama_model_loader: - kv 43: general.file_type u32 = 7
-llama_model_loader: - kv 44: split.no u16 = 0
-llama_model_loader: - kv 45: split.tensors.count i32 = 1025
-llama_model_loader: - kv 46: split.count u16 = 20
+Hrmm, I'm wondering if this has something to do with setting `token_embd.weight` weight to repacked quant types? I'm speculating wildly, hopefully my above test will give another datapoint though.
+
+I recall when I used the offline-repack tool with a `Q8_0` it converted everything to `q8_0_r8` except for one tensor, which stuck out to me but I didn't think much of it at the time:
+```
+[1/1025] token_embd.weight - [ 7168, 129280,1,1], type = q8_0, size = 938.984 MB, type = q8_0
+```
+
+> I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to Alright spam (although the probability of Alright is not actually 100% so it will deviate).
+
+I see, yeah a lot of variables in play with multiple imatrix files and all. Interesting it also babbles `Alright` sometimes.
+
+Anyway, I'll keep you posted if I get one cooked up that seems to be working better and narrow down if it is anything odd going on or just not all quants play well together on this model.
+
+---
+
+👤 **saood06** commented on **2025-03-30** at **21:30:18**
+
+> > This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing [[#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285)](https://github.com/ikawrakow/ik_llama.cpp/issues/285).
+
+> Okay, thanks for confirming success with those tensor types. I'll re-cooking again just changing `q8_0_r8` to `q8_0` to see if there is any effect. Plus it would allow use on GPU.
+
+Thanks, I don't have any mix cooking right, but I could do one overnight to test another mix if that would be helpful.
+
+> Hrmm, I'm wondering if this has something to do with setting `token_embd.weight` weight to repacked quant types? I'm speculating wildly, hopefully my above test will give another datapoint though.
+>
+> I recall when I used the offline-repack tool with a `Q8_0` it converted everything to `q8_0_r8` except for one tensor, which stuck out to me but I didn't think much of it at the time:
+>
+> ```
+> [1/1025] token_embd.weight - [ 7168, 129280,1,1], type = q8_0, size = 938.984 MB, type = q8_0
+> ```
+>
+
+I don't think it is a wild speculation.
+
+It might be the reason, see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/272/files#diff-b74fdb6e796b36d230cafcbff50ebd34cf27bd55b6b4ca0ad5a2ff8191b1066bR6784-R6786) and [this](https://github.com/ikawrakow/ik_llama.cpp/blob/4819257ce66a680608cf9c7871156041d00eb7da/src/llama.cpp#L16920).
+
+Also now that you do mention it I do think something about this was brought up at some point but I can't remember where (so no reference).
+
+> > I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to Alright spam (although the probability of Alright is not actually 100% so it will deviate).
+>
+> I see, yeah a lot of variables in play with multiple imatrix files and all. Interesting it also babbles `Alright` sometimes.
+>
+> Anyway, I'll keep you posted if I get one cooked up that seems to be working better and narrow down if it is anything odd going on or just not all quants play well together on this model.
+
+Thanks, I'll do the same.
+
+---
+
+👤 **ubergarm** commented on **2025-03-31** at **00:42:54**
+
+> I don't think it is a wild speculation.
+>
+> It might be the reason, see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/272/files#diff-b74fdb6e796b36d230cafcbff50ebd34cf27bd55b6b4ca0ad5a2ff8191b1066bR6784-R6786) and [this](https://github.com/ikawrakow/ik_llama.cpp/blob/4819257ce66a680608cf9c7871156041d00eb7da/src/llama.cpp#L16920).
+>
+> Also now that you do mention it I do think something about this was brought up at some point but I can't remember where (so no reference).
+
+Wow thanks, you are really good with keeping track of so much disperate information and links haha...
+
+Seems like logic for `token_embd.weight` is `if (new_type == GGML_TYPE_Q8_0_R8) { new_type = GGML_TYPE_Q8_0; }`
+
+And I am currently testing perplexity on my experiment above using `Q8_0` instead of `Q8_0_R8` quant and its looking just fine:
+
+```
llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 664 tensors
-================================ Have weights data with 720 entries
-[ 1/1025] output.weight - [ 7168, 129280, 1, 1], type = q8_0, size = 938.984 MB
-[ 2/1025] output_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 3/1025] token_embd.weight - [ 7168, 129280, 1, 1], type = q8_0, size = 938.984 MB
-[ 4/1025] blk.0.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 5/1025] blk.0.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 6/1025] blk.0.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 7/1025] blk.0.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 8/1025] blk.0.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 9/1025] blk.0.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 10/1025] blk.0.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 11/1025] blk.0.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 12/1025] blk.0.ffn_down.weight - [18432, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 13/1025] blk.0.ffn_gate.weight - [ 7168, 18432, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 14/1025] blk.0.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 15/1025] blk.0.ffn_up.weight - [ 7168, 18432, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 16/1025] blk.1.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 17/1025] blk.1.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 18/1025] blk.1.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 19/1025] blk.1.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 20/1025] blk.1.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 21/1025] blk.1.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 22/1025] blk.1.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 23/1025] blk.1.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 24/1025] blk.1.ffn_down.weight - [18432, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 25/1025] blk.1.ffn_gate.weight - [ 7168, 18432, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 26/1025] blk.1.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 27/1025] blk.1.ffn_up.weight - [ 7168, 18432, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 28/1025] blk.2.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 29/1025] blk.2.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 30/1025] blk.2.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 31/1025] blk.2.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 32/1025] blk.2.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 33/1025] blk.2.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 34/1025] blk.2.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 35/1025] blk.2.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 36/1025] blk.2.ffn_down.weight - [18432, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 37/1025] blk.2.ffn_gate.weight - [ 7168, 18432, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 38/1025] blk.2.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 39/1025] blk.2.ffn_up.weight - [ 7168, 18432, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 133.88 MiB -> 70.88 MiB
-[ 40/1025] blk.3.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 41/1025] blk.3.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 42/1025] blk.3.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 43/1025] blk.3.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 44/1025] blk.3.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 45/1025] blk.3.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 46/1025] blk.3.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 47/1025] blk.3.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 48/1025] blk.3.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 49/1025] blk.3.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 50/1025] blk.3.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 51/1025] blk.3.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 52/1025] blk.3.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 53/1025] blk.3.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 54/1025] blk.3.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 55/1025] blk.3.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 56/1025] blk.3.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 57/1025] blk.4.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 58/1025] blk.4.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 59/1025] blk.4.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 60/1025] blk.4.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 61/1025] blk.4.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 62/1025] blk.4.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 63/1025] blk.4.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 64/1025] blk.4.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 65/1025] blk.4.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 66/1025] blk.4.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 67/1025] blk.4.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 68/1025] blk.4.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 69/1025] blk.4.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 70/1025] blk.4.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 71/1025] blk.4.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 72/1025] blk.4.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 73/1025] blk.4.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 74/1025] blk.5.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 75/1025] blk.5.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 76/1025] blk.5.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 77/1025] blk.5.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 78/1025] blk.5.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 79/1025] blk.5.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 80/1025] blk.5.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 81/1025] blk.5.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 82/1025] blk.5.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 83/1025] blk.5.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 84/1025] blk.5.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 85/1025] blk.5.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 86/1025] blk.5.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 87/1025] blk.5.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 88/1025] blk.5.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 89/1025] blk.5.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 90/1025] blk.5.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 91/1025] blk.6.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 92/1025] blk.6.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 93/1025] blk.6.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 94/1025] blk.6.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 95/1025] blk.6.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 96/1025] blk.6.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 97/1025] blk.6.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 98/1025] blk.6.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 99/1025] blk.6.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 100/1025] blk.6.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 101/1025] blk.6.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 102/1025] blk.6.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 103/1025] blk.6.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 104/1025] blk.6.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 105/1025] blk.6.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 106/1025] blk.6.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 107/1025] blk.6.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 108/1025] blk.7.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 109/1025] blk.7.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 110/1025] blk.7.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 111/1025] blk.7.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 112/1025] blk.7.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 113/1025] blk.7.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 114/1025] blk.7.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 115/1025] blk.7.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 116/1025] blk.7.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 117/1025] blk.7.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 118/1025] blk.7.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 119/1025] blk.7.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 120/1025] blk.7.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 121/1025] blk.7.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 122/1025] blk.7.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 123/1025] blk.7.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 124/1025] blk.7.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 125/1025] blk.8.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 126/1025] blk.8.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 127/1025] blk.8.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 128/1025] blk.8.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 129/1025] blk.8.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 130/1025] blk.8.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 131/1025] blk.8.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 132/1025] blk.8.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 133/1025] blk.8.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 134/1025] blk.8.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 135/1025] blk.8.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 136/1025] blk.8.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 137/1025] blk.8.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 138/1025] blk.8.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 139/1025] blk.8.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 140/1025] blk.8.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 141/1025] blk.8.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 142/1025] blk.9.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 143/1025] blk.9.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 144/1025] blk.9.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 145/1025] blk.9.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 146/1025] blk.9.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 147/1025] blk.9.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 148/1025] blk.9.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 149/1025] blk.9.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 150/1025] blk.9.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 151/1025] blk.9.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 152/1025] blk.9.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 153/1025] blk.9.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 154/1025] blk.9.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 155/1025] blk.9.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 156/1025] blk.9.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 157/1025] blk.9.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 158/1025] blk.9.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 159/1025] blk.10.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 160/1025] blk.10.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 161/1025] blk.10.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 162/1025] blk.10.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 163/1025] blk.10.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 164/1025] blk.10.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 165/1025] blk.10.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 166/1025] blk.10.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 167/1025] blk.10.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 168/1025] blk.10.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 169/1025] blk.10.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 170/1025] blk.10.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 171/1025] blk.10.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 172/1025] blk.10.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 173/1025] blk.10.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 174/1025] blk.10.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 175/1025] blk.10.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 176/1025] blk.11.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 177/1025] blk.11.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 178/1025] blk.11.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 179/1025] blk.11.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 180/1025] blk.11.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 181/1025] blk.11.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 182/1025] blk.11.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 183/1025] blk.11.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 184/1025] blk.11.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 185/1025] blk.11.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 186/1025] blk.11.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 187/1025] blk.11.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 188/1025] blk.11.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 189/1025] blk.11.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 190/1025] blk.11.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 191/1025] blk.11.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 192/1025] blk.11.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 193/1025] blk.12.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 194/1025] blk.12.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 195/1025] blk.12.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 196/1025] blk.12.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 197/1025] blk.12.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 198/1025] blk.12.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 199/1025] blk.12.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 200/1025] blk.12.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 201/1025] blk.12.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 202/1025] blk.12.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 203/1025] blk.12.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 204/1025] blk.12.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 205/1025] blk.12.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 206/1025] blk.12.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 207/1025] blk.12.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 208/1025] blk.12.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 209/1025] blk.12.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 210/1025] blk.13.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 211/1025] blk.13.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 212/1025] blk.13.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 213/1025] blk.13.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 214/1025] blk.13.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 215/1025] blk.13.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 216/1025] blk.13.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 217/1025] blk.13.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 218/1025] blk.13.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 219/1025] blk.13.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 220/1025] blk.13.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 221/1025] blk.13.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 222/1025] blk.13.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 223/1025] blk.13.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 224/1025] blk.13.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 225/1025] blk.13.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 226/1025] blk.13.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 227/1025] blk.14.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 228/1025] blk.14.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 229/1025] blk.14.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 230/1025] blk.14.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 231/1025] blk.14.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 232/1025] blk.14.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 233/1025] blk.14.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 234/1025] blk.14.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 235/1025] blk.14.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 236/1025] blk.14.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 237/1025] blk.14.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 238/1025] blk.14.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 239/1025] blk.14.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 240/1025] blk.14.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 241/1025] blk.14.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 242/1025] blk.14.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 243/1025] blk.14.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 244/1025] blk.15.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 245/1025] blk.15.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 246/1025] blk.15.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 247/1025] blk.15.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 248/1025] blk.15.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 249/1025] blk.15.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 250/1025] blk.15.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 251/1025] blk.15.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 252/1025] blk.15.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 253/1025] blk.15.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 254/1025] blk.15.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 255/1025] blk.15.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 256/1025] blk.15.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 257/1025] blk.15.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 258/1025] blk.15.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 259/1025] blk.15.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 260/1025] blk.15.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 261/1025] blk.16.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 262/1025] blk.16.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 263/1025] blk.16.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 264/1025] blk.16.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 265/1025] blk.16.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 266/1025] blk.16.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 267/1025] blk.16.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 268/1025] blk.16.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 269/1025] blk.16.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 270/1025] blk.16.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 271/1025] blk.16.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 272/1025] blk.16.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 273/1025] blk.16.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 274/1025] blk.16.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 275/1025] blk.16.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 276/1025] blk.16.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 277/1025] blk.16.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 278/1025] blk.17.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 279/1025] blk.17.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 280/1025] blk.17.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 281/1025] blk.17.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 282/1025] blk.17.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 283/1025] blk.17.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 284/1025] blk.17.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 285/1025] blk.17.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 286/1025] blk.17.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 287/1025] blk.17.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 288/1025] blk.17.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 289/1025] blk.17.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 290/1025] blk.17.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 291/1025] blk.17.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 292/1025] blk.17.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 293/1025] blk.17.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 294/1025] blk.17.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 295/1025] blk.18.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 296/1025] blk.18.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 297/1025] blk.18.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 298/1025] blk.18.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 299/1025] blk.18.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 300/1025] blk.18.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 301/1025] blk.18.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 302/1025] blk.18.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 303/1025] blk.18.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 304/1025] blk.18.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 305/1025] blk.18.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 306/1025] blk.18.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 307/1025] blk.18.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 308/1025] blk.18.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 309/1025] blk.18.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 310/1025] blk.18.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 311/1025] blk.18.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 312/1025] blk.19.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 313/1025] blk.19.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 314/1025] blk.19.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 315/1025] blk.19.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 316/1025] blk.19.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 317/1025] blk.19.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 318/1025] blk.19.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 319/1025] blk.19.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 320/1025] blk.19.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 321/1025] blk.19.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 322/1025] blk.19.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 323/1025] blk.19.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 324/1025] blk.19.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 325/1025] blk.19.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 326/1025] blk.19.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 327/1025] blk.19.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 328/1025] blk.19.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 329/1025] blk.20.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 330/1025] blk.20.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 331/1025] blk.20.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 332/1025] blk.20.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 333/1025] blk.20.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 334/1025] blk.20.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 335/1025] blk.20.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 336/1025] blk.20.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 337/1025] blk.20.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 338/1025] blk.20.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 339/1025] blk.20.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 340/1025] blk.20.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 341/1025] blk.20.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 342/1025] blk.20.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 343/1025] blk.20.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 344/1025] blk.20.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 345/1025] blk.20.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 346/1025] blk.21.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 347/1025] blk.21.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 348/1025] blk.21.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 349/1025] blk.21.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 350/1025] blk.21.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 351/1025] blk.21.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 352/1025] blk.21.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 353/1025] blk.21.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 354/1025] blk.21.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 355/1025] blk.21.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 356/1025] blk.21.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 357/1025] blk.21.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 358/1025] blk.21.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 359/1025] blk.21.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 360/1025] blk.21.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 361/1025] blk.21.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 362/1025] blk.21.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 363/1025] blk.22.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 364/1025] blk.22.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 365/1025] blk.22.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 366/1025] blk.22.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 367/1025] blk.22.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 368/1025] blk.22.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 369/1025] blk.22.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 370/1025] blk.22.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 371/1025] blk.22.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 372/1025] blk.22.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 373/1025] blk.22.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 374/1025] blk.22.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 375/1025] blk.22.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 376/1025] blk.22.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 377/1025] blk.22.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 378/1025] blk.22.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 379/1025] blk.22.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 380/1025] blk.23.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 381/1025] blk.23.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 382/1025] blk.23.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 383/1025] blk.23.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 384/1025] blk.23.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 385/1025] blk.23.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 386/1025] blk.23.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 387/1025] blk.23.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 388/1025] blk.23.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 389/1025] blk.23.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 390/1025] blk.23.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 391/1025] blk.23.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 392/1025] blk.23.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 393/1025] blk.23.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 394/1025] blk.23.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 395/1025] blk.23.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 396/1025] blk.23.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 397/1025] blk.24.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 398/1025] blk.24.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 399/1025] blk.24.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 400/1025] blk.24.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 401/1025] blk.24.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 402/1025] blk.24.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 403/1025] blk.24.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 404/1025] blk.24.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 405/1025] blk.24.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 406/1025] blk.24.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 407/1025] blk.24.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 408/1025] blk.24.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 409/1025] blk.24.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 410/1025] blk.24.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 411/1025] blk.24.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 412/1025] blk.24.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 413/1025] blk.24.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 414/1025] blk.25.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 415/1025] blk.25.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 416/1025] blk.25.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 417/1025] blk.25.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 418/1025] blk.25.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 419/1025] blk.25.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 420/1025] blk.25.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 421/1025] blk.25.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 422/1025] blk.25.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 423/1025] blk.25.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 424/1025] blk.25.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 425/1025] blk.25.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 426/1025] blk.25.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 427/1025] blk.25.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 428/1025] blk.25.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 429/1025] blk.25.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 430/1025] blk.25.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 431/1025] blk.26.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 432/1025] blk.26.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 433/1025] blk.26.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 434/1025] blk.26.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 435/1025] blk.26.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 436/1025] blk.26.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 437/1025] blk.26.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 438/1025] blk.26.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 439/1025] blk.26.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 440/1025] blk.26.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 441/1025] blk.26.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 442/1025] blk.26.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 443/1025] blk.26.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 444/1025] blk.26.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 445/1025] blk.26.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 446/1025] blk.26.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 447/1025] blk.26.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 448/1025] blk.27.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 449/1025] blk.27.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 450/1025] blk.27.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 451/1025] blk.27.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 452/1025] blk.27.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 453/1025] blk.27.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 454/1025] blk.27.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 455/1025] blk.27.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 456/1025] blk.27.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 457/1025] blk.27.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 458/1025] blk.27.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 459/1025] blk.27.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 460/1025] blk.27.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 461/1025] blk.27.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 462/1025] blk.27.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 463/1025] blk.27.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 464/1025] blk.27.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 465/1025] blk.28.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 466/1025] blk.28.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 467/1025] blk.28.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 468/1025] blk.28.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 469/1025] blk.28.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 470/1025] blk.28.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 471/1025] blk.28.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 472/1025] blk.28.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 473/1025] blk.28.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 474/1025] blk.28.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 475/1025] blk.28.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 476/1025] blk.28.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 477/1025] blk.28.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 478/1025] blk.28.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 479/1025] blk.28.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 480/1025] blk.28.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 481/1025] blk.28.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 482/1025] blk.29.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 483/1025] blk.29.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 484/1025] blk.29.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 485/1025] blk.29.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 486/1025] blk.29.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 487/1025] blk.29.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 488/1025] blk.29.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 489/1025] blk.29.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 490/1025] blk.29.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 491/1025] blk.29.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 492/1025] blk.29.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 493/1025] blk.29.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 494/1025] blk.29.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 495/1025] blk.29.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 496/1025] blk.29.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 497/1025] blk.29.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 498/1025] blk.29.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 499/1025] blk.30.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 500/1025] blk.30.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 501/1025] blk.30.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 502/1025] blk.30.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 503/1025] blk.30.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 504/1025] blk.30.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 505/1025] blk.30.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 506/1025] blk.30.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 507/1025] blk.30.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 508/1025] blk.30.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 509/1025] blk.30.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 510/1025] blk.30.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 511/1025] blk.30.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 512/1025] blk.30.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 513/1025] blk.30.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 514/1025] blk.30.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 515/1025] blk.30.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 516/1025] blk.31.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 517/1025] blk.31.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 518/1025] blk.31.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 519/1025] blk.31.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 520/1025] blk.31.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 521/1025] blk.31.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 522/1025] blk.31.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 523/1025] blk.31.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 524/1025] blk.31.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 525/1025] blk.31.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 526/1025] blk.31.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 527/1025] blk.31.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 528/1025] blk.31.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 529/1025] blk.31.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 530/1025] blk.31.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 531/1025] blk.31.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 532/1025] blk.31.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 533/1025] blk.32.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 534/1025] blk.32.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 535/1025] blk.32.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 536/1025] blk.32.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 537/1025] blk.32.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 538/1025] blk.32.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 539/1025] blk.32.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 540/1025] blk.32.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 541/1025] blk.32.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 542/1025] blk.32.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 543/1025] blk.32.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 544/1025] blk.32.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 545/1025] blk.32.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 546/1025] blk.32.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 547/1025] blk.32.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 548/1025] blk.32.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 549/1025] blk.32.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 550/1025] blk.33.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 551/1025] blk.33.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 552/1025] blk.33.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 553/1025] blk.33.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 554/1025] blk.33.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 555/1025] blk.33.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 556/1025] blk.33.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 557/1025] blk.33.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 558/1025] blk.33.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 559/1025] blk.33.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 560/1025] blk.33.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 561/1025] blk.33.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 562/1025] blk.33.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 563/1025] blk.33.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 564/1025] blk.33.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 565/1025] blk.33.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 566/1025] blk.33.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 567/1025] blk.34.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 568/1025] blk.34.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 569/1025] blk.34.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 570/1025] blk.34.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 571/1025] blk.34.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 572/1025] blk.34.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 573/1025] blk.34.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 574/1025] blk.34.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 575/1025] blk.34.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 576/1025] blk.34.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 577/1025] blk.34.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 578/1025] blk.34.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 579/1025] blk.34.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 580/1025] blk.34.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 581/1025] blk.34.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 582/1025] blk.34.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 583/1025] blk.34.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 584/1025] blk.35.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 585/1025] blk.35.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 586/1025] blk.35.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 587/1025] blk.35.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 588/1025] blk.35.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 589/1025] blk.35.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 590/1025] blk.35.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 591/1025] blk.35.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 592/1025] blk.35.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 593/1025] blk.35.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 594/1025] blk.35.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 595/1025] blk.35.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 596/1025] blk.35.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 597/1025] blk.35.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 598/1025] blk.35.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 599/1025] blk.35.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 600/1025] blk.35.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 601/1025] blk.36.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 602/1025] blk.36.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 603/1025] blk.36.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 604/1025] blk.36.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 605/1025] blk.36.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 606/1025] blk.36.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 607/1025] blk.36.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 608/1025] blk.36.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 609/1025] blk.36.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 610/1025] blk.36.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 611/1025] blk.36.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 612/1025] blk.36.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 613/1025] blk.36.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 614/1025] blk.36.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 615/1025] blk.36.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 616/1025] blk.36.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 617/1025] blk.36.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 618/1025] blk.37.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 619/1025] blk.37.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 620/1025] blk.37.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 621/1025] blk.37.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 622/1025] blk.37.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 623/1025] blk.37.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 624/1025] blk.37.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 625/1025] blk.37.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 626/1025] blk.37.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 627/1025] blk.37.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 628/1025] blk.37.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 629/1025] blk.37.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 630/1025] blk.37.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 631/1025] blk.37.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 632/1025] blk.37.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 633/1025] blk.37.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 634/1025] blk.37.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 635/1025] blk.38.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 636/1025] blk.38.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 637/1025] blk.38.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 638/1025] blk.38.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 639/1025] blk.38.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 640/1025] blk.38.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 641/1025] blk.38.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 642/1025] blk.38.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 643/1025] blk.38.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 644/1025] blk.38.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 645/1025] blk.38.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 646/1025] blk.38.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 647/1025] blk.38.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 648/1025] blk.38.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 649/1025] blk.38.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 650/1025] blk.38.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 651/1025] blk.38.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 652/1025] blk.39.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 653/1025] blk.39.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 654/1025] blk.39.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 655/1025] blk.39.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 656/1025] blk.39.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 657/1025] blk.39.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 658/1025] blk.39.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 659/1025] blk.39.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 660/1025] blk.39.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 661/1025] blk.39.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 662/1025] blk.39.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 663/1025] blk.39.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 664/1025] blk.39.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 665/1025] blk.39.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 666/1025] blk.39.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 667/1025] blk.39.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 668/1025] blk.39.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 669/1025] blk.40.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 670/1025] blk.40.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 671/1025] blk.40.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 672/1025] blk.40.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 673/1025] blk.40.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 674/1025] blk.40.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 675/1025] blk.40.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 676/1025] blk.40.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 677/1025] blk.40.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 678/1025] blk.40.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 679/1025] blk.40.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 680/1025] blk.40.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 681/1025] blk.40.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 682/1025] blk.40.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 683/1025] blk.40.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 684/1025] blk.40.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 685/1025] blk.40.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 686/1025] blk.41.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 687/1025] blk.41.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 688/1025] blk.41.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 689/1025] blk.41.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 690/1025] blk.41.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 691/1025] blk.41.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 692/1025] blk.41.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 693/1025] blk.41.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 694/1025] blk.41.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 695/1025] blk.41.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 696/1025] blk.41.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 697/1025] blk.41.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 698/1025] blk.41.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 699/1025] blk.41.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 700/1025] blk.41.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 701/1025] blk.41.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 702/1025] blk.41.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 703/1025] blk.42.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 704/1025] blk.42.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 705/1025] blk.42.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 706/1025] blk.42.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 707/1025] blk.42.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 708/1025] blk.42.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 709/1025] blk.42.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 710/1025] blk.42.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 711/1025] blk.42.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 712/1025] blk.42.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 713/1025] blk.42.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 714/1025] blk.42.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 715/1025] blk.42.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 716/1025] blk.42.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 717/1025] blk.42.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 718/1025] blk.42.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 719/1025] blk.42.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 720/1025] blk.43.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 721/1025] blk.43.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 722/1025] blk.43.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 723/1025] blk.43.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 724/1025] blk.43.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 725/1025] blk.43.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 726/1025] blk.43.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 727/1025] blk.43.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 728/1025] blk.43.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 729/1025] blk.43.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 730/1025] blk.43.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 731/1025] blk.43.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 732/1025] blk.43.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 733/1025] blk.43.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 734/1025] blk.43.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 735/1025] blk.43.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 736/1025] blk.43.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 737/1025] blk.44.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 738/1025] blk.44.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 739/1025] blk.44.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 740/1025] blk.44.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 741/1025] blk.44.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 742/1025] blk.44.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 743/1025] blk.44.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 744/1025] blk.44.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 745/1025] blk.44.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 746/1025] blk.44.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 747/1025] blk.44.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 748/1025] blk.44.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 749/1025] blk.44.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 750/1025] blk.44.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 751/1025] blk.44.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 752/1025] blk.44.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 753/1025] blk.44.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 754/1025] blk.45.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 755/1025] blk.45.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 756/1025] blk.45.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 757/1025] blk.45.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 758/1025] blk.45.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 759/1025] blk.45.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 760/1025] blk.45.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 761/1025] blk.45.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 762/1025] blk.45.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 763/1025] blk.45.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 764/1025] blk.45.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 765/1025] blk.45.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 766/1025] blk.45.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 767/1025] blk.45.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 768/1025] blk.45.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 769/1025] blk.45.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 770/1025] blk.45.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 771/1025] blk.46.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 772/1025] blk.46.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 773/1025] blk.46.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 774/1025] blk.46.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 775/1025] blk.46.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 776/1025] blk.46.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 777/1025] blk.46.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 778/1025] blk.46.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 779/1025] blk.46.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 780/1025] blk.46.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 781/1025] blk.46.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 782/1025] blk.46.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 783/1025] blk.46.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 784/1025] blk.46.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 785/1025] blk.46.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 786/1025] blk.46.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 787/1025] blk.46.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 788/1025] blk.47.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 789/1025] blk.47.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 790/1025] blk.47.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 791/1025] blk.47.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 792/1025] blk.47.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 793/1025] blk.47.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 794/1025] blk.47.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 795/1025] blk.47.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 796/1025] blk.47.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 797/1025] blk.47.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 798/1025] blk.47.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 799/1025] blk.47.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 800/1025] blk.47.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 801/1025] blk.47.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 802/1025] blk.47.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 803/1025] blk.47.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 804/1025] blk.47.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 805/1025] blk.48.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 806/1025] blk.48.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 807/1025] blk.48.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 808/1025] blk.48.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 809/1025] blk.48.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 810/1025] blk.48.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 811/1025] blk.48.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 812/1025] blk.48.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 813/1025] blk.48.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 814/1025] blk.48.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 815/1025] blk.48.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 816/1025] blk.48.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 817/1025] blk.48.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 818/1025] blk.48.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 819/1025] blk.48.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 820/1025] blk.48.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 821/1025] blk.48.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 822/1025] blk.49.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 823/1025] blk.49.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 824/1025] blk.49.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 825/1025] blk.49.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 826/1025] blk.49.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 827/1025] blk.49.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 828/1025] blk.49.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 829/1025] blk.49.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 830/1025] blk.49.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 831/1025] blk.49.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 832/1025] blk.49.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 833/1025] blk.49.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 834/1025] blk.49.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 835/1025] blk.49.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 836/1025] blk.49.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 837/1025] blk.49.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 838/1025] blk.49.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 839/1025] blk.50.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 840/1025] blk.50.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 841/1025] blk.50.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 842/1025] blk.50.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 843/1025] blk.50.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 844/1025] blk.50.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 845/1025] blk.50.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 846/1025] blk.50.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 847/1025] blk.50.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 848/1025] blk.50.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 849/1025] blk.50.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 850/1025] blk.50.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 851/1025] blk.50.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 852/1025] blk.50.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 853/1025] blk.50.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 854/1025] blk.50.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 855/1025] blk.50.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 856/1025] blk.51.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 857/1025] blk.51.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 858/1025] blk.51.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 859/1025] blk.51.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 860/1025] blk.51.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 861/1025] blk.51.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 862/1025] blk.51.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 863/1025] blk.51.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 864/1025] blk.51.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 865/1025] blk.51.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 866/1025] blk.51.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 867/1025] blk.51.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 868/1025] blk.51.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 869/1025] blk.51.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 870/1025] blk.51.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 871/1025] blk.51.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 872/1025] blk.51.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 873/1025] blk.52.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 874/1025] blk.52.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 875/1025] blk.52.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 876/1025] blk.52.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 877/1025] blk.52.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 878/1025] blk.52.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 879/1025] blk.52.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 880/1025] blk.52.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 881/1025] blk.52.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 882/1025] blk.52.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 883/1025] blk.52.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 884/1025] blk.52.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 885/1025] blk.52.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 886/1025] blk.52.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 887/1025] blk.52.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 888/1025] blk.52.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 889/1025] blk.52.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 890/1025] blk.53.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 891/1025] blk.53.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 892/1025] blk.53.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 893/1025] blk.53.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 894/1025] blk.53.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 895/1025] blk.53.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 896/1025] blk.53.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 897/1025] blk.53.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 898/1025] blk.53.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 899/1025] blk.53.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 900/1025] blk.53.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 901/1025] blk.53.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 902/1025] blk.53.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 903/1025] blk.53.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 904/1025] blk.53.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 905/1025] blk.53.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 906/1025] blk.53.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 907/1025] blk.54.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 908/1025] blk.54.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 909/1025] blk.54.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 910/1025] blk.54.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 911/1025] blk.54.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 912/1025] blk.54.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 913/1025] blk.54.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 914/1025] blk.54.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 915/1025] blk.54.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 916/1025] blk.54.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 917/1025] blk.54.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 918/1025] blk.54.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 919/1025] blk.54.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 920/1025] blk.54.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 921/1025] blk.54.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 922/1025] blk.54.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 923/1025] blk.54.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 924/1025] blk.55.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 925/1025] blk.55.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 926/1025] blk.55.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 927/1025] blk.55.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 928/1025] blk.55.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 929/1025] blk.55.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 930/1025] blk.55.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 931/1025] blk.55.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 932/1025] blk.55.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 933/1025] blk.55.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 934/1025] blk.55.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 935/1025] blk.55.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 936/1025] blk.55.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 937/1025] blk.55.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 938/1025] blk.55.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 939/1025] blk.55.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 940/1025] blk.55.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 941/1025] blk.56.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 942/1025] blk.56.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 943/1025] blk.56.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 944/1025] blk.56.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 945/1025] blk.56.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 946/1025] blk.56.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 947/1025] blk.56.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 948/1025] blk.56.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 949/1025] blk.56.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 950/1025] blk.56.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 951/1025] blk.56.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 952/1025] blk.56.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 953/1025] blk.56.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 954/1025] blk.56.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 955/1025] blk.56.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 956/1025] blk.56.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 957/1025] blk.56.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 958/1025] blk.57.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 959/1025] blk.57.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 960/1025] blk.57.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 961/1025] blk.57.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 962/1025] blk.57.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 963/1025] blk.57.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 964/1025] blk.57.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 965/1025] blk.57.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 966/1025] blk.57.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 967/1025] blk.57.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 968/1025] blk.57.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 969/1025] blk.57.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 970/1025] blk.57.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 971/1025] blk.57.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 972/1025] blk.57.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 973/1025] blk.57.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 974/1025] blk.57.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 975/1025] blk.58.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 976/1025] blk.58.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 977/1025] blk.58.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 978/1025] blk.58.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 979/1025] blk.58.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 980/1025] blk.58.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 981/1025] blk.58.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 982/1025] blk.58.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[ 983/1025] blk.58.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 984/1025] blk.58.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 985/1025] blk.58.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 986/1025] blk.58.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 987/1025] blk.58.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 988/1025] blk.58.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 989/1025] blk.58.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 990/1025] blk.58.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[ 991/1025] blk.58.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[ 992/1025] blk.59.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[ 993/1025] blk.59.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 994/1025] blk.59.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[ 995/1025] blk.59.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 996/1025] blk.59.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[ 997/1025] blk.59.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[ 998/1025] blk.59.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 999/1025] blk.59.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[1000/1025] blk.59.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[1001/1025] blk.59.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[1002/1025] blk.59.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[1003/1025] blk.59.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[1004/1025] blk.59.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[1005/1025] blk.59.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[1006/1025] blk.59.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[1007/1025] blk.59.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[1008/1025] blk.59.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[1009/1025] blk.60.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = q8_0, size = 4.184 MB
-[1010/1025] blk.60.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[1011/1025] blk.60.attn_kv_b.weight - [ 512, 32768, 1, 1], type = q8_0, size = 17.000 MB
-[1012/1025] blk.60.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[1013/1025] blk.60.attn_output.weight - [16384, 7168, 1, 1], type = q8_0, converting to iq5_k_r4 .. size = 119.00 MiB -> 77.00 MiB
-[1014/1025] blk.60.attn_q_a.weight - [ 7168, 1536, 1, 1], type = q8_0, size = 11.156 MB
-[1015/1025] blk.60.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[1016/1025] blk.60.attn_q_b.weight - [ 1536, 24576, 1, 1], type = q8_0, size = 38.250 MB
-[1017/1025] blk.60.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[1018/1025] blk.60.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[1019/1025] blk.60.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[1020/1025] blk.60.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[1021/1025] blk.60.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[1022/1025] blk.60.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-[1023/1025] blk.60.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[1024/1025] blk.60.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = q8_0, converting to iq4_k_r4 .. size = 3808.00 MiB -> 2016.00 MiB
-[1025/1025] blk.60.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = q8_0, converting to iq4_k_r4 .. size = 14.88 MiB -> 7.88 MiB
-llama_model_quantize_internal: model size = 680237.97 MB
-llama_model_quantize_internal: quant size = 364082.97 MB
-
-main: quantize time = 13350534.07 ms
-main: total time = 13350534.07 ms
-
-
-
-This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing #285.
-
-I have had two non functional mixes so far as mentioned in https://github.com/ikawrakow/ik_llama.cpp/pull/295#issuecomment-2762814972 and the comments that follow.
-
-Two things I didn't mention over there though
-
-1) My functional DeepSeek-V3-0324 mix used bartowski's imatrix file and the two non functional one used the one from team mradermacher.
-
-2) The second broken mix (where I was going to test setting output.weight to iq6_k), I ended up realizing after I tested it I messed up the custom quant rule and it actually ended up being q6_k_r4 for both `blk.X.attn_output.weight` and `output.weight` so the fact that it didn't work is even more suprising when looking at it versus the working R1 mix, and why my next mix went back to the imatrix dataset I know worked for me.
-
-I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to `Alright` spam (although the probability of Alright is not actually 100% so it will deviate).
-
-Command used to make this fourth quant
-```
-./llama-quantize --imatrix /mnt/sda/deepseek-ai_DeepSeek-V3-0324.imatrix --custom-q ".*\.attn_output.weight=q5_k_r4,output\.weight=q6_k_r4,.*=iq4_k_r4" /mnt/sda/DeepseekV3_0324/DeepseekV3_0324-256x21B-BF16.gguf /mnt/sda/DeepSeek-V3-0324-IQ4_K_R4_ATT4.gguf IQ4_K_R4 48
-```
-
----
-
-👤 **ubergarm** commented the **2025-03-30** at **21:11:28**:
-
-> This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing https://github.com/ikawrakow/ik_llama.cpp/issues/285.
-
-Okay, thanks for confirming success with those tensor types. I'll re-cooking again just changing `q8_0_r8` to `q8_0` to see if there is any effect. Plus it would allow use on GPU.
-
-> The second broken mix (where I was going to test setting output.weight to iq6_k), I ended up realizing after I tested it I messed up the custom quant rule and it actually ended up being q6_k_r4 for both blk.X.attn_output.weight and output.weight
-
-> ...4th mix going back to bartowski's and it is also not functional....
-
-Hrmm, I'm wondering if this has something to do with setting `token_embd.weight` weight to repacked quant types? I'm speculating wildly, hopefully my above test will give another datapoint though.
-
-I recall when I used the offline-repack tool with a `Q8_0` it converted everything to `q8_0_r8` except for one tensor, which stuck out to me but I didn't think much of it at the time:
-```
-[1/1025] token_embd.weight - [ 7168, 129280,1,1], type = q8_0, size = 938.984 MB, type = q8_0
-```
-
-> I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to Alright spam (although the probability of Alright is not actually 100% so it will deviate).
-
-I see, yeah a lot of variables in play with multiple imatrix files and all. Interesting it also babbles `Alright` sometimes.
-
-Anyway, I'll keep you posted if I get one cooked up that seems to be working better and narrow down if it is anything odd going on or just not all quants play well together on this model.
-
----
-
-👤 **ubergarm** commented the **2025-03-30** at **21:11:28**:
-
-> This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing https://github.com/ikawrakow/ik_llama.cpp/issues/285.
-
-Okay, thanks for confirming success with those tensor types. I'll re-cooking again just changing `q8_0_r8` to `q8_0` to see if there is any effect. Plus it would allow use on GPU.
-
-> The second broken mix (where I was going to test setting output.weight to iq6_k), I ended up realizing after I tested it I messed up the custom quant rule and it actually ended up being q6_k_r4 for both blk.X.attn_output.weight and output.weight
-
-> ...4th mix going back to bartowski's and it is also not functional....
-
-Hrmm, I'm wondering if this has something to do with setting `token_embd.weight` weight to repacked quant types? I'm speculating wildly, hopefully my above test will give another datapoint though.
-
-I recall when I used the offline-repack tool with a `Q8_0` it converted everything to `q8_0_r8` except for one tensor, which stuck out to me but I didn't think much of it at the time:
-```
-
-[ 1/1025] token_embd.weight - [ 7168, 129280, 1, 1], type = q8_0, size = 938.984 MB, type = q8_0
-```
-
-> I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to Alright spam (although the probability of Alright is not actually 100% so it will deviate).
-
-I see, yeah a lot of variables in play with multiple imatrix files and all. Interesting it also babbles `Alright` sometimes.
-
-Anyway, I'll keep you posted if I get one cooked up that seems to be working better and narrow down if it is anything odd going on or just not all quants play well together on this model.
-
----
-
-👤 **saood06** commented the **2025-03-30** at **21:30:18**:
-
-> > This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285).
-
-> Okay, thanks for confirming success with those tensor types. I'll re-cooking again just changing `q8_0_r8` to `q8_0` to see if there is any effect. Plus it would allow use on GPU.
-
-Thanks, I don't have any mix cooking right, but I could do one overnight to test another mix if that would be helpful.
-
-> Hrmm, I'm wondering if this has something to do with setting `token_embd.weight` weight to repacked quant types? I'm speculating wildly, hopefully my above test will give another datapoint though.
->
-> I recall when I used the offline-repack tool with a `Q8_0` it converted everything to `q8_0_r8` except for one tensor, which stuck out to me but I didn't think much of it at the time:
->
-> ```
-> [1/1025] token_embd.weight - [ 7168, 129280,1,1], type = q8_0, size = 938.984 MB, type = q8_0
-> ```
->
-
-I don't think it is a wild speculation.
-
-It might be the reason, see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/272/files#diff-b74fdb6e796b36d230cafcbff50ebd34cf27bd55b6b4ca0ad5a2ff8191b1066bR6784-R6786) and [this](https://github.com/ikawrakow/ik_llama.cpp/blob/4819257ce66a680608cf9c7871156041d00eb7da/src/llama.cpp#L16920).
-
-Also now that you do mention it I do think something about this was brought up at some point but I can't remember where (so no reference).
-
-> > I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to Alright spam (although the probability of Alright is not actually 100% so it will deviate).
->
-> I see, yeah a lot of variables in play with multiple imatrix files and all. Interesting it also babbles `Alright` sometimes.
->
-> Anyway, I'll keep you posted if I get one cooked up that seems to be working better and narrow down if it is anything odd going on or just not all quants play well together on this model.
-
-Thanks, I'll do the same.
-
----
-
-👤 **saood06** commented the **2025-03-30** at **21:30:18**:
-
-> > This mix functions (albeit a bit slow for my liking) and we know q8 functions as it was tested before closing [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285).
-
-> Okay, thanks for confirming success with those tensor types. I'll re-cooking again just changing `q8_0_r8` to `q8_0` to see if there is any effect. Plus it would allow use on GPU.
-
-Thanks, I don't have any mix cooking right, but I could do one overnight to test another mix if that would be helpful.
-
-> Hrmm, I'm wondering if this has something to do with setting `token_embd.weight` weight to repacked quant types? I'm speculating wildly, hopefully my above test will give another datapoint though.
->
-> I recall when I used the offline-repack tool with a `Q8_0` it converted everything to `q8_0_r8` except for one tensor, which stuck out to me but I didn't think much of it at the time:
->
-> ```
-> [1/1025] token_embd.weight - [ 7168, 129280,1,1], type = q8_0, size = 938.984 MB, type = q8_0
-> ```
->
-
-I don't think it is a wild speculation.
-
-This might be the reason, see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/272/files#diff-b74fdb6e796b36d230cafcbff50ebd34cf27bd55b6b4ca0ad5a2ff8191b1066bR6784-R6786) and [this](https://github.com/ikawrakow/ik_llama.cpp/blob/4819257ce66a680608cf9c7871156041d00eb7da/src/llama.cpp#L16920).
-
-Also now that you do mention it I do think something about this was brought up at some point but I can't remember where (so no reference).
-
-> > I just finished testing a 4th mix going back to bartowski's and it is also not functional. It seems to babble vaguely related tokens to ones that make sense before it turns to Alright spam (although the probability of Alright is not actually 100% so it will deviate).
->
-> I see, yeah a lot of variables in play with multiple imatrix files and all. Interesting it also babbles `Alright` sometimes.
->
-> Anyway, I'll keep you posted if I get one cooked up that seems to be working better and narrow down if it is anything odd going on or just not all quants play well together on this model.
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **00:42:54**:
-
-> I don't think it is a wild speculation.
->
-> It might be the reason, see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/272/files#diff-b74fdb6e796b36d230cafcbff50ebd34cf27bd55b6b4ca0ad5a2ff8191b1066bR6784-R6786) and [this](https://github.com/ikawrakow/ik_llama.cpp/blob/4819257ce66a680608cf9c7871156041d00eb7da/src/llama.cpp#L16920).
->
-> Also now that you do mention it I do think something about this was brought up at some point but I can't remember where (so no reference).
-
-Wow thanks, you are really good with keeping track of so much disperate information and links haha...
-
-Seems like logic for `token_embd.weight` is `if (new_type == GGML_TYPE_Q8_0_R8) { new_type = GGML_TYPE_Q8_0; }`
-
-And I am currently testing perplexity on my experiment above using `Q8_0` instead of `Q8_0_R8` quant and its looking just fine:
-
-```
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-```
-
-So probably yeah, the issue I'm seeing here is because I used `q8_0_r8` for `token_embd.weight` which seems like a known invalid combination.
-
-Gonna let it finish up and curious how good the perplexity is relative to the full `Q8_0` hehe... its addictive...
-
----
-
-*UPDATE* Wow!! `3.2596 +/- 0.01786` for this `DeepSeek-V3-0324-IQ4_K_R4.gguf` quant vs full `Q8_0` at `3.2454 +/- 0.01773` in almost half the size!
-
-```bash
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-
-llama_print_timings: load time = 2327.19 ms
-llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: prompt eval time = 3249602.81 ms / 287232 tokens ( 11.31 ms per token, 88.39 tokens per second)
-llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: total time = 3300377.65 ms / 287233 tokens
-
-Final estimate: PPL = 3.2596 +/- 0.01786
-```
-
-
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **00:42:54**:
-
-> I don't think it is a wild speculation.
->
-> It might be the reason, see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/272/files#diff-b74fdb6e796b36d230cafcbff50ebd34cf27bd55b6b4ca0ad5a2ff8191b1066bR6784-R6786) and [this](https://github.com/ikawrakow/ik_llama.cpp/blob/4819257ce66a680608cf9c7871156041d00eb7da/src/llama.cpp#L16920).
->
-> Also now that you do mention it I do think something about this was brought up at some point but I can't remember where (so no reference).
-
-Wow thanks, you are really good with keeping track of so much disperate information and links haha...
-
-Seems like logic for `token_embd.weight` is `if (new_type == GGML_TYPE_Q8_0_R8) { new_type = GGML_TYPE_Q8_0; }`
-
-And I am currently testing perplexity on my experiment above using `Q8_0` instead of `Q8_0_R8` quant and its looking just fine:
-
-```
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-```
-
-So probably yeah, the issue I'm seeing here is because I used `q8_0_r8` for `token_embd.weight` which seems like a known invalid combination.
-
-Gonna let it finish up and curious how good the perplexity is relative to the full `Q8_0` hehe... its addictive...
-
----
-
-👤 **saood06** commented the **2025-03-31** at **01:46:10**:
-
-> Wow thanks, you are really good with keeping track of so much disperate information and links haha...
-
-You say this right after I say I don't have a reference (I jest).
-
->
-> And I am currently testing perplexity on my experiment above using `Q8_0` instead of `Q8_0_R8` quant and its looking just fine:
-
-Nice.
-
-> Gonna let it finish up and curious how good the perplexity is relative to the full `Q8_0` hehe... its addictive...
-
-I know, I want to test my pure IQ4_K_R4 (minus the token_embd.weight), I'm probably going to have that quant cook overnight and test it later. The 4th mix was fast in the preliminary performance screening I did before functionality testing it.
-
-I thought about how the ratio of tokens used by me in sweep-bench vs server and I had an idea that I could tweak sweep-bench to do actually useful work instead of just decoding and prefilling random tokens.
-
->UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
-
-Ooh, nice. If you don't mind would you test pure IQ4_K_R4 with IQ4_K token_embd.weight and see how close that gets? I know `-ser` is designed to be used instead, but it would be interesting see it tested for IQ4_K/IQ4_K_R4.
-
->model size = 386.183 GiB (4.936 BPW)
-
-Just barely out of reach for my 384 GB RAM server, but I also think that using IQ6_K for some of the Q8_0 could get me there without affecting PPL much at all, but I did experiment with something similar with my third IQ4_K_R4 based mix of R1, which I barely used because I preferred the faster mixes.
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **02:02:07**:
-
-> You say this right after I say I don't have a reference (I jest).
-
-😂
-
-> If you don't mind would you test pure IQ4_K_R4 with IQ4_K token_embd.weight and see how close that gets?
-
-I think I can clean up some disk space now that I know which of my previous gguf's experiments are junk. Do I need to use `--pure` ? Otherwise I'll just update my existing `--custom-q` with your requested types.
-
-> Just barely out of reach for my 384 GB RAM server,
-
-Is this server CPU only? Otherwise all the q8_0's will fit in under 24GB VRAM with 32k context which might barely work for you.
-
-Interesting, yeah chopping the q8_0's could trim a little bit. It's pretty interesting how little of the weights are for attention relative to the MoEs. Psure GPT-3 was like 1/3rd attention weights. Deepseek seems like under 5% or something (didn't actually calculate it). I wonder if making say the last 10 routed experts slightly smaller would save more space while keeping attention maxxed out. Just spitballing, I really dunno what I'm doing haha...
-
----
-
-👤 **saood06** commented the **2025-03-31** at **02:15:36**:
-
-> > If you don't mind would you test pure IQ4_K_R4 with IQ4_K token_embd.weight and see how close that gets?
->
-> I think I can clean up some disk space now that I know which of my previous gguf's experiments are junk. Do I need to use `--pure` ? Otherwise I'll just update my existing `--custom-q` with your requested types.
-
-You can use whatever you find easier, I find `--custom-q` easier as well, what matters is the mix it produces.
-
-> > Just barely out of reach for my 384 GB RAM server,
->
-> Is this server CPU only? Otherwise all the q8_0's will fit in under 24GB VRAM with 32k context which might barely work for you.
-
-The server is CPU only, I have a 3090 but in another machine that could be used with RPC, but my RPC sync still hasn't progressed to test it here, and my initial testing on llama.cpp showed RPC didn't help with the tensor offload/MLA stuff.
-
-> Interesting, yeah chopping the q8_0's could trim a little bit. It's pretty interesting how little of the weights are for attention relative to the MoEs. Psure GPT-3 was like 1/3rd attention weights. Deepseek seems like under 5% or something (didn't actually calculate it). I wonder if making say the last 10 routed experts slightly smaller would save more space while keeping attention maxxed out. Just spitballing, I really dunno what I'm doing haha...
-
-I'm not sure what your trying to say. MoE's are different from dense models, but both have tensors that are more or less sensitive to being quantized.
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **03:21:46**:
-
-> You can use whatever you find easier, I find --custom-q easier as well, what matters is the mix it produces.
-
-Super, it is cooking now, however, I looks like one of the tensors is not happy with `iq4_k_r4` and falling back to `q5_0`. The log is a bit wonky, but it could just be that unused `attn_k_b.weight` so not an actual issue. I'll let it keep going and hopefully get your perplexity by tomorrow morning!
-
-
-
-quantize snippet for `iq4_k_r4`
-
-```bash
-[ 793/1147] blk.42.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.attn_k_b.weight
-
-
-change_type_if_necessar : tensor cols 128 x 65536 are not divisible by 256, required for iq4_k_r4 - using fallback quantization q5_0
-
-====== llama_model_quantize_internal: did not find weights for blk.42.attn_k_b.weight
-converting to q5_0 .. size = 16.00 MiB -> 5.50 MiB
-[ 794/1147] blk.42.attn_v_b.weight - [ 512, 16384, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.attn_v_b.weight
-converting to iq4_k_r4 .. size = 16.00 MiB -> 4.50 MiB
-[ 795/1147] blk.42.attn_output.weight - [16384, 7168, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.attn_output.weight
-converting to iq4_k_r4 .. size = 224.00 MiB -> 63.00 MiB
-[ 796/1147] blk.42.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
-[ 797/1147] blk.42.attn_q_a.weight - [ 7168, 1536, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.attn_q_a.weight
-converting to iq4_k_r4 .. size = 21.00 MiB -> 5.91 MiB
-[ 798/1147] blk.42.attn_q_b.weight - [ 1536, 24576, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.attn_q_b.weight
-converting to iq4_k_r4 .. size = 72.00 MiB -> 20.25 MiB
-[ 799/1147] blk.42.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 800/1147] blk.42.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.ffn_down_exps.weight
-converting to iq4_k_r4 .. size = 7168.00 MiB -> 2016.00 MiB
-[ 801/1147] blk.42.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.ffn_gate_exps.weight
-converting to iq4_k_r4 .. size = 7168.00 MiB -> 2016.00 MiB
-[ 802/1147] blk.42.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.42.ffn_up_exps.weight
-converting to iq4_k_r4 .. size = 7168.00 MiB -> 2016.00 MiB
-[ 803/1147] blk.42.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
-[ 804/1147] blk.43.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
-[ 805/1147] blk.43.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
-[ 806/1147] blk.43.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.ffn_down_shexp.weight
-converting to iq4_k_r4 .. size = 28.00 MiB -> 7.88 MiB
-[ 807/1147] blk.43.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.ffn_gate_shexp.weight
-converting to iq4_k_r4 .. size = 28.00 MiB -> 7.88 MiB
-[ 808/1147] blk.43.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.ffn_up_shexp.weight
-converting to iq4_k_r4 .. size = 28.00 MiB -> 7.88 MiB
-[ 809/1147] blk.43.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
-[ 810/1147] blk.43.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.attn_kv_a_mqa.weight
-converting to iq4_k_r4 .. size = 7.88 MiB -> 2.21 MiB
-[ 811/1147] blk.43.attn_kv_b.weight - [ 512, 32768, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.attn_kv_b.weight
-converting to iq4_k_r4 .. size = 32.00 MiB -> 9.00 MiB
-[ 812/1147] blk.43.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.attn_k_b.weight
-
-
-change_type_if_necessar : tensor cols 128 x 65536 are not divisible by 256, required for iq4_k_r4 - using fallback quantization q5_0
-
-====== llama_model_quantize_internal: did not find weights for blk.43.attn_k_b.weight
-converting to q5_0 .. size = 16.00 MiB -> 5.50 MiB
-[ 813/1147] blk.43.attn_v_b.weight - [ 512, 16384, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
-.43.attn_v_b.weight
-```
-
-
-
-> I have a 3090 but in another machine that could be used with RPC
-
-ooh right right, yeah so all CPU it is.
-
-> I'm not sure what your trying to say. MoE's are different from dense models, but both have tensors that are more or less sensitive to being quantized.
-
-Haha, I'm not sure either 💀 lol... I'm just wondering if trimming weight at say the last 10 layers of the *routed experts* (not MoEs) might drop overall size quicker than trimming it from the already fairly small embeddings/dense layers/attention/norms/bias/shared expert layers.
-
----
-
-👤 **saood06** commented the **2025-03-31** at **03:37:38**:
-
-> > You can use whatever you find easier, I find --custom-q easier as well, what matters is the mix it produces.
->
-> Super, it is cooking now, however, I looks like one of the tensors is not happy with `iq4_k_r4` and falling back to `q5_0`.
-
-That is fine and expected for that tensor.
-
->I'll let it keep going and hopefully get your perplexity by tomorrow morning!
-
-Thanks!
-
->ooh right right, yeah so all CPU it is.
-
-There are still models (and configurations) where RPC on ik_llama.cpp would benefit performance such as Miqu based quants. Deepseek is just not one of those.
-
----
-
-👤 **saood06** commented the **2025-03-31** at **03:37:38**:
-
-> > You can use whatever you find easier, I find --custom-q easier as well, what matters is the mix it produces.
->
-> Super, it is cooking now, however, I looks like one of the tensors is not happy with `iq4_k_r4` and falling back to `q5_0`.
-
-That is fine and expected for that tensor.
-
->I'll let it keep going and hopefully get your perplexity by tomorrow morning!
-
-Thanks.
-
->ooh right right, yeah so all CPU it is.
-
-There are still models (and configurations) where RPC on ik_llama.cpp would benefit performance such as Miqu based quants. Deepseek is just not one of those.
-
----
-
-👤 **ikawrakow** commented the **2025-03-31** at **05:50:51**:
-
-So, `token_embd.weight` cannot be quantized with row-interleaved quants (one needs to be able to get individual single rows out if this tensor to fill the input state, but the row-interleaved quants pack 4 or 8 rows together, so this does not work). I have checks in place, but it looks like I'm not catching all possible paths to arrive at an interleaved quants. So, I guess, until I find and fix the issue it is best to just explicitly specify the type of the `token_embd.weight` tensor with a custom rule.
-
-`attn_k_b.weight` can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
-
-> UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
-
-Amazing! You should publish this model.
-
-I second @saood06's request to explore how much quality degradation there will be from moving the attention tensors and the shared experts to `iq6_k` and `iq5_k`, as this will make CPU-only TG quite a bit faster. For hybrid setups (with attention and shared experts being run on the GPU), one should look into `q6_K/q5_K` instead.
-
----
-
-👤 **saood06** commented the **2025-03-31** at **06:55:11**:
-
->So, token_embd.weight cannot be quantized with row-interleaved quants (one needs to be able to get individual single rows out if this tensor to fill the input state, but the row-interleaved quants pack 4 or 8 rows together, so this does not work). I have checks in place, but it looks like I'm not catching all possible paths to arrive at an interleaved quants.
-
-Thanks for the explanation.
-
-> `attn_k_b.weight` can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
-
-I've had situations where it doesn't and llama-quantize crashes.
-
-command: `./llama-quantize --pure --imatrix /mnt/sda/imatrix_V30324_mrader.dat --output-tensor-type q6_k_r4 /mnt/sda/DeepseekV3_0324/DeepseekV3_0324-256x21B-BF16.gguf /mnt/sda/DeepSeek-V3-0324-IQ4_K_R4_ATT3.gguf IQ4_K_R4 48`
-
-The assert being triggered:
-```
-====== llama_model_quantize_internal: did not find weights for blk.0.attn_k_b.weight
-converting to iq4_k_r4 .. /home/saood06/ik_main/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:5244: GGML_ASSERT(n_per_row%QK_K == 0) failed
-```
-
->
-> > UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
->
-> Amazing!
-
-Yes. It is impressive how good the quants that can be made from this repo are.
-
-> I second [@saood06](https://github.com/saood06)'s request to explore how much quality degradation there will be from moving the attention tensors and the shared experts to `iq6_k` and `iq5_k`, as this will make CPU-only TG quite a bit faster.
-
-Yes, and maybe also try to use iq5_k_r4 for less MoE down projection tensors, maybe just the first 3. That should shave off a good bit of size and hopefully maintain almost all of the benefit of the MoE down projection tensors with just the first 3. Writing the `--custom-q` command it should be possible to just specify it for blk 3, blk 4, blk 5, as the first three blocks are dense and don't have any MoE down projection tensors, so they start at blk 3.
-
->For hybrid setups (with attention and shared experts being run on the GPU), one should look into `q6_K/q5_K` instead.
-
-I wonder how much extra context that would let you squeeze in. I've gone above 32k before and Deepseek docs say "Note that the CoT output can reach up to 32K tokens".
-
----
-
-👤 **ikawrakow** commented the **2025-03-31** at **08:41:57**:
-
-> I've had situations where it doesn't and llama-quantize crashes.
-
-This happened after PR #294? #294 should have fixed the `--pure` use case.
-
----
-
-👤 **saood06** commented the **2025-03-31** at **08:58:01**:
-
-> > I've had situations where it doesn't and llama-quantize crashes.
->
-> This happened after PR [#294](https://github.com/ikawrakow/ik_llama.cpp/pull/294)? [#294](https://github.com/ikawrakow/ik_llama.cpp/pull/294) should have fixed the `--pure` use case.
-
-This was before, that looks like it would fix it. Thanks.
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **14:22:13**:
-
-> > I'll let it keep going and hopefully get your perplexity by tomorrow morning!
-
-> Thanks!
-
-Just grabbed the log, here is how your "pure" `iq4_k_r4` stacks up on full perplexity run , size, and duration:
-| Model | Size (GiB) | PPL | Duration (minutes) |
-| --- | --- | --- | --- |
-| DeepSeek-V3-0324-IQ2_K_R4 | 227 | 3.5614 +/- 0.02001 | (different rig) |
-| DeepSeek-V3-0324-PURE-IQ4_K_R4 | 353 | 3.2942 +/- 0.01812 | 47.56 |
-| DeepSeek-V3-0324-IQ4_K_R4 | 387 | 3.2596 +/- 0.01786 | 55.01 |
-| DeepSeek-V3-0324-Q8_0 | 666 | 3.2454 +/- 0.01773 | 68.87 |
-
-
-
-In terms of speed to calculate perplexity, these three were similar setups more or less using a single socket of the Xeon 6980P
-
-
-
-#### "PURE" `IQ4_K_R4` perplexity log details
-```
-main: build = 3613 (4819257c)
-main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
-main: seed = 1337
-
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q5_0: 61 tensors
-llama_model_loader: - type iq4_k: 1 tensors
-llama_model_loader: - type iq4_k_r4: 724 tensors
-
-llm_load_print_meta: model size = 352.470 GiB (4.505 BPW)
-
-perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
-perplexity: 19.63 seconds per pass - ETA 45.88 minutes
-[1]2.4366,[2]3.1393,[3]2.3037,[4]1.9385,[5]1.7532,[6]1.6176,[7]1.5316,[8]1.4745,[9]1.4313,[10]1.3953,[11]1.3829,[12]1.4097,[13]1.4224,[14]1.5443,[15]1.6735,[16]1.7303,[17]1.8888,[18]2.0140,[19]1.9767,[20]1.9637,[21]2.0686,[22]2.0468,[23]2.0218,[24]2.0329,[25]2.0040,[26]1.9824,[27]2.0276,[28]2.0377,[29]2.0839,[30]2.1167,[31]2.1493,[32]2.1657,[33]2.2060,[34]2.2503,[35]2.2965,[36]2.3499,[37]2.3852,[38]2.4336,[39]2.4732,[40]2.5311,[41]2.5728,[42]2.5850,[43]2.6354,[44]2.6530,[45]2.7332,[46]2.7820,[47]2.7394,[48]2.6930,[49]2.6667,[50]2.6835,[51]2.7280,[52]2.7399,[53]2.7902,[54]2.8021,[55]2.8316,[56]2.8626,[57]2.8758,[58]2.9093,[59]2.9190,[60]2.9659,[61]3.0052,[62]3.0520,[63]3.0836,[64]3.1250,[65]3.1341,[66]3.1157,[67]3.0915,[68]3.1179,[69]3.1110,[70]3.1238,[71]3.1416,[72]3.1557,[73]3.1697,[74]3.1909,[75]3.1705,[76]3.1256,[77]3.0826,[78]3.0789,[79]3.0595,[80]3.0426,[81]3.0078,[82]3.0106,[83]2.9793,[84]2.9450,[85]2.9116,[86]2.8887,[87]2.8825,[88]2.8559,[89]2.8395,[90]2.8144,[91]2.7862,[92]2.7616,[93]2.7362,[94]2.7115,[95]2.6895,[96]2.6870,[97]2.6926,[98]2.6774,[99]2.6605,[100]2.6627,[101]2.6544,[102]2.6697,[103]2.6946,[104]2.7113,[105]2.7078,[106]2.7294,[107]2.7536,[108]2.7740,[109]2.8065,[110]2.8397,[111]2.8578,[112]2.8328,[113]2.8199,[114]2.7992,[115]2.7843,[116]2.7698,[117]2.7482,[118]2.7275,[119]2.7064,[120]2.6881,[121]2.6734,[122]2.6562,[123]2.6392,[124]2.6209,[125]2.6041,[126]2.5874,[127]2.5740,[128]2.5650,[129]2.5535,[130]2.5403,[131]2.5311,[132]2.5374,[133]2.5470,[134]2.5539,[135]2.5645,[136]2.5795,[137]2.5931,[138]2.6010,[139]2.6117,[140]2.6123,[141]2.6142,[142]2.6130,[143]2.6143,[144]2.6119,[145]2.6040,[146]2.6025,[147]2.6072,[148]2.6072,[149]2.6088,[150]2.6037,[151]2.6020,[152]2.5995,[153]2.5956,[154]2.5956,[155]2.5999,[156]2.6014,[157]2.6067,[158]2.6150,[159]2.6172,[160]2.6265,[161]2.6347,[162]2.6448,[163]2.6492,[164]2.6696,[165]2.6929,[166]2.7101,[167]2.7218,[168]2.7453,[169]2.7678,[170]2.7894,[171]2.8113,[172]2.7959,[173]2.7801,[174]2.7666,[175]2.7552,[176]2.7436,[177]2.7320,[178]2.7195,[179]2.7066,[180]2.7101,[181]2.7245,[182]2.7393,[183]2.7539,[184]2.7673,[185]2.7776,[186]2.7936,[187]2.8089,[188]2.8233,[189]2.8342,[190]2.8351,[191]2.8425,[192]2.8457,[193]2.8508,[194]2.8699,[195]2.8784,[196]2.8913,[197]2.9010,[198]2.9059,[199]2.9117,[200]2.9111,[201]2.9259,[202]2.9213,[203]2.9270,[204]2.9302,[205]2.9297,[206]2.9326,[207]2.9412,[208]2.9508,[209]2.9597,[210]2.9604,[211]2.9557,[212]2.9561,[213]2.9636,[214]2.9655,[215]2.9709,[216]2.9716,[217]2.9673,[218]2.9673,[219]2.9682,[220]2.9683,[221]2.9689,[222]2.9690,[223]2.9691,[224]2.9737,[225]2.9755,[226]2.9680,[227]2.9658,[228]2.9675,[229]2.9713,[230]2.9773,[231]2.9834,[232]2.9758,[233]2.9687,[234]2.9685,[235]2.9668,[236]2.9753,[237]2.9836,[238]2.9929,[239]3.0028,[240]3.0120,[241]3.0232,[242]3.0379,[243]3.0503,[244]3.0585,[245]3.0702,[246]3.0808,[247]3.0796,[248]3.0754,[249]3.0734,[250]3.0675,[251]3.0655,[252]3.0677,[253]3.0718,[254]3.0790,[255]3.0855,[256]3.0890,[257]3.0915,[258]3.0927,[259]3.0964,[260]3.0987,[261]3.1000,[262]3.0991,[263]3.1047,[264]3.1072,[265]3.1079,[266]3.1095,[267]3.1113,[268]3.1145,[269]3.1173,[270]3.1163,[271]3.1147,[272]3.1084,[273]3.1080,[274]3.1011,[275]3.0904,[276]3.0793,[277]3.0812,[278]3.0911,[279]3.0973,[280]3.1049,[281]3.1121,[282]3.1179,[283]3.1240,[284]3.1302,[285]3.1435,[286]3.1456,[287]3.1488,[288]3.1540,[289]3.1560,[290]3.1480,[291]3.1395,[292]3.1371,[293]3.1359,[294]3.1333,[295]3.1311,[296]3.1328,[297]3.1335,[298]3.1388,[299]3.1447,[300]3.1474,[301]3.1517,[302]3.1536,[303]3.1550,[304]3.1546,[305]3.1661,[306]3.1730,[307]3.1836,[308]3.1729,[309]3.1675,[310]3.1583,[311]3.1607,[312]3.1624,[313]3.1680,[314]3.1704,[315]3.1735,[316]3.1749,[317]3.1767,[318]3.1771,[319]3.1771,[320]3.1812,[321]3.1816,[322]3.1835,[323]3.1896,[324]3.1904,[325]3.1957,[326]3.1999,[327]3.2036,[328]3.2058,[329]3.2078,[330]3.2141,[331]3.2171,[332]3.2210,[333]3.2202,[334]3.2205,[335]3.2212,[336]3.2213,[337]3.2225,[338]3.2227,[339]3.2253,[340]3.2289,[341]3.2341,[342]3.2428,[343]3.2517,[344]3.2569,[345]3.2484,[346]3.2405,[347]3.2354,[348]3.2282,[349]3.2243,[350]3.2229,[351]3.2274,[352]3.2418,[353]3.2506,[354]3.2630,[355]3.2712,[356]3.2767,[357]3.2881,[358]3.2977,[359]3.3005,[360]3.3067,[361]3.3162,[362]3.3246,[363]3.3303,[364]3.3371,[365]3.3426,[366]3.3527,[367]3.3613,[368]3.3678,[369]3.3754,[370]3.3842,[371]3.3974,[372]3.4064,[373]3.4098,[374]3.4130,[375]3.4179,[376]3.4301,[377]3.4412,[378]3.4442,[379]3.4440,[380]3.4407,[381]3.4455,[382]3.4513,[383]3.4546,[384]3.4588,[385]3.4627,[386]3.4688,[387]3.4744,[388]3.4774,[389]3.4675,[390]3.4587,[391]3.4486,[392]3.4433,[393]3.4341,[394]3.4256,[395]3.4167,[396]3.4071,[397]3.3985,[398]3.3894,[399]3.3794,[400]3.3711,[401]3.3614,[402]3.3515,[403]3.3434,[404]3.3336,[405]3.3244,[406]3.3149,[407]3.3058,[408]3.2972,[409]3.2888,[410]3.2830,[411]3.2839,[412]3.2794,[413]3.2811,[414]3.2828,[415]3.2799,[416]3.2799,[417]3.2821,[418]3.2767,[419]3.2778,[420]3.2752,[421]3.2738,[422]3.2743,[423]3.2736,[424]3.2771,[425]3.2768,[426]3.2773,[427]3.2766,[428]3.2791,[429]3.2805,[430]3.2830,[431]3.2838,[432]3.2831,[433]3.2794,[434]3.2796,[435]3.2722,[436]3.2665,[437]3.2625,[438]3.2609,[439]3.2579,[440]3.2627,[441]3.2680,[442]3.2753,[443]3.2732,[444]3.2742,[445]3.2752,[446]3.2792,[447]3.2825,[448]3.2848,[449]3.2878,[450]3.2916,[451]3.2947,[452]3.2968,[453]3.2982,[454]3.2969,[455]3.2993,[456]3.2997,[457]3.3022,[458]3.3073,[459]3.3077,[460]3.3079,[461]3.3048,[462]3.3084,[463]3.3156,[464]3.3208,[465]3.3144,[466]3.3124,[467]3.3104,[468]3.3117,[469]3.3091,[470]3.3065,[471]3.3070,[472]3.3078,[473]3.3071,[474]3.3061,[475]3.3071,[476]3.3057,[477]3.3050,[478]3.3057,[479]3.3075,[480]3.3100,[481]3.3063,[482]3.3098,[483]3.3091,[484]3.3127,[485]3.3189,[486]3.3221,[487]3.3255,[488]3.3309,[489]3.3334,[490]3.3384,[491]3.3444,[492]3.3489,[493]3.3486,[494]3.3498,[495]3.3522,[496]3.3540,[497]3.3568,[498]3.3572,[499]3.3569,[500]3.3608,[501]3.3654,[502]3.3644,[503]3.3631,[504]3.3651,[505]3.3682,[506]3.3761,[507]3.3791,[508]3.3826,[509]3.3753,[510]3.3699,[511]3.3635,[512]3.3592,[513]3.3533,[514]3.3518,[515]3.3536,[516]3.3488,[517]3.3487,[518]3.3473,[519]3.3476,[520]3.3515,[521]3.3505,[522]3.3490,[523]3.3545,[524]3.3535,[525]3.3520,[526]3.3473,[527]3.3423,[528]3.3391,[529]3.3361,[530]3.3332,[531]3.3303,[532]3.3249,[533]3.3190,[534]3.3145,[535]3.3149,[536]3.3173,[537]3.3203,[538]3.3224,[539]3.3250,[540]3.3303,[541]3.3334,[542]3.3357,[543]3.3302,[544]3.3259,[545]3.3256,[546]3.3193,[547]3.3131,[548]3.3067,[549]3.3000,[550]3.2943,[551]3.2882,[552]3.2827,[553]3.2773,[554]3.2754,[555]3.2737,[556]3.2764,[557]3.2803,[558]3.2863,[559]3.2908,[560]3.2961,[561]3.2942,
-llama_print_timings: load time = 2197.28 ms
-llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: prompt eval time = 2802141.29 ms / 287232 tokens ( 9.76 ms per token, 102.50 tokens per second)
-llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: total time = 2853371.87 ms / 287233 tokens
-
-Final estimate: PPL = 3.2942 +/- 0.01812
-```
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **14:22:13**:
-
-> > I'll let it keep going and hopefully get your perplexity by tomorrow morning!
-
-> Thanks!
-
-Just grabbed the log, here is how your "pure" `iq4_k_r4` stacks up on full perplexity run and size:
-| Model | Size (GiB) | PPL |
-| --- | --- | --- |
-| DeepSeek-V3-0324-IQ2_K_R4 | 227 | 3.5614 +/- 0.02001 |
-| DeepSeek-V3-0324-PURE-IQ4_K_R4 | 353 | 3.2942 +/- 0.01812 |
-| DeepSeek-V3-0324-IQ4_K_R4 | 387 | 3.2596 +/- 0.01786 |
-| DeepSeek-V3-0324-Q8_0 | 666 | 3.2454 +/- 0.01773 |
-
-
-
-#### "PURE" `IQ4_K_R4` perplexity log details
-```
-main: build = 3613 (4819257c)
-main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
-main: seed = 1337
-
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q5_0: 61 tensors
-llama_model_loader: - type iq4_k: 1 tensors
-llama_model_loader: - type iq4_k_r4: 724 tensors
-
-llm_load_print_meta: model size = 352.470 GiB (4.505 BPW)
-
-perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
-perplexity: 19.63 seconds per pass - ETA 45.88 minutes
-[1]2.4366,[2]3.1393,[3]2.3037,[4]1.9385,[5]1.7532,[6]1.6176,[7]1.5316,[8]1.4745,[9]1.4313,[10]1.3953,[11]1.3829,[12]1.4097,[13]1.4224,[14]1.5443,[15]1.6735,[16]1.7303,[17]1.8888,[18]2.0140,[19]1.9767,[20]1.9637,[21]2.0686,[22]2.0468,[23]2.0218,[24]2.0329,[25]2.0040,[26]1.9824,[27]2.0276,[28]2.0377,[29]2.0839,[30]2.1167,[31]2.1493,[32]2.1657,[33]2.2060,[34]2.2503,[35]2.2965,[36]2.3499,[37]2.3852,[38]2.4336,[39]2.4732,[40]2.5311,[41]2.5728,[42]2.5850,[43]2.6354,[44]2.6530,[45]2.7332,[46]2.7820,[47]2.7394,[48]2.6930,[49]2.6667,[50]2.6835,[51]2.7280,[52]2.7399,[53]2.7902,[54]2.8021,[55]2.8316,[56]2.8626,[57]2.8758,[58]2.9093,[59]2.9190,[60]2.9659,[61]3.0052,[62]3.0520,[63]3.0836,[64]3.1250,[65]3.1341,[66]3.1157,[67]3.0915,[68]3.1179,[69]3.1110,[70]3.1238,[71]3.1416,[72]3.1557,[73]3.1697,[74]3.1909,[75]3.1705,[76]3.1256,[77]3.0826,[78]3.0789,[79]3.0595,[80]3.0426,[81]3.0078,[82]3.0106,[83]2.9793,[84]2.9450,[85]2.9116,[86]2.8887,[87]2.8825,[88]2.8559,[89]2.8395,[90]2.8144,[91]2.7862,[92]2.7616,[93]2.7362,[94]2.7115,[95]2.6895,[96]2.6870,[97]2.6926,[98]2.6774,[99]2.6605,[100]2.6627,[101]2.6544,[102]2.6697,[103]2.6946,[104]2.7113,[105]2.7078,[106]2.7294,[107]2.7536,[108]2.7740,[109]2.8065,[110]2.8397,[111]2.8578,[112]2.8328,[113]2.8199,[114]2.7992,[115]2.7843,[116]2.7698,[117]2.7482,[118]2.7275,[119]2.7064,[120]2.6881,[121]2.6734,[122]2.6562,[123]2.6392,[124]2.6209,[125]2.6041,[126]2.5874,[127]2.5740,[128]2.5650,[129]2.5535,[130]2.5403,[131]2.5311,[132]2.5374,[133]2.5470,[134]2.5539,[135]2.5645,[136]2.5795,[137]2.5931,[138]2.6010,[139]2.6117,[140]2.6123,[141]2.6142,[142]2.6130,[143]2.6143,[144]2.6119,[145]2.6040,[146]2.6025,[147]2.6072,[148]2.6072,[149]2.6088,[150]2.6037,[151]2.6020,[152]2.5995,[153]2.5956,[154]2.5956,[155]2.5999,[156]2.6014,[157]2.6067,[158]2.6150,[159]2.6172,[160]2.6265,[161]2.6347,[162]2.6448,[163]2.6492,[164]2.6696,[165]2.6929,[166]2.7101,[167]2.7218,[168]2.7453,[169]2.7678,[170]2.7894,[171]2.8113,[172]2.7959,[173]2.7801,[174]2.7666,[175]2.7552,[176]2.7436,[177]2.7320,[178]2.7195,[179]2.7066,[180]2.7101,[181]2.7245,[182]2.7393,[183]2.7539,[184]2.7673,[185]2.7776,[186]2.7936,[187]2.8089,[188]2.8233,[189]2.8342,[190]2.8351,[191]2.8425,[192]2.8457,[193]2.8508,[194]2.8699,[195]2.8784,[196]2.8913,[197]2.9010,[198]2.9059,[199]2.9117,[200]2.9111,[201]2.9259,[202]2.9213,[203]2.9270,[204]2.9302,[205]2.9297,[206]2.9326,[207]2.9412,[208]2.9508,[209]2.9597,[210]2.9604,[211]2.9557,[212]2.9561,[213]2.9636,[214]2.9655,[215]2.9709,[216]2.9716,[217]2.9673,[218]2.9673,[219]2.9682,[220]2.9683,[221]2.9689,[222]2.9690,[223]2.9691,[224]2.9737,[225]2.9755,[226]2.9680,[227]2.9658,[228]2.9675,[229]2.9713,[230]2.9773,[231]2.9834,[232]2.9758,[233]2.9687,[234]2.9685,[235]2.9668,[236]2.9753,[237]2.9836,[238]2.9929,[239]3.0028,[240]3.0120,[241]3.0232,[242]3.0379,[243]3.0503,[244]3.0585,[245]3.0702,[246]3.0808,[247]3.0796,[248]3.0754,[249]3.0734,[250]3.0675,[251]3.0655,[252]3.0677,[253]3.0718,[254]3.0790,[255]3.0855,[256]3.0890,[257]3.0915,[258]3.0927,[259]3.0964,[260]3.0987,[261]3.1000,[262]3.0991,[263]3.1047,[264]3.1072,[265]3.1079,[266]3.1095,[267]3.1113,[268]3.1145,[269]3.1173,[270]3.1163,[271]3.1147,[272]3.1084,[273]3.1080,[274]3.1011,[275]3.0904,[276]3.0793,[277]3.0812,[278]3.0911,[279]3.0973,[280]3.1049,[281]3.1121,[282]3.1179,[283]3.1240,[284]3.1302,[285]3.1435,[286]3.1456,[287]3.1488,[288]3.1540,[289]3.1560,[290]3.1480,[291]3.1395,[292]3.1371,[293]3.1359,[294]3.1333,[295]3.1311,[296]3.1328,[297]3.1335,[298]3.1388,[299]3.1447,[300]3.1474,[301]3.1517,[302]3.1536,[303]3.1550,[304]3.1546,[305]3.1661,[306]3.1730,[307]3.1836,[308]3.1729,[309]3.1675,[310]3.1583,[311]3.1607,[312]3.1624,[313]3.1680,[314]3.1704,[315]3.1735,[316]3.1749,[317]3.1767,[318]3.1771,[319]3.1771,[320]3.1812,[321]3.1816,[322]3.1835,[323]3.1896,[324]3.1904,[325]3.1957,[326]3.1999,[327]3.2036,[328]3.2058,[329]3.2078,[330]3.2141,[331]3.2171,[332]3.2210,[333]3.2202,[334]3.2205,[335]3.2212,[336]3.2213,[337]3.2225,[338]3.2227,[339]3.2253,[340]3.2289,[341]3.2341,[342]3.2428,[343]3.2517,[344]3.2569,[345]3.2484,[346]3.2405,[347]3.2354,[348]3.2282,[349]3.2243,[350]3.2229,[351]3.2274,[352]3.2418,[353]3.2506,[354]3.2630,[355]3.2712,[356]3.2767,[357]3.2881,[358]3.2977,[359]3.3005,[360]3.3067,[361]3.3162,[362]3.3246,[363]3.3303,[364]3.3371,[365]3.3426,[366]3.3527,[367]3.3613,[368]3.3678,[369]3.3754,[370]3.3842,[371]3.3974,[372]3.4064,[373]3.4098,[374]3.4130,[375]3.4179,[376]3.4301,[377]3.4412,[378]3.4442,[379]3.4440,[380]3.4407,[381]3.4455,[382]3.4513,[383]3.4546,[384]3.4588,[385]3.4627,[386]3.4688,[387]3.4744,[388]3.4774,[389]3.4675,[390]3.4587,[391]3.4486,[392]3.4433,[393]3.4341,[394]3.4256,[395]3.4167,[396]3.4071,[397]3.3985,[398]3.3894,[399]3.3794,[400]3.3711,[401]3.3614,[402]3.3515,[403]3.3434,[404]3.3336,[405]3.3244,[406]3.3149,[407]3.3058,[408]3.2972,[409]3.2888,[410]3.2830,[411]3.2839,[412]3.2794,[413]3.2811,[414]3.2828,[415]3.2799,[416]3.2799,[417]3.2821,[418]3.2767,[419]3.2778,[420]3.2752,[421]3.2738,[422]3.2743,[423]3.2736,[424]3.2771,[425]3.2768,[426]3.2773,[427]3.2766,[428]3.2791,[429]3.2805,[430]3.2830,[431]3.2838,[432]3.2831,[433]3.2794,[434]3.2796,[435]3.2722,[436]3.2665,[437]3.2625,[438]3.2609,[439]3.2579,[440]3.2627,[441]3.2680,[442]3.2753,[443]3.2732,[444]3.2742,[445]3.2752,[446]3.2792,[447]3.2825,[448]3.2848,[449]3.2878,[450]3.2916,[451]3.2947,[452]3.2968,[453]3.2982,[454]3.2969,[455]3.2993,[456]3.2997,[457]3.3022,[458]3.3073,[459]3.3077,[460]3.3079,[461]3.3048,[462]3.3084,[463]3.3156,[464]3.3208,[465]3.3144,[466]3.3124,[467]3.3104,[468]3.3117,[469]3.3091,[470]3.3065,[471]3.3070,[472]3.3078,[473]3.3071,[474]3.3061,[475]3.3071,[476]3.3057,[477]3.3050,[478]3.3057,[479]3.3075,[480]3.3100,[481]3.3063,[482]3.3098,[483]3.3091,[484]3.3127,[485]3.3189,[486]3.3221,[487]3.3255,[488]3.3309,[489]3.3334,[490]3.3384,[491]3.3444,[492]3.3489,[493]3.3486,[494]3.3498,[495]3.3522,[496]3.3540,[497]3.3568,[498]3.3572,[499]3.3569,[500]3.3608,[501]3.3654,[502]3.3644,[503]3.3631,[504]3.3651,[505]3.3682,[506]3.3761,[507]3.3791,[508]3.3826,[509]3.3753,[510]3.3699,[511]3.3635,[512]3.3592,[513]3.3533,[514]3.3518,[515]3.3536,[516]3.3488,[517]3.3487,[518]3.3473,[519]3.3476,[520]3.3515,[521]3.3505,[522]3.3490,[523]3.3545,[524]3.3535,[525]3.3520,[526]3.3473,[527]3.3423,[528]3.3391,[529]3.3361,[530]3.3332,[531]3.3303,[532]3.3249,[533]3.3190,[534]3.3145,[535]3.3149,[536]3.3173,[537]3.3203,[538]3.3224,[539]3.3250,[540]3.3303,[541]3.3334,[542]3.3357,[543]3.3302,[544]3.3259,[545]3.3256,[546]3.3193,[547]3.3131,[548]3.3067,[549]3.3000,[550]3.2943,[551]3.2882,[552]3.2827,[553]3.2773,[554]3.2754,[555]3.2737,[556]3.2764,[557]3.2803,[558]3.2863,[559]3.2908,[560]3.2961,[561]3.2942,
-llama_print_timings: load time = 2197.28 ms
-llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: prompt eval time = 2802141.29 ms / 287232 tokens ( 9.76 ms per token, 102.50 tokens per second)
-llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: total time = 2853371.87 ms / 287233 tokens
-
-Final estimate: PPL = 3.2942 +/- 0.01812
-```
-
----
-
-👤 **ikawrakow** commented the **2025-03-31** at **14:52:10**:
-
-`3.2942` is 1.5% higher than `Q8_0`, so not too bad. I think with `IQ5_K` for the attention tensors and shared experts it should be (almost) on par with the result obtained with `Q8_0` for these.
-
-I'm somewhat surprised that the PP speed of the pure `IQ4_K` is better than the `IQ4_K` mix by almost 15%. Is it so that you used `Q8_0`, and not `Q8_0_R8` for the mix, because there was the issue with the NaN/very high PPL due to row-interleaved quants being used for token embeddings?
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **15:56:26**:
-
-> 3.2942 is 1.5% higher than Q8_0, so not too bad. I think with IQ5_K for the attention tensors and shared experts it should be (almost) on par with the result obtained with Q8_0 for these.
-
-Nice, getting it dialed in. I don't think @saood06 tried that exact combo in his mixes yet.
-
-> I'm somewhat surprised that the PP speed of the pure IQ4_K is better than the IQ4_K mix by almost 15%. Is it so that you used Q8_0, and not Q8_0_R8 for the mix, because there was the issue with the NaN/very high PPL due to row-interleaved quants being used for token embeddings?
-
-Right, the "non pure" `IQ4_K_R4` here has `Q8_0`s for attention/embeds/dense/shared expert layers as well as `IQ5_K_R4` for routed experted down projections. I just didn't specify `-rtr` on the perplexity script is all. That nan issue has been fixed in the branch I was using.
-
-So the duration is not a fair comparison given the "pure" was using repacked quants while the "non pure" and full `q8_0` were *not* repacked.
-
-Maybe I'll follow up later with proper llama-bench comparisons after getting the mixes dialed in for perplexity.
-
-Can close this issue now then as the original question has been answered.
-
-Thanks!
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **19:52:27**:
-
-> Maybe I'll follow up later with proper llama-bench comparisons
-
-> I'm somewhat surprised that the PP speed of the pure IQ4_K is better than the IQ4_K mix by almost 15%
-
-@ikawrakow
-
-I did a quick llama-bench comparison between the `PURE-IQ4_K_R4` and the `q8_0`/mix `IQ4_K_R4` (using -rtr 1 for `q8_0_r8` this time) on the CPU only the Xeon 6980P with 88 threads and found the results interesting. The graph shows the "pure" version as baseline 100%.
-
-I believe this is basically the same as @saood06 's pure version rolled last night vs his earlier working mix mentioned above.
-
-
-
-
-
-Command details and raw data
-
-## Common Setup
-```bash
-echo Setting power profile to performance:
-powerprofilesctl set performance
-
-echo Set numa balancing to be off:
-echo 0 | sudo tee /proc/sys/kernel/numa_balancing
-
-echo Maximizing chances of loading model into THPs
-echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/enabled
-echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/defrag
-
-echo Dropping all caches... (to hopefully use more THPs)
-sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
-```
-
-## `IQ4_K_R4`
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-bench \
- -rtr 1 \
- -thp 0 \
- --mmap 0 \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
- -ctk q8_0 \
- -mla 3 -fa 1 \
- -amb 1024 \
- -fmoe 1 \
- -p 512,8192,16384 -n 0 \
- -gp 512,64 \
- -gp 8192,64 \
- -gp 16384,64 \
- -r 2 \
- --numa numactl \
- --threads 88
-
-## note all q8_0 get repacked with `-rtr 1` to be `q8_r_8` including `attn_k_b.weight` presumably
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-
-## Confirm fully loaded into THPs
-$ grep Huge /proc/meminfo
-AnonHugePages: 41615360 kB
-ShmemHugePages: 0 kB
-FileHugePages: 0 kB
-HugePages_Total: 0
-HugePages_Free: 0
-HugePages_Rsvd: 0
-HugePages_Surp: 0
-Hugepagesize: 2048 kB
-Hugetlb: 0 kB
-
-$ du /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf
-404947028 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf
-```
-
-| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | rtr | fmoe | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | --: | ---: | ------------: | ---------------: |
-============ Repacked 611 tensors
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp512 | 122.55 ± 3.11 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp8192 | 74.34 ± 2.11 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp16384 | 52.68 ± 0.21 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp512 | 8.20 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp8192 | 6.70 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp16384 | 5.52 ± 0.00 |
-
-`build: 4819257c (3613)`
-
-## `PURE-IQ4_K_R4`
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-bench \
- -thp 0 \
- --mmap 0 \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf \
- -ctk q8_0 \
- -mla 3 -fa 1 \
- -amb 1024 \
- -fmoe 1 \
- -p 512,8192,16384 -n 0 \
- -gp 512,64 \
- -gp 8192,64 \
- -gp 16384,64 \
- -r 2 \
- --numa numactl \
- --threads 88
-
-## note the q5_0 attn_k_b.weight so not totally "pure" hah...
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q5_0: 61 tensors
-llama_model_loader: - type iq4_k: 1 tensors
-llama_model_loader: - type iq4_k_r4: 724 tensors
-
-## Confirm fully loaded into THPs
-$ grep Huge /proc/meminfo
-AnonHugePages: 372733952 kB
-ShmemHugePages: 0 kB
-FileHugePages: 0 kB
-HugePages_Total: 0
-HugePages_Free: 0
-HugePages_Rsvd: 0
-HugePages_Surp: 0
-Hugepagesize: 2048 kB
-Hugetlb: 0 kB
-
-$ du /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf
-369596400 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf
-```
-
-| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | fmoe | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | ---: | ------------: | ---------------: |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 112.83 ± 0.69 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 63.66 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 47.50 ± 0.15 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 8.50 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.13 ± 0.02 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.48 ± 0.02 |
-
-`build: 4819257c (3613)`
-
-
-
-> attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
-
-I'm still wondering a bit about that `attn_k_b.weight` error `128 x 65536 are not divisible by 256` which falls back to `q4_0` or `q5_0` etc. However it seems that `q8_0_r8` is okay?
-
-```
-[ 52/1147] blk.3.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type q8_0_r8 for tensor blk$
-3.attn_k_b.weight
-
-====== llama_model_quantize_internal: did not find weights for blk.3.attn_k_b.weight
-converting to q8_0_r8 .. size = 16.00 MiB -> 8.50 MiB
-```
-
-So wondering if I do a mostly `iq5_k_r4` attention/shared experts, should I let the `attn_k_b.weight` fall back to `q5_0` or set them up to `q8_0_r8` (assuming CPU inference).
-
-Anyway, learning a lot as usual, gonna close this one as solved. Cheers!
-
----
-
-👤 **ubergarm** commented the **2025-03-31** at **19:52:27**:
-
-> Maybe I'll follow up later with proper llama-bench comparisons
-
-> I'm somewhat surprised that the PP speed of the pure IQ4_K is better than the IQ4_K mix by almost 15%
-
-@ikawrakow
-
-I did a quick llama-bench comparison between the `PURE-IQ4_K_R4` and the `q8_0`/mix `IQ4_K_R4` (using -rtr 1 for `q8_0_r8` this time) on the CPU only the Xeon 6980P with 88 threads and found the results interesting. The graph shows the "pure" version as baseline 100%.
-
-I believe this is basically the same as @saood06 's pure version rolled last night vs his earlier working mix mentioned above.
-
-
-
-
-
-Command details and raw data
-
-## Common Setup
-```bash
-echo Setting power profile to performance:
-powerprofilesctl set performance
-
-echo Set numa balancing to be off:
-echo 0 | sudo tee /proc/sys/kernel/numa_balancing
-
-echo Maximizing chances of loading model into THPs
-echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/enabled
-echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/defrag
-
-echo Dropping all caches... (to hopefully use more THPs)
-sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
-```
-
-## `IQ4_K_R4`
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-bench \
- -rtr 1 \
- -thp 0 \
- --mmap 0 \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
- -ctk q8_0 \
- -mla 3 -fa 1 \
- -amb 1024 \
- -fmoe 1 \
- -p 512,8192,16384 -n 0 \
- -gp 512,64 \
- -gp 8192,64 \
- -gp 16384,64 \
- -r 2 \
- --numa numactl \
- --threads 88
-
-## Confirm fully loaded into THPs
-$ grep Huge /proc/meminfo
-AnonHugePages: 41615360 kB
-ShmemHugePages: 0 kB
-FileHugePages: 0 kB
-HugePages_Total: 0
-HugePages_Free: 0
-HugePages_Rsvd: 0
-HugePages_Surp: 0
-Hugepagesize: 2048 kB
-Hugetlb: 0 kB
-
-$ du /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf
-404947028 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf
-```
-
-| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | rtr | fmoe | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | --: | ---: | ------------: | ---------------: |
-============ Repacked 611 tensors
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp512 | 122.55 ± 3.11 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp8192 | 74.34 ± 2.11 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp16384 | 52.68 ± 0.21 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp512 | 8.20 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp8192 | 6.70 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp16384 | 5.52 ± 0.00 |
-
-`build: 4819257c (3613)`
-
-## `PURE-IQ4_K_R4`
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-bench \
- -thp 0 \
- --mmap 0 \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf \
- -ctk q8_0 \
- -mla 3 -fa 1 \
- -amb 1024 \
- -fmoe 1 \
- -p 512,8192,16384 -n 0 \
- -gp 512,64 \
- -gp 8192,64 \
- -gp 16384,64 \
- -r 2 \
- --numa numactl \
- --threads 88
-
-## Confirm fully loaded into THPs
-$ grep Huge /proc/meminfo
-AnonHugePages: 372733952 kB
-ShmemHugePages: 0 kB
-FileHugePages: 0 kB
-HugePages_Total: 0
-HugePages_Free: 0
-HugePages_Rsvd: 0
-HugePages_Surp: 0
-Hugepagesize: 2048 kB
-Hugetlb: 0 kB
-
-$ du /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf
-369596400 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf
-```
-
-| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | fmoe | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | ---: | ------------: | ---------------: |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 112.83 ± 0.69 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 63.66 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 47.50 ± 0.15 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 8.50 ± 0.00 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.13 ± 0.02 |
-| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.48 ± 0.02 |
-
-`build: 4819257c (3613)`
-
-
-
-> attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
-
-I'm still wondering a bit about that `attn_k_b.weight` error `128 x 65536 are not divisible by 256` which falls back to `q4_0` or `q5_0` etc. However it seems that `q8_0_r8` is okay?
-
-```
-[ 52/1147] blk.3.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type q8_0_r8 for tensor blk$
-3.attn_k_b.weight
-
-====== llama_model_quantize_internal: did not find weights for blk.3.attn_k_b.weight
-converting to q8_0_r8 .. size = 16.00 MiB -> 8.50 MiB
-```
-
-So wondering if I do a mostly `iq5_k_r4` attention/shared experts, should I let the `attn_k_b.weight` fall back to `q5_0` or set them up to `q8_0_r8` (assuming CPU inference).
-
-Anyway, learning a lot as usual, gonna close this one as solved. Cheers!
-
----
-
-👤 **saood06** commented the **2025-04-01** at **01:02:46**:
-
-> Just grabbed the log, here is how your "pure" `iq4_k_r4` stacks up on full perplexity run , size, and duration:
-> Model Size (GiB) PPL Duration (minutes)
-> DeepSeek-V3-0324-IQ2_K_R4 227 3.5614 +/- 0.02001 (different rig)
-> DeepSeek-V3-0324-PURE-IQ4_K_R4 353 3.2942 +/- 0.01812 47.56
-> DeepSeek-V3-0324-IQ4_K_R4 387 3.2596 +/- 0.01786 55.01
-> DeepSeek-V3-0324-Q8_0 666 3.2454 +/- 0.01773 68.87
->
-> 
->
-> In terms of speed to calculate perplexity, these three were similar setups more or less using a single socket of the Xeon 6980P
-
-Thanks, it looks like an acceptable loss in quality for me if it performs fast (wasn't able to make the quant overnight, the quant is cooking now)
-
-
-> `3.2942` is 1.5% higher than `Q8_0`, so not too bad.
-
-I agree.
-
->I think with `IQ5_K` for the attention tensors and shared experts it should be (almost) on par with the result obtained with `Q8_0` for these.
-
-It might be, but I probably won't test it as doing full ppl runs takes me way too long, and I think I'll be happy with my "pure" IQ4_K_R4 as that should still be faster, even if it is a bit lower quality.
-
-
-> I did a quick llama-bench comparison between the `PURE-IQ4_K_R4` and the `q8_0`/mix `IQ4_K_R4` (using -rtr 1 for `q8_0_r8` this time) on the CPU only the Xeon 6980P with 88 threads and found the results interesting. The graph shows the "pure" version as baseline 100%.
->
-> 
-
-I'm really surprised that the PURE gains a bit more TG lead at a depth of 8K, but then ends up behind at 16K. This is different from what I've seen when testing. It would be interesting to see the sweep bench and when they actually intersect and how the curves actually look because on my system I've tested up to that depth and the pure still wins out in TG (and it seems like it will always stay ahead with the lead gaining like you saw initially), so I'm curious as to why it ends up losing at higher depths for you.
-
-
-> > attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
->
-> I'm still wondering a bit about that `attn_k_b.weight` error `128 x 65536 are not divisible by 256` which falls back to `q4_0` or `q5_0` etc. However it seems that `q8_0_r8` is okay?
-
-Yes. `q8_0_r8` is not an i-, k-, or iqk-quants.
-
-
-> So wondering if I do a mostly `iq5_k_r4` attention/shared experts, should I let the `attn_k_b.weight` fall back to `q5_0` or set them up to `q8_0_r8` (assuming CPU inference).
-
-Both work and will have tradeoffs. I think `q5_0` is fine, but other people think that tensor is more sensitive and should be set higher when you can.
-
----
-
-👤 **ikawrakow** commented the **2025-04-01** at **08:20:43**:
-
->> I'm still wondering a bit about that attn_k_b.weight error 128 x 65536 are not divisible by 256 which falls back to q4_0 or q5_0 etc. However it seems that q8_0_r8 is okay?
->
-> Both work and will have tradeoffs. I think q5_0 is fine, but other people think that tensor is more sensitive and should be set higher when you can.
-
-Note that `Q5_0` quantization was improved in #295, so it should be fine now. But if in doubt, you can use `Q6_0`, which is basically on par with `Q6_K` after PR #295. For CPU-only you can use `q5_0_r4` or `q6_0_r4`.
-
-> It might be, but I probably won't test it as doing full ppl runs takes me way too long, and I think I'll be happy with my "pure" IQ4_K_R4 as that should still be faster, even if it is a bit lower quality.
-
-Fair enough.
-
-But if you get the urge to experiment and you are content with slight accuracy loss, you may consider `IQ4_KS`. Here is a performance comparison between pure `IQ4_K` and pure `IQ4_KS` for DeepSeek-Lite on my Ryzen-7950X CPU:
-
-| model | size | fa | mla | rtr | fmoe | test | t/s |
-| -------------------- | ---------: | -: | --: | --: | ---: | ------------: | ---------------: |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | pp512 | 700.85 ± 2.43 |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 34.41 ± 0.00 |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 31.93 ± 0.01 |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 25.78 ± 0.00 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | pp512 | 659.06 ± 2.14 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 32.04 ± 0.06 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 29.66 ± 0.02 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 23.74 ± 0.00 |
-
-For DeepSeek-Lite we have `PPL(bf16) = 6.767`, `PPL(pure IQ4_K) = 6.821` (so +0.80%), and `PPL(pure IQ4_KS) = 6.858` (so, +1.34%).
-
----
-
-👤 **ubergarm** commented the **2025-04-01** at **15:22:03**:
-
-> > UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
->
-> Amazing! You should publish this model.
-
-Okay, I have two published `ik_llama.cpp` exclusive quants up on [huggingface ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) repo with hopefully enough of a quick start to get people curious enough to try this fork!
-
-> Note that Q5_0 quantization was improved in https://github.com/ikawrakow/ik_llama.cpp/pull/295, so it should be fine now. But if in doubt, you can use Q6_0, which is basically on par with Q6_K after PR https://github.com/ikawrakow/ik_llama.cpp/pull/295. For CPU-only you can use q5_0_r4 or q6_0_r4
-
-Ahh great, I didn't realize there was a `q5_0_r4`/`q6_0_r4` which is exactly what I was looking for to keep that tensor optimized. So if I re-made the "pure" benchmarked above it could be optimized using the `_r4` for possibly a bit more speed which may be related to:
-
-> I'm really surprised that the PURE gains a bit more TG lead at a depth of 8K, but then ends up behind at 16K. This is different from what I've seen when testing. It would be interesting to see the sweep bench and when they actually intersect and how the curves actually look...
-
-Yeah I was surprised about that too, I still need to dial in how many threads for tg vs pp too as it pp scales up and actually seems to improve with more threads. I'm out tomorrow but would like to finally get a good llama-sweep-bench going, I should have enough info to run it and get a curve. Thanks!
-
----
-
-👤 **saood06** commented the **2025-04-01** at **21:39:19**:
-
-> Fair enough.
->
-> But if you get the urge to experiment and you are content with slight accuracy loss, you may consider `IQ4_KS`. Here is a performance comparison between pure `IQ4_K` and pure `IQ4_KS` for DeepSeek-Lite on my Ryzen-7950X CPU:
-
-| model | size | fa | mla | rtr | fmoe | test | t/s |
-| -------------------- | ---------: | -: | --: | --: | ---: | ------------: | ---------------: |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | pp512 | 700.85 ± 2.43 |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 34.41 ± 0.00 |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 31.93 ± 0.01 |
-| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 25.78 ± 0.00 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | pp512 | 659.06 ± 2.14 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 32.04 ± 0.06 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 29.66 ± 0.02 |
-| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 23.74 ± 0.00 |
->
-> For DeepSeek-Lite we have `PPL(bf16) = 6.767`, `PPL(pure IQ4_K) = 6.821` (so +0.80%), and `PPL(pure IQ4_KS) = 6.858` (so, +1.34%).
-
-This on the other hand does tempt me. I like my IQ4_K_R4 but trading off more quality for speed is still tempting.
-
-
-
-> Ahh great, I didn't realize there was a `q5_0_r4`/`q6_0_r4` which is exactly what I was looking for to keep that tensor optimized. So if I re-made the "pure" benchmarked above it could be optimized using the `_r4` for possibly a bit more speed
-
-I forgot about it as well, since I just let the fallback handle that tensor.
-
-> Yeah I was surprised about that too, I still need to dial in how many threads for tg vs pp too as it pp scales up and actually seems to improve with more threads. I'm out tomorrow but would like to finally get a good llama-sweep-bench going, I should have enough info to run it and get a curve. Thanks!
-
-If you do it would be interesting to see (also I haven't tested it, but setting -tb in sweep-bench should work and allow you to run different thread counts for TG and PP just like you can for the other examples like server and main).
-
-My "pure" IQ4_K_R4 finished and the preliminary sweep bench results were really good (didn't benchmark very far as I wanted to inference with it, and just wanted to confirm it was loaded in and fast). I'll post a sweep bench graph out to 16K comparing it to some of my old results later.
-
----
-
-👤 **saood06** commented the **2025-04-01** at **21:39:19**:
-
-> Fair enough.
->
-> But if you get the urge to experiment and you are content with slight accuracy loss, you may consider `IQ4_KS`. Here is a performance comparison between pure `IQ4_K` and pure `IQ4_KS` for DeepSeek-Lite on my Ryzen-7950X CPU:
-> model size fa mla rtr fmoe test t/s
-> deepseek2 16B IQ4_KS 8.15 GiB 1 3 1 1 pp512 700.85 ± 2.43
-> deepseek2 16B IQ4_KS 8.15 GiB 1 3 1 1 tg128@pp512 34.41 ± 0.00
-> deepseek2 16B IQ4_KS 8.15 GiB 1 3 1 1 tg128@pp4096 31.93 ± 0.01
-> deepseek2 16B IQ4_KS 8.15 GiB 1 3 1 1 tg128@pp16384 25.78 ± 0.00
-> deepseek2 16B IQ4_K 9.00 GiB 1 3 1 1 pp512 659.06 ± 2.14
-> deepseek2 16B IQ4_K 9.00 GiB 1 3 1 1 tg128@pp512 32.04 ± 0.06
-> deepseek2 16B IQ4_K 9.00 GiB 1 3 1 1 tg128@pp4096 29.66 ± 0.02
-> deepseek2 16B IQ4_K 9.00 GiB 1 3 1 1 tg128@pp16384 23.74 ± 0.00
->
-> For DeepSeek-Lite we have `PPL(bf16) = 6.767`, `PPL(pure IQ4_K) = 6.821` (so +0.80%), and `PPL(pure IQ4_KS) = 6.858` (so, +1.34%).
-
-This on the other hand does tempt me. I like my IQ4_K_R4 but trading off more quality for speed is still tempting.
-
-
-
-> Ahh great, I didn't realize there was a `q5_0_r4`/`q6_0_r4` which is exactly what I was looking for to keep that tensor optimized. So if I re-made the "pure" benchmarked above it could be optimized using the `_r4` for possibly a bit more speed
-
-I forgot about it as well, since I just let the fallback handle that tensor.
-
-> Yeah I was surprised about that too, I still need to dial in how many threads for tg vs pp too as it pp scales up and actually seems to improve with more threads. I'm out tomorrow but would like to finally get a good llama-sweep-bench going, I should have enough info to run it and get a curve. Thanks!
-
-If you do it would be interesting to see.
-
-My "pure" IQ4_K_R4 finished and the preliminary sweep bench results were really good (didn't benchmark very far as I wanted to inference with it, and just wanted to confirm it was loaded in and fast). I'll post a sweep bench graph out to 16K comparing it to some of my old results later.
-
----
-
-👤 **saood06** commented the **2025-04-03** at **03:10:35**:
-
-Here's the full graph comparing my currently used fast quants for R1 and V3. The mixes for both are similar. I'm going to go back and test #287 next with more configurations to see if I can find one that gives me more performance.
-
-
-
-
-
-Not included in the graph, but looking at other tests I ran #259 does seem to have an impact on performance on my system since I had a very similar quant mix with and without those tensors and they performed slightly differently.
-
----
-
-👤 **saood06** commented the **2025-04-03** at **03:10:35**:
-
-Here's the full graph comparing my fast quants for both R1 and V3. The mixes for both are similar. I'm going to go back and test #287 next with more configurations to see if I can find one that works for it.
-
-
-
-
-
-Not included in the graph, but looking at other tests I ran #259 does seem to have an impact on performance on my system since I had a very similar quant mix with and without those tensors and they performed slightly differently.
-
----
-
-👤 **saood06** commented the **2025-04-04** at **13:59:03**:
-
-Finally tested batch performance but this is at depth of 0, I'll test deeper depths later.
-
-
-
-12 is the highest, but 6 gets most of the way there.
-
----
-
-👤 **ubergarm** commented the **2025-04-04** at **15:43:41**:
-
-Currently cooking up a CPU only "speed mix" blend using some of the advice from above. Will keep you posted.
-
-Otherwise, ran a CPU only `llama-sweep-bench` on the `IQ5_K_R4/IQ4_K_R4` routed experts /`q8_0` all else blend. Accidently left the Intel Xeon 6980P in `balanced` mode instead of `performance`, but the trends should be similar.
-
-
-
-
-
-llama-sweep-bench DeepSeek-V3-0324-IQ4_K_R4 logs
-
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
- --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
- --run-time-repack \
- --no-mmap \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 1024 \
- -fmoe \
- -c 32768 \
- -ub 512 \
- --threads 88 \
- --threads-batch 128 \
- --numa numactl
-
-Current power profile is: balanced
-Current THP enabled and defrag configs are:
-[always] madvise never
-[always] defer defer+madvise madvise never
-Set numa balancing to be:
-0
-
-llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 340
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["...
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3...
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["...
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
-llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
-llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.47 MiB
-llm_load_tensors: CPU buffer size = 395450.97 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 1024
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CPU output buffer size = 0.49 MiB
-llama_new_context_with_model: CPU compute buffer size = 2662.01 MiB
-llama_new_context_with_model: graph nodes = 5500
-llama_new_context_with_model: graph splits = 1
+llama_model_loader: - type q8_0: 612 tensors
+llama_model_loader: - type iq4_k_r4: 116 tensors
+llama_model_loader: - type iq5_k_r4: 58 tensors
+```
-main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 88, n_threads_batch = 128
+So probably yeah, the issue I'm seeing here is because I used `q8_0_r8` for `token_embd.weight` which seems like a known invalid combination.
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 4.412 | 116.05 | 13.303 | 9.62 |
-| 512 | 128 | 512 | 4.384 | 116.79 | 13.639 | 9.38 |
-| 512 | 128 | 1024 | 4.711 | 108.69 | 14.823 | 8.64 |
-| 512 | 128 | 1536 | 5.448 | 93.98 | 15.187 | 8.43 |
-| 512 | 128 | 2048 | 5.361 | 95.51 | 15.282 | 8.38 |
-| 512 | 128 | 2560 | 6.005 | 85.26 | 16.579 | 7.72 |
-| 512 | 128 | 3072 | 6.276 | 81.58 | 15.304 | 8.36 |
-| 512 | 128 | 3584 | 6.383 | 80.21 | 15.072 | 8.49 |
-| 512 | 128 | 4096 | 6.548 | 78.19 | 15.006 | 8.53 |
-| 512 | 128 | 4608 | 7.245 | 70.67 | 15.262 | 8.39 |
-| 512 | 128 | 5120 | 7.498 | 68.29 | 15.404 | 8.31 |
-| 512 | 128 | 5632 | 7.992 | 64.06 | 15.555 | 8.23 |
-| 512 | 128 | 6144 | 7.825 | 65.43 | 16.026 | 7.99 |
-| 512 | 128 | 6656 | 8.140 | 62.90 | 16.011 | 7.99 |
-| 512 | 128 | 7168 | 9.216 | 55.55 | 16.322 | 7.84 |
-| 512 | 128 | 7680 | 9.197 | 55.67 | 16.641 | 7.69 |
-| 512 | 128 | 8192 | 9.601 | 53.33 | 17.393 | 7.36 |
-| 512 | 128 | 8704 | 9.049 | 56.58 | 17.375 | 7.37 |
-| 512 | 128 | 9216 | 9.669 | 52.95 | 17.475 | 7.32 |
-| 512 | 128 | 9728 | 9.592 | 53.38 | 17.728 | 7.22 |
-| 512 | 128 | 10240 | 10.385 | 49.30 | 18.297 | 7.00 |
-| 512 | 128 | 10752 | 10.284 | 49.79 | 18.500 | 6.92 |
-| 512 | 128 | 11264 | 10.422 | 49.13 | 18.387 | 6.96 |
-| 512 | 128 | 11776 | 11.144 | 45.94 | 18.602 | 6.88 |
-| 512 | 128 | 12288 | 11.066 | 46.27 | 19.002 | 6.74 |
-| 512 | 128 | 12800 | 11.749 | 43.58 | 19.933 | 6.42 |
-| 512 | 128 | 13312 | 11.813 | 43.34 | 19.790 | 6.47 |
-| 512 | 128 | 13824 | 12.959 | 39.51 | 18.546 | 6.90 |
-| 512 | 128 | 14336 | 12.402 | 41.28 | 20.914 | 6.12 |
-| 512 | 128 | 14848 | 13.064 | 39.19 | 20.959 | 6.11 |
-| 512 | 128 | 15360 | 13.137 | 38.97 | 21.331 | 6.00 |
-| 512 | 128 | 15872 | 13.158 | 38.91 | 21.756 | 5.88 |
-| 512 | 128 | 16384 | 13.227 | 38.71 | 21.625 | 5.92 |
-| 512 | 128 | 16896 | 14.089 | 36.34 | 22.327 | 5.73 |
-| 512 | 128 | 17408 | 14.251 | 35.93 | 22.982 | 5.57 |
-| 512 | 128 | 17920 | 14.794 | 34.61 | 22.817 | 5.61 |
-| 512 | 128 | 18432 | 14.544 | 35.20 | 23.187 | 5.52 |
-| 512 | 128 | 18944 | 14.835 | 34.51 | 23.744 | 5.39 |
-| 512 | 128 | 19456 | 15.538 | 32.95 | 20.042 | 6.39 |
-| 512 | 128 | 19968 | 16.182 | 31.64 | 24.139 | 5.30 |
-| 512 | 128 | 20480 | 16.972 | 30.17 | 24.933 | 5.13 |
-| 512 | 128 | 20992 | 15.876 | 32.25 | 25.319 | 5.06 |
-| 512 | 128 | 21504 | 16.150 | 31.70 | 25.309 | 5.06 |
-| 512 | 128 | 22016 | 16.810 | 30.46 | 25.217 | 5.08 |
-| 512 | 128 | 22528 | 17.180 | 29.80 | 25.202 | 5.08 |
-| 512 | 128 | 23040 | 18.171 | 28.18 | 25.445 | 5.03 |
-| 512 | 128 | 23552 | 17.318 | 29.56 | 26.029 | 4.92 |
-| 512 | 128 | 24064 | 18.848 | 27.16 | 26.128 | 4.90 |
-| 512 | 128 | 24576 | 18.282 | 28.01 | 26.675 | 4.80 |
-| 512 | 128 | 25088 | 18.234 | 28.08 | 21.079 | 6.07 |
-| 512 | 128 | 25600 | 18.584 | 27.55 | 27.583 | 4.64 |
-| 512 | 128 | 26112 | 19.350 | 26.46 | 27.687 | 4.62 |
-| 512 | 128 | 26624 | 19.053 | 26.87 | 27.982 | 4.57 |
-| 512 | 128 | 27136 | 19.228 | 26.63 | 28.328 | 4.52 |
-| 512 | 128 | 27648 | 20.705 | 24.73 | 28.819 | 4.44 |
-| 512 | 128 | 28160 | 19.993 | 25.61 | 29.508 | 4.34 |
-| 512 | 128 | 28672 | 20.698 | 24.74 | 29.902 | 4.28 |
-| 512 | 128 | 29184 | 20.320 | 25.20 | 29.555 | 4.33 |
-| 512 | 128 | 29696 | 21.366 | 23.96 | 30.114 | 4.25 |
-| 512 | 128 | 30208 | 21.293 | 24.05 | 29.625 | 4.32 |
-| 512 | 128 | 30720 | 21.417 | 23.91 | 22.628 | 5.66 |
-| 512 | 128 | 31232 | 21.941 | 23.34 | 30.653 | 4.18 |
-| 512 | 128 | 31744 | 22.326 | 22.93 | 31.921 | 4.01 |
-| 512 | 128 | 32256 | 23.055 | 22.21 | 31.750 | 4.03 |
-============ Repacked 611 tensors
+Gonna let it finish up and curious how good the perplexity is relative to the full `Q8_0` hehe... its addictive...
+
+---
+
+*UPDATE* Wow!! `3.2596 +/- 0.01786` for this `DeepSeek-V3-0324-IQ4_K_R4.gguf` quant vs full `Q8_0` at `3.2454 +/- 0.01773` in almost half the size!
+
+```bash
+llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
+
+llama_print_timings: load time = 2327.19 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 3249602.81 ms / 287232 tokens ( 11.31 ms per token, 88.39 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 3300377.65 ms / 287233 tokens
+Final estimate: PPL = 3.2596 +/- 0.01786
```
-> Finally tested batch performance
+---
-Oh nice, is that with `llama-batched-bench` ?
+👤 **saood06** commented on **2025-03-31** at **01:46:10**
----
+> Wow thanks, you are really good with keeping track of so much disperate information and links haha...
-👤 **ikawrakow** commented the **2025-04-04** at **16:55:06**:
+You say this right after I say I don't have a reference (I jest).
-Nearly a 6X decrease in PP performance is quite a bit more than I'm expecting. In my testing it has been more in the 2.5X range when going to 32k tokens. I wonder if this is due to the balanced performance setting or the huge model (or both).
+>
+> And I am currently testing perplexity on my experiment above using `Q8_0` instead of `Q8_0_R8` quant and its looking just fine:
----
+Nice.
+
+> Gonna let it finish up and curious how good the perplexity is relative to the full `Q8_0` hehe... its addictive...
-👤 **ubergarm** commented the **2025-04-04** at **17:59:03**:
+I know, I want to test my pure IQ4_K_R4 (minus the token_embd.weight), I'm probably going to have that quant cook overnight and test it later. The 4th mix was fast in the preliminary performance screening I did before functionality testing it.
-> Nearly a 6X decrease in PP performance is quite a bit more than I'm expecting. In my testing it has been more in the 2.5X range when going to 32k tokens. I wonder if this is due to the balanced performance setting or the huge model (or both).
+I thought about how the ratio of tokens used by me in sweep-bench vs server and I had an idea that I could tweak sweep-bench to do actually useful work instead of just decoding and prefilling random tokens.
-Yeah, a lot of little variables can effect performance. One other data point I got was from [fairydreaming on r/LocalLLama](https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/comment/ml1lgob/) which drops off more slowly on their CPU+GPU rig to ~1.5X decrease in PP performance across 32k context.
+>UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
+
+Ooh, nice. If you don't mind would you test pure IQ4_K_R4 with IQ4_K token_embd.weight and see how close that gets? I know `-ser` is designed to be used instead, but it would be interesting see it tested for IQ4_K/IQ4_K_R4.
+
+>model size = 386.183 GiB (4.936 BPW)
+
+Just barely out of reach for my 384 GB RAM server, but I also think that using IQ6_K for some of the Q8_0 could get me there without affecting PPL much at all, but I did experiment with something similar with my third IQ4_K_R4 based mix of R1, which I barely used because I preferred the faster mixes.
---
-👤 **ikawrakow** commented the **2025-04-04** at **18:02:59**:
+👤 **ubergarm** commented on **2025-03-31** at **02:02:07**
-The TG peaks are also quite interesting. If I could make the performance stay where the peaks are for any `N_KV`, it would be a ~40% improvement at 32k tokens! Here I wonder if it is related to the 88 threads (and the work not splitting very well between them), or somehow related to the `-amb` option.
+> You say this right after I say I don't have a reference (I jest).
-@ubergarm
+😂
-You always use `numactl`. I'm really curious to know what happens if you don't involve `numactl` at all. I.e.,
-```
-./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
- --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
- --run-time-repack \
- --no-mmap \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 1024 \
- -fmoe \
- -c 32768 \
- -ub 512 \
- --threads 88 \
- --threads-batch 128
-```
+> If you don't mind would you test pure IQ4_K_R4 with IQ4_K token_embd.weight and see how close that gets?
+
+I think I can clean up some disk space now that I know which of my previous gguf's experiments are junk. Do I need to use `--pure` ? Otherwise I'll just update my existing `--custom-q` with your requested types.
+
+> Just barely out of reach for my 384 GB RAM server,
+
+Is this server CPU only? Otherwise all the q8_0's will fit in under 24GB VRAM with 32k context which might barely work for you.
+
+Interesting, yeah chopping the q8_0's could trim a little bit. It's pretty interesting how little of the weights are for attention relative to the MoEs. Psure GPT-3 was like 1/3rd attention weights. Deepseek seems like under 5% or something (didn't actually calculate it). I wonder if making say the last 10 routed experts slightly smaller would save more space while keeping attention maxxed out. Just spitballing, I really dunno what I'm doing haha...
---
-👤 **ikawrakow** commented the **2025-04-04** at **18:05:38**:
+👤 **saood06** commented on **2025-03-31** at **02:15:36**
-The fairydreaming tests use a GPU for attention, the slower drop in performance is expected in that setup. But for pure CPU inference I'm expecting around 2.5X lower performance at 32k tokens.
+> > If you don't mind would you test pure IQ4_K_R4 with IQ4_K token_embd.weight and see how close that gets?
+>
+> I think I can clean up some disk space now that I know which of my previous gguf's experiments are junk. Do I need to use `--pure` ? Otherwise I'll just update my existing `--custom-q` with your requested types.
+
+You can use whatever you find easier, I find `--custom-q` easier as well, what matters is the mix it produces.
+
+> > Just barely out of reach for my 384 GB RAM server,
+>
+> Is this server CPU only? Otherwise all the q8_0's will fit in under 24GB VRAM with 32k context which might barely work for you.
+
+The server is CPU only, I have a 3090 but in another machine that could be used with RPC, but my RPC sync still hasn't progressed to test it here, and my initial testing on llama.cpp showed RPC didn't help with the tensor offload/MLA stuff.
+
+> Interesting, yeah chopping the q8_0's could trim a little bit. It's pretty interesting how little of the weights are for attention relative to the MoEs. Psure GPT-3 was like 1/3rd attention weights. Deepseek seems like under 5% or something (didn't actually calculate it). I wonder if making say the last 10 routed experts slightly smaller would save more space while keeping attention maxxed out. Just spitballing, I really dunno what I'm doing haha...
+
+I'm not sure what your trying to say. MoE's are different from dense models, but both have tensors that are more or less sensitive to being quantized.
---
-👤 **ubergarm** commented the **2025-04-04** at **21:02:06**:
+👤 **ubergarm** commented on **2025-03-31** at **03:21:46**
-> You always use numactl. I'm really curious to know what happens if you don't involve numactl at all. I.e.,
+> You can use whatever you find easier, I find --custom-q easier as well, what matters is the mix it produces.
-I had some time while waiting for my "speed blend" to rsync between servers and tried the command without any numactl stuff. Interestingly, it loaded mostly on node 1, then some of the weights went into node 0 just before loading finished. I included numastat to show that in the detailed log.
+Super, it is cooking now, however, I looks like one of the tensors is not happy with `iq4_k_r4` and falling back to `q5_0`. The log is a bit wonky, but it could just be that unused `attn_k_b.weight` so not an actual issue. I'll let it keep going and hopefully get your perplexity by tomorrow morning!
-
+
+
+quantize snippet for `iq4_k_r4`
+
+```bash
+[ 793/1147] blk.42.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.attn_k_b.weight
+
+
+change_type_if_necessar : tensor cols 128 x 65536 are not divisible by 256, required for iq4_k_r4 - using fallback quantization q5_0
+
+====== llama_model_quantize_internal: did not find weights for blk.42.attn_k_b.weight
+converting to q5_0 .. size = 16.00 MiB -> 5.50 MiB
+[ 794/1147] blk.42.attn_v_b.weight - [ 512, 16384, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.attn_v_b.weight
+converting to iq4_k_r4 .. size = 16.00 MiB -> 4.50 MiB
+[ 795/1147] blk.42.attn_output.weight - [16384, 7168, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.attn_output.weight
+converting to iq4_k_r4 .. size = 224.00 MiB -> 63.00 MiB
+[ 796/1147] blk.42.attn_q_a_norm.weight - [ 1536, 1, 1, 1], type = f32, size = 0.006 MB
+[ 797/1147] blk.42.attn_q_a.weight - [ 7168, 1536, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.attn_q_a.weight
+converting to iq4_k_r4 .. size = 21.00 MiB -> 5.91 MiB
+[ 798/1147] blk.42.attn_q_b.weight - [ 1536, 24576, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.attn_q_b.weight
+converting to iq4_k_r4 .. size = 72.00 MiB -> 20.25 MiB
+[ 799/1147] blk.42.attn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
+[ 800/1147] blk.42.ffn_down_exps.weight - [ 2048, 7168, 256, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.ffn_down_exps.weight
+converting to iq4_k_r4 .. size = 7168.00 MiB -> 2016.00 MiB
+[ 801/1147] blk.42.ffn_gate_exps.weight - [ 7168, 2048, 256, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.ffn_gate_exps.weight
+converting to iq4_k_r4 .. size = 7168.00 MiB -> 2016.00 MiB
+[ 802/1147] blk.42.ffn_up_exps.weight - [ 7168, 2048, 256, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.42.ffn_up_exps.weight
+converting to iq4_k_r4 .. size = 7168.00 MiB -> 2016.00 MiB
+[ 803/1147] blk.42.ffn_norm.weight - [ 7168, 1, 1, 1], type = f32, size = 0.027 MB
+[ 804/1147] blk.43.exp_probs_b.bias - [ 256, 1, 1, 1], type = f32, size = 0.001 MB
+[ 805/1147] blk.43.ffn_gate_inp.weight - [ 7168, 256, 1, 1], type = f32, size = 7.000 MB
+[ 806/1147] blk.43.ffn_down_shexp.weight - [ 2048, 7168, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.ffn_down_shexp.weight
+converting to iq4_k_r4 .. size = 28.00 MiB -> 7.88 MiB
+[ 807/1147] blk.43.ffn_gate_shexp.weight - [ 7168, 2048, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.ffn_gate_shexp.weight
+converting to iq4_k_r4 .. size = 28.00 MiB -> 7.88 MiB
+[ 808/1147] blk.43.ffn_up_shexp.weight - [ 7168, 2048, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.ffn_up_shexp.weight
+converting to iq4_k_r4 .. size = 28.00 MiB -> 7.88 MiB
+[ 809/1147] blk.43.attn_kv_a_norm.weight - [ 512, 1, 1, 1], type = f32, size = 0.002 MB
+[ 810/1147] blk.43.attn_kv_a_mqa.weight - [ 7168, 576, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.attn_kv_a_mqa.weight
+converting to iq4_k_r4 .. size = 7.88 MiB -> 2.21 MiB
+[ 811/1147] blk.43.attn_kv_b.weight - [ 512, 32768, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.attn_kv_b.weight
+converting to iq4_k_r4 .. size = 32.00 MiB -> 9.00 MiB
+[ 812/1147] blk.43.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.attn_k_b.weight
-
-llama-sweep-bench without `numactl` stuff
+change_type_if_necessar : tensor cols 128 x 65536 are not divisible by 256, required for iq4_k_r4 - using fallback quantization q5_0
-```bash
-# drop caches
-$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
+====== llama_model_quantize_internal: did not find weights for blk.43.attn_k_b.weight
+converting to q5_0 .. size = 16.00 MiB -> 5.50 MiB
+[ 813/1147] blk.43.attn_v_b.weight - [ 512, 16384, 1, 1], type = bf16, Using custom type iq4_k_r4 for tensor blk
+.43.attn_v_b.weight
+```
-# set to performance this time
-Current power profile is: performance
+
-# always encourages it to use anonhugepages
-# as testing suggets improves performance on this rig
-Current THP enabled and defrag configs are:
-[always]
-[always]
+> I have a 3090 but in another machine that could be used with RPC
-# numa_balancing off
-Set numa balancing to be: 0
+ooh right right, yeah so all CPU it is.
-$ ./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
- --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
- --run-time-repack \
- --no-mmap \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 1024 \
- -fmoe \
- -c 32768 \
- -ub 512 \
- --threads 88 \
- --threads-batch 128 2>&1 | tee -a output.log
+> I'm not sure what your trying to say. MoE's are different from dense models, but both have tensors that are more or less sensitive to being quantized.
-llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 340
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["...
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3...
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["...
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
-llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
-llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.47 MiB
-llm_load_tensors: CPU buffer size = 395450.97 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 1024
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CPU output buffer size = 0.49 MiB
-llama_new_context_with_model: CPU compute buffer size = 2662.01 MiB
-llama_new_context_with_model: graph nodes = 5500
-llama_new_context_with_model: graph splits = 1
+Haha, I'm not sure either 💀 lol... I'm just wondering if trimming weight at say the last 10 layers of the *routed experts* (not MoEs) might drop overall size quicker than trimming it from the already fairly small embeddings/dense layers/attention/norms/bias/shared expert layers.
-main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 88, n_threads_batch = 128
+---
+
+👤 **saood06** commented on **2025-03-31** at **03:37:38**
+
+> > You can use whatever you find easier, I find --custom-q easier as well, what matters is the mix it produces.
+>
+> Super, it is cooking now, however, I looks like one of the tensors is not happy with `iq4_k_r4` and falling back to `q5_0`.
+
+That is fine and expected for that tensor.
+
+>I'll let it keep going and hopefully get your perplexity by tomorrow morning!
+
+Thanks!
+
+>ooh right right, yeah so all CPU it is.
+
+There are still models (and configurations) where RPC on ik_llama.cpp would benefit performance such as Miqu based quants. Deepseek is just not one of those.
+
+---
+
+👤 **ikawrakow** commented on **2025-03-31** at **05:50:51**
+
+So, `token_embd.weight` cannot be quantized with row-interleaved quants (one needs to be able to get individual single rows out if this tensor to fill the input state, but the row-interleaved quants pack 4 or 8 rows together, so this does not work). I have checks in place, but it looks like I'm not catching all possible paths to arrive at an interleaved quants. So, I guess, until I find and fix the issue it is best to just explicitly specify the type of the `token_embd.weight` tensor with a custom rule.
+
+`attn_k_b.weight` can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
+
+> UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
+
+Amazing! You should publish this model.
+
+I second @saood06's request to explore how much quality degradation there will be from moving the attention tensors and the shared experts to `iq6_k` and `iq5_k`, as this will make CPU-only TG quite a bit faster. For hybrid setups (with attention and shared experts being run on the GPU), one should look into `q6_K/q5_K` instead.
+
+---
+
+👤 **saood06** commented on **2025-03-31** at **06:55:11**
+
+>So, token_embd.weight cannot be quantized with row-interleaved quants (one needs to be able to get individual single rows out if this tensor to fill the input state, but the row-interleaved quants pack 4 or 8 rows together, so this does not work). I have checks in place, but it looks like I'm not catching all possible paths to arrive at an interleaved quants.
+
+Thanks for the explanation.
+
+> `attn_k_b.weight` can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
+
+I've had situations where it doesn't and llama-quantize crashes.
+
+command: `./llama-quantize --pure --imatrix /mnt/sda/imatrix_V30324_mrader.dat --output-tensor-type q6_k_r4 /mnt/sda/DeepseekV3_0324/DeepseekV3_0324-256x21B-BF16.gguf /mnt/sda/DeepSeek-V3-0324-IQ4_K_R4_ATT3.gguf IQ4_K_R4 48`
+
+The assert being triggered:
+```
+====== llama_model_quantize_internal: did not find weights for blk.0.attn_k_b.weight
+converting to iq4_k_r4 .. /home/saood06/ik_main/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:5244: GGML_ASSERT(n_per_row%QK_K == 0) failed
```
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 4.214 | 121.49 | 19.559 | 6.54 |
-| 512 | 128 | 512 | 4.304 | 118.97 | 19.317 | 6.63 |
-| 512 | 128 | 1024 | 4.539 | 112.79 | 19.692 | 6.50 |
-| 512 | 128 | 1536 | 4.859 | 105.37 | 20.024 | 6.39 |
-| 512 | 128 | 2048 | 5.429 | 94.31 | 21.110 | 6.06 |
-| 512 | 128 | 2560 | 5.698 | 89.86 | 21.308 | 6.01 |
-| 512 | 128 | 3072 | 5.948 | 86.08 | 21.940 | 5.83 |
-| 512 | 128 | 3584 | 6.368 | 80.40 | 21.664 | 5.91 |
-| 512 | 128 | 4096 | 6.665 | 76.82 | 21.375 | 5.99 |
-| 512 | 128 | 4608 | 7.055 | 72.57 | 21.764 | 5.88 |
-| 512 | 128 | 5120 | 7.397 | 69.22 | 21.929 | 5.84 |
-| 512 | 128 | 5632 | 7.846 | 65.25 | 21.051 | 6.08 |
-| 512 | 128 | 6144 | 8.496 | 60.27 | 23.048 | 5.55 |
-| 512 | 128 | 6656 | 8.884 | 57.63 | 21.473 | 5.96 |
-| 512 | 128 | 7168 | 9.241 | 55.41 | 22.841 | 5.60 |
-| 512 | 128 | 7680 | 9.832 | 52.08 | 21.809 | 5.87 |
-| 512 | 128 | 8192 | 9.957 | 51.42 | 22.837 | 5.60 |
-| 512 | 128 | 8704 | 10.521 | 48.67 | 23.967 | 5.34 |
-| 512 | 128 | 9216 | 10.787 | 47.46 | 23.475 | 5.45 |
-| 512 | 128 | 9728 | 11.187 | 45.77 | 23.407 | 5.47 |
-| 512 | 128 | 10240 | 11.988 | 42.71 | 25.122 | 5.10 |
-| 512 | 128 | 10752 | 12.502 | 40.95 | 24.736 | 5.17 |
-| 512 | 128 | 11264 | 12.874 | 39.77 | 24.705 | 5.18 |
-| 512 | 128 | 11776 | 12.893 | 39.71 | 24.578 | 5.21 |
-| 512 | 128 | 12288 | 13.309 | 38.47 | 25.649 | 4.99 |
-| 512 | 128 | 12800 | 13.647 | 37.52 | 24.652 | 5.19 |
-| 512 | 128 | 13312 | 14.318 | 35.76 | 25.035 | 5.11 |
-| 512 | 128 | 13824 | 14.879 | 34.41 | 24.243 | 5.28 |
-| 512 | 128 | 14336 | 15.221 | 33.64 | 25.826 | 4.96 |
-| 512 | 128 | 14848 | 15.292 | 33.48 | 26.096 | 4.91 |
-| 512 | 128 | 15360 | 15.592 | 32.84 | 25.744 | 4.97 |
-| 512 | 128 | 15872 | 15.757 | 32.49 | 26.224 | 4.88 |
-| 512 | 128 | 16384 | 14.834 | 34.51 | 26.616 | 4.81 |
-| 512 | 128 | 16896 | 15.757 | 32.49 | 27.967 | 4.58 |
-| 512 | 128 | 17408 | 16.378 | 31.26 | 27.682 | 4.62 |
-| 512 | 128 | 17920 | 16.754 | 30.56 | 27.855 | 4.60 |
-| 512 | 128 | 18432 | 17.300 | 29.59 | 27.905 | 4.59 |
-| 512 | 128 | 18944 | 17.347 | 29.52 | 28.338 | 4.52 |
-| 512 | 128 | 19456 | 17.895 | 28.61 | 24.992 | 5.12 |
-| 512 | 128 | 19968 | 18.210 | 28.12 | 28.662 | 4.47 |
-| 512 | 128 | 20480 | 18.579 | 27.56 | 28.880 | 4.43 |
-| 512 | 128 | 20992 | 18.920 | 27.06 | 29.153 | 4.39 |
-| 512 | 128 | 21504 | 19.537 | 26.21 | 29.282 | 4.37 |
-| 512 | 128 | 22016 | 19.716 | 25.97 | 29.682 | 4.31 |
-| 512 | 128 | 22528 | 20.576 | 24.88 | 30.040 | 4.26 |
-| 512 | 128 | 23040 | 20.705 | 24.73 | 30.366 | 4.22 |
-| 512 | 128 | 23552 | 21.201 | 24.15 | 30.501 | 4.20 |
-| 512 | 128 | 24064 | 21.809 | 23.48 | 30.800 | 4.16 |
-| 512 | 128 | 24576 | 22.042 | 23.23 | 30.988 | 4.13 |
-| 512 | 128 | 25088 | 22.660 | 22.59 | 26.174 | 4.89 |
-| 512 | 128 | 25600 | 23.038 | 22.22 | 31.451 | 4.07 |
-| 512 | 128 | 26112 | 23.601 | 21.69 | 31.606 | 4.05 |
-| 512 | 128 | 26624 | 23.744 | 21.56 | 31.454 | 4.07 |
-| 512 | 128 | 27136 | 24.403 | 20.98 | 32.176 | 3.98 |
-| 512 | 128 | 27648 | 24.954 | 20.52 | 31.961 | 4.00 |
-| 512 | 128 | 28160 | 25.142 | 20.36 | 32.050 | 3.99 |
-| 512 | 128 | 28672 | 25.774 | 19.87 | 32.425 | 3.95 |
-| 512 | 128 | 29184 | 25.847 | 19.81 | 33.104 | 3.87 |
-| 512 | 128 | 29696 | 26.218 | 19.53 | 32.757 | 3.91 |
-| 512 | 128 | 30208 | 26.704 | 19.17 | 33.055 | 3.87 |
-| 512 | 128 | 30720 | 27.111 | 18.89 | 27.009 | 4.74 |
-| 512 | 128 | 31232 | 26.987 | 18.97 | 33.298 | 3.84 |
-| 512 | 128 | 31744 | 26.712 | 19.17 | 33.334 | 3.84 |
-| 512 | 128 | 32256 | 28.083 | 18.23 | 33.414 | 3.83 |
+>
+> > UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
+>
+> Amazing!
-`============ Repacked 611 tensors`
+Yes. It is impressive how good the quants that can be made from this repo are.
-```bash
-$ grep Huge /proc/meminfo
-AnonHugePages: 406736896 kB
-ShmemHugePages: 0 kB
-FileHugePages: 0 kB
-HugePages_Total: 0
-HugePages_Free: 0
-HugePages_Rsvd: 0
-HugePages_Surp: 0
-Hugepagesize: 2048 kB
-Hugetlb: 0 kB
+> I second [@saood06](https://github.com/saood06)'s request to explore how much quality degradation there will be from moving the attention tensors and the shared experts to `iq6_k` and `iq5_k`, as this will make CPU-only TG quite a bit faster.
-$ numastat -m -p $(pidof llama-sweep-bench)
-Per-node process memory usage (in MBs) for PID 659855 (llama-sweep-ben)
- Node 0 Node 1 Total
- --------------- --------------- ---------------
-Huge 0.00 0.00 0.00
-Heap 2.80 34.14 36.94
-Stack 0.04 0.05 0.08
-Private 13999.99 383083.54 397083.52
----------------- --------------- --------------- ---------------
-Total 14002.82 383117.72 397120.54
+Yes, and maybe also try to use iq5_k_r4 for less MoE down projection tensors, maybe just the first 3. That should shave off a good bit of size and hopefully maintain almost all of the benefit of the MoE down projection tensors with just the first 3. Writing the `--custom-q` command it should be possible to just specify it for blk 3, blk 4, blk 5, as the first three blocks are dense and don't have any MoE down projection tensors, so they start at blk 3.
-Per-node system memory usage (in MBs):
- Node 0 Node 1 Total
- --------------- --------------- ---------------
-MemTotal 771710.76 773987.20 1545697.96
-MemFree 743559.40 1745.54 745304.94
-MemUsed 28151.36 772241.67 800393.03
-SwapCached 0.21 0.69 0.90
-Active 14157.56 383159.96 397317.52
-Inactive 8662.71 383016.18 391678.89
-Active(anon) 14076.79 383139.31 397216.09
-Inactive(anon) 3.26 22.98 26.25
-Active(file) 80.78 20.65 101.43
-Inactive(file) 8659.45 382993.20 391652.64
-Unevictable 29.86 5.50 35.36
-Mlocked 21.07 5.50 26.57
-Dirty 20.00 0.05 20.05
-Writeback 0.00 0.00 0.00
-FilePages 8755.46 383025.92 391781.38
-Mapped 82.61 63.21 145.82
-AnonPages 14097.36 383158.36 397255.73
-Shmem 11.92 5.88 17.80
-KernelStack 39.69 38.11 77.80
-PageTables 6.78 775.85 782.62
-SecPageTables 0.00 0.00 0.00
-NFS_Unstable 0.00 0.00 0.00
-Bounce 0.00 0.00 0.00
-WritebackTmp 0.00 0.00 0.00
-Slab 2489.91 2737.77 5227.68
-SReclaimable 402.44 1022.84 1425.27
-SUnreclaim 2087.47 1714.93 3802.40
-AnonHugePages 14010.00 383100.00 397110.00
-ShmemHugePages 0.00 0.00 0.00
-ShmemPmdMapped 0.00 0.00 0.00
-FileHugePages 0.00 0.00 0.00
-FilePmdMapped 0.00 0.00 0.00
-HugePages_Total 0.00 0.00 0.00
-HugePages_Free 0.00 0.00 0.00
-HugePages_Surp 0.00 0.00 0.00
-KReclaimable 402.44 1022.84 1425.27
+>For hybrid setups (with attention and shared experts being run on the GPU), one should look into `q6_K/q5_K` instead.
+
+I wonder how much extra context that would let you squeeze in. I've gone above 32k before and Deepseek docs say "Note that the CoT output can reach up to 32K tokens".
+
+---
+
+👤 **ikawrakow** commented on **2025-03-31** at **08:41:57**
+
+> I've had situations where it doesn't and llama-quantize crashes.
+
+This happened after PR [#294](https://github.com/ikawrakow/ik_llama.cpp/issues/294)? [#294](https://github.com/ikawrakow/ik_llama.cpp/issues/294) should have fixed the `--pure` use case.
+
+---
+
+👤 **saood06** commented on **2025-03-31** at **08:58:01**
+
+> > I've had situations where it doesn't and llama-quantize crashes.
+>
+> This happened after PR [[#294](https://github.com/ikawrakow/ik_llama.cpp/issues/294)](https://github.com/ikawrakow/ik_llama.cpp/pull/294)? [[#294](https://github.com/ikawrakow/ik_llama.cpp/issues/294)](https://github.com/ikawrakow/ik_llama.cpp/pull/294) should have fixed the `--pure` use case.
+
+This was before, that looks like it would fix it. Thanks.
+
+---
+
+👤 **ubergarm** commented on **2025-03-31** at **14:22:13**
+
+> > I'll let it keep going and hopefully get your perplexity by tomorrow morning!
+
+> Thanks!
+
+Just grabbed the log, here is how your "pure" `iq4_k_r4` stacks up on full perplexity run , size, and duration:
+| Model | Size (GiB) | PPL | Duration (minutes) |
+| --- | --- | --- | --- |
+| DeepSeek-V3-0324-IQ2_K_R4 | 227 | 3.5614 +/- 0.02001 | (different rig) |
+| DeepSeek-V3-0324-PURE-IQ4_K_R4 | 353 | 3.2942 +/- 0.01812 | 47.56 |
+| DeepSeek-V3-0324-IQ4_K_R4 | 387 | 3.2596 +/- 0.01786 | 55.01 |
+| DeepSeek-V3-0324-Q8_0 | 666 | 3.2454 +/- 0.01773 | 68.87 |
+
+
+
+In terms of speed to calculate perplexity, these three were similar setups more or less using a single socket of the Xeon 6980P
+
+
+
+#### "PURE" `IQ4_K_R4` perplexity log details
+```
+main: build = 3613 (4819257c)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: seed = 1337
+
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q5_0: 61 tensors
+llama_model_loader: - type iq4_k: 1 tensors
+llama_model_loader: - type iq4_k_r4: 724 tensors
+
+llm_load_print_meta: model size = 352.470 GiB (4.505 BPW)
+
+perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
+perplexity: 19.63 seconds per pass - ETA 45.88 minutes
+[1]2.4366,[2]3.1393,[3]2.3037,[4]1.9385,[5]1.7532,[6]1.6176,[7]1.5316,[8]1.4745,[9]1.4313,[10]1.3953,[11]1.3829,[12]1.4097,[13]1.4224,[14]1.5443,[15]1.6735,[16]1.7303,[17]1.8888,[18]2.0140,[19]1.9767,[20]1.9637,[21]2.0686,[22]2.0468,[23]2.0218,[24]2.0329,[25]2.0040,[26]1.9824,[27]2.0276,[28]2.0377,[29]2.0839,[30]2.1167,[31]2.1493,[32]2.1657,[33]2.2060,[34]2.2503,[35]2.2965,[36]2.3499,[37]2.3852,[38]2.4336,[39]2.4732,[40]2.5311,[41]2.5728,[42]2.5850,[43]2.6354,[44]2.6530,[45]2.7332,[46]2.7820,[47]2.7394,[48]2.6930,[49]2.6667,[50]2.6835,[51]2.7280,[52]2.7399,[53]2.7902,[54]2.8021,[55]2.8316,[56]2.8626,[57]2.8758,[58]2.9093,[59]2.9190,[60]2.9659,[61]3.0052,[62]3.0520,[63]3.0836,[64]3.1250,[65]3.1341,[66]3.1157,[67]3.0915,[68]3.1179,[69]3.1110,[70]3.1238,[71]3.1416,[72]3.1557,[73]3.1697,[74]3.1909,[75]3.1705,[76]3.1256,[77]3.0826,[78]3.0789,[79]3.0595,[80]3.0426,[81]3.0078,[82]3.0106,[83]2.9793,[84]2.9450,[85]2.9116,[86]2.8887,[87]2.8825,[88]2.8559,[89]2.8395,[90]2.8144,[91]2.7862,[92]2.7616,[93]2.7362,[94]2.7115,[95]2.6895,[96]2.6870,[97]2.6926,[98]2.6774,[99]2.6605,[100]2.6627,[101]2.6544,[102]2.6697,[103]2.6946,[104]2.7113,[105]2.7078,[106]2.7294,[107]2.7536,[108]2.7740,[109]2.8065,[110]2.8397,[111]2.8578,[112]2.8328,[113]2.8199,[114]2.7992,[115]2.7843,[116]2.7698,[117]2.7482,[118]2.7275,[119]2.7064,[120]2.6881,[121]2.6734,[122]2.6562,[123]2.6392,[124]2.6209,[125]2.6041,[126]2.5874,[127]2.5740,[128]2.5650,[129]2.5535,[130]2.5403,[131]2.5311,[132]2.5374,[133]2.5470,[134]2.5539,[135]2.5645,[136]2.5795,[137]2.5931,[138]2.6010,[139]2.6117,[140]2.6123,[141]2.6142,[142]2.6130,[143]2.6143,[144]2.6119,[145]2.6040,[146]2.6025,[147]2.6072,[148]2.6072,[149]2.6088,[150]2.6037,[151]2.6020,[152]2.5995,[153]2.5956,[154]2.5956,[155]2.5999,[156]2.6014,[157]2.6067,[158]2.6150,[159]2.6172,[160]2.6265,[161]2.6347,[162]2.6448,[163]2.6492,[164]2.6696,[165]2.6929,[166]2.7101,[167]2.7218,[168]2.7453,[169]2.7678,[170]2.7894,[171]2.8113,[172]2.7959,[173]2.7801,[174]2.7666,[175]2.7552,[176]2.7436,[177]2.7320,[178]2.7195,[179]2.7066,[180]2.7101,[181]2.7245,[182]2.7393,[183]2.7539,[184]2.7673,[185]2.7776,[186]2.7936,[187]2.8089,[188]2.8233,[189]2.8342,[190]2.8351,[191]2.8425,[192]2.8457,[193]2.8508,[194]2.8699,[195]2.8784,[196]2.8913,[197]2.9010,[198]2.9059,[199]2.9117,[200]2.9111,[201]2.9259,[202]2.9213,[203]2.9270,[204]2.9302,[205]2.9297,[206]2.9326,[207]2.9412,[208]2.9508,[209]2.9597,[210]2.9604,[211]2.9557,[212]2.9561,[213]2.9636,[214]2.9655,[215]2.9709,[216]2.9716,[217]2.9673,[218]2.9673,[219]2.9682,[220]2.9683,[221]2.9689,[222]2.9690,[223]2.9691,[224]2.9737,[225]2.9755,[226]2.9680,[227]2.9658,[228]2.9675,[229]2.9713,[230]2.9773,[231]2.9834,[232]2.9758,[233]2.9687,[234]2.9685,[235]2.9668,[236]2.9753,[237]2.9836,[238]2.9929,[239]3.0028,[240]3.0120,[241]3.0232,[242]3.0379,[243]3.0503,[244]3.0585,[245]3.0702,[246]3.0808,[247]3.0796,[248]3.0754,[249]3.0734,[250]3.0675,[251]3.0655,[252]3.0677,[253]3.0718,[254]3.0790,[255]3.0855,[256]3.0890,[257]3.0915,[258]3.0927,[259]3.0964,[260]3.0987,[261]3.1000,[262]3.0991,[263]3.1047,[264]3.1072,[265]3.1079,[266]3.1095,[267]3.1113,[268]3.1145,[269]3.1173,[270]3.1163,[271]3.1147,[272]3.1084,[273]3.1080,[274]3.1011,[275]3.0904,[276]3.0793,[277]3.0812,[278]3.0911,[279]3.0973,[280]3.1049,[281]3.1121,[282]3.1179,[283]3.1240,[284]3.1302,[285]3.1435,[286]3.1456,[287]3.1488,[288]3.1540,[289]3.1560,[290]3.1480,[291]3.1395,[292]3.1371,[293]3.1359,[294]3.1333,[295]3.1311,[296]3.1328,[297]3.1335,[298]3.1388,[299]3.1447,[300]3.1474,[301]3.1517,[302]3.1536,[303]3.1550,[304]3.1546,[305]3.1661,[306]3.1730,[307]3.1836,[308]3.1729,[309]3.1675,[310]3.1583,[311]3.1607,[312]3.1624,[313]3.1680,[314]3.1704,[315]3.1735,[316]3.1749,[317]3.1767,[318]3.1771,[319]3.1771,[320]3.1812,[321]3.1816,[322]3.1835,[323]3.1896,[324]3.1904,[325]3.1957,[326]3.1999,[327]3.2036,[328]3.2058,[329]3.2078,[330]3.2141,[331]3.2171,[332]3.2210,[333]3.2202,[334]3.2205,[335]3.2212,[336]3.2213,[337]3.2225,[338]3.2227,[339]3.2253,[340]3.2289,[341]3.2341,[342]3.2428,[343]3.2517,[344]3.2569,[345]3.2484,[346]3.2405,[347]3.2354,[348]3.2282,[349]3.2243,[350]3.2229,[351]3.2274,[352]3.2418,[353]3.2506,[354]3.2630,[355]3.2712,[356]3.2767,[357]3.2881,[358]3.2977,[359]3.3005,[360]3.3067,[361]3.3162,[362]3.3246,[363]3.3303,[364]3.3371,[365]3.3426,[366]3.3527,[367]3.3613,[368]3.3678,[369]3.3754,[370]3.3842,[371]3.3974,[372]3.4064,[373]3.4098,[374]3.4130,[375]3.4179,[376]3.4301,[377]3.4412,[378]3.4442,[379]3.4440,[380]3.4407,[381]3.4455,[382]3.4513,[383]3.4546,[384]3.4588,[385]3.4627,[386]3.4688,[387]3.4744,[388]3.4774,[389]3.4675,[390]3.4587,[391]3.4486,[392]3.4433,[393]3.4341,[394]3.4256,[395]3.4167,[396]3.4071,[397]3.3985,[398]3.3894,[399]3.3794,[400]3.3711,[401]3.3614,[402]3.3515,[403]3.3434,[404]3.3336,[405]3.3244,[406]3.3149,[407]3.3058,[408]3.2972,[409]3.2888,[410]3.2830,[411]3.2839,[412]3.2794,[413]3.2811,[414]3.2828,[415]3.2799,[416]3.2799,[417]3.2821,[418]3.2767,[419]3.2778,[420]3.2752,[421]3.2738,[422]3.2743,[423]3.2736,[424]3.2771,[425]3.2768,[426]3.2773,[427]3.2766,[428]3.2791,[429]3.2805,[430]3.2830,[431]3.2838,[432]3.2831,[433]3.2794,[434]3.2796,[435]3.2722,[436]3.2665,[437]3.2625,[438]3.2609,[439]3.2579,[440]3.2627,[441]3.2680,[442]3.2753,[443]3.2732,[444]3.2742,[445]3.2752,[446]3.2792,[447]3.2825,[448]3.2848,[449]3.2878,[450]3.2916,[451]3.2947,[452]3.2968,[453]3.2982,[454]3.2969,[455]3.2993,[456]3.2997,[457]3.3022,[458]3.3073,[459]3.3077,[460]3.3079,[461]3.3048,[462]3.3084,[463]3.3156,[464]3.3208,[465]3.3144,[466]3.3124,[467]3.3104,[468]3.3117,[469]3.3091,[470]3.3065,[471]3.3070,[472]3.3078,[473]3.3071,[474]3.3061,[475]3.3071,[476]3.3057,[477]3.3050,[478]3.3057,[479]3.3075,[480]3.3100,[481]3.3063,[482]3.3098,[483]3.3091,[484]3.3127,[485]3.3189,[486]3.3221,[487]3.3255,[488]3.3309,[489]3.3334,[490]3.3384,[491]3.3444,[492]3.3489,[493]3.3486,[494]3.3498,[495]3.3522,[496]3.3540,[497]3.3568,[498]3.3572,[499]3.3569,[500]3.3608,[501]3.3654,[502]3.3644,[503]3.3631,[504]3.3651,[505]3.3682,[506]3.3761,[507]3.3791,[508]3.3826,[509]3.3753,[510]3.3699,[511]3.3635,[512]3.3592,[513]3.3533,[514]3.3518,[515]3.3536,[516]3.3488,[517]3.3487,[518]3.3473,[519]3.3476,[520]3.3515,[521]3.3505,[522]3.3490,[523]3.3545,[524]3.3535,[525]3.3520,[526]3.3473,[527]3.3423,[528]3.3391,[529]3.3361,[530]3.3332,[531]3.3303,[532]3.3249,[533]3.3190,[534]3.3145,[535]3.3149,[536]3.3173,[537]3.3203,[538]3.3224,[539]3.3250,[540]3.3303,[541]3.3334,[542]3.3357,[543]3.3302,[544]3.3259,[545]3.3256,[546]3.3193,[547]3.3131,[548]3.3067,[549]3.3000,[550]3.2943,[551]3.2882,[552]3.2827,[553]3.2773,[554]3.2754,[555]3.2737,[556]3.2764,[557]3.2803,[558]3.2863,[559]3.2908,[560]3.2961,[561]3.2942,
+llama_print_timings: load time = 2197.28 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 2802141.29 ms / 287232 tokens ( 9.76 ms per token, 102.50 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 2853371.87 ms / 287233 tokens
+
+Final estimate: PPL = 3.2942 +/- 0.01812
```
-
+---
+
+👤 **ikawrakow** commented on **2025-03-31** at **14:52:10**
+
+`3.2942` is 1.5% higher than `Q8_0`, so not too bad. I think with `IQ5_K` for the attention tensors and shared experts it should be (almost) on par with the result obtained with `Q8_0` for these.
+
+I'm somewhat surprised that the PP speed of the pure `IQ4_K` is better than the `IQ4_K` mix by almost 15%. Is it so that you used `Q8_0`, and not `Q8_0_R8` for the mix, because there was the issue with the NaN/very high PPL due to row-interleaved quants being used for token embeddings?
+
+---
+
+👤 **ubergarm** commented on **2025-03-31** at **15:56:26**
+
+> 3.2942 is 1.5% higher than Q8_0, so not too bad. I think with IQ5_K for the attention tensors and shared experts it should be (almost) on par with the result obtained with Q8_0 for these.
+
+Nice, getting it dialed in. I don't think @saood06 tried that exact combo in his mixes yet.
+
+> I'm somewhat surprised that the PP speed of the pure IQ4_K is better than the IQ4_K mix by almost 15%. Is it so that you used Q8_0, and not Q8_0_R8 for the mix, because there was the issue with the NaN/very high PPL due to row-interleaved quants being used for token embeddings?
+
+Right, the "non pure" `IQ4_K_R4` here has `Q8_0`s for attention/embeds/dense/shared expert layers as well as `IQ5_K_R4` for routed experted down projections. I just didn't specify `-rtr` on the perplexity script is all. That nan issue has been fixed in the branch I was using.
+
+So the duration is not a fair comparison given the "pure" was using repacked quants while the "non pure" and full `q8_0` were *not* repacked.
+
+Maybe I'll follow up later with proper llama-bench comparisons after getting the mixes dialed in for perplexity.
+
+Can close this issue now then as the original question has been answered.
+
+Thanks!
---
-👤 **ubergarm** commented the **2025-04-04** at **21:02:06**:
+👤 **ubergarm** commented on **2025-03-31** at **19:52:27**
-> You always use numactl. I'm really curious to know what happens if you don't involve numactl at all. I.e.,
+> Maybe I'll follow up later with proper llama-bench comparisons
-I had some time while waiting for my "speed blend" to rsync between servers and tried the command without any numactl stuff. Interestingly, it loaded mostly on node 1, then some of the weights went into node 0 just before loading finished. I included numastat to show that in the detailed log.
+> I'm somewhat surprised that the PP speed of the pure IQ4_K is better than the IQ4_K mix by almost 15%
-
+@ikawrakow
-
+I did a quick llama-bench comparison between the `PURE-IQ4_K_R4` and the `q8_0`/mix `IQ4_K_R4` (using -rtr 1 for `q8_0_r8` this time) on the CPU only the Xeon 6980P with 88 threads and found the results interesting. The graph shows the "pure" version as baseline 100%.
-llama-sweep-bench without `numactl` stuff
+I believe this is basically the same as @saood06 's pure version rolled last night vs his earlier working mix mentioned above.
-```bash
-# drop caches
-$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
+
-# set to performance this time
-Current power profile is: performance
+
-# always encourages it to use anonhugepages
-# as testing suggets improves performance on this rig
-Current THP enabled and defrag configs are:
-[always]
-[always]
+Command details and raw data
-# numa_balancing off
-Set numa balancing to be: 0
+## Common Setup
+```bash
+echo Setting power profile to performance:
+powerprofilesctl set performance
-$ ./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
- --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
- --run-time-repack \
- --no-mmap \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 1024 \
- -fmoe \
- -c 32768 \
- -ub 512 \
- --threads 88 \
- --threads-batch 128 2>&1 | tee -a output.log
+echo Set numa balancing to be off:
+echo 0 | sudo tee /proc/sys/kernel/numa_balancing
-llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 340
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["...
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3...
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["...
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
-llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
-llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.47 MiB
-llm_load_tensors: CPU buffer size = 395450.97 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 1024
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
-llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CPU output buffer size = 0.49 MiB
-llama_new_context_with_model: CPU compute buffer size = 2662.01 MiB
-llama_new_context_with_model: graph nodes = 5500
-llama_new_context_with_model: graph splits = 1
+echo Maximizing chances of loading model into THPs
+echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/enabled
+echo always | sudo tee -a /sys/kernel/mm/transparent_hugepage/defrag
-main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 88, n_threads_batch = 128
+echo Dropping all caches... (to hopefully use more THPs)
+sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
```
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 4.214 | 121.49 | 19.559 | 6.54 |
-| 512 | 128 | 512 | 4.304 | 118.97 | 19.317 | 6.63 |
-| 512 | 128 | 1024 | 4.539 | 112.79 | 19.692 | 6.50 |
-| 512 | 128 | 1536 | 4.859 | 105.37 | 20.024 | 6.39 |
-| 512 | 128 | 2048 | 5.429 | 94.31 | 21.110 | 6.06 |
-| 512 | 128 | 2560 | 5.698 | 89.86 | 21.308 | 6.01 |
-| 512 | 128 | 3072 | 5.948 | 86.08 | 21.940 | 5.83 |
-| 512 | 128 | 3584 | 6.368 | 80.40 | 21.664 | 5.91 |
-| 512 | 128 | 4096 | 6.665 | 76.82 | 21.375 | 5.99 |
-| 512 | 128 | 4608 | 7.055 | 72.57 | 21.764 | 5.88 |
-| 512 | 128 | 5120 | 7.397 | 69.22 | 21.929 | 5.84 |
-| 512 | 128 | 5632 | 7.846 | 65.25 | 21.051 | 6.08 |
-| 512 | 128 | 6144 | 8.496 | 60.27 | 23.048 | 5.55 |
-| 512 | 128 | 6656 | 8.884 | 57.63 | 21.473 | 5.96 |
-| 512 | 128 | 7168 | 9.241 | 55.41 | 22.841 | 5.60 |
-| 512 | 128 | 7680 | 9.832 | 52.08 | 21.809 | 5.87 |
-| 512 | 128 | 8192 | 9.957 | 51.42 | 22.837 | 5.60 |
-| 512 | 128 | 8704 | 10.521 | 48.67 | 23.967 | 5.34 |
-| 512 | 128 | 9216 | 10.787 | 47.46 | 23.475 | 5.45 |
-| 512 | 128 | 9728 | 11.187 | 45.77 | 23.407 | 5.47 |
-| 512 | 128 | 10240 | 11.988 | 42.71 | 25.122 | 5.10 |
-| 512 | 128 | 10752 | 12.502 | 40.95 | 24.736 | 5.17 |
-| 512 | 128 | 11264 | 12.874 | 39.77 | 24.705 | 5.18 |
-| 512 | 128 | 11776 | 12.893 | 39.71 | 24.578 | 5.21 |
-| 512 | 128 | 12288 | 13.309 | 38.47 | 25.649 | 4.99 |
-| 512 | 128 | 12800 | 13.647 | 37.52 | 24.652 | 5.19 |
-| 512 | 128 | 13312 | 14.318 | 35.76 | 25.035 | 5.11 |
-| 512 | 128 | 13824 | 14.879 | 34.41 | 24.243 | 5.28 |
-| 512 | 128 | 14336 | 15.221 | 33.64 | 25.826 | 4.96 |
-| 512 | 128 | 14848 | 15.292 | 33.48 | 26.096 | 4.91 |
-| 512 | 128 | 15360 | 15.592 | 32.84 | 25.744 | 4.97 |
-| 512 | 128 | 15872 | 15.757 | 32.49 | 26.224 | 4.88 |
-| 512 | 128 | 16384 | 14.834 | 34.51 | 26.616 | 4.81 |
-| 512 | 128 | 16896 | 15.757 | 32.49 | 27.967 | 4.58 |
-| 512 | 128 | 17408 | 16.378 | 31.26 | 27.682 | 4.62 |
-| 512 | 128 | 17920 | 16.754 | 30.56 | 27.855 | 4.60 |
-| 512 | 128 | 18432 | 17.300 | 29.59 | 27.905 | 4.59 |
-| 512 | 128 | 18944 | 17.347 | 29.52 | 28.338 | 4.52 |
-| 512 | 128 | 19456 | 17.895 | 28.61 | 24.992 | 5.12 |
-| 512 | 128 | 19968 | 18.210 | 28.12 | 28.662 | 4.47 |
-| 512 | 128 | 20480 | 18.579 | 27.56 | 28.880 | 4.43 |
-| 512 | 128 | 20992 | 18.920 | 27.06 | 29.153 | 4.39 |
-| 512 | 128 | 21504 | 19.537 | 26.21 | 29.282 | 4.37 |
-| 512 | 128 | 22016 | 19.716 | 25.97 | 29.682 | 4.31 |
-| 512 | 128 | 22528 | 20.576 | 24.88 | 30.040 | 4.26 |
-| 512 | 128 | 23040 | 20.705 | 24.73 | 30.366 | 4.22 |
-| 512 | 128 | 23552 | 21.201 | 24.15 | 30.501 | 4.20 |
-| 512 | 128 | 24064 | 21.809 | 23.48 | 30.800 | 4.16 |
-| 512 | 128 | 24576 | 22.042 | 23.23 | 30.988 | 4.13 |
-| 512 | 128 | 25088 | 22.660 | 22.59 | 26.174 | 4.89 |
-| 512 | 128 | 25600 | 23.038 | 22.22 | 31.451 | 4.07 |
-| 512 | 128 | 26112 | 23.601 | 21.69 | 31.606 | 4.05 |
-| 512 | 128 | 26624 | 23.744 | 21.56 | 31.454 | 4.07 |
-| 512 | 128 | 27136 | 24.403 | 20.98 | 32.176 | 3.98 |
-| 512 | 128 | 27648 | 24.954 | 20.52 | 31.961 | 4.00 |
-| 512 | 128 | 28160 | 25.142 | 20.36 | 32.050 | 3.99 |
-| 512 | 128 | 28672 | 25.774 | 19.87 | 32.425 | 3.95 |
-| 512 | 128 | 29184 | 25.847 | 19.81 | 33.104 | 3.87 |
-| 512 | 128 | 29696 | 26.218 | 19.53 | 32.757 | 3.91 |
-| 512 | 128 | 30208 | 26.704 | 19.17 | 33.055 | 3.87 |
-| 512 | 128 | 30720 | 27.111 | 18.89 | 27.009 | 4.74 |
-| 512 | 128 | 31232 | 26.987 | 18.97 | 33.298 | 3.84 |
-| 512 | 128 | 31744 | 26.712 | 19.17 | 33.334 | 3.84 |
-| 512 | 128 | 32256 | 28.083 | 18.23 | 33.414 | 3.83 |
+## `IQ4_K_R4`
+```bash
+numactl -N 0 -m 0 \
+./build/bin/llama-bench \
+ -rtr 1 \
+ -thp 0 \
+ --mmap 0 \
+ --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
+ -ctk q8_0 \
+ -mla 3 -fa 1 \
+ -amb 1024 \
+ -fmoe 1 \
+ -p 512,8192,16384 -n 0 \
+ -gp 512,64 \
+ -gp 8192,64 \
+ -gp 16384,64 \
+ -r 2 \
+ --numa numactl \
+ --threads 88
+
+## note all q8_0 get repacked with `-rtr 1` to be `q8_r_8` including `attn_k_b.weight` presumably
+llama_model_loader: - type q8_0: 612 tensors
+llama_model_loader: - type iq4_k_r4: 116 tensors
+llama_model_loader: - type iq5_k_r4: 58 tensors
+
+## Confirm fully loaded into THPs
+$ grep Huge /proc/meminfo
+AnonHugePages: 41615360 kB
+ShmemHugePages: 0 kB
+FileHugePages: 0 kB
+HugePages_Total: 0
+HugePages_Free: 0
+HugePages_Rsvd: 0
+HugePages_Surp: 0
+Hugepagesize: 2048 kB
+Hugetlb: 0 kB
+
+$ du /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf
+404947028 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf
+```
+
+| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | rtr | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | --: | ---: | ------------: | ---------------: |
+============ Repacked 611 tensors
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp512 | 122.55 ± 3.11 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp8192 | 74.34 ± 2.11 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | pp16384 | 52.68 ± 0.21 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp512 | 8.20 ± 0.00 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp8192 | 6.70 ± 0.00 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 386.18 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | 1 | tg64@pp16384 | 5.52 ± 0.00 |
+
+`build: 4819257c (3613)`
+
+## `PURE-IQ4_K_R4`
+```bash
+numactl -N 0 -m 0 \
+./build/bin/llama-bench \
+ -thp 0 \
+ --mmap 0 \
+ --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf \
+ -ctk q8_0 \
+ -mla 3 -fa 1 \
+ -amb 1024 \
+ -fmoe 1 \
+ -p 512,8192,16384 -n 0 \
+ -gp 512,64 \
+ -gp 8192,64 \
+ -gp 16384,64 \
+ -r 2 \
+ --numa numactl \
+ --threads 88
+
+## note the q5_0 attn_k_b.weight so not totally "pure" hah...
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q5_0: 61 tensors
+llama_model_loader: - type iq4_k: 1 tensors
+llama_model_loader: - type iq4_k_r4: 724 tensors
+
+## Confirm fully loaded into THPs
+$ grep Huge /proc/meminfo
+AnonHugePages: 372733952 kB
+ShmemHugePages: 0 kB
+FileHugePages: 0 kB
+HugePages_Total: 0
+HugePages_Free: 0
+HugePages_Rsvd: 0
+HugePages_Surp: 0
+Hugepagesize: 2048 kB
+Hugetlb: 0 kB
+
+$ du /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf
+369596400 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-PURE-IQ4_K_R4.gguf
+```
+
+| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | ---: | ------------: | ---------------: |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 112.83 ± 0.69 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 63.66 ± 0.00 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 47.50 ± 0.15 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 8.50 ± 0.00 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.13 ± 0.02 |
+| deepseek2 671B IQ4_K_R4 - 4.5 bpw | 352.47 GiB | 672.05 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.48 ± 0.02 |
+
+`build: 4819257c (3613)`
+
+
+
+> attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
+
+I'm still wondering a bit about that `attn_k_b.weight` error `128 x 65536 are not divisible by 256` which falls back to `q4_0` or `q5_0` etc. However it seems that `q8_0_r8` is okay?
+
+```
+[ 52/1147] blk.3.attn_k_b.weight - [ 128, 65536, 1, 1], type = bf16, Using custom type q8_0_r8 for tensor blk$
+3.attn_k_b.weight
+
+====== llama_model_quantize_internal: did not find weights for blk.3.attn_k_b.weight
+converting to q8_0_r8 .. size = 16.00 MiB -> 8.50 MiB
+```
+
+So wondering if I do a mostly `iq5_k_r4` attention/shared experts, should I let the `attn_k_b.weight` fall back to `q5_0` or set them up to `q8_0_r8` (assuming CPU inference).
+
+Anyway, learning a lot as usual, gonna close this one as solved. Cheers!
+
+---
+
+👤 **saood06** commented on **2025-04-01** at **01:02:46**
+
+> Just grabbed the log, here is how your "pure" `iq4_k_r4` stacks up on full perplexity run , size, and duration:
+> Model Size (GiB) PPL Duration (minutes)
+> DeepSeek-V3-0324-IQ2_K_R4 227 3.5614 +/- 0.02001 (different rig)
+> DeepSeek-V3-0324-PURE-IQ4_K_R4 353 3.2942 +/- 0.01812 47.56
+> DeepSeek-V3-0324-IQ4_K_R4 387 3.2596 +/- 0.01786 55.01
+> DeepSeek-V3-0324-Q8_0 666 3.2454 +/- 0.01773 68.87
+>
+> 
+>
+> In terms of speed to calculate perplexity, these three were similar setups more or less using a single socket of the Xeon 6980P
+
+Thanks, it looks like an acceptable loss in quality for me if it performs fast (wasn't able to make the quant overnight, the quant is cooking now)
+
+
+> `3.2942` is 1.5% higher than `Q8_0`, so not too bad.
+
+I agree.
+
+>I think with `IQ5_K` for the attention tensors and shared experts it should be (almost) on par with the result obtained with `Q8_0` for these.
+
+It might be, but I probably won't test it as doing full ppl runs takes me way too long, and I think I'll be happy with my "pure" IQ4_K_R4 as that should still be faster, even if it is a bit lower quality.
+
+
+> I did a quick llama-bench comparison between the `PURE-IQ4_K_R4` and the `q8_0`/mix `IQ4_K_R4` (using -rtr 1 for `q8_0_r8` this time) on the CPU only the Xeon 6980P with 88 threads and found the results interesting. The graph shows the "pure" version as baseline 100%.
+>
+> 
+
+I'm really surprised that the PURE gains a bit more TG lead at a depth of 8K, but then ends up behind at 16K. This is different from what I've seen when testing. It would be interesting to see the sweep bench and when they actually intersect and how the curves actually look because on my system I've tested up to that depth and the pure still wins out in TG (and it seems like it will always stay ahead with the lead gaining like you saw initially), so I'm curious as to why it ends up losing at higher depths for you.
+
+
+> > attn_k_b.weight can't be k-, i-, or iqk-quant because its row size is 128, so not a multiple of 256 as needed by i-, k-, idk-quants. Normally this should be caught and a corresponding legacy quant with a block size of 32 should be used instead.
+>
+> I'm still wondering a bit about that `attn_k_b.weight` error `128 x 65536 are not divisible by 256` which falls back to `q4_0` or `q5_0` etc. However it seems that `q8_0_r8` is okay?
+
+Yes. `q8_0_r8` is not an i-, k-, or iqk-quants.
+
+
+> So wondering if I do a mostly `iq5_k_r4` attention/shared experts, should I let the `attn_k_b.weight` fall back to `q5_0` or set them up to `q8_0_r8` (assuming CPU inference).
+
+Both work and will have tradeoffs. I think `q5_0` is fine, but other people think that tensor is more sensitive and should be set higher when you can.
+
+---
+
+👤 **ikawrakow** commented on **2025-04-01** at **08:20:43**
+
+>> I'm still wondering a bit about that attn_k_b.weight error 128 x 65536 are not divisible by 256 which falls back to q4_0 or q5_0 etc. However it seems that q8_0_r8 is okay?
+>
+> Both work and will have tradeoffs. I think q5_0 is fine, but other people think that tensor is more sensitive and should be set higher when you can.
+
+Note that `Q5_0` quantization was improved in [#295](https://github.com/ikawrakow/ik_llama.cpp/issues/295), so it should be fine now. But if in doubt, you can use `Q6_0`, which is basically on par with `Q6_K` after PR [#295](https://github.com/ikawrakow/ik_llama.cpp/issues/295). For CPU-only you can use `q5_0_r4` or `q6_0_r4`.
+
+> It might be, but I probably won't test it as doing full ppl runs takes me way too long, and I think I'll be happy with my "pure" IQ4_K_R4 as that should still be faster, even if it is a bit lower quality.
+
+Fair enough.
+
+But if you get the urge to experiment and you are content with slight accuracy loss, you may consider `IQ4_KS`. Here is a performance comparison between pure `IQ4_K` and pure `IQ4_KS` for DeepSeek-Lite on my Ryzen-7950X CPU:
+
+| model | size | fa | mla | rtr | fmoe | test | t/s |
+| -------------------- | ---------: | -: | --: | --: | ---: | ------------: | ---------------: |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | pp512 | 700.85 ± 2.43 |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 34.41 ± 0.00 |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 31.93 ± 0.01 |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 25.78 ± 0.00 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | pp512 | 659.06 ± 2.14 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 32.04 ± 0.06 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 29.66 ± 0.02 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 23.74 ± 0.00 |
+
+For DeepSeek-Lite we have `PPL(bf16) = 6.767`, `PPL(pure IQ4_K) = 6.821` (so +0.80%), and `PPL(pure IQ4_KS) = 6.858` (so, +1.34%).
-`============ Repacked 611 tensors`
+---
-```bash
-$ grep Huge /proc/meminfo
-AnonHugePages: 406736896 kB
-ShmemHugePages: 0 kB
-FileHugePages: 0 kB
-HugePages_Total: 0
-HugePages_Free: 0
-HugePages_Rsvd: 0
-HugePages_Surp: 0
-Hugepagesize: 2048 kB
-Hugetlb: 0 kB
+👤 **ubergarm** commented on **2025-04-01** at **15:22:03**
-$ numastat -m -p $(pidof llama-sweep-bench)
-Per-node process memory usage (in MBs) for PID 659855 (llama-sweep-ben)
- Node 0 Node 1 Total
- --------------- --------------- ---------------
-Huge 0.00 0.00 0.00
-Heap 2.80 34.14 36.94
-Stack 0.04 0.05 0.08
-Private 13999.99 383083.54 397083.52
----------------- --------------- --------------- ---------------
-Total 14002.82 383117.72 397120.54
+> > UPDATE Wow!! 3.2596 +/- 0.01786 for this DeepSeek-V3-0324-IQ4_K_R4.gguf quant vs full Q8_0 at 3.2454 +/- 0.01773 in almost half the size!
+>
+> Amazing! You should publish this model.
-Per-node system memory usage (in MBs):
- Node 0 Node 1 Total
- --------------- --------------- ---------------
-MemTotal 771710.76 773987.20 1545697.96
-MemFree 743559.40 1745.54 745304.94
-MemUsed 28151.36 772241.67 800393.03
-SwapCached 0.21 0.69 0.90
-Active 14157.56 383159.96 397317.52
-Inactive 8662.71 383016.18 391678.89
-Active(anon) 14076.79 383139.31 397216.09
-Inactive(anon) 3.26 22.98 26.25
-Active(file) 80.78 20.65 101.43
-Inactive(file) 8659.45 382993.20 391652.64
-Unevictable 29.86 5.50 35.36
-Mlocked 21.07 5.50 26.57
-Dirty 20.00 0.05 20.05
-Writeback 0.00 0.00 0.00
-FilePages 8755.46 383025.92 391781.38
-Mapped 82.61 63.21 145.82
-AnonPages 14097.36 383158.36 397255.73
-Shmem 11.92 5.88 17.80
-KernelStack 39.69 38.11 77.80
-PageTables 6.78 775.85 782.62
-SecPageTables 0.00 0.00 0.00
-NFS_Unstable 0.00 0.00 0.00
-Bounce 0.00 0.00 0.00
-WritebackTmp 0.00 0.00 0.00
-Slab 2489.91 2737.77 5227.68
-SReclaimable 402.44 1022.84 1425.27
-SUnreclaim 2087.47 1714.93 3802.40
-AnonHugePages 14010.00 383100.00 397110.00
-ShmemHugePages 0.00 0.00 0.00
-ShmemPmdMapped 0.00 0.00 0.00
-FileHugePages 0.00 0.00 0.00
-FilePmdMapped 0.00 0.00 0.00
-HugePages_Total 0.00 0.00 0.00
-HugePages_Free 0.00 0.00 0.00
-HugePages_Surp 0.00 0.00 0.00
-KReclaimable 402.44 1022.84 1425.27
-```
+Okay, I have two published `ik_llama.cpp` exclusive quants up on [huggingface ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) repo with hopefully enough of a quick start to get people curious enough to try this fork!
-
+> Note that Q5_0 quantization was improved in https://github.com/ikawrakow/ik_llama.cpp/pull/295, so it should be fine now. But if in doubt, you can use Q6_0, which is basically on par with Q6_K after PR https://github.com/ikawrakow/ik_llama.cpp/pull/295. For CPU-only you can use q5_0_r4 or q6_0_r4
----
+Ahh great, I didn't realize there was a `q5_0_r4`/`q6_0_r4` which is exactly what I was looking for to keep that tensor optimized. So if I re-made the "pure" benchmarked above it could be optimized using the `_r4` for possibly a bit more speed which may be related to:
-👤 **saood06** commented the **2025-04-05** at **02:58:44**:
+> I'm really surprised that the PURE gains a bit more TG lead at a depth of 8K, but then ends up behind at 16K. This is different from what I've seen when testing. It would be interesting to see the sweep bench and when they actually intersect and how the curves actually look...
-@ubergarm
+Yeah I was surprised about that too, I still need to dial in how many threads for tg vs pp too as it pp scales up and actually seems to improve with more threads. I'm out tomorrow but would like to finally get a good llama-sweep-bench going, I should have enough info to run it and get a curve. Thanks!
-You can use the script included to plot them together with the legend using the filenames.
+---
-I did it using your raw data.
+👤 **saood06** commented on **2025-04-01** at **21:39:19**
-TG:
-
+> Fair enough.
+>
+> But if you get the urge to experiment and you are content with slight accuracy loss, you may consider `IQ4_KS`. Here is a performance comparison between pure `IQ4_K` and pure `IQ4_KS` for DeepSeek-Lite on my Ryzen-7950X CPU:
-PP:
+| model | size | fa | mla | rtr | fmoe | test | t/s |
+| -------------------- | ---------: | -: | --: | --: | ---: | ------------: | ---------------: |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | pp512 | 700.85 ± 2.43 |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 34.41 ± 0.00 |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 31.93 ± 0.01 |
+| deepseek2 16B IQ4_KS | 8.15 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 25.78 ± 0.00 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | pp512 | 659.06 ± 2.14 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp512 | 32.04 ± 0.06 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp4096 | 29.66 ± 0.02 |
+| deepseek2 16B IQ4_K | 9.00 GiB | 1 | 3 | 1 | 1 | tg128@pp16384 | 23.74 ± 0.00 |
+>
+> For DeepSeek-Lite we have `PPL(bf16) = 6.767`, `PPL(pure IQ4_K) = 6.821` (so +0.80%), and `PPL(pure IQ4_KS) = 6.858` (so, +1.34%).
-
+This on the other hand does tempt me. I like my IQ4_K_R4 but trading off more quality for speed is still tempting.
->Oh nice, is that with llama-batched-bench ?
-It is but I just used a script to graph it. Raw results below, the result for B=1, sweep bench result was used.
-| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
-|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
-| 0 | 128 | 2 | 256 | 0.961 | 0.00 | 42.118 | 6.08 | 43.079 | 5.94 |
-| 0 | 128 | 3 | 384 | 0.963 | 0.00 | 46.332 | 8.29 | 47.295 | 8.12 |
-| 0 | 128 | 4 | 512 | 0.971 | 0.00 | 54.238 | 9.44 | 55.209 | 9.27 |
-| 0 | 128 | 5 | 640 | 1.114 | 0.00 | 58.274 | 10.98 | 59.387 | 10.78 |
-| 0 | 128 | 6 | 768 | 0.960 | 0.00 | 64.813 | 11.85 | 65.773 | 11.68 |
-| 0 | 128 | 7 | 896 | 0.959 | 0.00 | 82.076 | 10.92 | 83.035 | 10.79 |
-| 0 | 128 | 8 | 1024 | 0.961 | 0.00 | 88.326 | 11.59 | 89.287 | 11.47 |
-| 0 | 128 | 9 | 1152 | 0.963 | 0.00 | 105.301 | 10.94 | 106.264 | 10.84 |
-| 0 | 128 | 10 | 1280 | 0.960 | 0.00 | 103.148 | 12.41 | 104.108 | 12.29 |
-| 0 | 128 | 11 | 1408 | 0.960 | 0.00 | 118.788 | 11.85 | 119.748 | 11.76 |
-| 0 | 128 | 12 | 1536 | 0.962 | 0.00 | 118.974 | 12.91 | 119.936 | 12.81 |
-| 0 | 128 | 13 | 1664 | 0.965 | 0.00 | 141.875 | 11.73 | 142.840 | 11.65 |
-| 0 | 128 | 14 | 1792 | 0.972 | 0.00 | 150.249 | 11.93 | 151.221 | 11.85 |
-| 0 | 128 | 15 | 1920 | 0.962 | 0.00 | 158.899 | 12.08 | 159.861 | 12.01 |
-| 0 | 128 | 16 | 2048 | 0.965 | 0.00 | 197.818 | 10.35 | 198.783 | 10.30 |
+> Ahh great, I didn't realize there was a `q5_0_r4`/`q6_0_r4` which is exactly what I was looking for to keep that tensor optimized. So if I re-made the "pure" benchmarked above it could be optimized using the `_r4` for possibly a bit more speed
+I forgot about it as well, since I just let the fallback handle that tensor.
-@ikawrakow
+> Yeah I was surprised about that too, I still need to dial in how many threads for tg vs pp too as it pp scales up and actually seems to improve with more threads. I'm out tomorrow but would like to finally get a good llama-sweep-bench going, I should have enough info to run it and get a curve. Thanks!
-> The fairydreaming tests use a GPU for attention, the slower drop in performance is expected in that setup. But for pure CPU inference I'm expecting around 2.5X lower performance at 32k tokens.
+If you do it would be interesting to see (also I haven't tested it, but setting -tb in sweep-bench should work and allow you to run different thread counts for TG and PP just like you can for the other examples like server and main).
-My own results show ~3.5X lower PP performance at just 16k tokens.
+My "pure" IQ4_K_R4 finished and the preliminary sweep bench results were really good (didn't benchmark very far as I wanted to inference with it, and just wanted to confirm it was loaded in and fast). I'll post a sweep bench graph out to 16K comparing it to some of my old results later.
---
-👤 **ikawrakow** commented the **2025-04-05** at **06:07:18**:
+👤 **saood06** commented on **2025-04-03** at **03:10:35**
-I'm almost sure the TG peaks are due to number of threads. If you try with 128 TG threads, performance will be slightly lower at zero context, but for large contexts it should match the peaks for all context lengths.
+Here's the full graph comparing my currently used fast quants for R1 and V3. The mixes for both are similar. I'm going to go back and test [#287](https://github.com/ikawrakow/ik_llama.cpp/issues/287) next with more configurations to see if I can find one that gives me more performance.
+
+
+
+
+
+Not included in the graph, but looking at other tests I ran [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) does seem to have an impact on performance on my system since I had a very similar quant mix with and without those tensors and they performed slightly differently.
---
-👤 **ubergarm** commented the **2025-04-05** at **15:58:02**:
+👤 **saood06** commented on **2025-04-04** at **13:59:03**
-Okay, got my "CPU only speed blend" quant cooked, copied over, perplexity, and a few sweep-bench comparisons against itself with different threads and amb settings.
+Finally tested batch performance but this is at depth of 0, I'll test deeper depths later.
-
+
-DeepSeek-V3-0324-CPU-IQ3_K_R4 "CPU only speed blend" mix
+12 is the highest, but 6 gets most of the way there.
-## tl;dr;
+---
-Mostly ~q6/iq5_k_r4 for embedding/attention/dense layers/shared experts. First 17 routed experts are down/(up|gate) iq5_k_r4/iq4_k_r4 and the remainder are iq4_k_r4/iq3_k_r4.
+👤 **ubergarm** commented on **2025-04-04** at **15:43:41**
-`PPL = 3.3193 +/- 0.01830`
+Currently cooking up a CPU only "speed mix" blend using some of the advice from above. Will keep you posted.
+
+Otherwise, ran a CPU only `llama-sweep-bench` on the `IQ5_K_R4/IQ4_K_R4` routed experts /`q8_0` all else blend. Accidently left the Intel Xeon 6980P in `balanced` mode instead of `performance`, but the trends should be similar.
+
+
+
+
+
+llama-sweep-bench DeepSeek-V3-0324-IQ4_K_R4 logs
```bash
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type iq6_k: 1 tensors
-llama_model_loader: - type q6_0_r4: 61 tensors
-llama_model_loader: - type iq3_k_r4: 82 tensors
-llama_model_loader: - type iq4_k_r4: 75 tensors
-llama_model_loader: - type iq5_k_r4: 567 tensors
+numactl -N 0 -m 0 \
+./build/bin/llama-sweep-bench \
+ --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
+ --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
+ --run-time-repack \
+ --no-mmap \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -amb 1024 \
+ -fmoe \
+ -c 32768 \
+ -ub 512 \
+ --threads 88 \
+ --threads-batch 128 \
+ --numa numactl
+
+Current power profile is: balanced
+Current THP enabled and defrag configs are:
+[always] madvise never
+[always] defer defer+madvise madvise never
+Set numa balancing to be:
+0
+llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
+llama_model_loader: - kv 3: general.version str = V3-0324
+llama_model_loader: - kv 4: general.basename str = DeepSeek
+llama_model_loader: - kv 5: general.size_label str = 256x21B
+llama_model_loader: - kv 6: general.license str = mit
+llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 16: general.file_type u32 = 340
+llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["...
+llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3...
+llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["...
+llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
+llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 45: general.quantization_version u32 = 2
+llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
+llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
+llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
+llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 612 tensors
+llama_model_loader: - type iq4_k_r4: 116 tensors
+llama_model_loader: - type iq5_k_r4: 58 tensors
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ3_K - 3.4325 bpw
+llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 324.011 GiB (4.141 BPW)
-llm_load_print_meta: repeating layers = 322.703 GiB (4.136 BPW, 670.196 B parameters)
+llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
+llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
llm_load_print_meta: general.name = DeepSeek V3 0324
-```
-
-## Perplexity
-```bash
-numactl -N 1 -m 1 \
-./build/bin/llama-perplexity \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K_R4.gguf \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 512 \
- -fmoe \
- --ctx-size 512 \
- --ubatch-size 512 \
- -f wiki.test.raw \
- --seed 1337 \
- --numa numactl \
- --threads 128
-
-main: build = 3622 (c616306a)
-main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
-main: seed = 1337
-
-llama_kv_cache_init: CPU KV buffer size = 72.91 MiB
-llama_new_context_with_model: KV self size = 72.91 MiB, c^KV (q8_0): 72.91 MiB, kv^T: not used
-llama_new_context_with_model: CPU output buffer size = 1.97 MiB
-llama_new_context_with_model: CPU compute buffer size = 450.01 MiB
-llama_new_context_with_model: graph nodes = 3487
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.47 MiB
+llm_load_tensors: CPU buffer size = 395450.97 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 3
+llama_new_context_with_model: attn_max_b = 1024
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
+llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
+llama_new_context_with_model: CPU output buffer size = 0.49 MiB
+llama_new_context_with_model: CPU compute buffer size = 2662.01 MiB
+llama_new_context_with_model: graph nodes = 5500
llama_new_context_with_model: graph splits = 1
-system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
-perplexity: tokenizing the input ..
-perplexity: tokenization took 885.253 ms
-perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
-perplexity: 18.52 seconds per pass - ETA 43.28 minutes
-[1]2.5128,[2]3.1998,[3]2.3365,[4]1.9572,[5]1.7672,[6]1.6281,[7]1.5395,[8]1.4757,[9]1.4355,[10]1.3986,[11]1.3863,[12]1.4171,[13]1.4335,[14]1.5570,[15]1.6860,[16]1.7427,[17]1.9032,[18]2.0271,[19]1.9913,[20]1.9776,[21]2.0854,[22]2.0602,[23]2.0347,[24]2.0476,[25]2.0186,[26]1.9969,[27]2.0413,[28]2.0507,[29]2.0970,[30]2.1295,[31]2.1608,[32]2.1794,[33]2.2186,[34]2.2617,[35]2.3099,[36]2.3635,[37]2.3978,[38]2.4457,[39]2.4853,[40]2.5440,[41]2.5853,[42]2.5976,[43]2.6473,[44]2.6637,[45]2.7436,[46]2.7934,[47]2.7499,[48]2.7051,[49]2.6812,[50]2.6987,[51]2.7413,[52]2.7537,[53]2.8060,[54]2.8201,[55]2.8508,[56]2.8807,[57]2.8940,[58]2.9277,[59]2.9387,[60]2.9864,[61]3.0248,[62]3.0709,[63]3.1017,[64]3.1429,[65]3.1526,[66]3.1355,[67]3.1118,[68]3.1372,[69]3.1314,[70]3.1476,[71]3.1660,[72]3.1796,[73]3.1931,[74]3.2149,[75]3.1951,[76]3.1489,[77]3.1060,[78]3.1012,[79]3.0804,[80]3.0632,[81]3.0289,[82]3.0333,[83]3.0030,[84]2.9691,[85]2.9358,[86]2.9134,[87]2.9083,[88]2.8809,[89]2.8642,[90]2.8387,[91]2.8113,[92]2.7865,[93]2.7604,[94]2.7369,[95]2.7151,[96]2.7141,[97]2.7189,[98]2.7038,[99]2.6870,[100]2.6894,[101]2.6821,[102]2.6980,[103]2.7237,[104]2.7405,[105]2.7372,[106]2.7591,[107]2.7837,[108]2.8041,[109]2.8372,[110]2.8699,[111]2.8884,[112]2.8629,[113]2.8500,[114]2.8292,[115]2.8139,[116]2.8010,[117]2.7792,[118]2.7587,[119]2.7376,[120]2.7196,[121]2.7036,[122]2.6864,[123]2.6691,[124]2.6500,[125]2.6333,[126]2.6165,[127]2.6034,[128]2.5949,[129]2.5838,[130]2.5714,[131]2.5622,[132]2.5688,[133]2.5782,[134]2.5857,[135]2.5965,[136]2.6115,[137]2.6256,[138]2.6335,[139]2.6442,[140]2.6447,[141]2.6465,[142]2.6450,[143]2.6459,[144]2.6432,[145]2.6352,[146]2.6334,[147]2.6377,[148]2.6379,[149]2.6395,[150]2.6337,[151]2.6321,[152]2.6294,[153]2.6255,[154]2.6254,[155]2.6295,[156]2.6307,[157]2.6363,[158]2.6444,[159]2.6469,[160]2.6556,[161]2.6641,[162]2.6743,[163]2.6796,[164]2.6999,[165]2.7236,[166]2.7410,[167]2.7531,[168]2.7770,[169]2.7996,[170]2.8214,[171]2.8429,[172]2.8273,[173]2.8112,[174]2.7987,[175]2.7868,[176]2.7746,[177]2.7635,[178]2.7508,[179]2.7373,[180]2.7409,[181]2.7550,[182]2.7698,[183]2.7839,[184]2.7969,[185]2.8065,[186]2.8224,[187]2.8380,[188]2.8519,[189]2.8622,[190]2.8627,[191]2.8698,[192]2.8729,[193]2.8780,[194]2.8971,[195]2.9057,[196]2.9187,[197]2.9283,[198]2.9329,[199]2.9386,[200]2.9379,[201]2.9528,[202]2.9480,[203]2.9532,[204]2.9558,[205]2.9556,[206]2.9582,[207]2.9667,[208]2.9757,[209]2.9846,[210]2.9847,[211]2.9802,[212]2.9808,[213]2.9883,[214]2.9901,[215]2.9957,[216]2.9962,[217]2.9920,[218]2.9920,[219]2.9927,[220]2.9925,[221]2.9932,[222]2.9930,[223]2.9939,[224]2.9986,[225]3.0004,[226]2.9925,[227]2.9900,[228]2.9914,[229]2.9951,[230]3.0014,[231]3.0074,[232]2.9994,[233]2.9921,[234]2.9923,[235]2.9911,[236]2.9998,[237]3.0079,[238]3.0172,[239]3.0268,[240]3.0361,[241]3.0471,[242]3.0615,[243]3.0741,[244]3.0820,[245]3.0929,[246]3.1031,[247]3.1021,[248]3.0979,[249]3.0960,[250]3.0899,[251]3.0878,[252]3.0899,[253]3.0939,[254]3.1008,[255]3.1070,[256]3.1101,[257]3.1131,[258]3.1144,[259]3.1179,[260]3.1201,[261]3.1214,[262]3.1205,[263]3.1263,[264]3.1286,[265]3.1291,[266]3.1306,[267]3.1327,[268]3.1357,[269]3.1385,[270]3.1378,[271]3.1363,[272]3.1297,[273]3.1294,[274]3.1225,[275]3.1122,[276]3.1010,[277]3.1029,[278]3.1128,[279]3.1187,[280]3.1265,[281]3.1338,[282]3.1394,[283]3.1458,[284]3.1518,[285]3.1654,[286]3.1675,[287]3.1708,[288]3.1759,[289]3.1781,[290]3.1701,[291]3.1613,[292]3.1597,[293]3.1591,[294]3.1570,[295]3.1548,[296]3.1570,[297]3.1575,[298]3.1631,[299]3.1689,[300]3.1718,[301]3.1758,[302]3.1780,[303]3.1795,[304]3.1790,[305]3.1904,[306]3.1973,[307]3.2079,[308]3.1969,[309]3.1920,[310]3.1831,[311]3.1862,[312]3.1877,[313]3.1936,[314]3.1959,[315]3.1990,[316]3.2006,[317]3.2026,[318]3.2032,[319]3.2035,[320]3.2076,[321]3.2078,[322]3.2096,[323]3.2160,[324]3.2167,[325]3.2221,[326]3.2263,[327]3.2302,[328]3.2327,[329]3.2346,[330]3.2409,[331]3.2439,[332]3.2478,[333]3.2467,[334]3.2467,[335]3.2474,[336]3.2475,[337]3.2486,[338]3.2488,[339]3.2512,[340]3.2547,[341]3.2599,[342]3.2687,[343]3.2775,[344]3.2824,[345]3.2740,[346]3.2664,[347]3.2617,[348]3.2543,[349]3.2505,[350]3.2491,[351]3.2537,[352]3.2683,[353]3.2772,[354]3.2897,[355]3.2982,[356]3.3034,[357]3.3150,[358]3.3248,[359]3.3276,[360]3.3340,[361]3.3433,[362]3.3519,[363]3.3572,[364]3.3639,[365]3.3695,[366]3.3796,[367]3.3881,[368]3.3943,[369]3.4019,[370]3.4104,[371]3.4235,[372]3.4322,[373]3.4356,[374]3.4389,[375]3.4437,[376]3.4563,[377]3.4674,[378]3.4704,[379]3.4704,[380]3.4668,[381]3.4718,[382]3.4775,[383]3.4807,[384]3.4850,[385]3.4888,[386]3.4947,[387]3.5004,[388]3.5033,[389]3.4933,[390]3.4842,[391]3.4740,[392]3.4687,[393]3.4596,[394]3.4511,[395]3.4422,[396]3.4325,[397]3.4241,[398]3.4150,[399]3.4048,[400]3.3963,[401]3.3865,[402]3.3766,[403]3.3683,[404]3.3584,[405]3.3492,[406]3.3398,[407]3.3307,[408]3.3220,[409]3.3136,[410]3.3076,[411]3.3086,[412]3.3038,[413]3.3059,[414]3.3075,[415]3.3050,[416]3.3052,[417]3.3071,[418]3.3014,[419]3.3026,[420]3.3000,[421]3.2989,[422]3.2994,[423]3.2989,[424]3.3026,[425]3.3024,[426]3.3029,[427]3.3019,[428]3.3043,[429]3.3055,[430]3.3082,[431]3.3091,[432]3.3081,[433]3.3046,[434]3.3051,[435]3.2979,[436]3.2921,[437]3.2881,[438]3.2863,[439]3.2839,[440]3.2887,[441]3.2943,[442]3.3014,[443]3.2995,[444]3.3002,[445]3.3011,[446]3.3052,[447]3.3086,[448]3.3108,[449]3.3137,[450]3.3174,[451]3.3201,[452]3.3221,[453]3.3237,[454]3.3223,[455]3.3248,[456]3.3250,[457]3.3274,[458]3.3324,[459]3.3327,[460]3.3328,[461]3.3296,[462]3.3332,[463]3.3404,[464]3.3456,[465]3.3391,[466]3.3371,[467]3.3352,[468]3.3366,[469]3.3339,[470]3.3313,[471]3.3317,[472]3.3325,[473]3.3316,[474]3.3305,[475]3.3315,[476]3.3304,[477]3.3295,[478]3.3301,[479]3.3316,[480]3.3341,[481]3.3304,[482]3.3339,[483]3.3334,[484]3.3369,[485]3.3428,[486]3.3461,[487]3.3495,[488]3.3550,[489]3.3575,[490]3.3626,[491]3.3687,[492]3.3732,[493]3.3730,[494]3.3741,[495]3.3762,[496]3.3781,[497]3.3809,[498]3.3814,[499]3.3810,[500]3.3848,[501]3.3892,[502]3.3883,[503]3.3870,[504]3.3888,[505]3.3918,[506]3.3999,[507]3.4030,[508]3.4065,[509]3.3990,[510]3.3941,[511]3.3880,[512]3.3837,[513]3.3780,[514]3.3765,[515]3.3785,[516]3.3735,[517]3.3735,[518]3.3724,[519]3.3725,[520]3.3764,[521]3.3751,[522]3.3735,[523]3.3789,[524]3.3778,[525]3.3762,[526]3.3717,[527]3.3665,[528]3.3636,[529]3.3604,[530]3.3576,[531]3.3545,[532]3.3490,[533]3.3432,[534]3.3388,[535]3.3392,[536]3.3418,[537]3.3449,[538]3.3475,[539]3.3500,[540]3.3552,[541]3.3583,[542]3.3606,[543]3.3552,[544]3.3510,[545]3.3506,[546]3.3443,[547]3.3382,[548]3.3318,[549]3.3255,[550]3.3199,[551]3.3139,[552]3.3083,[553]3.3027,[554]3.3008,[555]3.2993,[556]3.3020,[557]3.3058,[558]3.3116,[559]3.3158,[560]3.3212,[561]3.3193,
-llama_print_timings: load time = 225352.00 ms
-llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: prompt eval time = 2556352.12 ms / 287232 tokens ( 8.90 ms per token, 112.36 tokens per second)
-llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: total time = 2599092.64 ms / 287233 tokens
-
-Final estimate: PPL = 3.3193 +/- 0.01830
-```
-
-## Quantization
-```bash
-#!/usr/bin/env bash
-
-# Notes:
-# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2765210993
-# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2768567062
-custom="
-# Token embedding and output tensors
-# note token_embd cannot be repacked quant type e.g. `*_r4`
-token_embd\.weight=iq6_k
-output\.weight=iq5_k_r4
-output_norm\.weight=iq5_k_r4
-
-# First 3 dense layers (0-3)
-blk\.[0-2]\.attn_k_b.*=q6_0_r4
-blk\.[0-2]\.attn_.*=iq5_k_r4
-blk\.[0-2]\..*=iq5_k_r4
-
-# All attention, norm weights, and bias tensors for MoE layers (3-60)
-# Except blk.*.attn_k_b.weight is not divisible by 256, so no iq6_k, so go with q6_0_r4
-blk\.[3-9]\.attn_k_b.*=q6_0_r4
-blk\.[1-5][0-9]\.attn_k_b.*=q6_0_r4
-blk\.60\.attn_k_b.*=q6_0_r4
+main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 88, n_threads_batch = 128
-blk\.[3-9]\.attn_.*=iq5_k_r4
-blk\.[1-5][0-9]\.attn_.*=iq5_k_r4
-blk\.60\.attn_.*=iq5_k_r4
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 4.412 | 116.05 | 13.303 | 9.62 |
+| 512 | 128 | 512 | 4.384 | 116.79 | 13.639 | 9.38 |
+| 512 | 128 | 1024 | 4.711 | 108.69 | 14.823 | 8.64 |
+| 512 | 128 | 1536 | 5.448 | 93.98 | 15.187 | 8.43 |
+| 512 | 128 | 2048 | 5.361 | 95.51 | 15.282 | 8.38 |
+| 512 | 128 | 2560 | 6.005 | 85.26 | 16.579 | 7.72 |
+| 512 | 128 | 3072 | 6.276 | 81.58 | 15.304 | 8.36 |
+| 512 | 128 | 3584 | 6.383 | 80.21 | 15.072 | 8.49 |
+| 512 | 128 | 4096 | 6.548 | 78.19 | 15.006 | 8.53 |
+| 512 | 128 | 4608 | 7.245 | 70.67 | 15.262 | 8.39 |
+| 512 | 128 | 5120 | 7.498 | 68.29 | 15.404 | 8.31 |
+| 512 | 128 | 5632 | 7.992 | 64.06 | 15.555 | 8.23 |
+| 512 | 128 | 6144 | 7.825 | 65.43 | 16.026 | 7.99 |
+| 512 | 128 | 6656 | 8.140 | 62.90 | 16.011 | 7.99 |
+| 512 | 128 | 7168 | 9.216 | 55.55 | 16.322 | 7.84 |
+| 512 | 128 | 7680 | 9.197 | 55.67 | 16.641 | 7.69 |
+| 512 | 128 | 8192 | 9.601 | 53.33 | 17.393 | 7.36 |
+| 512 | 128 | 8704 | 9.049 | 56.58 | 17.375 | 7.37 |
+| 512 | 128 | 9216 | 9.669 | 52.95 | 17.475 | 7.32 |
+| 512 | 128 | 9728 | 9.592 | 53.38 | 17.728 | 7.22 |
+| 512 | 128 | 10240 | 10.385 | 49.30 | 18.297 | 7.00 |
+| 512 | 128 | 10752 | 10.284 | 49.79 | 18.500 | 6.92 |
+| 512 | 128 | 11264 | 10.422 | 49.13 | 18.387 | 6.96 |
+| 512 | 128 | 11776 | 11.144 | 45.94 | 18.602 | 6.88 |
+| 512 | 128 | 12288 | 11.066 | 46.27 | 19.002 | 6.74 |
+| 512 | 128 | 12800 | 11.749 | 43.58 | 19.933 | 6.42 |
+| 512 | 128 | 13312 | 11.813 | 43.34 | 19.790 | 6.47 |
+| 512 | 128 | 13824 | 12.959 | 39.51 | 18.546 | 6.90 |
+| 512 | 128 | 14336 | 12.402 | 41.28 | 20.914 | 6.12 |
+| 512 | 128 | 14848 | 13.064 | 39.19 | 20.959 | 6.11 |
+| 512 | 128 | 15360 | 13.137 | 38.97 | 21.331 | 6.00 |
+| 512 | 128 | 15872 | 13.158 | 38.91 | 21.756 | 5.88 |
+| 512 | 128 | 16384 | 13.227 | 38.71 | 21.625 | 5.92 |
+| 512 | 128 | 16896 | 14.089 | 36.34 | 22.327 | 5.73 |
+| 512 | 128 | 17408 | 14.251 | 35.93 | 22.982 | 5.57 |
+| 512 | 128 | 17920 | 14.794 | 34.61 | 22.817 | 5.61 |
+| 512 | 128 | 18432 | 14.544 | 35.20 | 23.187 | 5.52 |
+| 512 | 128 | 18944 | 14.835 | 34.51 | 23.744 | 5.39 |
+| 512 | 128 | 19456 | 15.538 | 32.95 | 20.042 | 6.39 |
+| 512 | 128 | 19968 | 16.182 | 31.64 | 24.139 | 5.30 |
+| 512 | 128 | 20480 | 16.972 | 30.17 | 24.933 | 5.13 |
+| 512 | 128 | 20992 | 15.876 | 32.25 | 25.319 | 5.06 |
+| 512 | 128 | 21504 | 16.150 | 31.70 | 25.309 | 5.06 |
+| 512 | 128 | 22016 | 16.810 | 30.46 | 25.217 | 5.08 |
+| 512 | 128 | 22528 | 17.180 | 29.80 | 25.202 | 5.08 |
+| 512 | 128 | 23040 | 18.171 | 28.18 | 25.445 | 5.03 |
+| 512 | 128 | 23552 | 17.318 | 29.56 | 26.029 | 4.92 |
+| 512 | 128 | 24064 | 18.848 | 27.16 | 26.128 | 4.90 |
+| 512 | 128 | 24576 | 18.282 | 28.01 | 26.675 | 4.80 |
+| 512 | 128 | 25088 | 18.234 | 28.08 | 21.079 | 6.07 |
+| 512 | 128 | 25600 | 18.584 | 27.55 | 27.583 | 4.64 |
+| 512 | 128 | 26112 | 19.350 | 26.46 | 27.687 | 4.62 |
+| 512 | 128 | 26624 | 19.053 | 26.87 | 27.982 | 4.57 |
+| 512 | 128 | 27136 | 19.228 | 26.63 | 28.328 | 4.52 |
+| 512 | 128 | 27648 | 20.705 | 24.73 | 28.819 | 4.44 |
+| 512 | 128 | 28160 | 19.993 | 25.61 | 29.508 | 4.34 |
+| 512 | 128 | 28672 | 20.698 | 24.74 | 29.902 | 4.28 |
+| 512 | 128 | 29184 | 20.320 | 25.20 | 29.555 | 4.33 |
+| 512 | 128 | 29696 | 21.366 | 23.96 | 30.114 | 4.25 |
+| 512 | 128 | 30208 | 21.293 | 24.05 | 29.625 | 4.32 |
+| 512 | 128 | 30720 | 21.417 | 23.91 | 22.628 | 5.66 |
+| 512 | 128 | 31232 | 21.941 | 23.34 | 30.653 | 4.18 |
+| 512 | 128 | 31744 | 22.326 | 22.93 | 31.921 | 4.01 |
+| 512 | 128 | 32256 | 23.055 | 22.21 | 31.750 | 4.03 |
+============ Repacked 611 tensors
-blk\.[3-9]\.ffn_norm\.weight=iq5_k_r4
-blk\.[1-5][0-9]\.ffn_norm\.weight=iq5_k_r4
-blk\.60\.ffn_norm\.weight=iq5_k_r4
+```
-blk\.[3-9]\.exp_probs_b\.bias=iq5_k_r4
-blk\.[1-5][0-9]\.exp_probs_b\.bias=iq5_k_r4
-blk\.60\.exp_probs_b\.bias=iq5_k_r4
+
-# Shared Experts (3-60)
-blk\.[3-9]\.ffn_down_shexp\.weight=iq5_k_r4
-blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_k_r4
-blk\.60\.ffn_down_shexp\.weight=iq5_k_r4
+> Finally tested batch performance
-blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
-blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
-blk\.60\.ffn_(gate|up)_shexp\.weight=iq5_k_r4
+Oh nice, is that with `llama-batched-bench` ?
-# Routed Experts (3-60)
-# First ~16 layers are more sensitive so keep larger
-blk\.[3-9]\.ffn_down_exps\.weight=iq5_k_r4
-blk\.[1][0-9]\.ffn_down_exps\.weight=iq5_k_r4
-blk\.[2-5][0-9]\.ffn_down_exps\.weight=iq4_k_r4
-blk\.60\.ffn_down_exps\.weight=iq4_k_r4
+---
-blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
-blk\.[1][0-9]\.ffn_(gate|up)_exps\.weight=iq4_k_r4
-blk\.[2-5][0-9]\.ffn_(gate|up)_exps\.weight=iq3_k_r4
-blk\.60\.ffn_(gate|up)_exps\.weight=iq3_k_r4
-"
-custom=$(
- echo "$custom" | grep -v '^#' | \
- sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
-)
+👤 **ikawrakow** commented on **2025-04-04** at **16:55:06**
-./build/bin/llama-quantize \
- --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
- --token-embedding-type iq6_k \
- --output-tensor-type iq5_k_r4 \
- --custom-q "$custom" \
- /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
- /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K.gguf \
- IQ3_K \
- 24
-```
+Nearly a 6X decrease in PP performance is quite a bit more than I'm expecting. In my testing it has been more in the 2.5X range when going to 32k tokens. I wonder if this is due to the balanced performance setting or the huge model (or both).
-
+---
-@saood06
+👤 **ubergarm** commented on **2025-04-04** at **17:59:03**
-> You can use the script included to plot them together with the legend using the filenames.
+> Nearly a 6X decrease in PP performance is quite a bit more than I'm expecting. In my testing it has been more in the 2.5X range when going to 32k tokens. I wonder if this is due to the balanced performance setting or the huge model (or both).
-Ahh yes, I see, got the script working like so:
-```bash
-$ uv venv ./venv --python 3.12 --python-preference=only-managed
-$ source ./venv/bin/activate
-$ uv pip install pandas matplotlib
-$ python ./examples/sweep-bench/sweep-bench-plot.py \
- DeepSeek-V3-0324-CPU-IQ3_K_R4-tb128-t88-amb1024.md \
- DeepSeek-V3-0324-CPU-IQ3_K_R4-tb128-t128-amb1024.md \
- DeepSeek-V3-0324-CPU-IQ3_K_R4-tb128-t88-amb1536.md
-```
+Yeah, a lot of little variables can effect performance. One other data point I got was from [fairydreaming on r/LocalLLama](https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/comment/ml1lgob/) which drops off more slowly on their CPU+GPU rig to ~1.5X decrease in PP performance across 32k context.
---
-@ikawrakow
-
-> I'm almost sure the TG peaks are due to number of threads. If you try with 128 TG threads, performance will be slightly lower at zero context, but for large contexts it should match the peaks for all context lengths.
+👤 **ikawrakow** commented on **2025-04-04** at **18:02:59**
-I used saood06's script above to graph these three configurations. The variables between the runs are:
-* `--threads` either 88 or 128
-* `-amb` either 1024 or 1536
+The TG peaks are also quite interesting. If I could make the performance stay where the peaks are for any `N_KV`, it would be a ~40% improvement at 32k tokens! Here I wonder if it is related to the 88 threads (and the work not splitting very well between them), or somehow related to the `-amb` option.
-I left `--threads-batch` constant at 128 using single socket of Intel Xeon 6980P (with numactl).
+@ubergarm
-#### pp
+You always use `numactl`. I'm really curious to know what happens if you don't involve `numactl` at all. I.e.,
+```
+./build/bin/llama-sweep-bench \
+ --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
+ --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
+ --run-time-repack \
+ --no-mmap \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -amb 1024 \
+ -fmoe \
+ -c 32768 \
+ -ub 512 \
+ --threads 88 \
+ --threads-batch 128
+```
-
+---
-#### tg
+👤 **ikawrakow** commented on **2025-04-04** at **18:05:38**
-
+The fairydreaming tests use a GPU for attention, the slower drop in performance is expected in that setup. But for pure CPU inference I'm expecting around 2.5X lower performance at 32k tokens.
-## Observations
+---
-* With tg threads 88 the bumps in speed occur at the same place for both `-amb 1024` and `-amb 1536`.
-* Raising tg threads to 128 seems slightly worse with no bumps in speed.
-* Oddly pp had some variability between the runs despite keeping `--threads-batch 128` constant
+👤 **ubergarm** commented on **2025-04-04** at **21:02:06**
-I'm not sure what to try next. I could:
-* play with `numactl --interleave=all llama-sweep-bench --numa distribute` and pump up threads to 256 (each CPU has 128 physical cores).
-* try varying `--threads` to other multiples of 8 e.g. 64,72,80, ,96 to see if it effects the tg bump
-* explore perplexity/speed trade-off using smaller quant vs `-ser 6,1`
+> You always use numactl. I'm really curious to know what happens if you don't involve numactl at all. I.e.,
-That's all for now. Below are just the swee-bench logs for reference. Thanks!
+I had some time while waiting for my "speed blend" to rsync between servers and tried the command without any numactl stuff. Interestingly, it loaded mostly on node 1, then some of the weights went into node 0 just before loading finished. I included numastat to show that in the detailed log.
-## Logs
+
-llama-sweep-bench logs and raw data
+llama-sweep-bench without `numactl` stuff
```bash
-## pp 128 threads, tg 88 threads, amb 1024
-numactl -N 0 -m 0 \
-./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K_R4.gguf \
+# drop caches
+$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
+
+# set to performance this time
+Current power profile is: performance
+
+# always encourages it to use anonhugepages
+# as testing suggets improves performance on this rig
+Current THP enabled and defrag configs are:
+[always]
+[always]
+
+# numa_balancing off
+Set numa balancing to be: 0
+
+$ ./build/bin/llama-sweep-bench \
+ --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
+ --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
+ --run-time-repack \
--no-mmap \
-ctk q8_0 \
-mla 3 -fa \
@@ -5490,32 +3041,127 @@ numactl -N 0 -m 0 \
-c 32768 \
-ub 512 \
--threads 88 \
- --threads-batch 128 \
- --numa numactl
-
-Current power profile is: performance
-Current THP enabled and defrag configs are:
-[always] madvise never
-[always] defer defer+madvise madvise never
-Set numa balancing to be:
-0
-llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-
-IQ3_K_R4.gguf (version GGUF V3 (latest))
+ --threads-batch 128 2>&1 | tee -a output.log
+llama_model_loader: loaded meta data with 50 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
+llama_model_loader: - kv 3: general.version str = V3-0324
+llama_model_loader: - kv 4: general.basename str = DeepSeek
+llama_model_loader: - kv 5: general.size_label str = 256x21B
+llama_model_loader: - kv 6: general.license str = mit
+llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 16: general.file_type u32 = 340
+llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["...
+llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3...
+llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["...
+llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
+llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 45: general.quantization_version u32 = 2
+llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
+llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
+llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
+llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type iq6_k: 1 tensors
-llama_model_loader: - type q6_0_r4: 61 tensors
-llama_model_loader: - type iq3_k_r4: 82 tensors
-llama_model_loader: - type iq4_k_r4: 75 tensors
-llama_model_loader: - type iq5_k_r4: 567 tensors
-
+llama_model_loader: - type q8_0: 612 tensors
+llama_model_loader: - type iq4_k_r4: 116 tensors
+llama_model_loader: - type iq5_k_r4: 58 tensors
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ3_K - 3.4325 bpw
+llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 324.011 GiB (4.141 BPW)
-llm_load_print_meta: repeating layers = 322.703 GiB (4.136 BPW, 670.196 B parameters)
+llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
+llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
llm_load_print_meta: general.name = DeepSeek V3 0324
-
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.47 MiB
+llm_load_tensors: CPU buffer size = 395450.97 MiB
+....................................................................................................
llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
@@ -5526,7 +3172,67 @@ llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
-
+llama_kv_cache_init: layer 0: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 1: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 2: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 3: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 4: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 5: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 6: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 7: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 8: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 9: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 10: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 11: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 12: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 13: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 14: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 15: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 16: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 17: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 18: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 19: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 20: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 21: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 22: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 23: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 24: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 25: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 26: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 27: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 28: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 29: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 30: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 31: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 32: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 33: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 34: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 35: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 36: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 37: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 38: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 39: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 40: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 41: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 42: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 43: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 44: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 45: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 46: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 47: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 48: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 49: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 50: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 51: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 52: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 53: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 54: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 55: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 56: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 57: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 58: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 59: n_embd_head_qk_rope = 64, kv_lora_rank = 512
+llama_kv_cache_init: layer 60: n_embd_head_qk_rope = 64, kv_lora_rank = 512
llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
@@ -5535,293 +3241,198 @@ llama_new_context_with_model: graph nodes = 5500
llama_new_context_with_model: graph splits = 1
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 88, n_threads_batch = 128
+```
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 4.705 | 108.82 | 11.986 | 10.68 |
-| 512 | 128 | 512 | 4.756 | 107.65 | 12.792 | 10.01 |
-| 512 | 128 | 1024 | 5.161 | 99.20 | 12.700 | 10.08 |
-| 512 | 128 | 1536 | 5.728 | 89.39 | 12.775 | 10.02 |
-| 512 | 128 | 2048 | 5.682 | 90.11 | 12.947 | 9.89 |
-| 512 | 128 | 2560 | 6.333 | 80.84 | 14.947 | 8.56 |
-| 512 | 128 | 3072 | 6.517 | 78.57 | 13.199 | 9.70 |
-| 512 | 128 | 3584 | 6.776 | 75.56 | 13.677 | 9.36 |
-| 512 | 128 | 4096 | 7.022 | 72.92 | 13.826 | 9.26 |
-| 512 | 128 | 4608 | 7.585 | 67.51 | 13.937 | 9.18 |
-| 512 | 128 | 5120 | 9.009 | 56.83 | 14.367 | 8.91 |
-| 512 | 128 | 5632 | 8.190 | 62.51 | 14.409 | 8.88 |
-| 512 | 128 | 6144 | 8.799 | 58.19 | 14.651 | 8.74 |
-| 512 | 128 | 6656 | 9.711 | 52.72 | 14.788 | 8.66 |
-| 512 | 128 | 7168 | 9.143 | 56.00 | 15.070 | 8.49 |
-| 512 | 128 | 7680 | 9.905 | 51.69 | 15.394 | 8.31 |
-| 512 | 128 | 8192 | 9.458 | 54.14 | 16.353 | 7.83 |
-| 512 | 128 | 8704 | 10.134 | 50.52 | 15.867 | 8.07 |
-| 512 | 128 | 9216 | 10.179 | 50.30 | 16.088 | 7.96 |
-| 512 | 128 | 9728 | 10.385 | 49.30 | 16.817 | 7.61 |
-| 512 | 128 | 10240 | 10.765 | 47.56 | 17.119 | 7.48 |
-| 512 | 128 | 10752 | 10.896 | 46.99 | 17.115 | 7.48 |
-| 512 | 128 | 11264 | 11.317 | 45.24 | 17.280 | 7.41 |
-| 512 | 128 | 11776 | 11.461 | 44.67 | 17.702 | 7.23 |
-| 512 | 128 | 12288 | 12.248 | 41.80 | 18.129 | 7.06 |
-| 512 | 128 | 12800 | 12.176 | 42.05 | 18.294 | 7.00 |
-| 512 | 128 | 13312 | 12.296 | 41.64 | 18.273 | 7.00 |
-| 512 | 128 | 13824 | 13.446 | 38.08 | 17.938 | 7.14 |
-| 512 | 128 | 14336 | 13.376 | 38.28 | 19.027 | 6.73 |
-| 512 | 128 | 14848 | 13.901 | 36.83 | 19.547 | 6.55 |
-| 512 | 128 | 15360 | 13.727 | 37.30 | 19.853 | 6.45 |
-| 512 | 128 | 15872 | 14.168 | 36.14 | 20.259 | 6.32 |
-| 512 | 128 | 16384 | 14.756 | 34.70 | 20.206 | 6.33 |
-| 512 | 128 | 16896 | 15.237 | 33.60 | 20.719 | 6.18 |
-| 512 | 128 | 17408 | 15.027 | 34.07 | 20.608 | 6.21 |
-| 512 | 128 | 17920 | 15.585 | 32.85 | 21.305 | 6.01 |
-| 512 | 128 | 18432 | 15.882 | 32.24 | 21.786 | 5.88 |
-| 512 | 128 | 18944 | 16.613 | 30.82 | 22.082 | 5.80 |
-| 512 | 128 | 19456 | 16.195 | 31.61 | 18.518 | 6.91 |
-| 512 | 128 | 19968 | 17.213 | 29.75 | 22.846 | 5.60 |
-| 512 | 128 | 20480 | 17.539 | 29.19 | 22.746 | 5.63 |
-| 512 | 128 | 20992 | 17.368 | 29.48 | 23.104 | 5.54 |
-| 512 | 128 | 21504 | 17.592 | 29.10 | 23.148 | 5.53 |
-| 512 | 128 | 22016 | 17.977 | 28.48 | 23.651 | 5.41 |
-| 512 | 128 | 22528 | 18.229 | 28.09 | 23.878 | 5.36 |
-| 512 | 128 | 23040 | 18.590 | 27.54 | 24.244 | 5.28 |
-| 512 | 128 | 23552 | 19.303 | 26.52 | 24.274 | 5.27 |
-| 512 | 128 | 24064 | 19.662 | 26.04 | 25.586 | 5.00 |
-| 512 | 128 | 24576 | 20.019 | 25.58 | 25.427 | 5.03 |
-| 512 | 128 | 25088 | 20.519 | 24.95 | 19.775 | 6.47 |
-| 512 | 128 | 25600 | 20.427 | 25.06 | 26.742 | 4.79 |
-| 512 | 128 | 26112 | 20.727 | 24.70 | 26.280 | 4.87 |
-| 512 | 128 | 26624 | 20.837 | 24.57 | 27.207 | 4.70 |
-| 512 | 128 | 27136 | 21.536 | 23.77 | 27.221 | 4.70 |
-| 512 | 128 | 27648 | 21.512 | 23.80 | 27.161 | 4.71 |
-| 512 | 128 | 28160 | 21.916 | 23.36 | 27.883 | 4.59 |
-| 512 | 128 | 28672 | 22.764 | 22.49 | 27.623 | 4.63 |
-| 512 | 128 | 29184 | 22.665 | 22.59 | 28.389 | 4.51 |
-| 512 | 128 | 29696 | 23.483 | 21.80 | 28.581 | 4.48 |
-| 512 | 128 | 30208 | 23.785 | 21.53 | 28.538 | 4.49 |
-| 512 | 128 | 30720 | 24.100 | 21.24 | 21.589 | 5.93 |
-| 512 | 128 | 31232 | 24.275 | 21.09 | 29.526 | 4.34 |
-| 512 | 128 | 31744 | 24.416 | 20.97 | 28.978 | 4.42 |
-| 512 | 128 | 32256 | 25.127 | 20.38 | 28.427 | 4.50 |
+| 512 | 128 | 0 | 4.214 | 121.49 | 19.559 | 6.54 |
+| 512 | 128 | 512 | 4.304 | 118.97 | 19.317 | 6.63 |
+| 512 | 128 | 1024 | 4.539 | 112.79 | 19.692 | 6.50 |
+| 512 | 128 | 1536 | 4.859 | 105.37 | 20.024 | 6.39 |
+| 512 | 128 | 2048 | 5.429 | 94.31 | 21.110 | 6.06 |
+| 512 | 128 | 2560 | 5.698 | 89.86 | 21.308 | 6.01 |
+| 512 | 128 | 3072 | 5.948 | 86.08 | 21.940 | 5.83 |
+| 512 | 128 | 3584 | 6.368 | 80.40 | 21.664 | 5.91 |
+| 512 | 128 | 4096 | 6.665 | 76.82 | 21.375 | 5.99 |
+| 512 | 128 | 4608 | 7.055 | 72.57 | 21.764 | 5.88 |
+| 512 | 128 | 5120 | 7.397 | 69.22 | 21.929 | 5.84 |
+| 512 | 128 | 5632 | 7.846 | 65.25 | 21.051 | 6.08 |
+| 512 | 128 | 6144 | 8.496 | 60.27 | 23.048 | 5.55 |
+| 512 | 128 | 6656 | 8.884 | 57.63 | 21.473 | 5.96 |
+| 512 | 128 | 7168 | 9.241 | 55.41 | 22.841 | 5.60 |
+| 512 | 128 | 7680 | 9.832 | 52.08 | 21.809 | 5.87 |
+| 512 | 128 | 8192 | 9.957 | 51.42 | 22.837 | 5.60 |
+| 512 | 128 | 8704 | 10.521 | 48.67 | 23.967 | 5.34 |
+| 512 | 128 | 9216 | 10.787 | 47.46 | 23.475 | 5.45 |
+| 512 | 128 | 9728 | 11.187 | 45.77 | 23.407 | 5.47 |
+| 512 | 128 | 10240 | 11.988 | 42.71 | 25.122 | 5.10 |
+| 512 | 128 | 10752 | 12.502 | 40.95 | 24.736 | 5.17 |
+| 512 | 128 | 11264 | 12.874 | 39.77 | 24.705 | 5.18 |
+| 512 | 128 | 11776 | 12.893 | 39.71 | 24.578 | 5.21 |
+| 512 | 128 | 12288 | 13.309 | 38.47 | 25.649 | 4.99 |
+| 512 | 128 | 12800 | 13.647 | 37.52 | 24.652 | 5.19 |
+| 512 | 128 | 13312 | 14.318 | 35.76 | 25.035 | 5.11 |
+| 512 | 128 | 13824 | 14.879 | 34.41 | 24.243 | 5.28 |
+| 512 | 128 | 14336 | 15.221 | 33.64 | 25.826 | 4.96 |
+| 512 | 128 | 14848 | 15.292 | 33.48 | 26.096 | 4.91 |
+| 512 | 128 | 15360 | 15.592 | 32.84 | 25.744 | 4.97 |
+| 512 | 128 | 15872 | 15.757 | 32.49 | 26.224 | 4.88 |
+| 512 | 128 | 16384 | 14.834 | 34.51 | 26.616 | 4.81 |
+| 512 | 128 | 16896 | 15.757 | 32.49 | 27.967 | 4.58 |
+| 512 | 128 | 17408 | 16.378 | 31.26 | 27.682 | 4.62 |
+| 512 | 128 | 17920 | 16.754 | 30.56 | 27.855 | 4.60 |
+| 512 | 128 | 18432 | 17.300 | 29.59 | 27.905 | 4.59 |
+| 512 | 128 | 18944 | 17.347 | 29.52 | 28.338 | 4.52 |
+| 512 | 128 | 19456 | 17.895 | 28.61 | 24.992 | 5.12 |
+| 512 | 128 | 19968 | 18.210 | 28.12 | 28.662 | 4.47 |
+| 512 | 128 | 20480 | 18.579 | 27.56 | 28.880 | 4.43 |
+| 512 | 128 | 20992 | 18.920 | 27.06 | 29.153 | 4.39 |
+| 512 | 128 | 21504 | 19.537 | 26.21 | 29.282 | 4.37 |
+| 512 | 128 | 22016 | 19.716 | 25.97 | 29.682 | 4.31 |
+| 512 | 128 | 22528 | 20.576 | 24.88 | 30.040 | 4.26 |
+| 512 | 128 | 23040 | 20.705 | 24.73 | 30.366 | 4.22 |
+| 512 | 128 | 23552 | 21.201 | 24.15 | 30.501 | 4.20 |
+| 512 | 128 | 24064 | 21.809 | 23.48 | 30.800 | 4.16 |
+| 512 | 128 | 24576 | 22.042 | 23.23 | 30.988 | 4.13 |
+| 512 | 128 | 25088 | 22.660 | 22.59 | 26.174 | 4.89 |
+| 512 | 128 | 25600 | 23.038 | 22.22 | 31.451 | 4.07 |
+| 512 | 128 | 26112 | 23.601 | 21.69 | 31.606 | 4.05 |
+| 512 | 128 | 26624 | 23.744 | 21.56 | 31.454 | 4.07 |
+| 512 | 128 | 27136 | 24.403 | 20.98 | 32.176 | 3.98 |
+| 512 | 128 | 27648 | 24.954 | 20.52 | 31.961 | 4.00 |
+| 512 | 128 | 28160 | 25.142 | 20.36 | 32.050 | 3.99 |
+| 512 | 128 | 28672 | 25.774 | 19.87 | 32.425 | 3.95 |
+| 512 | 128 | 29184 | 25.847 | 19.81 | 33.104 | 3.87 |
+| 512 | 128 | 29696 | 26.218 | 19.53 | 32.757 | 3.91 |
+| 512 | 128 | 30208 | 26.704 | 19.17 | 33.055 | 3.87 |
+| 512 | 128 | 30720 | 27.111 | 18.89 | 27.009 | 4.74 |
+| 512 | 128 | 31232 | 26.987 | 18.97 | 33.298 | 3.84 |
+| 512 | 128 | 31744 | 26.712 | 19.17 | 33.334 | 3.84 |
+| 512 | 128 | 32256 | 28.083 | 18.23 | 33.414 | 3.83 |
+
+`============ Repacked 611 tensors`
+
+```bash
+$ grep Huge /proc/meminfo
+AnonHugePages: 406736896 kB
+ShmemHugePages: 0 kB
+FileHugePages: 0 kB
+HugePages_Total: 0
+HugePages_Free: 0
+HugePages_Rsvd: 0
+HugePages_Surp: 0
+Hugepagesize: 2048 kB
+Hugetlb: 0 kB
+
+$ numastat -m -p $(pidof llama-sweep-bench)
+Per-node process memory usage (in MBs) for PID 659855 (llama-sweep-ben)
+ Node 0 Node 1 Total
+ --------------- --------------- ---------------
+Huge 0.00 0.00 0.00
+Heap 2.80 34.14 36.94
+Stack 0.04 0.05 0.08
+Private 13999.99 383083.54 397083.52
+---------------- --------------- --------------- ---------------
+Total 14002.82 383117.72 397120.54
+
+Per-node system memory usage (in MBs):
+ Node 0 Node 1 Total
+ --------------- --------------- ---------------
+MemTotal 771710.76 773987.20 1545697.96
+MemFree 743559.40 1745.54 745304.94
+MemUsed 28151.36 772241.67 800393.03
+SwapCached 0.21 0.69 0.90
+Active 14157.56 383159.96 397317.52
+Inactive 8662.71 383016.18 391678.89
+Active(anon) 14076.79 383139.31 397216.09
+Inactive(anon) 3.26 22.98 26.25
+Active(file) 80.78 20.65 101.43
+Inactive(file) 8659.45 382993.20 391652.64
+Unevictable 29.86 5.50 35.36
+Mlocked 21.07 5.50 26.57
+Dirty 20.00 0.05 20.05
+Writeback 0.00 0.00 0.00
+FilePages 8755.46 383025.92 391781.38
+Mapped 82.61 63.21 145.82
+AnonPages 14097.36 383158.36 397255.73
+Shmem 11.92 5.88 17.80
+KernelStack 39.69 38.11 77.80
+PageTables 6.78 775.85 782.62
+SecPageTables 0.00 0.00 0.00
+NFS_Unstable 0.00 0.00 0.00
+Bounce 0.00 0.00 0.00
+WritebackTmp 0.00 0.00 0.00
+Slab 2489.91 2737.77 5227.68
+SReclaimable 402.44 1022.84 1425.27
+SUnreclaim 2087.47 1714.93 3802.40
+AnonHugePages 14010.00 383100.00 397110.00
+ShmemHugePages 0.00 0.00 0.00
+ShmemPmdMapped 0.00 0.00 0.00
+FileHugePages 0.00 0.00 0.00
+FilePmdMapped 0.00 0.00 0.00
+HugePages_Total 0.00 0.00 0.00
+HugePages_Free 0.00 0.00 0.00
+HugePages_Surp 0.00 0.00 0.00
+KReclaimable 402.44 1022.84 1425.27
+```
+
+
---
-## pp 128 threads, tg 128 threads, amb 1024
-numactl -N 0 -m 0 \
-./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K_R4.gguf \
- --no-mmap \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 1024 \
- -fmoe \
- -c 32768 \
- -ub 512 \
- --threads 128 \
- --threads-batch 128 \
- --numa numactl
+👤 **saood06** commented on **2025-04-05** at **02:58:44**
-llm_load_tensors: CPU buffer size = 331786.93 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 1024
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
+@ubergarm
-llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CPU output buffer size = 0.49 MiB
-llama_new_context_with_model: CPU compute buffer size = 2662.01 MiB
-llama_new_context_with_model: graph nodes = 5500
-llama_new_context_with_model: graph splits = 1
+You can use the script included to plot them together with the legend using the filenames.
-main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
+I did it using your raw data.
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 3.779 | 135.47 | 13.193 | 9.70 |
-| 512 | 128 | 512 | 4.045 | 126.57 | 13.382 | 9.56 |
-| 512 | 128 | 1024 | 4.369 | 117.19 | 13.530 | 9.46 |
-| 512 | 128 | 1536 | 4.770 | 107.33 | 13.700 | 9.34 |
-| 512 | 128 | 2048 | 5.170 | 99.04 | 13.834 | 9.25 |
-| 512 | 128 | 2560 | 5.480 | 93.42 | 13.874 | 9.23 |
-| 512 | 128 | 3072 | 5.845 | 87.59 | 14.029 | 9.12 |
-| 512 | 128 | 3584 | 6.176 | 82.90 | 14.164 | 9.04 |
-| 512 | 128 | 4096 | 6.658 | 76.90 | 14.341 | 8.93 |
-| 512 | 128 | 4608 | 6.973 | 73.42 | 14.519 | 8.82 |
-| 512 | 128 | 5120 | 7.357 | 69.59 | 14.709 | 8.70 |
-| 512 | 128 | 5632 | 7.727 | 66.26 | 14.921 | 8.58 |
-| 512 | 128 | 6144 | 8.305 | 61.65 | 15.091 | 8.48 |
-| 512 | 128 | 6656 | 8.449 | 60.60 | 15.324 | 8.35 |
-| 512 | 128 | 7168 | 9.073 | 56.43 | 15.551 | 8.23 |
-| 512 | 128 | 7680 | 9.224 | 55.51 | 15.783 | 8.11 |
-| 512 | 128 | 8192 | 9.140 | 56.02 | 16.039 | 7.98 |
-| 512 | 128 | 8704 | 9.140 | 56.02 | 16.306 | 7.85 |
-| 512 | 128 | 9216 | 9.465 | 54.09 | 16.553 | 7.73 |
-| 512 | 128 | 9728 | 10.000 | 51.20 | 16.827 | 7.61 |
-| 512 | 128 | 10240 | 10.120 | 50.59 | 17.263 | 7.41 |
-| 512 | 128 | 10752 | 10.410 | 49.18 | 17.336 | 7.38 |
-| 512 | 128 | 11264 | 11.062 | 46.29 | 17.599 | 7.27 |
-| 512 | 128 | 11776 | 11.012 | 46.49 | 17.861 | 7.17 |
-| 512 | 128 | 12288 | 11.309 | 45.27 | 18.129 | 7.06 |
-| 512 | 128 | 12800 | 11.971 | 42.77 | 18.366 | 6.97 |
-| 512 | 128 | 13312 | 12.554 | 40.79 | 18.661 | 6.86 |
-| 512 | 128 | 13824 | 12.917 | 39.64 | 18.894 | 6.77 |
-| 512 | 128 | 14336 | 12.615 | 40.59 | 19.122 | 6.69 |
-| 512 | 128 | 14848 | 13.540 | 37.81 | 19.439 | 6.58 |
-| 512 | 128 | 15360 | 13.878 | 36.89 | 19.695 | 6.50 |
-| 512 | 128 | 15872 | 14.107 | 36.30 | 20.001 | 6.40 |
-| 512 | 128 | 16384 | 13.998 | 36.58 | 20.294 | 6.31 |
-| 512 | 128 | 16896 | 14.100 | 36.31 | 20.600 | 6.21 |
-| 512 | 128 | 17408 | 14.413 | 35.52 | 21.126 | 6.06 |
-| 512 | 128 | 17920 | 14.795 | 34.61 | 21.591 | 5.93 |
-| 512 | 128 | 18432 | 15.112 | 33.88 | 22.046 | 5.81 |
-| 512 | 128 | 18944 | 16.007 | 31.99 | 22.389 | 5.72 |
-| 512 | 128 | 19456 | 16.391 | 31.24 | 22.861 | 5.60 |
-| 512 | 128 | 19968 | 16.073 | 31.85 | 23.214 | 5.51 |
-| 512 | 128 | 20480 | 16.437 | 31.15 | 23.621 | 5.42 |
-| 512 | 128 | 20992 | 16.814 | 30.45 | 24.032 | 5.33 |
-| 512 | 128 | 21504 | 17.145 | 29.86 | 24.297 | 5.27 |
-| 512 | 128 | 22016 | 18.069 | 28.34 | 24.443 | 5.24 |
-| 512 | 128 | 22528 | 17.998 | 28.45 | 24.715 | 5.18 |
-| 512 | 128 | 23040 | 18.518 | 27.65 | 25.119 | 5.10 |
-| 512 | 128 | 23552 | 18.645 | 27.46 | 25.608 | 5.00 |
-| 512 | 128 | 24064 | 19.016 | 26.93 | 26.009 | 4.92 |
-| 512 | 128 | 24576 | 19.271 | 26.57 | 26.465 | 4.84 |
-| 512 | 128 | 25088 | 19.655 | 26.05 | 26.904 | 4.76 |
-| 512 | 128 | 25600 | 19.987 | 25.62 | 27.073 | 4.73 |
-| 512 | 128 | 26112 | 20.322 | 25.19 | 27.443 | 4.66 |
-| 512 | 128 | 26624 | 20.694 | 24.74 | 27.875 | 4.59 |
-| 512 | 128 | 27136 | 20.961 | 24.43 | 28.282 | 4.53 |
-| 512 | 128 | 27648 | 21.311 | 24.02 | 28.494 | 4.49 |
-| 512 | 128 | 28160 | 21.620 | 23.68 | 28.750 | 4.45 |
-| 512 | 128 | 28672 | 22.491 | 22.76 | 28.979 | 4.42 |
-| 512 | 128 | 29184 | 22.813 | 22.44 | 29.399 | 4.35 |
-| 512 | 128 | 29696 | 22.584 | 22.67 | 29.749 | 4.30 |
-| 512 | 128 | 30208 | 22.926 | 22.33 | 30.058 | 4.26 |
-| 512 | 128 | 30720 | 23.372 | 21.91 | 30.385 | 4.21 |
-| 512 | 128 | 31232 | 23.479 | 21.81 | 30.789 | 4.16 |
-| 512 | 128 | 31744 | 23.455 | 21.83 | 31.089 | 4.12 |
-| 512 | 128 | 32256 | 24.589 | 20.82 | 31.422 | 4.07 |
+TG:
+
----
+PP:
-## pp 128 threads, tg 128 threads, amb 1536
+
-numactl -N 0 -m 0 \
-./build/bin/llama-sweep-bench \
- --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ3_K_R4.gguf \
- --no-mmap \
- -ctk q8_0 \
- -mla 3 -fa \
- -amb 1536 \
- -fmoe \
- -c 32768 \
- -ub 512 \
- --threads 88 \
- --threads-batch 128 \
- --numa numactl
+>Oh nice, is that with llama-batched-bench ?
-llm_load_tensors: CPU buffer size = 331786.93 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 1536
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
+It is but I just used a script to graph it. Raw results below, the result for B=1, sweep bench result was used.
-llama_kv_cache_init: CPU KV buffer size = 1166.63 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CPU output buffer size = 0.49 MiB
-llama_new_context_with_model: CPU compute buffer size = 2662.01 MiB
-llama_new_context_with_model: graph nodes = 5500
-llama_new_context_with_model: graph splits = 1
+| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+| 0 | 128 | 2 | 256 | 0.961 | 0.00 | 42.118 | 6.08 | 43.079 | 5.94 |
+| 0 | 128 | 3 | 384 | 0.963 | 0.00 | 46.332 | 8.29 | 47.295 | 8.12 |
+| 0 | 128 | 4 | 512 | 0.971 | 0.00 | 54.238 | 9.44 | 55.209 | 9.27 |
+| 0 | 128 | 5 | 640 | 1.114 | 0.00 | 58.274 | 10.98 | 59.387 | 10.78 |
+| 0 | 128 | 6 | 768 | 0.960 | 0.00 | 64.813 | 11.85 | 65.773 | 11.68 |
+| 0 | 128 | 7 | 896 | 0.959 | 0.00 | 82.076 | 10.92 | 83.035 | 10.79 |
+| 0 | 128 | 8 | 1024 | 0.961 | 0.00 | 88.326 | 11.59 | 89.287 | 11.47 |
+| 0 | 128 | 9 | 1152 | 0.963 | 0.00 | 105.301 | 10.94 | 106.264 | 10.84 |
+| 0 | 128 | 10 | 1280 | 0.960 | 0.00 | 103.148 | 12.41 | 104.108 | 12.29 |
+| 0 | 128 | 11 | 1408 | 0.960 | 0.00 | 118.788 | 11.85 | 119.748 | 11.76 |
+| 0 | 128 | 12 | 1536 | 0.962 | 0.00 | 118.974 | 12.91 | 119.936 | 12.81 |
+| 0 | 128 | 13 | 1664 | 0.965 | 0.00 | 141.875 | 11.73 | 142.840 | 11.65 |
+| 0 | 128 | 14 | 1792 | 0.972 | 0.00 | 150.249 | 11.93 | 151.221 | 11.85 |
+| 0 | 128 | 15 | 1920 | 0.962 | 0.00 | 158.899 | 12.08 | 159.861 | 12.01 |
+| 0 | 128 | 16 | 2048 | 0.965 | 0.00 | 197.818 | 10.35 | 198.783 | 10.30 |
-main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 88, n_threads_batch = 128
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 4.455 | 114.93 | 12.232 | 10.46 |
-| 512 | 128 | 512 | 4.597 | 111.38 | 12.618 | 10.14 |
-| 512 | 128 | 1024 | 4.789 | 106.91 | 12.856 | 9.96 |
-| 512 | 128 | 1536 | 5.212 | 98.24 | 12.819 | 9.99 |
-| 512 | 128 | 2048 | 5.514 | 92.85 | 13.029 | 9.82 |
-| 512 | 128 | 2560 | 5.848 | 87.56 | 14.833 | 8.63 |
-| 512 | 128 | 3072 | 6.283 | 81.49 | 13.322 | 9.61 |
-| 512 | 128 | 3584 | 6.673 | 76.73 | 13.870 | 9.23 |
-| 512 | 128 | 4096 | 7.769 | 65.90 | 14.078 | 9.09 |
-| 512 | 128 | 4608 | 8.379 | 61.11 | 14.311 | 8.94 |
-| 512 | 128 | 5120 | 7.530 | 67.99 | 14.187 | 9.02 |
-| 512 | 128 | 5632 | 8.165 | 62.70 | 14.485 | 8.84 |
-| 512 | 128 | 6144 | 8.587 | 59.63 | 14.747 | 8.68 |
-| 512 | 128 | 6656 | 9.117 | 56.16 | 15.042 | 8.51 |
-| 512 | 128 | 7168 | 9.610 | 53.28 | 15.254 | 8.39 |
-| 512 | 128 | 7680 | 9.586 | 53.41 | 15.127 | 8.46 |
-| 512 | 128 | 8192 | 9.961 | 51.40 | 15.912 | 8.04 |
-| 512 | 128 | 8704 | 10.993 | 46.58 | 15.844 | 8.08 |
-| 512 | 128 | 9216 | 10.423 | 49.12 | 16.107 | 7.95 |
-| 512 | 128 | 9728 | 10.673 | 47.97 | 16.464 | 7.77 |
-| 512 | 128 | 10240 | 11.141 | 45.96 | 16.899 | 7.57 |
-| 512 | 128 | 10752 | 11.421 | 44.83 | 16.458 | 7.78 |
-| 512 | 128 | 11264 | 14.421 | 35.50 | 17.190 | 7.45 |
-| 512 | 128 | 11776 | 12.696 | 40.33 | 17.436 | 7.34 |
-| 512 | 128 | 12288 | 12.079 | 42.39 | 17.327 | 7.39 |
-| 512 | 128 | 12800 | 12.304 | 41.61 | 17.591 | 7.28 |
-| 512 | 128 | 13312 | 13.400 | 38.21 | 17.857 | 7.17 |
-| 512 | 128 | 13824 | 12.764 | 40.11 | 17.791 | 7.19 |
-| 512 | 128 | 14336 | 13.515 | 37.88 | 18.744 | 6.83 |
-| 512 | 128 | 14848 | 13.556 | 37.77 | 18.888 | 6.78 |
-| 512 | 128 | 15360 | 13.925 | 36.77 | 19.552 | 6.55 |
-| 512 | 128 | 15872 | 14.119 | 36.26 | 20.393 | 6.28 |
-| 512 | 128 | 16384 | 14.246 | 35.94 | 20.078 | 6.38 |
-| 512 | 128 | 16896 | 14.739 | 34.74 | 20.428 | 6.27 |
-| 512 | 128 | 17408 | 15.744 | 32.52 | 21.013 | 6.09 |
-| 512 | 128 | 17920 | 15.983 | 32.03 | 21.100 | 6.07 |
-| 512 | 128 | 18432 | 16.247 | 31.51 | 21.502 | 5.95 |
-| 512 | 128 | 18944 | 16.554 | 30.93 | 21.797 | 5.87 |
-| 512 | 128 | 19456 | 16.923 | 30.25 | 18.987 | 6.74 |
-| 512 | 128 | 19968 | 17.313 | 29.57 | 22.714 | 5.64 |
-| 512 | 128 | 20480 | 17.972 | 28.49 | 22.245 | 5.75 |
-| 512 | 128 | 20992 | 17.986 | 28.47 | 22.409 | 5.71 |
-| 512 | 128 | 21504 | 18.304 | 27.97 | 23.061 | 5.55 |
-| 512 | 128 | 22016 | 19.044 | 26.88 | 23.934 | 5.35 |
-| 512 | 128 | 22528 | 19.563 | 26.17 | 23.447 | 5.46 |
-| 512 | 128 | 23040 | 20.054 | 25.53 | 23.932 | 5.35 |
-| 512 | 128 | 23552 | 20.210 | 25.33 | 24.398 | 5.25 |
-| 512 | 128 | 24064 | 21.129 | 24.23 | 25.225 | 5.07 |
-| 512 | 128 | 24576 | 19.675 | 26.02 | 25.531 | 5.01 |
-| 512 | 128 | 25088 | 20.162 | 25.39 | 19.989 | 6.40 |
-| 512 | 128 | 25600 | 20.685 | 24.75 | 25.551 | 5.01 |
-| 512 | 128 | 26112 | 20.721 | 24.71 | 26.588 | 4.81 |
-| 512 | 128 | 26624 | 20.997 | 24.38 | 27.079 | 4.73 |
-| 512 | 128 | 27136 | 21.587 | 23.72 | 27.030 | 4.74 |
-| 512 | 128 | 27648 | 22.148 | 23.12 | 27.153 | 4.71 |
-| 512 | 128 | 28160 | 22.081 | 23.19 | 27.515 | 4.65 |
-| 512 | 128 | 28672 | 22.620 | 22.64 | 27.332 | 4.68 |
-| 512 | 128 | 29184 | 22.811 | 22.45 | 27.864 | 4.59 |
-| 512 | 128 | 29696 | 22.791 | 22.47 | 28.755 | 4.45 |
-| 512 | 128 | 30208 | 23.195 | 22.07 | 28.234 | 4.53 |
-| 512 | 128 | 30720 | 23.924 | 21.40 | 21.459 | 5.96 |
-| 512 | 128 | 31232 | 23.809 | 21.50 | 29.165 | 4.39 |
-| 512 | 128 | 31744 | 23.712 | 21.59 | 29.106 | 4.40 |
-| 512 | 128 | 32256 | 24.421 | 20.97 | 29.634 | 4.32 |
-```
+@ikawrakow
-
+> The fairydreaming tests use a GPU for attention, the slower drop in performance is expected in that setup. But for pure CPU inference I'm expecting around 2.5X lower performance at 32k tokens.
+
+My own results show ~3.5X lower PP performance at just 16k tokens.
+
+---
+
+👤 **ikawrakow** commented on **2025-04-05** at **06:07:18**
+
+I'm almost sure the TG peaks are due to number of threads. If you try with 128 TG threads, performance will be slightly lower at zero context, but for large contexts it should match the peaks for all context lengths.
---
-👤 **ubergarm** commented the **2025-04-05** at **15:58:02**:
+👤 **ubergarm** commented on **2025-04-05** at **15:58:02**
Okay, got my "CPU only speed blend" quant cooked, copied over, perplexity, and a few sweep-bench comparisons against itself with different threads and amb settings.
@@ -6013,6 +3624,7 @@ I left `--threads-batch` constant at 128 using single socket of Intel Xeon 6980P
I'm not sure what to try next. I could:
* play with `numactl --interleave=all llama-sweep-bench --numa distribute` and pump up threads to 256 (each CPU has 128 physical cores).
* try varying `--threads` to other multiples of 8 e.g. 64,72,80, ,96 to see if it effects the tg bump
+* explore perplexity/speed trade-off using smaller quant vs `-ser 6,1`
That's all for now. Below are just the swee-bench logs for reference. Thanks!
@@ -6366,7 +3978,7 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ikawrakow** commented the **2025-04-06** at **07:58:05**:
+👤 **ikawrakow** commented on **2025-04-06** at **07:58:05**
@ubergarm
@@ -6374,6 +3986,6 @@ Thank you for the testing. I have no working hypothesis at this point what is ca
> I'm not sure what to try next
-I added PR #315. It disables K-cache repacking. That has a non-negligible impact on performance for large contexts. Here is a graph that compares your TG results to 3 different runs with DeepSeek-Lite. I have scaled with TG performance at zero context length so we can have them on the same graph. The red symbols are with PR #315. The blue and magenta symbols are with the main branch (one uses `-rtr`, the other uses the offline repacked version of the same model). Important to note that the K-cache repacking is done only for PP, and yet this additional memory allocation does affect TG performance! The effect for DeepSeek-R1/V3 may be bigger as the K-cache is larger. I did have runs where the TG performance drop happened earlier, and they ended with a lower performance at 32k tokens (but I didn't keep the logs for those).
+I added PR [#315](https://github.com/ikawrakow/ik_llama.cpp/issues/315). It disables K-cache repacking. That has a non-negligible impact on performance for large contexts. Here is a graph that compares your TG results to 3 different runs with DeepSeek-Lite. I have scaled with TG performance at zero context length so we can have them on the same graph. The red symbols are with PR [#315](https://github.com/ikawrakow/ik_llama.cpp/issues/315). The blue and magenta symbols are with the main branch (one uses `-rtr`, the other uses the offline repacked version of the same model). Important to note that the K-cache repacking is done only for PP, and yet this additional memory allocation does affect TG performance! The effect for DeepSeek-R1/V3 may be bigger as the K-cache is larger. I did have runs where the TG performance drop happened earlier, and they ended with a lower performance at 32k tokens (but I didn't keep the logs for those).

\ No newline at end of file
diff --git a/github-data/issues/297 - Update gguf-py scripts to support new quant types..md b/github-data/issues/297 - Update gguf-py scripts to support new quant types.md
similarity index 83%
rename from github-data/issues/297 - Update gguf-py scripts to support new quant types..md
rename to github-data/issues/297 - Update gguf-py scripts to support new quant types.md
index bb82e27b8..3593085e2 100644
--- a/github-data/issues/297 - Update gguf-py scripts to support new quant types..md
+++ b/github-data/issues/297 - Update gguf-py scripts to support new quant types.md
@@ -1,4 +1,4 @@
-### 📝 [#297](https://github.com/ikawrakow/ik_llama.cpp/issues/297) - Update gguf-py scripts to support new quant types.
+## 📌 [Issue #297](https://github.com/ikawrakow/ik_llama.cpp/issues/297) - Update gguf-py scripts to support new quant types.
| **Author** | `ubergarm` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
This is more of a convenience and lower priority. I wanted to print out some info with `gguf_dump.py` but looks like possibly just need to add latest quant enum constants into `GGMLQuantizationType` etc...
@@ -45,16 +45,16 @@ Thanks!
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-24** at **05:55:57**:
+👤 **saood06** commented on **2025-04-24** at **05:55:57**
@ubergarm
-#298 is now merged in which addressed it.
+[#298](https://github.com/ikawrakow/ik_llama.cpp/issues/298) is now merged in which addressed it.
---
-👤 **ubergarm** commented the **2025-04-24** at **14:35:23**:
+👤 **ubergarm** commented on **2025-04-24** at **14:35:23**
Sweet! Appreciate the update and confirming gguf dump works now with your `V3-0324-IQ4_K_R4` quant!
\ No newline at end of file
diff --git a/github-data/issues/30 - Bug_ Appcrash on Windows 7 with GGML_USE_IQK_MULMAT.md b/github-data/issues/30 - Bug Appcrash on Windows 7 with GGML_USE_IQK_MULMAT.md
similarity index 84%
rename from github-data/issues/30 - Bug_ Appcrash on Windows 7 with GGML_USE_IQK_MULMAT.md
rename to github-data/issues/30 - Bug Appcrash on Windows 7 with GGML_USE_IQK_MULMAT.md
index b6e933b52..da6348d04 100644
--- a/github-data/issues/30 - Bug_ Appcrash on Windows 7 with GGML_USE_IQK_MULMAT.md
+++ b/github-data/issues/30 - Bug Appcrash on Windows 7 with GGML_USE_IQK_MULMAT.md
@@ -1,14 +1,15 @@
-### 🐛 [#30](https://github.com/ikawrakow/ik_llama.cpp/issues/30) - Bug: Appcrash on Windows 7 with GGML_USE_IQK_MULMAT
+## 📌 [Issue #30](https://github.com/ikawrakow/ik_llama.cpp/issues/30) - Bug: Appcrash on Windows 7 with GGML_USE_IQK_MULMAT
| **Author** | `whoreson` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-08-30 |
| **Updated** | 2024-09-19 |
+| **Labels** | `bug`, `wontfix` |
---
-#### Description
+## 📄 Description
### What happened?
@@ -90,9 +91,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **whoreson** commented the **2024-08-30** at **20:30:11**:
+👤 **whoreson** commented on **2024-08-30** at **20:30:11**
Q4_1 crash backtrace:
```
@@ -137,7 +138,7 @@ Seems to be different perhaps?.. Still, works with stock llama.cpp.
---
-👤 **ikawrakow** commented the **2024-08-31** at **05:59:09**:
+👤 **ikawrakow** commented on **2024-08-31** at **05:59:09**
Can you post your `system_info` message when these crashes happen? It should look something like this
```
@@ -148,7 +149,7 @@ Thanks!
---
-👤 **whoreson** commented the **2024-08-31** at **08:22:16**:
+👤 **whoreson** commented on **2024-08-31** at **08:22:16**
```
INFO [ main] system info | tid="1" timestamp=1725092503 n_thr
@@ -160,7 +161,7 @@ A = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD =
---
-👤 **ikawrakow** commented the **2024-08-31** at **10:50:07**:
+👤 **ikawrakow** commented on **2024-08-31** at **10:50:07**
I was suspecting something I might have missed between `AVX2` and `AVX`, but no, you have `AVX2`.
@@ -170,7 +171,7 @@ With the second crash you posted a bt (the one during quantization), what are th
---
-👤 **whoreson** commented the **2024-08-31** at **11:56:33**:
+👤 **whoreson** commented on **2024-08-31** at **11:56:33**
Hmm no, all of these are results of llama-cli, not quantize.
@@ -189,13 +190,13 @@ $4 = 0
---
-👤 **ikawrakow** commented the **2024-08-31** at **12:22:17**:
+👤 **ikawrakow** commented on **2024-08-31** at **12:22:17**
Then `y4` must be `null`?
---
-👤 **whoreson** commented the **2024-08-31** at **14:33:20**:
+👤 **whoreson** commented on **2024-08-31** at **14:33:20**
```
(gdb) p y4
@@ -204,7 +205,7 @@ $5 = (block_q8_1_x4 * restrict) 0x3870ca0
---
-👤 **ikawrakow** commented the **2024-08-31** at **15:57:55**:
+👤 **ikawrakow** commented on **2024-08-31** at **15:57:55**
So
* `y4` is not null
@@ -217,25 +218,25 @@ So
---
-👤 **whoreson** commented the **2024-08-31** at **19:21:42**:
+👤 **whoreson** commented on **2024-08-31** at **19:21:42**
Ehm, looks like it's not gonna be that easy... Just tried with TDM-GCC's gcc version 10.3.0 (tdm64-1), and the results are the same.
---
-👤 **whoreson** commented the **2024-08-31** at **19:29:10**:
+👤 **whoreson** commented on **2024-08-31** at **19:29:10**
Hmm... Could it be related that I've been disabling the -muse-unaligned-vector-move assembler flag? I don't have a recent enough binutils for it, and llama.cpp's been working so far...
---
-👤 **whoreson** commented the **2024-08-31** at **19:46:57**:
+👤 **whoreson** commented on **2024-08-31** at **19:46:57**
Alas, no... Same crash with latest mingw's gcc 14.1 and binutils 2.42.
---
-👤 **ikawrakow** commented the **2024-09-01** at **09:34:15**:
+👤 **ikawrakow** commented on **2024-09-01** at **09:34:15**
If you tried 3 different compiler versions and the crash persists, then it is more likely that it is a bug in the code that somehow only shows up on Windows (any Windows or just Windows 7?).
@@ -243,7 +244,7 @@ I see [here](https://github.com/google/sanitizers/wiki/AddressSanitizerWindowsPo
---
-👤 **whoreson** commented the **2024-09-01** at **19:57:45**:
+👤 **whoreson** commented on **2024-09-01** at **19:57:45**
Okay "good news", I've compiled it with the same TDM-GCC on a Windows 11 box (with -mno-avx512f, because it's a much newer CPU), and it crashes there too.
@@ -251,13 +252,13 @@ It works when compiled with the default AVX512 setting.
---
-👤 **ikawrakow** commented the **2024-09-02** at **08:54:50**:
+👤 **ikawrakow** commented on **2024-09-02** at **08:54:50**
Do you find it important to disable AVX512?
---
-👤 **whoreson** commented the **2024-09-02** at **16:31:29**:
+👤 **whoreson** commented on **2024-09-02** at **16:31:29**
Well since the Windows 7 PC in question is only AVX2, I kinda absolutely have to, in order to maintain the comparison...
@@ -265,25 +266,25 @@ So it'd seem to me that there's some AVX2 bug going on on all Windows OSes? I'll
---
-👤 **whoreson** commented the **2024-09-02** at **16:38:57**:
+👤 **whoreson** commented on **2024-09-02** at **16:38:57**
I can set up an rdesktop access if that's at all helpful.
---
-👤 **ikawrakow** commented the **2024-09-02** at **17:31:21**:
+👤 **ikawrakow** commented on **2024-09-02** at **17:31:21**
`-march=native` does not work? This enables the features your CPU supports. If you are setting this manually, you need `FMA` and `F16C` in addition to `AVX2`
---
-👤 **whoreson** commented the **2024-09-03** at **18:21:16**:
+👤 **whoreson** commented on **2024-09-03** at **18:21:16**
Err, I think you misunderstood. I'm using the default flags as usual. In order to test the AVX2 code on the PC which has Windows 11 (to check if it's a 7 vs 11 issue), I had to disable AVX512 on that box - naturally.
---
-👤 **whoreson** commented the **2024-09-14** at **17:00:21**:
+👤 **whoreson** commented on **2024-09-14** at **17:00:21**
> I can set up an rdesktop access if that's at all helpful.
@@ -291,30 +292,30 @@ Sooo... no?
---
-👤 **ikawrakow** commented the **2024-09-15** at **06:25:32**:
+👤 **ikawrakow** commented on **2024-09-15** at **06:25:32**
We can try, but I'm not very hopeful as I haven't touched a Windows computer for 10+ years. What is the Linux rdesktop client one uses these days? I'm on Ubuntu 22.04.
---
-👤 **whoreson** commented the **2024-09-15** at **08:41:29**:
+👤 **whoreson** commented on **2024-09-15** at **08:41:29**
Well, it's called just that, "rdesktop". It works fine. I'll set it up then. Err, can github do private messages? If not, I have Telegram.
---
-👤 **ikawrakow** commented the **2024-09-15** at **10:01:30**:
+👤 **ikawrakow** commented on **2024-09-15** at **10:01:30**
As far as I can tell the private message feature has been removed from Githib. I don't have Telegram. I made my email address public. If you fetch the latest main branch the last commit will have my email.
---
-👤 **whoreson** commented the **2024-09-15** at **11:45:28**:
+👤 **whoreson** commented on **2024-09-15** at **11:45:28**
Cool, just sent you an e-mail (from s*.t*@gmail).
---
-👤 **ikawrakow** commented the **2024-09-19** at **08:49:48**:
+👤 **ikawrakow** commented on **2024-09-19** at **08:49:48**
So, I used the provided `rdesktop` access to try to debug - without success. Supporting exotic systems (and yes, a Windows 7 box in the year 2024 is an exotic system on my book) is not one of the goals here - you are much better served with the mainline `llama.cpp` project.
\ No newline at end of file
diff --git a/github-data/issues/300 - Bug_ IQK_FA_ALL_QUANTS causes failure to compile.md b/github-data/issues/300 - Bug IQK_FA_ALL_QUANTS causes failure to compile.md
similarity index 77%
rename from github-data/issues/300 - Bug_ IQK_FA_ALL_QUANTS causes failure to compile.md
rename to github-data/issues/300 - Bug IQK_FA_ALL_QUANTS causes failure to compile.md
index 8b23bac60..f9d1c0421 100644
--- a/github-data/issues/300 - Bug_ IQK_FA_ALL_QUANTS causes failure to compile.md
+++ b/github-data/issues/300 - Bug IQK_FA_ALL_QUANTS causes failure to compile.md
@@ -1,4 +1,4 @@
-### 🐛 [#300](https://github.com/ikawrakow/ik_llama.cpp/issues/300) - Bug: IQK_FA_ALL_QUANTS causes failure to compile
+## 📌 [Issue #300](https://github.com/ikawrakow/ik_llama.cpp/issues/300) - Bug: IQK_FA_ALL_QUANTS causes failure to compile
| **Author** | `saood06` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -30,9 +30,9 @@ Clear Linux OS
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-31** at **11:53:55**:
+👤 **ikawrakow** commented on **2025-03-31** at **11:53:55**
Sorry I broke it again. I'll look into it in a moment.
diff --git a/github-data/issues/305 - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU _ 4 .md b/github-data/issues/305 - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU 4 GPUs with -.md
similarity index 82%
rename from github-data/issues/305 - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU _ 4 .md
rename to github-data/issues/305 - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU 4 GPUs with -.md
index fb74e3e62..2a5db4447 100644
--- a/github-data/issues/305 - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU _ 4 .md
+++ b/github-data/issues/305 - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU 4 GPUs with -.md
@@ -1,4 +1,4 @@
-### 📝 [#305](https://github.com/ikawrakow/ik_llama.cpp/issues/305) - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU + 4 GPUs with -mla (1 or 2)
+## 📌 [Issue #305](https://github.com/ikawrakow/ik_llama.cpp/issues/305) - Gibberish output when using DeepSeek-V3-0324-IQ2_K_R4 on mixed CPU + 4 GPUs with -mla (1 or 2)
| **Author** | `Panchovix` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
HI there, thanks for your work!
@@ -301,9 +301,9 @@ EDIT: To note that other models have the same issue (like the mentioned above),
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-01** at **23:49:47**:
+👤 **saood06** commented on **2025-04-01** at **23:49:47**
I'm not sure why your getting bad output, but you might want to look into https://github.com/ikawrakow/ik_llama.cpp/pull/232 instead of just setting `-ngl` this is more tested and offers much higher performance.
@@ -311,7 +311,7 @@ More info about using it here: https://github.com/ikawrakow/ik_llama.cpp/discuss
---
-👤 **Panchovix** commented the **2025-04-02** at **00:02:11**:
+👤 **Panchovix** commented on **2025-04-02** at **00:02:11**
@saood06 Thanks for the suggestion! I did see the post but not sure how to exactly use it, because it seems to use it on a single GPU for all the layers, but on my case I'm using 27 layers of 61 and multiGPU, not sure how to adapt it.
@@ -321,17 +321,7 @@ I will try to rebuild with `cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUA
---
-👤 **Panchovix** commented the **2025-04-02** at **00:02:11**:
-
-@saood06 Thanks for the suggestion! I did see the post but not sure how to exactly use it, because it seems to use it on a single GPU, it is a bit easier, but not sure how to adapt it to multiGPU.
-
-I also did try with -mla 2 and -fa but same issue.
-
-I will try to rebuild with `cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_IQK_FA_ALL_QUANTS=1 -DGGML_BLAS=OFF` to see if it helps.
-
----
-
-👤 **saood06** commented the **2025-04-02** at **00:14:35**:
+👤 **saood06** commented on **2025-04-02** at **00:14:35**
> [@saood06](https://github.com/saood06) Thanks for the suggestion! I did see the post but not sure how to exactly use it, because it seems to use it on a single GPU for all the layers, but on my case I'm using 27 layers of 61 and multiGPU, not sure how to adapt it.
@@ -341,7 +331,7 @@ See https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-
---
-👤 **Panchovix** commented the **2025-04-02** at **00:30:55**:
+👤 **Panchovix** commented on **2025-04-02** at **00:30:55**
@saood06
@@ -351,7 +341,7 @@ Now, still no luck with `cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS
---
-👤 **saood06** commented the **2025-04-02** at **00:34:10**:
+👤 **saood06** commented on **2025-04-02** at **00:34:10**
> [@saood06](https://github.com/saood06)
>
@@ -362,7 +352,7 @@ Let me know if you have any more questions, my Deepseek machine doesn't have a G
---
-👤 **ikawrakow** commented the **2025-04-02** at **05:21:58**:
+👤 **ikawrakow** commented on **2025-04-02** at **05:21:58**
This model is not ideal for your multi-GPU setup. The row-interleaved quants (`X_R4, X_R8`) are best for CPU-only inference. They do not have CUDA matrix multiplication implementation, so all matrix multiplications involving them will run on the CPU, so yes, it will be slower (and your GPU's will be acting as very expensive RAM modules for your CPU).
@@ -374,7 +364,7 @@ All supported models working with mainline `llama.cpp` are supposed to work also
---
-👤 **saood06** commented the **2025-04-02** at **05:30:56**:
+👤 **saood06** commented on **2025-04-02** at **05:30:56**
> This model is not ideal for your multi-GPU setup. The row-interleaved quants (`X_R4, X_R8`) are best for CPU-only inference. They do not have CUDA matrix multiplication implementation, so all matrix multiplications involving them will run on the CPU, so yes, it will be slower (and your GPU's will be acting as very expensive RAM modules for your CPU).
>
@@ -385,18 +375,7 @@ So if he uses -ot then he will be able to offload all those to GPU(s), leaving j
---
-👤 **saood06** commented the **2025-04-02** at **05:30:56**:
-
-> This model is not ideal for your multi-GPU setup. The row-interleaved quants (`X_R4, X_R8`) are best for CPU-only inference. They do not have CUDA matrix multiplication implementation, so all matrix multiplications involving them will run on the CPU, so yes, it will be slower (and your GPU's will be acting as very expensive RAM modules for your CPU).
->
-
-It has `llama_model_loader: - type q8_0: 612 tensors`, this is ubergarm's mix where those are on the tensors that are better suited for GPU.
-
-So if he uses -ot then he will be able to offload all those to GPU(s).
-
----
-
-👤 **ikawrakow** commented the **2025-04-02** at **05:36:41**:
+👤 **ikawrakow** commented on **2025-04-02** at **05:36:41**
> So if he uses -ot then he will be able to offload all those to GPU(s), leaving just the row-interleaved quants to the CPU
@@ -404,7 +383,7 @@ Yes, that's true. But that way they will be using a small fraction of the 120 GB
---
-👤 **saood06** commented the **2025-04-02** at **05:53:21**:
+👤 **saood06** commented on **2025-04-02** at **05:53:21**
> Yes, that's true. But that way they will be using a small fraction of the 120 GB VRAM available.
@@ -412,7 +391,7 @@ In the linked discussion the commenter was never able to get more than one GPU t
---
-👤 **ikawrakow** commented the **2025-04-02** at **05:59:32**:
+👤 **ikawrakow** commented on **2025-04-02** at **05:59:32**
If you have been using UD_Q2_K_XL, try running it with this fork the same way you have in mainline, but add
```
@@ -429,39 +408,23 @@ will have all `ffn_down_exps` tensors and the `ffn_up/gate_exps` for layers 40-e
---
-👤 **ikawrakow** commented the **2025-04-02** at **06:06:28**:
+👤 **ikawrakow** commented on **2025-04-02** at **06:06:28**
> In the linked discussion the commenter was never able to get more than one GPU to be active, has that been fixed?
-I remember #242, but I don't have multiple GPUs to understand why the issue occurs. Apart from this, @davidsyoung has been using it with 16 x 3090, and I do not recall him reporting that only one GPU is being used.
+I remember [#242](https://github.com/ikawrakow/ik_llama.cpp/issues/242), but I don't have multiple GPUs to understand why the issue occurs. Apart from this, @davidsyoung has been using it with 16 x 3090, and I do not recall him reporting that only one GPU is being used.
---
-👤 **ikawrakow** commented the **2025-04-02** at **06:06:28**:
+👤 **saood06** commented on **2025-04-02** at **06:14:41**
-> In the linked discussion the commenter was never able to get more than one GPU to be active, has that been fixed?
-
-I remember #242, but I don't have multiple GPUs to understand why the issue occurs. Apart from this, @davidsyoung has bee using it with 16 x 3090, and I do not recall him reporting that only one GPU is being used.
-
----
-
-👤 **saood06** commented the **2025-04-02** at **06:14:41**:
-
-> I remember [#242](https://github.com/ikawrakow/ik_llama.cpp/discussions/242), but I don't have multiple GPUs to understand why the issue occurs. Apart from this, [@davidsyoung](https://github.com/davidsyoung) has been using it with 16 x 3090, and I do not recall him reporting that only one GPU is being used.
+> I remember [[#242](https://github.com/ikawrakow/ik_llama.cpp/issues/242)](https://github.com/ikawrakow/ik_llama.cpp/discussions/242), but I don't have multiple GPUs to understand why the issue occurs. Apart from this, [@davidsyoung](https://github.com/davidsyoung) has been using it with 16 x 3090, and I do not recall him reporting that only one GPU is being used.
Yes but maybe it is different if it offloaded fully to CUDA, because ThomasBaruzier's who had the issue his comments are at a time when davidsyoung was using ik_llama.cpp. Maybe @Panchovix you can tell us if all GPU's are being used when putting tensors on all of them with -ot.
---
-👤 **saood06** commented the **2025-04-02** at **06:14:41**:
-
-> I remember [#242](https://github.com/ikawrakow/ik_llama.cpp/discussions/242), but I don't have multiple GPUs to understand why the issue occurs. Apart from this, [@davidsyoung](https://github.com/davidsyoung) has been using it with 16 x 3090, and I do not recall him reporting that only one GPU is being used.
-
-Yes but maybe it is different if it offloaded fully to CUDA, because ThomasBaruzier's who had the issue his comments are at a time when davidsyoung was using ik_llama.cpp.
-
----
-
-👤 **davidsyoung** commented the **2025-04-02** at **06:43:15**:
+👤 **davidsyoung** commented on **2025-04-02** at **06:43:15**
Hey just wanted to jump in as tagged above.
@@ -481,27 +444,7 @@ Process of elimination!
---
-👤 **davidsyoung** commented the **2025-04-02** at **06:43:15**:
-
-Hey just wanted to jump in as tagged above.
-
-I never had an issue personally while using with all GPUs being used, but it’s going to be dependent on how GPUs are being balanced across GPUs.
-
-I didn’t have a mixed workflow of CPU/GPU offload like this, but if I was debugging I would go the route of what @ikawrakow is suggesting.
-
-I would also likely just to start, use a less exotic quantisation to rule that out. As you’re doing a mixed offload of GPU/CPU, I would use a standard Q4 quant.
-
-Then from there, I would use -ot commands like suggested above.
-
-Lower down the list of possibilities could be the -mla option you’re using, as it’s possible that combination of mixed offload, quant format, and those commands possibly haven’t been tested too heavily.
-
-It may also just simply be the model with Q2 quant.
-
-Process of elimination!
-
----
-
-👤 **Panchovix** commented the **2025-04-02** at **11:40:17**:
+👤 **Panchovix** commented on **2025-04-02** at **11:40:17**
Hi there guys, just woke up and saw all the new information, many thanks! I will try the suggestions when I come home after work (in about 11 hours).
@@ -512,7 +455,7 @@ From my understanding -ot may result in better performance but not address the g
---
-👤 **Panchovix** commented the **2025-04-02** at **16:35:44**:
+👤 **Panchovix** commented on **2025-04-02** at **16:35:44**
I did try a little via RDP (on Windows though, as I haven't managed to get a RDP client working unattended on Linux)
@@ -528,7 +471,7 @@ Probably I really don't know how to set up the -ot values and/or what does rtr w
---
-👤 **ubergarm** commented the **2025-04-02** at **22:07:04**:
+👤 **ubergarm** commented on **2025-04-02** at **22:07:04**
> I will try later on Linux to see how it behaves.
@@ -546,7 +489,7 @@ Its possible I saw you over on level1techs forum too, feel free to reach out to
---
-👤 **Panchovix** commented the **2025-04-02** at **22:56:11**:
+👤 **Panchovix** commented on **2025-04-02** at **22:56:11**
@ubergarm Thanks! But I think the model won't fit on 192GB RAM + 48GB RAM? Correct me if I'm wrong though. I will checkout the guide!
@@ -558,7 +501,7 @@ I think I went some time ago on level1techs, but never went much anymore because
---
-👤 **ubergarm** commented the **2025-04-03** at **00:42:00**:
+👤 **ubergarm** commented on **2025-04-03** at **00:42:00**
> 192GB RAM + 48GB RAM
@@ -570,14 +513,14 @@ Oops nope, there is a different person over there asking about using multiple GP
---
-👤 **whatever1983** commented the **2025-04-15** at **19:43:49**:
+👤 **whatever1983** commented on **2025-04-15** at **19:43:49**
@ubergarm and @ikawrakow
embedding needs to be iq3_k to emulate IQ2_M for way better coding performance. ikawrakow, can you make that into the IQ2_K_M, IQ2_K_M_R4 standard?
---
-👤 **ubergarm** commented the **2025-04-15** at **20:33:00**:
+👤 **ubergarm** commented on **2025-04-15** at **20:33:00**
@whatever1983
@@ -593,21 +536,7 @@ Sorry I'm confused, if you have a specific reference to the exact quant in quest
---
-👤 **ubergarm** commented the **2025-04-15** at **20:33:00**:
-
-> embedding needs to be iq3_k to emulate IQ2_M for way better coding performance.
-
-Hey bud, which `embedding` are you talking about? If you check the model card side-bar on hf for the [DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) (about which I assume you are referring?), the `token_embd.weight` is `q8_0`?
-
-> can you make that into the IQ2_K_M, IQ2_K_M_R4 standard?
-
-This fork allows the user to cook up whatever combinations they want with `llama-quantize --quantize-q` ... (and my recipe is shown on the hf model card too). I'm not sure where you're talking about `IQ2_K_M` or `IQ2_K_M_R4` those are not quants with which I'm familiar. You can see the [quants available listed in the `quantize` code here](https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/quantize/quantize.cpp#L26).
-
-Sorry I'm confused, if you have a specific reference to the exact quant in question I'll be back in office later this week. Cheers!
-
----
-
-👤 **Panchovix** commented the **2025-04-24** at **05:24:37**:
+👤 **Panchovix** commented on **2025-04-24** at **05:24:37**
HI there! Closing as MLA was recently merged into main llamacpp, and it seems to work with CUDA as for now, with newer quants (https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD)
@@ -617,7 +546,7 @@ EDIT: Re-opening as no luck for now either on main llamacpp
---
-👤 **ubergarm** commented the **2025-04-26** at **19:13:31**:
+👤 **ubergarm** commented on **2025-04-26** at **19:13:31**
@Panchovix
@@ -633,7 +562,7 @@ I have not tried that new "Unsloth Dynamic v2.0" quant with MLA, and am not sure
---
-👤 **Panchovix** commented the **2025-04-26** at **20:47:26**:
+👤 **Panchovix** commented on **2025-04-26** at **20:47:26**
Hi there @ubergarm, I did try IQ2_K_R4, but with multiple GPUs. The issue is that with just one GPU, I tried but the model didn't fit with RAM + VRAM (In theory it should but it gave me OOM anywayas).
@@ -641,7 +570,7 @@ As mentioned there, on llamacpp the error seems a bit different, outputing gibbe
---
-👤 **ubergarm** commented the **2025-04-27** at **01:38:32**:
+👤 **ubergarm** commented on **2025-04-27** at **01:38:32**
@Panchovix
@@ -689,53 +618,7 @@ Let me know what errors you get if any trying it this way. If you are still OOMi
---
-👤 **ubergarm** commented the **2025-04-27** at **01:38:32**:
-
-@Panchovix
-
-Give this a try:
-```
-# Install build dependencies and cuda toolkit as needed
-git clone https://github.com/ikawrakow/ik_llama.cpp
-cd ik_llama.cpp
-
-# Configure CUDA+CPU Backend
-cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF
-# Build
-cmake --build ./build --config Release -j $(nproc)
-
-# Confirm
-./build/bin/llama-server --version
-version: 3640 (xxxxxxxx)
-built with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
-
-# API Server using single GPU running out of mmap() only needs >~64GB RAM
-CUDA_VISIBLE_DEVICES="0" \
-./build/bin/llama-server \
- --model ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
- --alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
- --ctx-size 16384 \
- -ctk f16 \
- -mla 2 -fa \
- -amb 512 \
- -fmoe \
- --temp 0.3 \
- --min-p 0.05 \
- --n-gpu-layers 63 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 16 \
- --host 127.0.0.1 \
- --port 8080
-```
-
-You can also try the various unsloth quants (though I've not tested their new MLA quant on mainline `llama.cpp` nor this `ik_llama.cpp` fork... You just can't use `-rtr` as that would disable `mmap` and likely OOM you.
-
-Let me know what errors you get if any trying it this way. If you are still OOMing what is the output of `sudo dmesg -T | grep -i oom` or similar... Thanks!
-
----
-
-👤 **Panchovix** commented the **2025-04-28** at **19:48:30**:
+👤 **Panchovix** commented on **2025-04-28** at **19:48:30**
Sorry for the delay, haven't tested yet as I was trying with normal llamacpp to see how it behaves.
@@ -745,7 +628,7 @@ How can I know the layers, the experts, the size of the experts and such to try
---
-👤 **Panchovix** commented the **2025-04-29** at **04:34:52**:
+👤 **Panchovix** commented on **2025-04-29** at **04:34:52**
Just an small update, found https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2#680fad80e3c723c4b1f20c63, then I tested https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/discussions/2#681047075bb07c42d7e44256
@@ -761,7 +644,7 @@ So it is maybe resolved? But the issue seems to come when MLA is mixed with acti
---
-👤 **ubergarm** commented the **2025-04-29** at **16:14:26**:
+👤 **ubergarm** commented on **2025-04-29** at **16:14:26**
> How can I know the layers, the experts, the size of the experts and such to try to use -ot?
@@ -773,7 +656,7 @@ The longer answer is that this is the output you get from `./gguf-py/scripts/ggu
---
-👤 **ubergarm** commented the **2025-04-29** at **16:15:40**:
+👤 **ubergarm** commented on **2025-04-29** at **16:15:40**
> Same seems to happen here with IQ2_K_R4.
@@ -781,13 +664,13 @@ Don't run any `_R4` quants on GPU. Those are repacked for CPU use.
---
-👤 **Panchovix** commented the **2025-04-29** at **16:31:00**:
+👤 **Panchovix** commented on **2025-04-29** at **16:31:00**
Noted, many thanks for all the help! Closing the issue.
---
-👤 **ubergarm** commented the **2025-04-29** at **19:01:45**:
+👤 **ubergarm** commented on **2025-04-29** at **19:01:45**
> Noted, many thanks for all the help! Closing the issue.
@@ -797,7 +680,7 @@ Keeps us posted on your progress and benchmarks as you progress in your journey!
---
-👤 **Panchovix** commented the **2025-04-29** at **19:18:05**:
+👤 **Panchovix** commented on **2025-04-29** at **19:18:05**
Thanks! Yeah, I have 2 motherboards, a X670E Aorus Master and a X670 MSI Carbon, but using the latter now as it lets me use 4x48GB at 6000Mhz.
diff --git a/github-data/issues/306 - Confused by the -mla flag. What_s supported_.md b/github-data/issues/306 - Confused by the -mla flag. Whats supported.md
similarity index 71%
rename from github-data/issues/306 - Confused by the -mla flag. What_s supported_.md
rename to github-data/issues/306 - Confused by the -mla flag. Whats supported.md
index 287c5435f..8a7e33589 100644
--- a/github-data/issues/306 - Confused by the -mla flag. What_s supported_.md
+++ b/github-data/issues/306 - Confused by the -mla flag. Whats supported.md
@@ -1,4 +1,4 @@
-### 📝 [#306](https://github.com/ikawrakow/ik_llama.cpp/issues/306) - Confused by the -mla flag. What's supported?
+## 📌 [Issue #306](https://github.com/ikawrakow/ik_llama.cpp/issues/306) - Confused by the -mla flag. What's supported?
| **Author** | `Downtown-Case` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Trying to load Deepseek 32B (specifically an IQ4_KS_RQ quantization I just made) with the -mla 2 (or -mla any value) flag gives me a segfault.
@@ -28,9 +28,9 @@ Is that only supported by full Deepseek MoE, not the Qwen distills?
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-02** at **14:55:01**:
+👤 **ikawrakow** commented on **2025-04-02** at **14:55:01**
As far as I know, the distilled models use a standard attention mechanism (same as the underlying model used to prepare the distillation, i.e., Qwen, LLaMA-3, etc.). At least [this one](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF) does.
@@ -38,7 +38,7 @@ I guess, I should add checks to only allow MLA when we have a model using MLA.
---
-👤 **Downtown-Case** commented the **2025-04-02** at **14:59:41**:
+👤 **Downtown-Case** commented on **2025-04-02** at **14:59:41**
Interesting, thanks. I'm playing catch up here, and did find the MLA paper.
@@ -47,21 +47,12 @@ What major models *do* support MLA? Just the MoE deepseek releases? Adapted fine
---
-👤 **Downtown-Case** commented the **2025-04-02** at **14:59:41**:
-
-Interesting, thanks. I'm playing catch up here, and did find the MLA paper.
-
-
-What major models *do* support MLA? Just the MoE deepseek releases?
-
----
-
-👤 **ikawrakow** commented the **2025-04-02** at **15:02:38**:
+👤 **ikawrakow** commented on **2025-04-02** at **15:02:38**
As far as I know, DeepSeek-V2/V3/R1/Lite are the models that use MLA.
---
-👤 **Downtown-Case** commented the **2025-04-02** at **15:17:53**:
+👤 **Downtown-Case** commented on **2025-04-02** at **15:17:53**
Thanks! And I appreciate you posting this repo.
\ No newline at end of file
diff --git a/github-data/issues/308 - Bug_ Compiling for arm64_ error_ cannot convert _const uint32x4_t_ to _.md b/github-data/issues/308 - Bug Compiling for arm64 error cannot convert const uint32x4_t to uint8x16_t and .md
similarity index 89%
rename from github-data/issues/308 - Bug_ Compiling for arm64_ error_ cannot convert _const uint32x4_t_ to _.md
rename to github-data/issues/308 - Bug Compiling for arm64 error cannot convert const uint32x4_t to uint8x16_t and .md
index f7ff4925e..2b916010a 100644
--- a/github-data/issues/308 - Bug_ Compiling for arm64_ error_ cannot convert _const uint32x4_t_ to _.md
+++ b/github-data/issues/308 - Bug Compiling for arm64 error cannot convert const uint32x4_t to uint8x16_t and .md
@@ -1,4 +1,4 @@
-### 🐛 [#308](https://github.com/ikawrakow/ik_llama.cpp/issues/308) - Bug: Compiling for arm64, error: cannot convert ‘const uint32x4_t’ to ‘uint8x16_t’ and similar errors
+## 📌 [Issue #308](https://github.com/ikawrakow/ik_llama.cpp/issues/308) - Bug: Compiling for arm64, error: cannot convert ‘const uint32x4_t’ to ‘uint8x16_t’ and similar errors
| **Author** | `smpurkis` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -46,9 +46,9 @@ Linux instance-20240214-1712 6.8.0-1018-oracle #19~22.04.1-Ubuntu SMP Mon Dec 9
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-03** at **08:29:58**:
+👤 **ikawrakow** commented on **2025-04-03** at **08:29:58**
I'm not sure I want to fix those (I perceive them as useless noise from a compiler trying too hard to protect me). Can you try adding
```
@@ -58,7 +58,7 @@ to the compilation options?
---
-👤 **smpurkis** commented the **2025-04-03** at **08:41:16**:
+👤 **smpurkis** commented on **2025-04-03** at **08:41:16**
Still errors with int and float conversions, e.g.
```
@@ -70,25 +70,25 @@ I also tried adding `-fpermissive`, errors with the same
---
-👤 **smpurkis** commented the **2025-04-03** at **08:43:58**:
+👤 **smpurkis** commented on **2025-04-03** at **08:43:58**
Not sure if it makes any difference my `gcc` and `g++` versions are both `12.3.0`
---
-👤 **ikawrakow** commented the **2025-04-03** at **08:45:17**:
+👤 **ikawrakow** commented on **2025-04-03** at **08:45:17**
I'll try to fix those. Give me a few minutes.
---
-👤 **ikawrakow** commented the **2025-04-03** at **09:04:19**:
+👤 **ikawrakow** commented on **2025-04-03** at **09:04:19**
-Does #309 work?
+Does [#309](https://github.com/ikawrakow/ik_llama.cpp/issues/309) work?
---
-👤 **smpurkis** commented the **2025-04-03** at **11:04:07**:
+👤 **smpurkis** commented on **2025-04-03** at **11:04:07**
Unfortunately not, it fails on only a few things now though
```
@@ -150,13 +150,13 @@ make: *** [Makefile:1083: ggml/src/iqk/iqk_mul_mat.o] Error 1
---
-👤 **ikawrakow** commented the **2025-04-03** at **11:12:17**:
+👤 **ikawrakow** commented on **2025-04-03** at **11:12:17**
Thanks for testing. I have missed this one. The new version should compile now. The warnings are harmless.
---
-👤 **smpurkis** commented the **2025-04-03** at **12:07:50**:
+👤 **smpurkis** commented on **2025-04-03** at **12:07:50**
Not sure if this is an issue with just my setup
I'm getting
@@ -190,7 +190,7 @@ make: *** [Makefile:1376: llama-baby-llama] Error 1
---
-👤 **ikawrakow** commented the **2025-04-03** at **12:21:19**:
+👤 **ikawrakow** commented on **2025-04-03** at **12:21:19**
Is `baby-llama` something that you have modified yourself?
The link command lists all these object files, but normally it should just link against the `common` and `llama` libs:
@@ -206,7 +206,7 @@ Oh, you are using the Makefile? I think it only works with `cmake`. They have de
---
-👤 **smpurkis** commented the **2025-04-03** at **12:26:37**:
+👤 **smpurkis** commented on **2025-04-03** at **12:26:37**
Coolio, will cmake a go
```
@@ -216,7 +216,7 @@ I have made no modifications to any files.
---
-👤 **smpurkis** commented the **2025-04-03** at **12:29:40**:
+👤 **smpurkis** commented on **2025-04-03** at **12:29:40**
Hmm, getting other unresolved references with `cmake`
```
@@ -261,13 +261,13 @@ gmake: *** [Makefile:146: all] Error 2
---
-👤 **ikawrakow** commented the **2025-04-03** at **12:42:54**:
+👤 **ikawrakow** commented on **2025-04-03** at **12:42:54**
Can we take a look at the `compile_commands.json` in the `build` folder?
---
-👤 **smpurkis** commented the **2025-04-03** at **12:44:33**:
+👤 **smpurkis** commented on **2025-04-03** at **12:44:33**
Sure here is it
```[
@@ -857,7 +857,7 @@ Sure here is it
---
-👤 **ikawrakow** commented the **2025-04-03** at **12:55:13**:
+👤 **ikawrakow** commented on **2025-04-03** at **12:55:13**
Are you cross-compiling? The above is missing the native flag, which should be ON by default unless cross-compiling. Can you try adding `-DGGML_NATIVE=1` to the `cmake` command?
@@ -865,7 +865,7 @@ Also not sure about OpenMP on this system (it is better to use it on `x86_64` Li
---
-👤 **smpurkis** commented the **2025-04-03** at **13:09:04**:
+👤 **smpurkis** commented on **2025-04-03** at **13:09:04**
I'm using whatever the default settings are.
Adding `-DGGML_NATIVE=1`, running the following unfortunately still errors
@@ -989,13 +989,13 @@ gmake: *** [Makefile:146: all] Error 2
---
-👤 **smpurkis** commented the **2025-04-03** at **13:10:10**:
+👤 **smpurkis** commented on **2025-04-03** at **13:10:10**
Happy to close this issue if it is too much trouble. I believe this is a similar environment to an android phone running termux, I can try it on that as well.
---
-👤 **ikawrakow** commented the **2025-04-03** at **13:25:32**:
+👤 **ikawrakow** commented on **2025-04-03** at **13:25:32**
No, it would be useful to resolve it (if you have the time to test). I'm curious about performance on a Graviton CPU.
@@ -1006,7 +1006,7 @@ I don't know if the correct flag is `-march=native` or perhaps `-mcpu=native`, o
---
-👤 **smpurkis** commented the **2025-04-03** at **13:40:16**:
+👤 **smpurkis** commented on **2025-04-03** at **13:40:16**
Adding `-march=native` to `-DCMAKE_CXX_FLAGS` and `-DCMAKE_C_FLAGS` worked. In full
```
@@ -1015,19 +1015,19 @@ cmake -B build -DCMAKE_CXX_FLAGS="-fpermissive -flax-vector-conversions -march=n
---
-👤 **ikawrakow** commented the **2025-04-03** at **13:50:21**:
+👤 **ikawrakow** commented on **2025-04-03** at **13:50:21**
Great! Thank you for the patience. If you come around to test, I would be interested in the results.
---
-👤 **smpurkis** commented the **2025-04-03** at **13:58:00**:
+👤 **smpurkis** commented on **2025-04-03** at **13:58:00**
Happy to test/benchmark, is there a script to run similar benchmarks to in the readme?
---
-👤 **ikawrakow** commented the **2025-04-03** at **14:14:50**:
+👤 **ikawrakow** commented on **2025-04-03** at **14:14:50**
The benchmarks were done using `llama-bench`.
@@ -1053,7 +1053,7 @@ Let me know if you have more questions.
---
-👤 **smpurkis** commented the **2025-04-04** at **15:43:05**:
+👤 **smpurkis** commented on **2025-04-04** at **15:43:05**
Here is what I got running the bench script over a variety of qwen 2.5 3b quants from https://huggingface.co/bartowski
@@ -1167,120 +1167,7 @@ ik_llama.cpp is faster on all except q4_0 format.
---
-👤 **smpurkis** commented the **2025-04-04** at **15:43:05**:
-
-Here is what I got running the bench script over a variety of qwen 2.5 3b quants from https://huggingface.co/bartowski
-
-```
-llama.cpp, commit id 74d4f5b041ad837153b0e90fc864b8290e01d8d5
-| model | size | params | backend | threads | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
-| qwen2 3B IQ3_S mix - 3.66 bpw | 1.38 GiB | 3.09 B | CPU | 1 | pp64 | 1.62 ± 0.00 |
-| qwen2 3B IQ3_S mix - 3.66 bpw | 1.38 GiB | 3.09 B | CPU | 1 | tg32 | 1.41 ± 0.00 |
-| qwen2 3B IQ3_S mix - 3.66 bpw | 1.38 GiB | 3.09 B | CPU | 2 | pp64 | 3.23 ± 0.01 |
-| qwen2 3B IQ3_S mix - 3.66 bpw | 1.38 GiB | 3.09 B | CPU | 2 | tg32 | 2.75 ± 0.00 |
-| qwen2 3B IQ3_S mix - 3.66 bpw | 1.38 GiB | 3.09 B | CPU | 3 | pp64 | 4.76 ± 0.01 |
-| qwen2 3B IQ3_S mix - 3.66 bpw | 1.38 GiB | 3.09 B | CPU | 3 | tg32 | 3.78 ± 0.28 |
-| qwen2 3B IQ4_XS - 4.25 bpw | 1.61 GiB | 3.09 B | CPU | 1 | pp64 | 5.90 ± 0.00 |
-| qwen2 3B IQ4_XS - 4.25 bpw | 1.61 GiB | 3.09 B | CPU | 1 | tg32 | 3.83 ± 0.01 |
-| qwen2 3B IQ4_XS - 4.25 bpw | 1.61 GiB | 3.09 B | CPU | 2 | pp64 | 11.65 ± 0.04 |
-| qwen2 3B IQ4_XS - 4.25 bpw | 1.61 GiB | 3.09 B | CPU | 2 | tg32 | 6.93 ± 0.05 |
-| qwen2 3B IQ4_XS - 4.25 bpw | 1.61 GiB | 3.09 B | CPU | 3 | pp64 | 17.01 ± 0.16 |
-| qwen2 3B IQ4_XS - 4.25 bpw | 1.61 GiB | 3.09 B | CPU | 3 | tg32 | 9.37 ± 0.41 |
-| qwen2 3B Q3_K - Large | 1.58 GiB | 3.09 B | CPU | 1 | pp64 | 3.46 ± 0.00 |
-| qwen2 3B Q3_K - Large | 1.58 GiB | 3.09 B | CPU | 1 | tg32 | 2.77 ± 0.01 |
-| qwen2 3B Q3_K - Large | 1.58 GiB | 3.09 B | CPU | 2 | pp64 | 6.89 ± 0.01 |
-| qwen2 3B Q3_K - Large | 1.58 GiB | 3.09 B | CPU | 2 | tg32 | 5.29 ± 0.01 |
-| qwen2 3B Q3_K - Large | 1.58 GiB | 3.09 B | CPU | 3 | pp64 | 9.82 ± 0.57 |
-| qwen2 3B Q3_K - Large | 1.58 GiB | 3.09 B | CPU | 3 | tg32 | 7.24 ± 0.31 |
-| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 1 | pp64 | 16.01 ± 0.02 |
-| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 1 | tg32 | 4.73 ± 0.04 |
-| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 2 | pp64 | 31.59 ± 0.16 |
-| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 2 | tg32 | 8.91 ± 0.15 |
-| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 3 | pp64 | 45.77 ± 0.56 |
-| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 3 | tg32 | 11.86 ± 0.88 |
-| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 1 | pp64 | 5.03 ± 0.01 |
-| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 1 | tg32 | 3.41 ± 0.01 |
-| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 2 | pp64 | 9.95 ± 0.03 |
-| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 2 | tg32 | 6.37 ± 0.04 |
-| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 3 | pp64 | 14.68 ± 0.20 |
-| qwen2 3B Q4_K - Medium | 1.79 GiB | 3.09 B | CPU | 3 | tg32 | 9.06 ± 0.19 |
-| qwen2 3B Q5_K - Medium | 2.14 GiB | 3.09 B | CPU | 1 | pp64 | 3.44 ± 0.01 |
-| qwen2 3B Q5_K - Medium | 2.14 GiB | 3.09 B | CPU | 1 | tg32 | 2.67 ± 0.02 |
-| qwen2 3B Q5_K - Medium | 2.14 GiB | 3.09 B | CPU | 2 | pp64 | 6.87 ± 0.02 |
-| qwen2 3B Q5_K - Medium | 2.14 GiB | 3.09 B | CPU | 2 | tg32 | 5.06 ± 0.03 |
-| qwen2 3B Q5_K - Medium | 2.14 GiB | 3.09 B | CPU | 3 | pp64 | 10.09 ± 0.07 |
-| qwen2 3B Q5_K - Medium | 2.14 GiB | 3.09 B | CPU | 3 | tg32 | 7.10 ± 0.31 |
-| qwen2 3B Q6_K | 2.36 GiB | 3.09 B | CPU | 1 | pp64 | 2.90 ± 0.00 |
-| qwen2 3B Q6_K | 2.36 GiB | 3.09 B | CPU | 1 | tg32 | 2.23 ± 0.01 |
-| qwen2 3B Q6_K | 2.36 GiB | 3.09 B | CPU | 2 | pp64 | 5.75 ± 0.04 |
-| qwen2 3B Q6_K | 2.36 GiB | 3.09 B | CPU | 2 | tg32 | 4.20 ± 0.03 |
-| qwen2 3B Q6_K | 2.36 GiB | 3.09 B | CPU | 3 | pp64 | 8.46 ± 0.09 |
-| qwen2 3B Q6_K | 2.36 GiB | 3.09 B | CPU | 3 | tg32 | 5.83 ± 0.31 |
-| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CPU | 1 | pp64 | 6.37 ± 0.02 |
-| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CPU | 1 | tg32 | 2.78 ± 0.05 |
-| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CPU | 2 | pp64 | 12.60 ± 0.08 |
-| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CPU | 2 | tg32 | 5.00 ± 0.27 |
-| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CPU | 3 | pp64 | 17.58 ± 0.78 |
-| qwen2 3B Q8_0 | 3.05 GiB | 3.09 B | CPU | 3 | tg32 | 7.12 ± 0.10 |
-
-
-ik_llama.cpp, commit id 310bce3c1db882c2e057582c546a8bc3c04478e1
-| model | size | params | backend | threads | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
-| qwen2 ?B IQ3_S mix - 3.66 bpw | 1.62 GiB | 3.40 B | CPU | 1 | pp64 | 6.13 ± 0.02 |
-| qwen2 ?B IQ3_S mix - 3.66 bpw | 1.62 GiB | 3.40 B | CPU | 1 | tg32 | 1.42 ± 0.00 |
-| qwen2 ?B IQ3_S mix - 3.66 bpw | 1.62 GiB | 3.40 B | CPU | 2 | pp64 | 12.14 ± 0.06 |
-| qwen2 ?B IQ3_S mix - 3.66 bpw | 1.62 GiB | 3.40 B | CPU | 2 | tg32 | 2.79 ± 0.01 |
-| qwen2 ?B IQ3_S mix - 3.66 bpw | 1.62 GiB | 3.40 B | CPU | 3 | pp64 | 17.73 ± 0.26 |
-| qwen2 ?B IQ3_S mix - 3.66 bpw | 1.62 GiB | 3.40 B | CPU | 3 | tg32 | 3.93 ± 0.10 |
-| qwen2 ?B IQ4_XS - 4.25 bpw | 1.85 GiB | 3.40 B | CPU | 1 | pp64 | 8.40 ± 0.04 |
-| qwen2 ?B IQ4_XS - 4.25 bpw | 1.85 GiB | 3.40 B | CPU | 1 | tg32 | 3.74 ± 0.01 |
-| qwen2 ?B IQ4_XS - 4.25 bpw | 1.85 GiB | 3.40 B | CPU | 2 | pp64 | 16.66 ± 0.03 |
-| qwen2 ?B IQ4_XS - 4.25 bpw | 1.85 GiB | 3.40 B | CPU | 2 | tg32 | 7.20 ± 0.10 |
-| qwen2 ?B IQ4_XS - 4.25 bpw | 1.85 GiB | 3.40 B | CPU | 3 | pp64 | 24.33 ± 0.15 |
-| qwen2 ?B IQ4_XS - 4.25 bpw | 1.85 GiB | 3.40 B | CPU | 3 | tg32 | 10.10 ± 0.35 |
-| qwen2 ?B Q3_K - Large | 1.82 GiB | 3.40 B | CPU | 1 | pp64 | 5.75 ± 0.02 |
-| qwen2 ?B Q3_K - Large | 1.82 GiB | 3.40 B | CPU | 1 | tg32 | 2.60 ± 0.01 |
-| qwen2 ?B Q3_K - Large | 1.82 GiB | 3.40 B | CPU | 2 | pp64 | 11.45 ± 0.07 |
-| qwen2 ?B Q3_K - Large | 1.82 GiB | 3.40 B | CPU | 2 | tg32 | 5.07 ± 0.02 |
-| qwen2 ?B Q3_K - Large | 1.82 GiB | 3.40 B | CPU | 3 | pp64 | 16.80 ± 0.19 |
-| qwen2 ?B Q3_K - Large | 1.82 GiB | 3.40 B | CPU | 3 | tg32 | 7.11 ± 0.30 |
-| qwen2 ?B Q4_0 | 1.94 GiB | 3.40 B | CPU | 1 | pp64 | 8.29 ± 0.02 |
-| qwen2 ?B Q4_0 | 1.94 GiB | 3.40 B | CPU | 1 | tg32 | 3.81 ± 0.03 |
-| qwen2 ?B Q4_0 | 1.94 GiB | 3.40 B | CPU | 2 | pp64 | 16.43 ± 0.13 |
-| qwen2 ?B Q4_0 | 1.94 GiB | 3.40 B | CPU | 2 | tg32 | 7.34 ± 0.07 |
-| qwen2 ?B Q4_0 | 1.94 GiB | 3.40 B | CPU | 3 | pp64 | 23.86 ± 0.37 |
-| qwen2 ?B Q4_0 | 1.94 GiB | 3.40 B | CPU | 3 | tg32 | 10.39 ± 0.37 |
-| qwen2 ?B Q4_K - Medium | 2.03 GiB | 3.40 B | CPU | 1 | pp64 | 7.55 ± 0.02 |
-| qwen2 ?B Q4_K - Medium | 2.03 GiB | 3.40 B | CPU | 1 | tg32 | 3.43 ± 0.01 |
-| qwen2 ?B Q4_K - Medium | 2.03 GiB | 3.40 B | CPU | 2 | pp64 | 15.56 ± 0.06 |
-| qwen2 ?B Q4_K - Medium | 2.03 GiB | 3.40 B | CPU | 2 | tg32 | 6.63 ± 0.06 |
-| qwen2 ?B Q4_K - Medium | 2.03 GiB | 3.40 B | CPU | 3 | pp64 | 22.73 ± 0.58 |
-| qwen2 ?B Q4_K - Medium | 2.03 GiB | 3.40 B | CPU | 3 | tg32 | 8.94 ± 0.56 |
-| qwen2 ?B Q5_K - Medium | 2.30 GiB | 3.40 B | CPU | 1 | pp64 | 7.09 ± 0.02 |
-| qwen2 ?B Q5_K - Medium | 2.30 GiB | 3.40 B | CPU | 1 | tg32 | 2.60 ± 0.01 |
-| qwen2 ?B Q5_K - Medium | 2.30 GiB | 3.40 B | CPU | 2 | pp64 | 13.99 ± 0.07 |
-| qwen2 ?B Q5_K - Medium | 2.30 GiB | 3.40 B | CPU | 2 | tg32 | 5.02 ± 0.04 |
-| qwen2 ?B Q5_K - Medium | 2.30 GiB | 3.40 B | CPU | 3 | pp64 | 20.50 ± 0.21 |
-| qwen2 ?B Q5_K - Medium | 2.30 GiB | 3.40 B | CPU | 3 | tg32 | 7.12 ± 0.21 |
-| qwen2 ?B Q6_K | 2.60 GiB | 3.40 B | CPU | 1 | pp64 | 5.35 ± 0.02 |
-| qwen2 ?B Q6_K | 2.60 GiB | 3.40 B | CPU | 1 | tg32 | 2.64 ± 0.01 |
-| qwen2 ?B Q6_K | 2.60 GiB | 3.40 B | CPU | 2 | pp64 | 10.61 ± 0.07 |
-| qwen2 ?B Q6_K | 2.60 GiB | 3.40 B | CPU | 2 | tg32 | 5.14 ± 0.03 |
-| qwen2 ?B Q6_K | 2.60 GiB | 3.40 B | CPU | 3 | pp64 | 15.33 ± 0.61 |
-| qwen2 ?B Q6_K | 2.60 GiB | 3.40 B | CPU | 3 | tg32 | 7.26 ± 0.16 |
-| qwen2 ?B Q8_0 | 3.36 GiB | 3.40 B | CPU | 1 | pp64 | 7.34 ± 0.13 |
-| qwen2 ?B Q8_0 | 3.36 GiB | 3.40 B | CPU | 1 | tg32 | 3.11 ± 0.02 |
-| qwen2 ?B Q8_0 | 3.36 GiB | 3.40 B | CPU | 2 | pp64 | 14.25 ± 0.51 |
-| qwen2 ?B Q8_0 | 3.36 GiB | 3.40 B | CPU | 2 | tg32 | 5.86 ± 0.08 |
-| qwen2 ?B Q8_0 | 3.36 GiB | 3.40 B | CPU | 3 | pp64 | 21.18 ± 0.39 |
-| qwen2 ?B Q8_0 | 3.36 GiB | 3.40 B | CPU | 3 | tg32 | 8.17 ± 0.31 |
-```
-
----
-
-👤 **ikawrakow** commented the **2025-04-04** at **15:49:16**:
+👤 **ikawrakow** commented on **2025-04-04** at **15:49:16**
Thank you for these.
@@ -1290,13 +1177,13 @@ To beat `llama.cpp` also for `Q4_0` quants, you need to use `-rtr 1`.
---
-👤 **smpurkis** commented the **2025-04-04** at **15:55:08**:
+👤 **smpurkis** commented on **2025-04-04** at **15:55:08**
Ah, my mistake, will try again with `-rtr 1`. It has 4 cores, but lags badly when using all 4, so generally use 3 as other services are running on the server.
---
-👤 **smpurkis** commented the **2025-04-04** at **16:04:17**:
+👤 **smpurkis** commented on **2025-04-04** at **16:04:17**
This is the results with `-rtr 1`, a bit slower than llama.cpp, about 30% slower on pp, same speed on tg though
```
@@ -1312,6 +1199,6 @@ This is the results with `-rtr 1`, a bit slower than llama.cpp, about 30% slower
---
-👤 **ikawrakow** commented the **2025-04-04** at **16:11:38**:
+👤 **ikawrakow** commented on **2025-04-04** at **16:11:38**
Interesting. On the M2-Max and any `x86_64` my `Q4_0` implementation beats mainline.
\ No newline at end of file
diff --git a/github-data/issues/314 - Llama 4 Support_.md b/github-data/issues/314 - Llama 4 Support.md
similarity index 72%
rename from github-data/issues/314 - Llama 4 Support_.md
rename to github-data/issues/314 - Llama 4 Support.md
index 2d56455ad..622b1c577 100644
--- a/github-data/issues/314 - Llama 4 Support_.md
+++ b/github-data/issues/314 - Llama 4 Support.md
@@ -1,4 +1,4 @@
-### 📝 [#314](https://github.com/ikawrakow/ik_llama.cpp/issues/314) - Llama 4 Support?
+## 📌 [Issue #314](https://github.com/ikawrakow/ik_llama.cpp/issues/314) - Llama 4 Support?
| **Author** | `Downtown-Case` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
https://huggingface.co/collections/meta-llama/llama-4-67f0c30d9fe03840bc9d0164
@@ -18,9 +18,9 @@ It's 10M context, so there must be some architectural difference from Llama 3.3
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-06** at **00:05:11**:
+👤 **saood06** commented on **2025-04-06** at **00:05:11**
>It's 10M context, so there must be some architectural difference from Llama 3.3
@@ -32,7 +32,7 @@ This shares a bit from Command-A:
---
-👤 **Downtown-Case** commented the **2025-04-06** at **02:15:26**:
+👤 **Downtown-Case** commented on **2025-04-06** at **02:15:26**
No MLA, which was my faint hope.
@@ -40,13 +40,7 @@ Some layers are dense though, so maybe this is a good offloading candidate.
---
-👤 **Downtown-Case** commented the **2025-04-06** at **02:15:26**:
-
-No MLA, which was my faint hope.
-
----
-
-👤 **saood06** commented the **2025-04-06** at **04:45:20**:
+👤 **saood06** commented on **2025-04-06** at **04:45:20**
> No MLA, which was my faint hope.
@@ -56,13 +50,13 @@ It would be interesting to see how much context the providers end up offering si
---
-👤 **ikawrakow** commented the **2025-04-08** at **08:04:36**:
+👤 **ikawrakow** commented on **2025-04-08** at **08:04:36**
I'll look into this in the next days. I did try downloading the Scout variant this morning using `huggingface-cli`, but it errored out. I'll try again later.
---
-👤 **Downtown-Case** commented the **2025-04-08** at **16:20:59**:
+👤 **Downtown-Case** commented on **2025-04-08** at **16:20:59**
@ikawrakow I have great success with this:
@@ -72,19 +66,19 @@ It hash checks every file, and will retry each one if it fails or times out.
---
-👤 **Downtown-Case** commented the **2025-04-08** at **16:23:04**:
+👤 **Downtown-Case** commented on **2025-04-08** at **16:23:04**
Oh, and Llama 4 seems to be quite bad at longer context, at least in my quick API tests.
---
-👤 **ikawrakow** commented the **2025-04-08** at **16:25:48**:
+👤 **ikawrakow** commented on **2025-04-08** at **16:25:48**
Bad as not producing good answers, or bad as being slow?
---
-👤 **saood06** commented the **2025-04-08** at **17:06:37**:
+👤 **saood06** commented on **2025-04-08** at **17:06:37**
> Oh, and Llama 4 seems to be quite bad at longer context, at least in my quick API tests.
@@ -92,7 +86,7 @@ Is it good at short contexts?
---
-👤 **Downtown-Case** commented the **2025-04-09** at **14:37:43**:
+👤 **Downtown-Case** commented on **2025-04-09** at **14:37:43**
> Bad as not producing good answers, or bad as being slow?
@@ -108,17 +102,7 @@ No idea, lol. Again I was testing over API, not llama.cpp.
---
-👤 **Downtown-Case** commented the **2025-04-09** at **14:37:43**:
-
-> Bad as not producing good answers, or bad as being slow?
-
-Bad at producing good answers.
-
-My long context tests are questions about long sets of papers or long stories (like novels) that need it to "understand" lots of whole context instead of pluck something out, like "judge these papers against each other," or "describe this character's arc to me," and its... not good. Even at like 70K, much less 1M context.
-
----
-
-👤 **saood06** commented the **2025-04-10** at **03:35:44**:
+👤 **saood06** commented on **2025-04-10** at **03:35:44**
> No idea, lol. Again I was testing over API, not llama.cpp.
diff --git a/github-data/issues/322 - Speculative decoding support.md b/github-data/issues/322 - Speculative decoding support.md
index ea18d6ea3..557646dcd 100644
--- a/github-data/issues/322 - Speculative decoding support.md
+++ b/github-data/issues/322 - Speculative decoding support.md
@@ -1,14 +1,15 @@
-### 📝 [#322](https://github.com/ikawrakow/ik_llama.cpp/issues/322) - Speculative decoding support
+## 📌 [Issue #322](https://github.com/ikawrakow/ik_llama.cpp/issues/322) - Speculative decoding support
| **Author** | `Lissanro` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-04-09 |
| **Updated** | 2025-06-03 |
+| **Labels** | `enhancement`, `help wanted` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -34,21 +35,21 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-09** at **12:29:17**:
+👤 **ikawrakow** commented on **2025-04-09** at **12:29:17**
I have never used or looked into speculative decoding, so it would be something new to learn and wrap my head around what needs to get done.
---
-👤 **orca-zhang** commented the **2025-04-09** at **14:29:57**:
+👤 **orca-zhang** commented on **2025-04-09** at **14:29:57**
That's great. I've tried to make a DRAFT model for speculative decoding but failed.
---
-👤 **saood06** commented the **2025-04-10** at **03:32:44**:
+👤 **saood06** commented on **2025-04-10** at **03:32:44**
> I have never used or looked into speculative decoding, so it would be something new to learn and wrap my head around what needs to get done.
@@ -58,23 +59,13 @@ It was something I was interested in syncing after updating the cache_prompt (an
---
-👤 **saood06** commented the **2025-04-10** at **03:32:44**:
-
-> I have never used or looked into speculative decoding, so it would be something new to learn and wrap my head around what needs to get done.
-
-The speculative example exists here in ik_llama.cpp, but there are a few functional commits that are missing (many commits are just refactorings or non functional tweaks), the speculative-simple and speculative support in server are missing.
-
-It was something I was interested in syncing after updating the cache_prompt (and maybe even adding some stuff to the API that front ends could benefit from for my usecases)
-
----
-
-👤 **orca-zhang** commented the **2025-04-10** at **15:33:41**:
+👤 **orca-zhang** commented on **2025-04-10** at **15:33:41**
I have tested it on the mainline, using UD-Q2_K_XL + DRAFT_0.5B_BF16 parameters `-ot=exp -ngl99 -ngld 99`. Although it is fast, the output quality is very poor, with almost no useful output. The draft model can run at 120 tokens/s, and the final tg can go from 9.35 -> 11.8 tokens/s, with a memory bandwidth of 608GB/s, 2S 6454s with a single 5080. Of course, it may also be a problem of parameter tuning.
---
-👤 **Lissanro** commented the **2025-04-10** at **16:29:34**:
+👤 **Lissanro** commented on **2025-04-10** at **16:29:34**
Speculative decoding should have zero impact on quality of output, since this is its the most important feature, to provide performance boost without affecting the quality. At worst, the draft model will not provide any speed up if it is very unlucky at predicting tokens of the main model.
@@ -82,29 +73,21 @@ If there is any impact on quality of the output from the main model while using
---
-👤 **Lissanro** commented the **2025-04-10** at **16:29:34**:
-
-Speculative decoding should have zero impact on quality of output, since this is the most important feature of the speculative decoding, to provide performance boost without affecting the quality. At worst, the draft model will not provide any speed up if it is very unlucky at predicting tokens of the main model.
-
-If there is any impact on quality of the output from the main model while using a draft model, it means there is a bug somewhere.
-
----
-
-👤 **ikawrakow** commented the **2025-04-10** at **18:19:24**:
+👤 **ikawrakow** commented on **2025-04-10** at **18:19:24**
Isn't this dependent on how it is implemented? If sampling is done without taking into account tokens predicted by the draft model, then sure, the draft model should not affect quality. But if someone was trying to be clever and somehow incorporate the draft tokens into the sampling (e.g., in order to increase acceptance rate), then it can lead to a disaster. I haven't checked how it is done in `llama.cpp`. But if @orca-zhang observes a much reduced quality of the generated output (I assume with otherwise identical parameters apart from using a draft model?), then either there is a bug, or it is not implemented correctly.
---
-👤 **saood06** commented the **2025-06-01** at **07:45:24**:
+👤 **saood06** commented on **2025-06-01** at **07:45:24**
Interestingly Eagle-2 seems like it may be coming to llama.cpp see https://github.com/ggml-org/llama.cpp/pull/13908. I'm keeping my eye on how easy it would be to add support here once there is a working PR in llama.cpp.
---
-👤 **ikawrakow** commented the **2025-06-01** at **09:04:08**:
+👤 **ikawrakow** commented on **2025-06-01** at **09:04:08**
-> Interestingly Eagle-2 seems like it may be coming to llama.cpp see [ggml-org/llama.cpp#13908](https://github.com/ggml-org/llama.cpp/pull/13908). I'm keeping my eye on how easy it would be to add support here once there is a working PR in llama.cpp.
+> Interestingly Eagle-2 seems like it may be coming to llama.cpp see [ggml-org/llama.cpp[#13908](https://github.com/ikawrakow/ik_llama.cpp/issues/13908)](https://github.com/ggml-org/llama.cpp/pull/13908). I'm keeping my eye on how easy it would be to add support here once there is a working PR in llama.cpp.
I know you are very interested in getting Eagle-2 here, but I don't find the results they report particularly impressive..
@@ -112,7 +95,7 @@ They have run benchmarks on an RTX-4080, which is the GPU I have. I also have Qw
---
-👤 **saood06** commented the **2025-06-01** at **09:58:50**:
+👤 **saood06** commented on **2025-06-01** at **09:58:50**
> I know you are very interested in getting Eagle-2 here, but I don't find the results they report particularly impressive..
>
@@ -128,21 +111,7 @@ Edit: See the comment below for a direct comparison, and an explanation for why
---
-👤 **saood06** commented the **2025-06-01** at **09:58:50**:
-
-> I know you are very interested in getting Eagle-2 here, but I don't find the results they report particularly impressive..
->
-> They have run benchmarks on an RTX-4080, which is the GPU I have. I also have Qwen2.5-7B-Instruct handy (is this the model they mean when they say "Qwen2-7B-Instruct"?). With that model in `bf16` (or `f16`) precision and no speculation I get 45 t/s on today's mainline and also with `ik_llama.cpp`. Which would mean a 10% speedup, and not the 35% they report for zero temperature. I guess they compare to mainline speculative implementation, but on my book that comparison is bogus. What they need to compare to is `Max(speculation, no speculation)`. This applies also to the "2.1" speedup, which in reality is just `53/45`, so 18%. If the "baseline" is just 37 t/s, it basically means that the draft model just consumes GPU cycles without resulting in any successful drafts with the current mainline speculative implementation.
-
-I didn't pay much attention to their performance results for a few reasons, first they haven't shared code yet, and hopefully aren't indicative of what the future PR allows for if used properly, and most importantly I have no idea why they are using such a large draft model, as that is far from optimal (even for the "naive" speculative implementation in llama.cpp and in here, I'm fairly certain the typical given advice is to use 10x smaller draft or even smaller for larger models [it is more complicated than that as picking the correct quant type matters]).
-
-For reference they tested with a 2.7GB draft model as stated in the PR, and looking at available Eagle-3 draft models it is 850 MB for [this](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.1-Instruct-8B/tree/main) 8B model, 1.28 GB for [this](https://huggingface.co/yuhuili/EAGLE3-Vicuna1.3-13B/tree/main) 13B model, and 3.15 GB for [this](https://huggingface.co/yuhuili/EAGLE3-LLaMA3.3-Instruct-70B/tree/main) 70B model. Their draft model is closest in size to the 70B when when they were drafting for a 7B model.
-
-The official Eagle based implementations perform well see: https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md.
-
----
-
-👤 **pockers21** commented the **2025-06-03** at **08:21:04**:
+👤 **pockers21** commented on **2025-06-03** at **08:21:04**
> > I know you are very interested in getting Eagle-2 here, but I don't find the results they report particularly impressive..
> > They have run benchmarks on an RTX-4080, which is the GPU I have. I also have Qwen2.5-7B-Instruct handy (is this the model they mean when they say "Qwen2-7B-Instruct"?). With that model in `bf16` (or `f16`) precision and no speculation I get 45 t/s on today's mainline and also with `ik_llama.cpp`. Which would mean a 10% speedup, and not the 35% they report for zero temperature. I guess they compare to mainline speculative implementation, but on my book that comparison is bogus. What they need to compare to is `Max(speculation, no speculation)`. This applies also to the "2.1" speedup, which in reality is just `53/45`, so 18%. If the "baseline" is just 37 t/s, it basically means that the draft model just consumes GPU cycles without resulting in any successful drafts with the current mainline speculative implementation.
@@ -166,7 +135,7 @@ This increases the model size from 1.6GB to 2.7GB. The smaller models you mentio
---
-👤 **saood06** commented the **2025-06-03** at **09:00:43**:
+👤 **saood06** commented on **2025-06-03** at **09:00:43**
> https://huggingface.co/yuhuili/EAGLE-Qwen2-7B-Instruct
>
@@ -182,24 +151,4 @@ Like I said, I'm (patiently) waiting to see the Phase-2 and Phase-3 submissions
>The smaller models you mentioned are EAGLE-3 draft models, not the EAGLE-2 I'm working with here.
-I definitely should have clarified that when I linked the other weights for reference. It's been a while since I've looked into Eagle and I forgot that EAGLE and EAGLE-2 share weights, and they have removed this line from their README ("Compared to EAGLE, EAGLE-2 does not require additional training and uses the same weights.") which would have reminded me, so I decided to reference the newer weights, but the most relevant reference would have been the one you linked. Sorry, that is my mistake, and I have edited my original comment to hopefully prevent anyone from being misled.
-
----
-
-👤 **saood06** commented the **2025-06-03** at **09:00:43**:
-
-> https://huggingface.co/yuhuili/EAGLE-Qwen2-7B-Instruct
->
-> This is the EAGLE-2 Qwen2 7B draft model repository, with a model size of 1.6GB. However, this model doesn't include the lm_head output layer, because in the code implementation, this layer is passed as a parameter at
->
-> https://github.com/SafeAILab/EAGLE/blob/main/eagle/model/cnets1.py#L673C54-L673C58
->
-> Since llama.cpp is not as flexible as Python and needs to specify this layer in the computation graph, I need to append the lm_head layer from the original Qwen2 7B Instruct model to the end of the draft model before converting it to GGUF format. This increases the model size from 1.6GB to 2.7GB.
-
-I see, thank you for the info on why the size is different. I've run into situations where mergekit generated safetensors were larger than expected because they added the lm_head tensor and the llama.cpp conversion script would fail (and in those situations the easiest fix was to remove them from the safetensors).
-
-Like I said, I'm (patiently) waiting to see the Phase-2 and Phase-3 submissions before I form any opinions on implementation and performance, I only commented about the size difference I saw since the conversion code and generated files for it where shared.
-
->The smaller models you mentioned are EAGLE-3 draft models, not the EAGLE-2 I'm working with here.
-
-I definitely should have clarified that when I linked the other weights for reference. It's been a while since I've looked into Eagle and I forgot that EAGLE and EAGLE-2 share weights, and they have removed this line from their README ("Compared to EAGLE, EAGLE-2 does not require additional training and uses the same weights.") which would have reminded me, so I decided to reference the newer weights, but the most relevant reference would have been the one you linked. Sorry, that is my mistake.
\ No newline at end of file
+I definitely should have clarified that when I linked the other weights for reference. It's been a while since I've looked into Eagle and I forgot that EAGLE and EAGLE-2 share weights, and they have removed this line from their README ("Compared to EAGLE, EAGLE-2 does not require additional training and uses the same weights.") which would have reminded me, so I decided to reference the newer weights, but the most relevant reference would have been the one you linked. Sorry, that is my mistake, and I have edited my original comment to hopefully prevent anyone from being misled.
\ No newline at end of file
diff --git a/github-data/issues/335 - Bug_ Llama 4 generates garbage with longer context _64K_ the issue is n.md b/github-data/issues/335 - Bug Llama 4 generates garbage with longer context 64K the issue is not present i.md
similarity index 71%
rename from github-data/issues/335 - Bug_ Llama 4 generates garbage with longer context _64K_ the issue is n.md
rename to github-data/issues/335 - Bug Llama 4 generates garbage with longer context 64K the issue is not present i.md
index 8f9247a46..a59e27a24 100644
--- a/github-data/issues/335 - Bug_ Llama 4 generates garbage with longer context _64K_ the issue is n.md
+++ b/github-data/issues/335 - Bug Llama 4 generates garbage with longer context 64K the issue is not present i.md
@@ -1,4 +1,4 @@
-### 🐛 [#335](https://github.com/ikawrakow/ik_llama.cpp/issues/335) - Bug: Llama 4 generates garbage with longer context (64K+; the issue is not present in the llama.cpp)
+## 📌 [Issue #335](https://github.com/ikawrakow/ik_llama.cpp/issues/335) - Bug: Llama 4 generates garbage with longer context (64K+; the issue is not present in the llama.cpp)
| **Author** | `Lissanro` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -68,9 +68,9 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-20** at **05:54:22**:
+👤 **ikawrakow** commented on **2025-04-20** at **05:54:22**
What happens if you don't use the `-amb 1024` command line argument? You may need to reduce the max. context size without that. I'm trying to pinpoint the problem, and two things come to mind:
* I have a bug when computing attention in chunks. If so, removing `-amb 1024` will make it work correctly
@@ -78,19 +78,19 @@ What happens if you don't use the `-amb 1024` command line argument? You may nee
---
-👤 **Lissanro** commented the **2025-04-20** at **14:02:44**:
+👤 **Lissanro** commented on **2025-04-20** at **14:02:44**
Unfortunately removing `-amb 1024 `did not help, I still get very long bad reply like `0: "0000: 0:00: 0:00: //:0:00:00:` - I let it run for a while, then stopped it because otherwise it probably would have continued until running out of output token limit. Here is full log without `-amb 1024` option in case it is useful: https://pastebin.com/hE8kP3Sn
---
-👤 **ikawrakow** commented the **2025-04-20** at **14:44:19**:
+👤 **ikawrakow** commented on **2025-04-20** at **14:44:19**
OK, thanks. I'll take a closer look when I come back from a short break.
---
-👤 **Lissanro** commented the **2025-04-23** at **05:40:29**:
+👤 **Lissanro** commented on **2025-04-23** at **05:40:29**
Some additional information about reproducing the issue with a smaller Scout model and maybe help to narrow down possible causes:
@@ -106,13 +106,13 @@ I also tried ik_llama.cpp without "-fa -ctk q8_0 -ctv q8_0" but still got bad ou
---
-👤 **ikawrakow** commented the **2025-04-23** at **06:16:05**:
+👤 **ikawrakow** commented on **2025-04-23** at **06:16:05**
Thanks, this is useful. I think I can run Scout with 16k context, so this will make debugging easier.
---
-👤 **ikawrakow** commented the **2025-04-23** at **08:29:12**:
+👤 **ikawrakow** commented on **2025-04-23** at **08:29:12**
Perplexity for context of 16k tokens seems fine:
```
@@ -128,7 +128,7 @@ Can you attach the specific prompt that triggers the bug?
---
-👤 **Lissanro** commented the **2025-04-24** at **06:22:02**:
+👤 **Lissanro** commented on **2025-04-24** at **06:22:02**
I decided to test with your exact quant, I downloaded it here:
@@ -209,82 +209,21 @@ This is how I run it:
---
-👤 **Lissanro** commented the **2025-04-24** at **06:22:02**:
-
-I decided to test with your exact quant, I download it here:
-
-https://huggingface.co/unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct-UD-Q2_K_XL.gguf
-
-After testing with it, I noticed that at 18K input, it still may produce coherent output in many cases, even though quality may be reduced. For example, a prompt to summaries Wikipedia article about AI, truncated to about 18K tokens:
-
-```
-## Summary
-
-Artificial intelligence (AI) refers to the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.
-...
-[few more paragraphs of text that provide seemingly normal summary of the article]
-```
-
-But when I increase input length further (around 23K toknes), it starts to breakdown:
-
-```
-The emergence of generative artificial intelligence (AI) has been seen as a significant breakthrough in the field of artificial intelligence (AI) behavior prediction prediction prediction patterns prediction analysis prediction analysis prediction vehicles and criticism criticism of the behavior of vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles vehicles
-...
-[word "vehicles" is repeated until running out of the token limit]
-```
-
-However, the very beginning still may look OK, and there is still a possibility that it may provide semi-coherent replies to some prompts. But I am pretty sure that using full size article about AI (around 72K) will reliably break it no matter what settings. Using full 72K token long that I share below, you can truncate it to the maximum context window you can run for the best reproducibility.
-
-Here are exact prompts used that reproduce the issue on my side:
-
-https://dragon.studio/2025/04/prompt-23K.txt (truncated Wikipedia article, around 23K tokens long, the result shown above)
-
-https://dragon.studio/2025/04/prompt-76K.txt (full Wikipedia article, around 76K tokens long)
-
-I think just by using long enough prompt it should be possible to reproduce the issue - the longer the prompt, the more reproducible it should be (as shown in the examples, it still starts semi-coherent for 23K long prompt for this combination of quant and prompt).
-
-For full reproducibility, I also provide exact setting I used:
-
-https://dragon.studio/2025/04/send_prompt.py - running this script like this will use fixed seed and determenistic temperature setting for the best reproducibility:
-
-```
-python3 send_prompt.py --temp=0 --seed=0 prompt-23.txt
-```
-
-You do not really need to use the script - it is quite short and does nothing fancy, just sets basic parameters and sends the prompt, then prints out the result. So probably you can just use the prompt in UI of your choice to get the same or similar result by just setting temperature and seed to 0 (not sure if it matters, but my test script by default sets top-k=40, top-p=0.9, min-p=0.1, max-tokens=1024).
-
-This is how I compiled ik_llama.cpp (after running "git clone" in the ~/pkgs folder):
-
-```
-cd ~/pkgs && cmake ik_llama.cpp -B ik_llama.cpp/build -DGGML_CUDA_FA_ALL_QUANTS=ON -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON && cmake --build ik_llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-server
-```
-
-This is how I run it:
-
-```
-~/pkgs/ik_llama.cpp/build/bin/llama-server \
---model /mnt/secondary/neuro/Llama-4-Scout-17B-16E-Instruct-UD-Q2_K_XL.gguf \
---ctx-size 81920 --n-gpu-layers 49 --tensor-split 25,25,25,25 -fa -ctk q8_0 -ctv q8_0 \
---threads 64 --host 0.0.0.0 --port 5000
-```
-
----
-
-👤 **ikawrakow** commented the **2025-04-24** at **08:53:02**:
+👤 **ikawrakow** commented on **2025-04-24** at **08:53:02**
Thank you for this! I can now reproduce it with my setup (single GPU). I was concerned that the bug was somehow related to splitting the model, which would have made it impossible for me to debug. I can now try to find the issue.
---
-👤 **ikawrakow** commented the **2025-04-24** at **11:22:27**:
+👤 **ikawrakow** commented on **2025-04-24** at **11:22:27**
@Lissanro
-#342 should fix it. Can you confirm that it works on your end? Thanks.
+[#342](https://github.com/ikawrakow/ik_llama.cpp/issues/342) should fix it. Can you confirm that it works on your end? Thanks.
---
-👤 **Lissanro** commented the **2025-04-25** at **00:35:45**:
+👤 **Lissanro** commented on **2025-04-25** at **00:35:45**
It seems to fix it.
@@ -316,7 +255,7 @@ If ik_llama.cpp is expected to generated different output given the same seed an
---
-👤 **ikawrakow** commented the **2025-04-25** at **07:00:27**:
+👤 **ikawrakow** commented on **2025-04-25** at **07:00:27**
Thank you for testing.
@@ -324,7 +263,7 @@ The output of `llama.cpp` and `ik_llama.cpp` cannot be identical because the cal
---
-👤 **ikawrakow** commented the **2025-04-25** at **07:06:23**:
+👤 **ikawrakow** commented on **2025-04-25** at **07:06:23**
> By the way, can you please share an exact command to measure perplexity? I could run it on my side to see if there is a difference in perplexity between ik_llama.cpp and llama.cpp, if this a potentially useful information.
diff --git a/github-data/issues/339 - Bug_ bitnet2b_2501 template issues.md b/github-data/issues/339 - Bug bitnet2b_2501 template issues.md
similarity index 63%
rename from github-data/issues/339 - Bug_ bitnet2b_2501 template issues.md
rename to github-data/issues/339 - Bug bitnet2b_2501 template issues.md
index 2f5a39914..d15d992f2 100644
--- a/github-data/issues/339 - Bug_ bitnet2b_2501 template issues.md
+++ b/github-data/issues/339 - Bug bitnet2b_2501 template issues.md
@@ -1,4 +1,4 @@
-### 🐛 [#339](https://github.com/ikawrakow/ik_llama.cpp/issues/339) - Bug: bitnet2b_2501 template issues
+## 📌 [Issue #339](https://github.com/ikawrakow/ik_llama.cpp/issues/339) - Bug: bitnet2b_2501 template issues
| **Author** | `saood06` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -32,14 +32,8 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-22** at **07:51:57**:
+👤 **saood06** commented on **2025-04-22** at **07:51:57**
-I think this can actually be closed, the llama_chat_apply_template_internal code looks correct, and I would just need to update the model's GGUF file. I don't use the CLI mode enough to know why it wasn't working there, but now I can get it to function properly in server when I use the correct template.
-
----
-
-👤 **saood06** commented the **2025-04-22** at **07:51:57**:
-
-I think this can actually be closed, the llama_chat_apply_template_internal code looks correct, and I would just need to update the model's GGUF file. I don't use the CLI mode enough to know why it wasn't working there.
\ No newline at end of file
+I think this can actually be closed, the llama_chat_apply_template_internal code looks correct, and I would just need to update the model's GGUF file. I don't use the CLI mode enough to know why it wasn't working there, but now I can get it to function properly in server when I use the correct template.
\ No newline at end of file
diff --git a/github-data/issues/34 - Bug_ FA fails when processing prompt lengths that are not a multiple of .md b/github-data/issues/34 - Bug FA fails when processing prompt lengths that are not a multiple of 8.md
similarity index 67%
rename from github-data/issues/34 - Bug_ FA fails when processing prompt lengths that are not a multiple of .md
rename to github-data/issues/34 - Bug FA fails when processing prompt lengths that are not a multiple of 8.md
index 63a971b5c..085061475 100644
--- a/github-data/issues/34 - Bug_ FA fails when processing prompt lengths that are not a multiple of .md
+++ b/github-data/issues/34 - Bug FA fails when processing prompt lengths that are not a multiple of 8.md
@@ -1,14 +1,15 @@
-### 🐛 [#34](https://github.com/ikawrakow/ik_llama.cpp/issues/34) - Bug: FA fails when processing prompt lengths that are not a multiple of 8
+## 📌 [Issue #34](https://github.com/ikawrakow/ik_llama.cpp/issues/34) - Bug: FA fails when processing prompt lengths that are not a multiple of 8
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-09-02 |
| **Updated** | 2024-09-02 |
+| **Labels** | `bug` |
---
-#### Description
+## 📄 Description
### What happened?
diff --git a/github-data/issues/340 - Bug_ _unknown model architecture_ _cohere2_ when trying to load Command.md b/github-data/issues/340 - Bug unknown model architecture cohere2 when trying to load Command A model.md
similarity index 74%
rename from github-data/issues/340 - Bug_ _unknown model architecture_ _cohere2_ when trying to load Command.md
rename to github-data/issues/340 - Bug unknown model architecture cohere2 when trying to load Command A model.md
index e481f64df..d8b9685e5 100644
--- a/github-data/issues/340 - Bug_ _unknown model architecture_ _cohere2_ when trying to load Command.md
+++ b/github-data/issues/340 - Bug unknown model architecture cohere2 when trying to load Command A model.md
@@ -1,4 +1,4 @@
-### 🐛 [#340](https://github.com/ikawrakow/ik_llama.cpp/issues/340) - Bug: \"unknown model architecture: 'cohere2'\" when trying to load Command A model
+## 📌 [Issue #340](https://github.com/ikawrakow/ik_llama.cpp/issues/340) - Bug: "unknown model architecture: 'cohere2'" when trying to load Command A model
| **Author** | `Alexey-Akishin` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -35,33 +35,33 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-22** at **17:19:53**:
+👤 **ikawrakow** commented on **2025-04-22** at **17:19:53**
I can look into adding it, but I don't have the bandwidth to test every model. Are you willing to test?
---
-👤 **saood06** commented the **2025-04-22** at **17:29:22**:
+👤 **saood06** commented on **2025-04-22** at **17:29:22**
I could test, there is a [small model](https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF) for it as well. I looked into the code, port looked simple (but would need to be redone because of their refactorings).
---
-👤 **Alexey-Akishin** commented the **2025-04-22** at **17:34:13**:
+👤 **Alexey-Akishin** commented on **2025-04-22** at **17:34:13**
I will be more than happy to test, I build ik_llama.cpp from source, so for example I can test a patch when it is available, no problem.
---
-👤 **mcm007** commented the **2025-04-25** at **07:23:19**:
+👤 **mcm007** commented on **2025-04-25** at **07:23:19**
-Tested on CPU only, the small 7B model works OK with #341 .
+Tested on CPU only, the small 7B model works OK with [#341](https://github.com/ikawrakow/ik_llama.cpp/issues/341) .
---
-👤 **Alexey-Akishin** commented the **2025-04-25** at **09:25:05**:
+👤 **Alexey-Akishin** commented on **2025-04-25** at **09:25:05**
Unfortunately it did not work for me with Command A. I just asked it to summarize first few paragraphs from wiki article about "dog":
@@ -95,13 +95,13 @@ Model I used for testing: https://huggingface.co/bartowski/CohereForAI_c4ai-comm
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:52:50**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:52:50**
It looks like something not quite right with the vocabulary. So, I guess, I need to test with this specific model.
---
-👤 **ikawrakow** commented the **2025-04-25** at **10:29:47**:
+👤 **ikawrakow** commented on **2025-04-25** at **10:29:47**
@Alexey-Akishin
@@ -111,26 +111,26 @@ Thanks.
---
-👤 **ikawrakow** commented the **2025-04-25** at **11:16:54**:
+👤 **ikawrakow** commented on **2025-04-25** at **11:16:54**
So, downloaded this specific model. Works fine on the CPU. Produces gibberish on the GPU with partial offload.Is this model another one of those where one needs `fp32` precision for it to work?
---
-👤 **ikawrakow** commented the **2025-04-25** at **13:00:29**:
+👤 **ikawrakow** commented on **2025-04-25** at **13:00:29**
> Is this model another one of those where one needs fp32 precision for it to work?
-Yes, it is. Setting the precision of the `K*Q` matrix multiplication to `fp32` fixes the gibberish on CUDA. The current state of #341 should also work with the 111B parameter Command-A model.
+Yes, it is. Setting the precision of the `K*Q` matrix multiplication to `fp32` fixes the gibberish on CUDA. The current state of [#341](https://github.com/ikawrakow/ik_llama.cpp/issues/341) should also work with the 111B parameter Command-A model.
---
-👤 **Alexey-Akishin** commented the **2025-04-25** at **21:38:17**:
+👤 **Alexey-Akishin** commented on **2025-04-25** at **21:38:17**
-I just tried latest #341 patch and it works well now! You are right, I was using CUDA (loading the whole model to GPUs). Thank you so much for adding support for Command A!
+I just tried latest [#341](https://github.com/ikawrakow/ik_llama.cpp/issues/341) patch and it works well now! You are right, I was using CUDA (loading the whole model to GPUs). Thank you so much for adding support for Command A!
---
-👤 **ikawrakow** commented the **2025-04-26** at **06:12:51**:
+👤 **ikawrakow** commented on **2025-04-26** at **06:12:51**
OK, thanks for testing. I'll merge the PR and close the issue.
\ No newline at end of file
diff --git a/github-data/issues/345 - build question newbie.md b/github-data/issues/345 - build question newbie.md
index edc49ec31..49faa3239 100644
--- a/github-data/issues/345 - build question newbie.md
+++ b/github-data/issues/345 - build question newbie.md
@@ -1,4 +1,4 @@
-### 📝 [#345](https://github.com/ikawrakow/ik_llama.cpp/issues/345) - build question newbie
+## 📌 [Issue #345](https://github.com/ikawrakow/ik_llama.cpp/issues/345) - build question newbie
| **Author** | `VinnyG9` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
hello, i just found this repo and I'm getting incredible performance on my rock5b SBC
@@ -179,9 +179,9 @@ Thank you very much 🙏
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **VinnyG9** commented the **2025-04-25** at **06:04:04**:
+👤 **VinnyG9** commented on **2025-04-25** at **06:04:04**
llama3.2/gemma3 are way worse on tg but same pp performance
@@ -200,13 +200,13 @@ Log end
---
-👤 **ikawrakow** commented the **2025-04-25** at **06:56:13**:
+👤 **ikawrakow** commented on **2025-04-25** at **06:56:13**
What is cogito?
---
-👤 **saood06** commented the **2025-04-25** at **07:00:09**:
+👤 **saood06** commented on **2025-04-25** at **07:00:09**
> What is cogito?
@@ -214,7 +214,7 @@ I'm assuming he's referring to this: https://huggingface.co/collections/deepcogi
---
-👤 **mcm007** commented the **2025-04-25** at **07:07:28**:
+👤 **mcm007** commented on **2025-04-25** at **07:07:28**
The t/s looks too high for a SBC, maybe the .gguf model is corrupt?
@@ -240,7 +240,7 @@ build: 55fb9c81 (3643)
---
-👤 **saood06** commented the **2025-04-25** at **07:26:13**:
+👤 **saood06** commented on **2025-04-25** at **07:26:13**
Also here are the tables from the first post
@@ -328,7 +328,7 @@ Also here are the tables from the first post
---
-👤 **VinnyG9** commented the **2025-04-25** at **07:27:20**:
+👤 **VinnyG9** commented on **2025-04-25** at **07:27:20**
> The t/s looks too high for a SBC, maybe the .gguf model is corrupt?
>
@@ -363,7 +363,7 @@ run/bench
---
-👤 **VinnyG9** commented the **2025-04-25** at **07:40:54**:
+👤 **VinnyG9** commented on **2025-04-25** at **07:40:54**
yeah, they all output gibberish
@@ -371,7 +371,7 @@ main llama.cpp works no problem
---
-👤 **ikawrakow** commented the **2025-04-25** at **07:48:15**:
+👤 **ikawrakow** commented on **2025-04-25** at **07:48:15**
I'm not familiar with this space, so had to look up what "rock 5b" is. According to [this](https://bret.dk/radxa-rock-5b-review-powerful-rk3588-sbc/) it has one Cortex-A76 and one Cortex-A55 CPU. For this the performance numbers look too high. Which means that most likely the `iqk` matrix multiplications that I have added do not get invoked, and it falls back to stock `ggml` implementation (`ggml` is the inference library behind `llama.cpp`). Most likely something goes wrong there, which leads to crazy performance and gibberish output. I did try to maintain this use case (the fallback to stock `ggml`) in a working condition for a while, but I think it is broken now.
@@ -379,7 +379,7 @@ I assume you are running Linux on this board? Can you do `cat /proc/cpuinfo`?
---
-👤 **saood06** commented the **2025-04-25** at **07:56:57**:
+👤 **saood06** commented on **2025-04-25** at **07:56:57**
> I'm not familiar with this space, so had to look up what "rock 5b" is. According to [this](https://bret.dk/radxa-rock-5b-review-powerful-rk3588-sbc/) it has one Cortex-A76 and one Cortex-A55 CPU. For this the performance numbers look too high.
@@ -387,15 +387,7 @@ It has eight cores in total, "Quad Cortex®-A76 @ 2.2~2.4GHz and a Quad Cortex®
---
-👤 **saood06** commented the **2025-04-25** at **07:56:57**:
-
-> I'm not familiar with this space, so had to look up what "rock 5b" is. According to [this](https://bret.dk/radxa-rock-5b-review-powerful-rk3588-sbc/) it has one Cortex-A76 and one Cortex-A55 CPU. For this the performance numbers look too high.
-
-It has eight cores in total, "Quad Cortex®-A76 @ 2.2~2.4GHz and a Quad Cortex®-A55 @ 1.8GHz" from [what I think is the official product page](https://radxa.com/products/rock5/5b/#techspec). But that is still too high.
-
----
-
-👤 **VinnyG9** commented the **2025-04-25** at **08:05:16**:
+👤 **VinnyG9** commented on **2025-04-25** at **08:05:16**
>
> I assume you are running Linux on this board? Can you do `cat /proc/cpuinfo`?
@@ -427,7 +419,7 @@ i get same performance on q4km and iq4nl are you sure it has to do with iqk mm?
---
-👤 **VinnyG9** commented the **2025-04-25** at **08:12:29**:
+👤 **VinnyG9** commented on **2025-04-25** at **08:12:29**
>
> But even with that the performance still seems too high.
@@ -444,7 +436,7 @@ maybe it's some dependency missing? but main runs normally...
---
-👤 **ikawrakow** commented the **2025-04-25** at **08:16:42**:
+👤 **ikawrakow** commented on **2025-04-25** at **08:16:42**
> i get same performance on q4km and iq4nl are you sure it has to do with iqk mm?
@@ -454,7 +446,7 @@ The CPU flags look completely unfamiliar, so I cannot deduce from there if the `
---
-👤 **VinnyG9** commented the **2025-04-25** at **08:22:04**:
+👤 **VinnyG9** commented on **2025-04-25** at **08:22:04**
> > i get same performance on q4km and iq4nl are you sure it has to do with iqk mm?
>
@@ -468,7 +460,7 @@ they do
---
-👤 **ikawrakow** commented the **2025-04-25** at **08:24:21**:
+👤 **ikawrakow** commented on **2025-04-25** at **08:24:21**
So, I'm finding that the `asimddp` CPU feature that you have should enable `__ARM_FEATURE_DOTPROD`. With that things should work correctly.
@@ -476,13 +468,7 @@ What is the compiler being used?
---
-👤 **ikawrakow** commented the **2025-04-25** at **08:24:21**:
-
-So, I'm finding that the `asimddp` feature that you have should enable `__ARM_FEATURE_DOTPROD`. With that things should work correctly.
-
----
-
-👤 **VinnyG9** commented the **2025-04-25** at **08:27:08**:
+👤 **VinnyG9** commented on **2025-04-25** at **08:27:08**
> So, I'm finding that the `asimddp` CPU feature that you have should enable `__ARM_FEATURE_DOTPROD`. With that things should work correctly.
>
@@ -499,7 +485,7 @@ could the llamafile feature interfere?
---
-👤 **ikawrakow** commented the **2025-04-25** at **08:41:23**:
+👤 **ikawrakow** commented on **2025-04-25** at **08:41:23**
Mainline `llama.cpp` has now a much more sophisticated CPU feature detection than this project that got added after I forked. Here things are more on the "do it yourself" level. What you can do to see if the features added by this repo are working, add
```
@@ -509,7 +495,7 @@ just before [this line](https://github.com/ikawrakow/ik_llama.cpp/blob/f176122a3
---
-👤 **ikawrakow** commented the **2025-04-25** at **08:56:26**:
+👤 **ikawrakow** commented on **2025-04-25** at **08:56:26**
> could the llamafile feature interfere?
@@ -517,7 +503,7 @@ Normally no, but you can disable it just in case with `-DGGML_LLAMAFILE=0`
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:01:18**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:01:18**
> Mainline `llama.cpp` has now a much more sophisticated CPU feature detection than this project that got added after I forked. Here things are more on the "do it yourself" level. What you can do to see if the features added by this repo are working, add
>
@@ -534,31 +520,31 @@ i get a build error
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:02:45**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:02:45**
Yes.
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:03:57**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:03:57**
Sorry, also add the same `printf` line in the `iqk_mul_mat` function just above that.
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:06:13**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:06:13**

---
-👤 **ikawrakow** commented the **2025-04-25** at **09:08:36**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:08:36**
Then you need to add `#include ` near the beginning of the file.
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:09:40**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:09:40**
> Then you need to add `#include ` near the beginning of the file.
@@ -570,13 +556,13 @@ now it errors but keeps building
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:12:34**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:12:34**
The warning is harmless. What happens after you run it?
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:15:41**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:15:41**
> The warning is harmless. What happens after you run it?
@@ -584,13 +570,13 @@ floods the terminal with "iqk is not enabled"
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:18:13**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:18:13**
OK, so we know that the build does not work on your system. Your CPU supports the necessary features, so we need to understand why the compiler is not enabling them, so we can fix it.
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:20:29**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:20:29**
> OK, so we know that the build does not work on your system. Your CPU supports the necessary features, so we need to understand why the compiler is not enabling them, so we can fix it.
@@ -598,13 +584,13 @@ i can try with clang19?
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:22:44**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:22:44**
Yes, you can try building with `clang`, maybe this will fix it. But if not, I guess I need to add the ability to manually set compiler flags.
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:24:15**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:24:15**
i got this with clang build setup
not sure why as I'd seen openmp found earlier
@@ -619,13 +605,13 @@ CMake Warning at ggml/src/CMakeLists.txt:167 (message):
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:25:33**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:25:33**
`OpenMP` is not really required. On my M2-Max laptop it actually hurts performance.
---
-👤 **VinnyG9** commented the **2025-04-25** at **09:29:02**:
+👤 **VinnyG9** commented on **2025-04-25** at **09:29:02**
same error on clang19
@@ -633,9 +619,9 @@ same error on clang19
---
-👤 **ikawrakow** commented the **2025-04-25** at **09:46:07**:
+👤 **ikawrakow** commented on **2025-04-25** at **09:46:07**
-So, I made PR #347
+So, I made PR [#347](https://github.com/ikawrakow/ik_llama.cpp/issues/347)
Can you try
```
@@ -646,7 +632,7 @@ cmake -DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16" (plus other things you w
---
-👤 **VinnyG9** commented the **2025-04-25** at **11:05:45**:
+👤 **VinnyG9** commented on **2025-04-25** at **11:05:45**

@@ -665,7 +651,7 @@ also not able to use the -fa flag
---
-👤 **ikawrakow** commented the **2025-04-25** at **11:14:20**:
+👤 **ikawrakow** commented on **2025-04-25** at **11:14:20**
Great. Not sure what could be wrong with `Q4_0` as it does work on my M2-Max. Mainline has done optimizations for `Q4_0` and `IQ4_NL` on ARM, so for these there will not be much difference (my implementation is faster than theirs on the M2-Max, but I guess my optimizations are too aggressive for the A76, so mainline ends up being faster for these two quants on a lower spec Arm CPU).
@@ -675,7 +661,7 @@ Why? What happens?
---
-👤 **VinnyG9** commented the **2025-04-25** at **18:28:37**:
+👤 **VinnyG9** commented on **2025-04-25** at **18:28:37**
> Great. Not sure what could be wrong with `Q4_0` as it does work on my M2-Max. Mainline has done optimizations for `Q4_0` and `IQ4_NL` on ARM, so for these there will not be much difference (my implementation is faster than theirs on the M2-Max, but I guess my optimizations are too aggressive for the A76, so mainline ends up being faster for these two quants on a lower spec Arm CPU).
>
@@ -697,7 +683,7 @@ offtopic: from what i got reading llamacpp issues llamafile enables tinyblas? it
---
-👤 **VinnyG9** commented the **2025-04-25** at **19:53:33**:
+👤 **VinnyG9** commented on **2025-04-25** at **19:53:33**
got some decent performance with bitnet new model, however if i disable OpenMP, tg drops to 16t/s:
@@ -729,7 +715,7 @@ but at least speed didn't drop much with longer text
---
-👤 **VinnyG9** commented the **2025-04-25** at **21:26:36**:
+👤 **VinnyG9** commented on **2025-04-25** at **21:26:36**
>
> ```
@@ -763,7 +749,7 @@ can someone explain why I'm not benefitting from the arm repack thing? like is n
---
-👤 **saood06** commented the **2025-04-26** at **00:36:22**:
+👤 **saood06** commented on **2025-04-26** at **00:36:22**
>nosme actually only worked on main only on clang
@@ -771,7 +757,7 @@ So for ik_llama.cpp was there a difference between clang and gcc now that you go
---
-👤 **ikawrakow** commented the **2025-04-26** at **06:12:06**:
+👤 **ikawrakow** commented on **2025-04-26** at **06:12:06**
> can someone explain why I'm not benefitting from the arm repack thing? like is not IQ4_NL supposed to run faster?
@@ -782,7 +768,7 @@ I find it interesting that explicitly disabling some features with `-march=armv8
---
-👤 **ikawrakow** commented the **2025-04-26** at **07:30:23**:
+👤 **ikawrakow** commented on **2025-04-26** at **07:30:23**
> got some decent performance with bitnet new model, however if i disable OpenMP, tg drops to 16t/s:
@@ -795,7 +781,7 @@ I think you can only get more bandwidth utilized if both CPUs get used. Unfortun
---
-👤 **VinnyG9** commented the **2025-04-26** at **17:55:29**:
+👤 **VinnyG9** commented on **2025-04-26** at **17:55:29**
> > nosme actually only worked on main only on clang
>
@@ -825,19 +811,13 @@ llama.cpp$ build/bin/llama-bench -m ../models/embed/bge-m3-Q4_0.gguf -p 64,128,2
---
-👤 **ikawrakow** commented the **2025-04-27** at **06:13:34**:
+👤 **ikawrakow** commented on **2025-04-27** at **06:13:34**
If your 566M parameter Bert model is something like [this one](https://huggingface.co/blogcncom/bge-m3-Q4_0-GGUF), 200 MiB out of 400 MiB are token embeddings. Only a tiny fraction of these 200 MiB gets actually used (~1000 bytes per generated token), so effectively you are running a 200 MiB model, so memory bandwidth utilized during TG is `120 t/s x 0.2 GiB = 24 GiB/s.`
---
-👤 **ikawrakow** commented the **2025-04-27** at **06:13:34**:
-
-If your 335M parameter Bert model is something like [this one](https://huggingface.co/blogcncom/bge-m3-Q4_0-GGUF), 200 MiB out of 400 MiB are token embeddings. Only a tiny fraction of these 200 MiB gets actually used (~1000 bytes per generated token), so effectively you are running a 200 MiB model, so memory bandwidth utilized during TG is `120 t/s x 0.2 GiB = 24 GiB/s.`
-
----
-
-👤 **VinnyG9** commented the **2025-04-30** at **04:45:02**:
+👤 **VinnyG9** commented on **2025-04-30** at **04:45:02**
> If your 566M parameter Bert model is something like [this one](https://huggingface.co/blogcncom/bge-m3-Q4_0-GGUF), 200 MiB out of 400 MiB are token embeddings. Only a tiny fraction of these 200 MiB gets actually used (~1000 bytes per generated token), so effectively you are running a 200 MiB model, so memory bandwidth utilized during TG is `120 t/s x 0.2 GiB = 24 GiB/s.`
diff --git a/github-data/issues/353 - Binaries releases for Windows _.md b/github-data/issues/353 - Binaries releases for Windows.md
similarity index 74%
rename from github-data/issues/353 - Binaries releases for Windows _.md
rename to github-data/issues/353 - Binaries releases for Windows.md
index dde516fa4..51cde7e0e 100644
--- a/github-data/issues/353 - Binaries releases for Windows _.md
+++ b/github-data/issues/353 - Binaries releases for Windows.md
@@ -1,4 +1,4 @@
-### 📝 [#353](https://github.com/ikawrakow/ik_llama.cpp/issues/353) - Binaries releases for Windows ?
+## 📌 [Issue #353](https://github.com/ikawrakow/ik_llama.cpp/issues/353) - Binaries releases for Windows ?
| **Author** | `lbarasc` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Hi,
@@ -18,9 +18,9 @@ Thank you.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-29** at **13:55:36**:
+👤 **ikawrakow** commented on **2025-04-29** at **13:55:36**
If this repository gains more momentum and there are users testing on Windows and providing feedback, sure, we can consider releasing Windows binaries.
@@ -33,13 +33,13 @@ Another thing is that this project does not aim at providing the broad hardware
---
-👤 **PmNz8** commented the **2025-04-30** at **22:54:13**:
+👤 **PmNz8** commented on **2025-04-30** at **22:54:13**
I managed to compile from source for Windows cpu, but not for cuda - it is above my skills level. Having (best automatically) compiled binaries available on github would be great! I can always test some binaries if that would be helpful, one of my machine runs intel with avx512 (rocket lake), the other is AMD zen 3 + Nvidia ada.
---
-👤 **saood06** commented the **2025-05-01** at **07:32:23**:
+👤 **saood06** commented on **2025-05-01** at **07:32:23**
> * I don't have access to a Windows machine
> * I don't feel OK releasing builds that were never tested
@@ -48,19 +48,13 @@ If you want to do occasional releases (since we don't have CI like mainline does
---
-👤 **SpookyT00th** commented the **2025-05-01** at **22:11:05**:
+👤 **SpookyT00th** commented on **2025-05-01** at **22:11:05**
I noticed you mentioned that this is intended to support newer GPUs. Do you know if the Nvidia V100 (Volta Architecture) is supported? also, does this support tensor parallelism? i want to fit this model across 128GB VRAM : https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF
---
-👤 **SpookyT00th** commented the **2025-05-01** at **22:11:05**:
-
-I noticed you mentioned that this is intended to support newer GPUs. Do you know if the Nvidia V100 (Volta Architecture) is supported?
-
----
-
-👤 **saood06** commented the **2025-05-02** at **03:05:53**:
+👤 **saood06** commented on **2025-05-02** at **03:05:53**
>also, does this support tensor parallelism? i want to fit this model across 128GB VRAM : https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF
@@ -68,13 +62,13 @@ For MoE models such as the one you linked, `-split-mode row` does not function,
---
-👤 **sousekd** commented the **2025-05-29** at **20:39:13**:
+👤 **sousekd** commented on **2025-05-29** at **20:39:13**
I would be happy to test on AMD Epyc Turin + RTX 4090 / RTX Pro 6000, if builds are provided.
---
-👤 **Thireus** commented the **2025-06-03** at **17:54:35**:
+👤 **Thireus** commented on **2025-06-03** at **17:54:35**
If anyone wants to give a go to the build I've created, and report back if it works decently... https://github.com/Thireus/ik_llama.cpp/releases
@@ -83,13 +77,13 @@ See https://github.com/Thireus/ik_llama.cpp/blob/main/.github/workflows/release.
---
-👤 **lbarasc** commented the **2025-06-03** at **19:25:40**:
+👤 **lbarasc** commented on **2025-06-03** at **19:25:40**
Well thank you !! i will test this on my server.
---
-👤 **ikawrakow** commented the **2025-06-05** at **07:05:32**:
+👤 **ikawrakow** commented on **2025-06-05** at **07:05:32**
How is the testing going here?
@@ -105,7 +99,7 @@ So, to cover pre-build binaries for Windows users, one would need 6 different bu
---
-👤 **PmNz8** commented the **2025-06-06** at **19:01:35**:
+👤 **PmNz8** commented on **2025-06-06** at **19:01:35**
@Thireus for me your binaries do not run. I try something simple like .\llama-cli.exe -m "D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf" and all I get in the log is:
@@ -124,23 +118,7 @@ Windows 11 + RTX 4090 @ 576.52 drivers.
---
-👤 **PmNz8** commented the **2025-06-06** at **19:01:35**:
-
-@Thireus for me your binaries do not run. I try something simple like .\llama-cli.exe -m "D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf" and all I get in the log is:
-
-```
-[1749236397] Log start
-[1749236397] Cmd: C:\Users\dawidgaming\Downloads\ik_llama-main-b3770-5a8bb97-bin-win-cuda-12.8-x64\llama-cli.exe -m D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf
-[1749236397] main: build = 1 (5a8bb97)
-[1749236397] main: built with MSVC 19.29.30159.0 for
-[1749236397] main: seed = 1749236397
-[1749236397] main: llama backend init
-[1749236397] main: load the model and apply lora adapter, if any
-```
-
----
-
-👤 **kiron111** commented the **2025-06-06** at **19:55:45**:
+👤 **kiron111** commented on **2025-06-06** at **19:55:45**
> If anyone wants to give a go to the build I've created, and report back if it works decently... https://github.com/Thireus/ik_llama.cpp/releases
>
diff --git a/github-data/issues/358 - Bug_ IQK_FA_ALL_QUANTS causes failure to compile.md b/github-data/issues/358 - Bug IQK_FA_ALL_QUANTS causes failure to compile.md
similarity index 79%
rename from github-data/issues/358 - Bug_ IQK_FA_ALL_QUANTS causes failure to compile.md
rename to github-data/issues/358 - Bug IQK_FA_ALL_QUANTS causes failure to compile.md
index cd8bd21cb..1faedfd5f 100644
--- a/github-data/issues/358 - Bug_ IQK_FA_ALL_QUANTS causes failure to compile.md
+++ b/github-data/issues/358 - Bug IQK_FA_ALL_QUANTS causes failure to compile.md
@@ -1,4 +1,4 @@
-### 🐛 [#358](https://github.com/ikawrakow/ik_llama.cpp/issues/358) - Bug: IQK_FA_ALL_QUANTS causes failure to compile
+## 📌 [Issue #358](https://github.com/ikawrakow/ik_llama.cpp/issues/358) - Bug: IQK_FA_ALL_QUANTS causes failure to compile
| **Author** | `saood06` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
diff --git a/github-data/issues/361 - Bug Build not detecting some supported ARM CPUs.md b/github-data/issues/361 - Bug Build not detecting some supported ARM CPUs.md
new file mode 100644
index 000000000..83e222864
--- /dev/null
+++ b/github-data/issues/361 - Bug Build not detecting some supported ARM CPUs.md
@@ -0,0 +1,35 @@
+## 📌 [Issue #361](https://github.com/ikawrakow/ik_llama.cpp/issues/361) - Bug: Build not detecting some supported ARM CPUs
+
+| **Author** | `saood06` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-04-30 |
+| **Updated** | 2025-05-02 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+This was reported in [#345](https://github.com/ikawrakow/ik_llama.cpp/issues/345) and I was also able to reproduce it on an Android device, there is a workaround with [#347](https://github.com/ikawrakow/ik_llama.cpp/issues/347) but ideally you should not need to set the architecture flag manually. This does not seem to affect the Apple ARM devices.
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-05-02** at **05:23:08**
+
+We can add something along the lines of mainline's automatic CPU feature detection. But I also have the experience that since they added the feature, mainline runs slower on my M2-Max CPU as it enables the `i8mm` CPU feature, but my guess is that this is emulated and not an actual feature of the M2 CPU.
+
+---
+
+👤 **saood06** commented on **2025-05-02** at **05:38:14**
+
+> We can add something along the lines of mainline's automatic CPU feature detection.
+
+Yes, I just created the issue since I hadn't looked into it fully.
+
+>But I also have the experience that since they added the feature, mainline runs slower on my M2-Max CPU as it enables the `i8mm` CPU feature, but my guess is that this is emulated and not an actual feature of the M2 CPU.
+
+That aligns with what was reported in [#345](https://github.com/ikawrakow/ik_llama.cpp/issues/345) where the user had better performance with `-march=armv8.2-a+dotprod+fp16+noi8mm+nosve+nosme` over just `"-march=armv8.2-a+dotprod+fp16"`. So it may not be just the M2 CPU. I'm not very familiar with the actual hardware implementation of the recent ARM extensions so I can't really say.
\ No newline at end of file
diff --git a/github-data/issues/361 - Bug_ Build not detecting some supported ARM CPUs.md b/github-data/issues/361 - Bug_ Build not detecting some supported ARM CPUs.md
deleted file mode 100644
index a4f2c61e8..000000000
--- a/github-data/issues/361 - Bug_ Build not detecting some supported ARM CPUs.md
+++ /dev/null
@@ -1,35 +0,0 @@
-### 🐛 [#361](https://github.com/ikawrakow/ik_llama.cpp/issues/361) - Bug: Build not detecting some supported ARM CPUs
-
-| **Author** | `saood06` |
-| :--- | :--- |
-| **State** | ✅ **Open** |
-| **Created** | 2025-04-30 |
-| **Updated** | 2025-05-02 |
-
----
-
-#### Description
-
-### What happened?
-
-This was reported in #345 and I was also able to reproduce it on an Android device, there is a workaround with #347 but ideally you should not need to set the architecture flag manually. This does not seem to affect the Apple ARM devices.
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** commented the **2025-05-02** at **05:23:08**:
-
-We can add something along the lines of mainline's automatic CPU feature detection. But I also have the experience that since they added the feature, mainline runs slower on my M2-Max CPU as it enables the `i8mm` CPU feature, but my guess is that this is emulated and not an actual feature of the M2 CPU.
-
----
-
-👤 **saood06** commented the **2025-05-02** at **05:38:14**:
-
-> We can add something along the lines of mainline's automatic CPU feature detection.
-
-Yes, I just created the issue since I hadn't looked into it fully.
-
->But I also have the experience that since they added the feature, mainline runs slower on my M2-Max CPU as it enables the `i8mm` CPU feature, but my guess is that this is emulated and not an actual feature of the M2 CPU.
-
-That aligns with what was reported in #345 where the user had better performance with `-march=armv8.2-a+dotprod+fp16+noi8mm+nosve+nosme` over just `"-march=armv8.2-a+dotprod+fp16"`. So it may not be just the M2 CPU. I'm not very familiar with the actual hardware implementation of the recent ARM extensions so I can't really say.
\ No newline at end of file
diff --git a/github-data/issues/362 - README language is vague wrt. _quantization improvements_.md b/github-data/issues/362 - README language is vague wrt. quantization improvements.md
similarity index 89%
rename from github-data/issues/362 - README language is vague wrt. _quantization improvements_.md
rename to github-data/issues/362 - README language is vague wrt. quantization improvements.md
index 41675d12f..e997fdfe3 100644
--- a/github-data/issues/362 - README language is vague wrt. _quantization improvements_.md
+++ b/github-data/issues/362 - README language is vague wrt. quantization improvements.md
@@ -1,4 +1,4 @@
-### 📝 [#362](https://github.com/ikawrakow/ik_llama.cpp/issues/362) - README language is vague wrt. \"quantization improvements\"
+## 📌 [Issue #362](https://github.com/ikawrakow/ik_llama.cpp/issues/362) - README language is vague wrt. "quantization improvements"
| **Author** | `usrlocalben` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -33,15 +33,15 @@ https://github.com/ikawrakow/ik_llama.cpp/commit/98d1626469879d35faba9cb7e9d0b1d
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-30** at **23:21:14**:
+👤 **saood06** commented on **2025-04-30** at **23:21:14**
As the README mentions you can often find detailed information in PRs. https://github.com/ikawrakow/ik_llama.cpp/pull/295 and https://github.com/ikawrakow/ik_llama.cpp/pull/302 are the related PRs
---
-👤 **ikawrakow** commented the **2025-05-01** at **16:41:52**:
+👤 **ikawrakow** commented on **2025-05-01** at **16:41:52**
Would you like to have links to the specific PR's in the News section? I did try this along with a short description initially, but then it becomes kind of too long for a News section.
@@ -72,6 +72,6 @@ Here is where the user needs to understand what the improvement was so they can
---
-👤 **usrlocalben** commented the **2025-05-13** at **13:16:29**:
+👤 **usrlocalben** commented on **2025-05-13** at **13:16:29**
Thanks for the commentary and also the README updates w/PR links on the line-items. I now resolve the language this way: To Quantize is a verb/action and therefore strongly refers to _computing_ the quant, i.e. llama-quantize. Closing
\ No newline at end of file
diff --git a/github-data/issues/363 - Bug_ Gibberish output when using flash attention using Mistral-Small-I.md b/github-data/issues/363 - Bug Gibberish output when using flash attention using Mistral-Small-Instruct-240.md
similarity index 95%
rename from github-data/issues/363 - Bug_ Gibberish output when using flash attention using Mistral-Small-I.md
rename to github-data/issues/363 - Bug Gibberish output when using flash attention using Mistral-Small-Instruct-240.md
index 95651bdf2..e81fd3682 100644
--- a/github-data/issues/363 - Bug_ Gibberish output when using flash attention using Mistral-Small-I.md
+++ b/github-data/issues/363 - Bug Gibberish output when using flash attention using Mistral-Small-Instruct-240.md
@@ -1,4 +1,4 @@
-### 🐛 [#363](https://github.com/ikawrakow/ik_llama.cpp/issues/363) - Bug: Gibberish output when using flash attention using Mistral-Small-Instruct-2409-Q6_K and Gemma-3-12b-it-q4_0 on CPU
+## 📌 [Issue #363](https://github.com/ikawrakow/ik_llama.cpp/issues/363) - Bug: Gibberish output when using flash attention using Mistral-Small-Instruct-2409-Q6_K and Gemma-3-12b-it-q4_0 on CPU
| **Author** | `djg26` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -36,15 +36,15 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-01** at **06:59:00**:
+👤 **ikawrakow** commented on **2025-05-01** at **06:59:00**
-Thank you for the bug report. Can you confirm that #364 fixes it?
+Thank you for the bug report. Can you confirm that [#364](https://github.com/ikawrakow/ik_llama.cpp/issues/364) fixes it?
---
-👤 **djg26** commented the **2025-05-01** at **12:07:51**:
+👤 **djg26** commented on **2025-05-01** at **12:07:51**
It's a little better, however it still breaks down after a few tokens, becoming gibberish again with a longer context. Tested with Mistral-Small-Instruct-2409-Q6_K with the following command ./llama-server -m ~/Downloads/Mistral-Small-Instruct-2409-Q6_K.gguf -ctk q8_0 -ctv q8_0 -fa -c 32768
With flash attention:
@@ -57,13 +57,13 @@ Without flash attention:
---
-👤 **ikawrakow** commented the **2025-05-01** at **13:38:15**:
+👤 **ikawrakow** commented on **2025-05-01** at **13:38:15**
What about Gemma?
---
-👤 **djg26** commented the **2025-05-01** at **13:53:04**:
+👤 **djg26** commented on **2025-05-01** at **13:53:04**
Same issue with Gemma as well.
@@ -77,7 +77,7 @@ Without flash attention:
---
-👤 **ikawrakow** commented the **2025-05-01** at **15:45:01**:
+👤 **ikawrakow** commented on **2025-05-01** at **15:45:01**
So, `f16` and `Q6_0` KV cache works. Here is the 5 paragraph story I get with `Q6_0`:
```
@@ -97,7 +97,7 @@ In the case of `Q8_0` KV cache, it does start OK, but transitions to repetitions
---
-👤 **ikawrakow** commented the **2025-05-01** at **16:15:55**:
+👤 **ikawrakow** commented on **2025-05-01** at **16:15:55**
OK, this should work now with `Q8_0` KV cache. Here the 5 paragraph story I get with
```
@@ -118,19 +118,19 @@ Hours later, soaked and shivering, Silas stood on the beach, watching as the coa
---
-👤 **djg26** commented the **2025-05-01** at **17:04:17**:
+👤 **djg26** commented on **2025-05-01** at **17:04:17**
It does work better now, but I'm still having repetition issues when the context has a few 1,000 tokens in it, both with Gemma3 and Mistral small. This is with the KV cache both on Q8_0.
---
-👤 **ikawrakow** commented the **2025-05-01** at **17:14:24**:
+👤 **ikawrakow** commented on **2025-05-01** at **17:14:24**
You need to give me a way to reproduce it.
---
-👤 **djg26** commented the **2025-05-01** at **17:35:28**:
+👤 **djg26** commented on **2025-05-01** at **17:35:28**
With q8_0 on kv cache
./llama-server -m ~/Downloads/gemma-3-12b-it-q4_0.gguf -ctk q8_0 -ctv q8_0 -fa -c 32768
@@ -144,7 +144,7 @@ I'm not very used to llama-cli so I've been doing it via llama-server's webui.
---
-👤 **ikawrakow** commented the **2025-05-01** at **18:18:58**:
+👤 **ikawrakow** commented on **2025-05-01** at **18:18:58**
Can you post `cat /proc/cpuinfo`? Thanks.
@@ -157,18 +157,7 @@ Can you post `cat /proc/cpuinfo`?
---
-👤 **ikawrakow** commented the **2025-05-01** at **18:18:58**:
-
-Here is what I get:
-
-
-
-
-Can you post `cat /proc/cpuinfo`?
-
----
-
-👤 **djg26** commented the **2025-05-01** at **18:31:10**:
+👤 **djg26** commented on **2025-05-01** at **18:31:10**
```
processor : 0
@@ -623,7 +612,7 @@ power management:
---
-👤 **djg26** commented the **2025-05-01** at **18:49:24**:
+👤 **djg26** commented on **2025-05-01** at **18:49:24**
I get the same issue using mistral small as well.
Command used: ./llama-server -m ~/Downloads/Mistral-Small-Instruct-2409-Q6_K.gguf -ctk q8_0 -ctv q8_0 -fa -c 32768
@@ -636,7 +625,7 @@ no Q8_0 KV cache
---
-👤 **djg26** commented the **2025-05-01** at **19:32:56**:
+👤 **djg26** commented on **2025-05-01** at **19:32:56**
Running with K at f16 and V at q8_0 seems to work fine.
./llama-server -m ~/Downloads/Mistral-Small-Instruct-2409-Q6_K.gguf -ctv q8_0 -fa -c 32768
@@ -644,30 +633,30 @@ Running with K at f16 and V at q8_0 seems to work fine.
---
-👤 **ikawrakow** commented the **2025-05-02** at **05:10:26**:
+👤 **ikawrakow** commented on **2025-05-02** at **05:10:26**
I'll have to investigate in more detail then. In the meantime, just don't use `Q8_0` for K-cache.
---
-👤 **ikawrakow** commented the **2025-05-04** at **06:04:20**:
+👤 **ikawrakow** commented on **2025-05-04** at **06:04:20**
-Not sure what to do with this one. It works fine on both of my systems (Zen4 and vanilla AVX2) after #364
+Not sure what to do with this one. It works fine on both of my systems (Zen4 and vanilla AVX2) after [#364](https://github.com/ikawrakow/ik_llama.cpp/issues/364)
---
-👤 **djg26** commented the **2025-05-04** at **14:39:17**:
+👤 **djg26** commented on **2025-05-04** at **14:39:17**
It seems like having K be ```q8_0``` and flash attention turned on doesn't work, at least for me. I've tested it on another computer with a ryzen 5 5600x and it still starts getting repetitive like before after some tokens. If I don't enable ```-fa``` and just have the K be ```q8_0``` then the models work fine. It also works fine if K is something like ```q6_0``` with the ```-fa``` flag.
---
-👤 **djg26** commented the **2025-05-04** at **17:18:01**:
+👤 **djg26** commented on **2025-05-04** at **17:18:01**
Qwen3-30B-A3B also works fine when using ```q8_0``` KV and flash attention.
---
-👤 **djg26** commented the **2025-05-09** at **17:16:43**:
+👤 **djg26** commented on **2025-05-09** at **17:16:43**
Closing as after doing another git pull to the latest version it seems to work fine now.
\ No newline at end of file
diff --git a/github-data/issues/365 - Bug_ Updated BitNet arch bitnet-b1.58.md b/github-data/issues/365 - Bug Updated BitNet arch bitnet-b1.58.md
similarity index 55%
rename from github-data/issues/365 - Bug_ Updated BitNet arch bitnet-b1.58.md
rename to github-data/issues/365 - Bug Updated BitNet arch bitnet-b1.58.md
index 78e48b09c..a47abf396 100644
--- a/github-data/issues/365 - Bug_ Updated BitNet arch bitnet-b1.58.md
+++ b/github-data/issues/365 - Bug Updated BitNet arch bitnet-b1.58.md
@@ -1,4 +1,4 @@
-### 🐛 [#365](https://github.com/ikawrakow/ik_llama.cpp/issues/365) - Bug: Updated BitNet arch bitnet-b1.58
+## 📌 [Issue #365](https://github.com/ikawrakow/ik_llama.cpp/issues/365) - Bug: Updated BitNet arch bitnet-b1.58
| **Author** | `jdluzen` |
| :--- | :--- |
@@ -8,13 +8,13 @@
---
-#### Description
+## 📄 Description
### What happened?
I'm very rusty at ggml, quants, etc. so please forgive my ignorance.
I've been attempting to get BitNet running, and by that I mean the _new_ BitNet as of April 23rd. MS uploaded a new version to HF, replacing the old one, and it seems to have breaking changes.
-From what I gather, #337 add support for the original 2025 BitNet with arch `bitnet-25`, but now the new one is `bitnet-b1.58`. I've been trying to add the changes from https://github.com/microsoft/BitNet/pull/212 with limited success. I'm also guessing that I need https://github.com/ggml-org/llama.cpp/compare/gg/bitnet since I am crashing because `vec_dot` is null at https://github.com/ikawrakow/ik_llama.cpp/blob/main/ggml/src/ggml.c#L14311 when `type` is `GGML_TYPE_I2_S` 36. Will try to get that implementation going next. I'm also on Windows arm64 which makes things more fun 😅
+From what I gather, [#337](https://github.com/ikawrakow/ik_llama.cpp/issues/337) add support for the original 2025 BitNet with arch `bitnet-25`, but now the new one is `bitnet-b1.58`. I've been trying to add the changes from https://github.com/microsoft/BitNet/pull/212 with limited success. I'm also guessing that I need https://github.com/ggml-org/llama.cpp/compare/gg/bitnet since I am crashing because `vec_dot` is null at https://github.com/ikawrakow/ik_llama.cpp/blob/main/ggml/src/ggml.c#L14311 when `type` is `GGML_TYPE_I2_S` 36. Will try to get that implementation going next. I'm also on Windows arm64 which makes things more fun 😅
Am I on the right track here?
### Name and Version
@@ -33,9 +33,9 @@ Windows
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **usatenko** commented the **2025-05-02** at **01:26:16**:
+👤 **usatenko** commented on **2025-05-02** at **01:26:16**
looks like I faced the same problem on macos, new ms model
`./bin/llama-quantize --allow-requantize models/ggml-model-i2_s.gguf ggml-model-i2_s_bn.gguf iq2_bn`
@@ -82,68 +82,23 @@ the model is from here: https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf
---
-👤 **usatenko** commented the **2025-05-02** at **01:26:16**:
-
-looks like I faced the same problem on macos, new ms model
-`./bin/llama-quantize --allow-requantize models/ggml-model-i2_s.gguf ggml-model-i2_s_bn.gguf iq2_bn`
-```
-main: build = 3657 (98d16264)
-main: built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
-main: quantizing 'models/ggml-model-i2_s.gguf' to 'ggml-model-i2_s_bn.gguf' as IQ2_BN
-llama_model_loader: loaded meta data with 24 key-value pairs and 332 tensors from models/ggml-model-i2_s.gguf (version GGUF V3 (latest))
-llama_model_loader: unknown type i2_s
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = bitnet-b1.58
-llama_model_loader: - kv 1: general.name str = bitnet2b
-llama_model_loader: - kv 2: bitnet-b1.58.vocab_size u32 = 128256
-llama_model_loader: - kv 3: bitnet-b1.58.context_length u32 = 4096
-llama_model_loader: - kv 4: bitnet-b1.58.embedding_length u32 = 2560
-llama_model_loader: - kv 5: bitnet-b1.58.block_count u32 = 30
-llama_model_loader: - kv 6: bitnet-b1.58.feed_forward_length u32 = 6912
-llama_model_loader: - kv 7: bitnet-b1.58.rope.dimension_count u32 = 128
-llama_model_loader: - kv 8: bitnet-b1.58.attention.head_count u32 = 20
-llama_model_loader: - kv 9: bitnet-b1.58.attention.head_count_kv u32 = 5
-llama_model_loader: - kv 10: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 11: bitnet-b1.58.attention.layer_norm_rms_epsilon f32 = 0.000010
-llama_model_loader: - kv 12: bitnet-b1.58.rope.freq_base f32 = 500000.000000
-llama_model_loader: - kv 13: general.file_type u32 = 40
-llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
-llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
-llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
-llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 128000
-llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 128001
-llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 128001
-llama_model_loader: - kv 22: tokenizer.chat_template str = {% for message in messages %}{% if lo...
-llama_model_loader: - kv 23: general.quantization_version u32 = 2
-llama_model_loader: - type f32: 121 tensors
-llama_model_loader: - type f16: 1 tensors
-llama_model_loader: - type i2_s: 210 tensors
-llama_model_quantize: failed to quantize: unknown model architecture: 'bitnet-b1.58'
-main: failed to quantize model from 'models/ggml-model-i2_s.gguf'
-```
-@ikawrakow can you help?
-
----
-
-👤 **saood06** commented the **2025-05-02** at **03:29:42**:
+👤 **saood06** commented on **2025-05-02** at **03:29:42**
I looked into this, and was able to reproduce and then port the commit that fixes it.
-I have made #366 that adds the new name.
+I have made [#366](https://github.com/ikawrakow/ik_llama.cpp/issues/366) that adds the new name.
I also confirmed that this is only a name change, as I ran gguf-hash.py on both the newly converted gguf based on the updated model and the one I had previously converted available [here](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF/tree/main) and the hashes are the same.
---
-👤 **usatenko** commented the **2025-05-02** at **10:18:54**:
+👤 **usatenko** commented on **2025-05-02** at **10:18:54**
thank you, it works now
---
-👤 **jdluzen** commented the **2025-05-03** at **02:01:15**:
+👤 **jdluzen** commented on **2025-05-03** at **02:01:15**
Thanks, those were the changes that I was trying to implement. Glad to know it works for others.
I switched back to Winx64 for now, but it seems my problems could be more than just this. Is the original model supposed to just work out of the box? https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/tree/main
@@ -152,7 +107,7 @@ Using a debug build `llama-cli.exe -m ggml-model-i2_s.gguf -p "hi what are you"`
---
-👤 **ikawrakow** commented the **2025-05-03** at **06:41:56**:
+👤 **ikawrakow** commented on **2025-05-03** at **06:41:56**
The Microsoft model uses their own quantization type `I2_S`. To use it with `ik_llama.cpp` you need to convert it like this
```
diff --git a/github-data/issues/367 - Bug_ IQ1_S_R4_ IQ1_M_R4 failed on Qwen3-235B-A22B.md b/github-data/issues/367 - Bug IQ1_S_R4 IQ1_M_R4 failed on Qwen3-235B-A22B.md
similarity index 80%
rename from github-data/issues/367 - Bug_ IQ1_S_R4_ IQ1_M_R4 failed on Qwen3-235B-A22B.md
rename to github-data/issues/367 - Bug IQ1_S_R4 IQ1_M_R4 failed on Qwen3-235B-A22B.md
index 8f6ac2d10..485ea2794 100644
--- a/github-data/issues/367 - Bug_ IQ1_S_R4_ IQ1_M_R4 failed on Qwen3-235B-A22B.md
+++ b/github-data/issues/367 - Bug IQ1_S_R4 IQ1_M_R4 failed on Qwen3-235B-A22B.md
@@ -1,4 +1,4 @@
-### 🐛 [#367](https://github.com/ikawrakow/ik_llama.cpp/issues/367) - Bug: IQ1_S_R4, IQ1_M_R4 failed on Qwen3-235B-A22B
+## 📌 [Issue #367](https://github.com/ikawrakow/ik_llama.cpp/issues/367) - Bug: IQ1_S_R4, IQ1_M_R4 failed on Qwen3-235B-A22B
| **Author** | `Flying-Cloud` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -51,47 +51,47 @@ sti1 >= 0 && besti2 >= 0 && best_k >= 0) failed
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Flying-Cloud** commented the **2025-05-03** at **10:26:11**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **10:26:11**
Oh I guess it because 1536 / 256 = 6 which is not divisible by 4?
---
-👤 **ikawrakow** commented the **2025-05-03** at **10:29:06**:
+👤 **ikawrakow** commented on **2025-05-03** at **10:29:06**
The number of rows must be a multiple of 4, not the number of blocks. Qwen3-235B-A22B should work with any `_R4` or `_R8` quant. The issue is in the quantization function itself. I'll look into it.
---
-👤 **ikawrakow** commented the **2025-05-03** at **11:01:47**:
+👤 **ikawrakow** commented on **2025-05-03** at **11:01:47**
-There is PR #368. Does it fix it? I cannot actually run such a large model (not enough RAM, not enough disk space), so it is a bit if a guessing game.
+There is PR [#368](https://github.com/ikawrakow/ik_llama.cpp/issues/368). Does it fix it? I cannot actually run such a large model (not enough RAM, not enough disk space), so it is a bit if a guessing game.
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **11:32:22**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **11:32:22**
-> There is PR [#368](https://github.com/ikawrakow/ik_llama.cpp/pull/368). Does it fix it? I cannot actually run such a large model (not enough RAM, not enough disk space), so it is a bit if a guessing game.
+> There is PR [[#368](https://github.com/ikawrakow/ik_llama.cpp/issues/368)](https://github.com/ikawrakow/ik_llama.cpp/pull/368). Does it fix it? I cannot actually run such a large model (not enough RAM, not enough disk space), so it is a bit if a guessing game.
It works! No longer error displayed. So what's the matter here. It seems like there are some near-zero weights in gate_proj weights?
---
-👤 **ikawrakow** commented the **2025-05-03** at **11:35:12**:
+👤 **ikawrakow** commented on **2025-05-03** at **11:35:12**
Either near zero weights, or the more tricky one, mismatching imatrix. Mismatching in the sense that the imatrix importances are zero where the model weights are not zero.
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **11:37:10**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **11:37:10**
Got it. It make sense since I notice the imatrix I downloaded from unsloth is computed through only 46 chunks. Thanks for your quick reply!
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **15:36:56**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **15:36:56**
Sorry to bother you again. I just found that IQ1_M_R4 fail in the deep layer of Qwen3-235B-A22B: blk.18.ffn_down_exps.weight
I try to revise the code from:
@@ -114,30 +114,7 @@ Still same Error as the issue begins.
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **15:36:56**:
-
-Sorry to bother you again. I just found that IQ1_M_R4 fail in the deep layer of Qwen3-235B-A22B: blk.18.ffn_down_exps.weight
-I try to revise to code from:
-```python
-float sumwx = 0;
- for (int j = 0; j < kBlockSize; ++j) sumwx += weight[j]*std::abs(xb[j]);
- if (!sumwx) {
- for (int j = 0; j < kBlockSize; ++j) weight[j] = sqrt(sigma2 + xb[j]*xb[j]);
- }
-```
-to
-```python
-float sumwx = 0;
- for (int j = 0; j < kBlockSize; ++j) sumwx += weight[j];
- if (sumwx < 1e-3) {
- for (int j = 0; j < kBlockSize; ++j) weight[j] = sqrt(sigma2 + xb[j]*xb[j]);
- }
-```
-Still same Error as the issue begins.
-
----
-
-👤 **ikawrakow** commented the **2025-05-03** at **15:49:48**:
+👤 **ikawrakow** commented on **2025-05-03** at **15:49:48**
So, we need to see what these values are that cause the assert.
Just before
@@ -158,7 +135,7 @@ The strange part if that in the log that you posted above the assert is on line
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **16:13:39**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **16:13:39**
I apply this code, and the results are:
```
@@ -184,21 +161,21 @@ Values:
---
-👤 **ikawrakow** commented the **2025-05-03** at **16:15:48**:
+👤 **ikawrakow** commented on **2025-05-03** at **16:15:48**
Oh, I see. Give me a minute, I'll push a fix.
---
-👤 **ikawrakow** commented the **2025-05-03** at **16:31:08**:
+👤 **ikawrakow** commented on **2025-05-03** at **16:31:08**
-See #371.
+See [#371](https://github.com/ikawrakow/ik_llama.cpp/issues/371).
The issue was I checked for very small values in a block of 32 quants. But then we quantize 2 blocks of 16 each. Hence, it can happen that the block of 32 has non-zero values, but one of the blocks of 16 does not.
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **16:50:07**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **16:50:07**
```
[ 22/1131] blk.1.ffn_down_exps.weight - [ 1536, 4096, 128, 1], type = bf16, converting to iq1_m_r4 .. Failed to find optimum division
@@ -227,13 +204,13 @@ I guess should change "!sumwx" to "sumwx < {a small threshold}"?
---
-👤 **ikawrakow** commented the **2025-05-03** at **17:01:35**:
+👤 **ikawrakow** commented on **2025-05-03** at **17:01:35**
I pushed another attempt.
---
-👤 **Flying-Cloud** commented the **2025-05-03** at **17:29:44**:
+👤 **Flying-Cloud** commented on **2025-05-03** at **17:29:44**
I tried the new attempt and it overcomes the barrier of "blk.13 down_exps" and "blk.18. down_exps"
If this success with whole quantization process for Qwen3-235B, I will check the ppl to ensure that it functions well.
@@ -241,7 +218,7 @@ It might takes a few time and I will let you know right away
---
-👤 **whatever1983** commented the **2025-05-03** at **20:24:08**:
+👤 **whatever1983** commented on **2025-05-03** at **20:24:08**
Seriously, are you guys crazy to quant the Qwen3 series with IQ1S? I am having trouble generating a working python Tetris game using 30B-A3B using IQ5K that I am forced to use IQ6K. The Qwen3 is a regression many ways trying to use too little active parameters, the end result is that any quantization at all wrecks coding performance.
@@ -251,17 +228,7 @@ Jack Ma is too focused on proving to the market that making active parameters as
---
-👤 **whatever1983** commented the **2025-05-03** at **20:24:08**:
-
-Seriously, are you guys crazy to quant the Qwen3 series with IQ1S? I am having trouble generating a working python Tetris game using 30B-A3B using IQ5K that I am forced to use IQ6K. The Qwen3 is a regression many ways trying to use too little active parameters, the end result is that quanting at all recks coding performance.
-
-Just a interesting observation, DS 0324 IQ2M is able to generate a fully working Tetris that's way more beautiful.
-
-Jack Ma is too focused on proving to the market that making active parameters as little as possible is the way to greater AI, which is totally wrong. You know, shorting the US market as a way of payment for releasing shitty little models is not the way forward for better AI.
-
----
-
-👤 **Flying-Cloud** commented the **2025-05-04** at **04:11:04**:
+👤 **Flying-Cloud** commented on **2025-05-04** at **04:11:04**
> I tried the new attempt and it overcomes the barrier of "blk.13 down_exps" and "blk.18. down_exps" If this success with whole quantization process for Qwen3-235B, I will check the ppl to ensure that it functions well. It might takes a few time and I will let you know right away
@@ -290,7 +257,7 @@ I suspect that Jack Ma kept the more powerful Qwen-Max model internally while ch
---
-👤 **ikawrakow** commented the **2025-05-04** at **04:19:01**:
+👤 **ikawrakow** commented on **2025-05-04** at **04:19:01**
> BTW, thish push has a minor typo: "1e-14f" instead of "1e-14"
@@ -298,7 +265,7 @@ I suspect that Jack Ma kept the more powerful Qwen-Max model internally while ch
---
-👤 **ikawrakow** commented the **2025-05-04** at **04:27:20**:
+👤 **ikawrakow** commented on **2025-05-04** at **04:27:20**
> Seriously, are you guys crazy to quant the Qwen3 series with IQ1S? I am having trouble generating a working python Tetris game using 30B-A3B using IQ5K that I am forced to use IQ6K.
diff --git a/github-data/issues/373 - DeepSeekV3 0324 can_t load newest UD quants _with MLA_. Older quant wor.md b/github-data/issues/373 - DeepSeekV3 0324 cant load newest UD quants with MLA. Older quant works but with .md
similarity index 88%
rename from github-data/issues/373 - DeepSeekV3 0324 can_t load newest UD quants _with MLA_. Older quant wor.md
rename to github-data/issues/373 - DeepSeekV3 0324 cant load newest UD quants with MLA. Older quant works but with .md
index 5c345f875..a45fffb33 100644
--- a/github-data/issues/373 - DeepSeekV3 0324 can_t load newest UD quants _with MLA_. Older quant wor.md
+++ b/github-data/issues/373 - DeepSeekV3 0324 cant load newest UD quants with MLA. Older quant works but with .md
@@ -1,4 +1,4 @@
-### 📝 [#373](https://github.com/ikawrakow/ik_llama.cpp/issues/373) - DeepSeekV3 0324 can't load newest UD quants (with MLA). Older quant works but with slower pre processing than gen speed (CPU + CUDA)
+## 📌 [Issue #373](https://github.com/ikawrakow/ik_llama.cpp/issues/373) - DeepSeekV3 0324 can't load newest UD quants (with MLA). Older quant works but with slower pre processing than gen speed (CPU + CUDA)
| **Author** | `Panchovix` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Hi there!
@@ -49,20 +49,20 @@ Ran it with
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **clockworkwhale** commented the **2025-05-04** at **01:38:06**:
+👤 **clockworkwhale** commented on **2025-05-04** at **01:38:06**
Confirmed I am also getting the exact same "check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape" error when attempting to load the newer quants with ik_llama.
---
-👤 **ikawrakow** commented the **2025-05-04** at **04:15:58**:
+👤 **ikawrakow** commented on **2025-05-04** at **04:15:58**
Please file an issue with mainline `llama.cpp` and/or the creators of the quantized model. MLA implementation existed here long before mainline `llama.cpp` had one, and they decided to make it incompatible with existing GGUFs. The implementation here works with the original GGUFs, and creates the tensors necessary for MLA on-the-fly during model load. The same could have (and should have) be done in mainline.
---
-👤 **Panchovix** commented the **2025-05-09** at **19:17:25**:
+👤 **Panchovix** commented on **2025-05-09** at **19:17:25**
Closing as it is fixed now on https://github.com/ikawrakow/ik_llama.cpp/commit/43a154d8b8b0e9217114577442cecb224a488d45
\ No newline at end of file
diff --git a/github-data/issues/376 - Bug_ unknown model architecture_ _deci_ _when loading Llama-3_1-Nemotro.md b/github-data/issues/376 - Bug unknown model architecture deci when loading Llama-3_1-Nemotron-Ultra-253B.md
similarity index 93%
rename from github-data/issues/376 - Bug_ unknown model architecture_ _deci_ _when loading Llama-3_1-Nemotro.md
rename to github-data/issues/376 - Bug unknown model architecture deci when loading Llama-3_1-Nemotron-Ultra-253B.md
index 00b5a809c..bdda42519 100644
--- a/github-data/issues/376 - Bug_ unknown model architecture_ _deci_ _when loading Llama-3_1-Nemotro.md
+++ b/github-data/issues/376 - Bug unknown model architecture deci when loading Llama-3_1-Nemotron-Ultra-253B.md
@@ -1,14 +1,15 @@
-### 🐛 [#376](https://github.com/ikawrakow/ik_llama.cpp/issues/376) - Bug: unknown model architecture: 'deci' (when loading Llama-3_1-Nemotron-Ultra-253B)
+## 📌 [Issue #376](https://github.com/ikawrakow/ik_llama.cpp/issues/376) - Bug: unknown model architecture: 'deci' (when loading Llama-3_1-Nemotron-Ultra-253B)
| **Author** | `Lissanro` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-04 |
| **Updated** | 2025-05-09 |
+| **Assignees** | `saood06` |
---
-#### Description
+## 📄 Description
### What happened?
@@ -86,21 +87,21 @@ munmap_chunk(): invalid pointer
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-04** at **09:35:03**:
+👤 **ikawrakow** commented on **2025-05-04** at **09:35:03**
I can take a look, but as with other giant models, I cannot test. Are you willing to test and provide benchmarks?
---
-👤 **saood06** commented the **2025-05-04** at **09:38:35**:
+👤 **saood06** commented on **2025-05-04** at **09:38:35**
I'm already working on it.
---
-👤 **Lissanro** commented the **2025-05-04** at **11:01:33**:
+👤 **Lissanro** commented on **2025-05-04** at **11:01:33**
> Are you willing to test and provide benchmarks?
@@ -110,7 +111,7 @@ As of benchmarks, at very least I planned to test input processing and output ge
---
-👤 **saood06** commented the **2025-05-04** at **11:07:21**:
+👤 **saood06** commented on **2025-05-04** at **11:07:21**
>Sure, I will be happy to test, at both short and log context lengths.
@@ -122,13 +123,13 @@ You can use sweep-bench to do that.
---
-👤 **Lissanro** commented the **2025-05-04** at **11:34:07**:
+👤 **Lissanro** commented on **2025-05-04** at **11:34:07**
I do not have the smaller model yet but I can try downloading it, for example from here https://huggingface.co/bartowski/Llama-3_1-Nemotron-51B-Instruct-GGUF (I only have 4G connection though and have some things still downloading, but I should be able to get the 51B within 2 days in case it will be needed for testing).
---
-👤 **saood06** commented the **2025-05-04** at **11:46:19**:
+👤 **saood06** commented on **2025-05-04** at **11:46:19**
>but I should be able to get the 51B within 2 days in case it will be needed for testing
diff --git a/github-data/issues/378 - Feature Request_ Use ik_llama.cpp with llama-cpp-python.md b/github-data/issues/378 - Feature Request Use ik_llama.cpp with llama-cpp-python.md
similarity index 72%
rename from github-data/issues/378 - Feature Request_ Use ik_llama.cpp with llama-cpp-python.md
rename to github-data/issues/378 - Feature Request Use ik_llama.cpp with llama-cpp-python.md
index cd37a7085..70e288a03 100644
--- a/github-data/issues/378 - Feature Request_ Use ik_llama.cpp with llama-cpp-python.md
+++ b/github-data/issues/378 - Feature Request Use ik_llama.cpp with llama-cpp-python.md
@@ -1,14 +1,15 @@
-### ✨ [#378](https://github.com/ikawrakow/ik_llama.cpp/issues/378) - Feature Request: Use ik_llama.cpp with llama-cpp-python
+## 📌 [Issue #378](https://github.com/ikawrakow/ik_llama.cpp/issues/378) - Feature Request: Use ik_llama.cpp with llama-cpp-python
| **Author** | `kadongre` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-05-04 |
| **Updated** | 2025-05-25 |
+| **Labels** | `enhancement`, `help wanted` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -34,27 +35,27 @@ Would be useful to leverage any of these mechanisms for ik_llama to utilize the
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-04** at **15:40:03**:
+👤 **ikawrakow** commented on **2025-05-04** at **15:40:03**
I'm not a Python person. `ik_llama.cpp` is a fork of `llama.cpp` and hence has inherited whatever Python bindings were there in June of last year. But I have no idea if they still work and, if not, what needs to get done.
---
-👤 **saood06** commented the **2025-05-04** at **16:28:41**:
+👤 **saood06** commented on **2025-05-04** at **16:28:41**
He is asking about `llama-cpp-python` which is it's own project that pulls in llama.cpp as a submodule: https://github.com/abetlen/llama-cpp-python/tree/main/vendor
---
-👤 **ikawrakow** commented the **2025-05-04** at **16:42:48**:
+👤 **ikawrakow** commented on **2025-05-04** at **16:42:48**
I see. Is it even possible to have `ik_llama.cpp` live as a sub-module in that project? Mainline has been very busy pushing pieces of code from here to there, renaming functions, changing interfaces for no actual benefit, etc. So, my guess is that it will not be easy, if it is even possible.
---
-👤 **Ph0rk0z** commented the **2025-05-14** at **16:17:57**:
+👤 **Ph0rk0z** commented on **2025-05-14** at **16:17:57**
Besides stuff like -ot and other new features, can just grab the revision from around the forking. IIRC, something around 3.0. They all have tags. Then it's a matter of adding most missing function names in ~2 places. Make it pull ik_llama instead of llama.cpp as the sub-module.
@@ -62,25 +63,19 @@ All the bindings do is call C++ functions from the library. Not sure why you'd w
---
-👤 **ikawrakow** commented the **2025-05-14** at **16:35:11**:
+👤 **ikawrakow** commented on **2025-05-14** at **16:35:11**
You want to do it?
---
-👤 **Ph0rk0z** commented the **2025-05-14** at **16:51:04**:
+👤 **Ph0rk0z** commented on **2025-05-14** at **16:51:04**
I was going to do it to maybe use ik_llama with textgen webui but it's a whole separate repo. Out of scope from here. It's been just as easy to run llama-server.. the only reason to bother is to use HF sampling instead of built in. IK is missing nsigma sampler and --cache-reuse stuff, textgen at least has context shifting in hf_llama.cpp mode.
---
-👤 **Ph0rk0z** commented the **2025-05-14** at **16:51:04**:
-
-I was going to do it to maybe use ik_llama with textgen webui but it's a whole separate repo. Out of scope from here. It's been just as easy to run llama-server.. the only reason to bother is to use HF sampling instead of built in. IK is missing nsigma and --cache-reuse stuff, textgen at least has context shifting in hf_llama.cpp mode.
-
----
-
-👤 **saood06** commented the **2025-05-25** at **05:05:19**:
+👤 **saood06** commented on **2025-05-25** at **05:05:19**
@ikawrakow
diff --git a/github-data/issues/379 - Bug_ Cannot build on WoA.md b/github-data/issues/379 - Bug Cannot build on WoA.md
similarity index 93%
rename from github-data/issues/379 - Bug_ Cannot build on WoA.md
rename to github-data/issues/379 - Bug Cannot build on WoA.md
index 7442a3886..8ed51fd9a 100644
--- a/github-data/issues/379 - Bug_ Cannot build on WoA.md
+++ b/github-data/issues/379 - Bug Cannot build on WoA.md
@@ -1,4 +1,4 @@
-### 🐛 [#379](https://github.com/ikawrakow/ik_llama.cpp/issues/379) - Bug: Cannot build on WoA
+## 📌 [Issue #379](https://github.com/ikawrakow/ik_llama.cpp/issues/379) - Bug: Cannot build on WoA
| **Author** | `jdluzen` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -54,9 +54,9 @@ Windows
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-05** at **05:08:38**:
+👤 **ikawrakow** commented on **2025-05-05** at **05:08:38**
The `ik_llama.cpp` build is less automated than mainline. I think you are the first to try building on Windows for ARM. You may need to manually specify the compiler options to make it work like this
```
diff --git a/github-data/issues/380 - Drop at the start of generation.md b/github-data/issues/380 - Drop at the start of generation.md
index 4f2b56447..63307493e 100644
--- a/github-data/issues/380 - Drop at the start of generation.md
+++ b/github-data/issues/380 - Drop at the start of generation.md
@@ -1,4 +1,4 @@
-### 📝 [#380](https://github.com/ikawrakow/ik_llama.cpp/issues/380) - Drop at the start of generation
+## 📌 [Issue #380](https://github.com/ikawrakow/ik_llama.cpp/issues/380) - Drop at the start of generation
| **Author** | `intulint` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
After the generation starts, the server crashes. This only happens on the Qwen3-30B-A3B, and I checked different quant. Regular dense models work, including other dense qwen3.
What could be the problem? I liked the acceleration in dense models, I thought moe would fly.
@@ -23,15 +23,15 @@ cmake --build ./build --config Release -j 16
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-05** at **05:12:28**:
+👤 **ikawrakow** commented on **2025-05-05** at **05:12:28**
Can you post the output of the above commands (including the `cmake` commands)? Thanks.
---
-👤 **intulint** commented the **2025-05-05** at **10:10:19**:
+👤 **intulint** commented on **2025-05-05** at **10:10:19**
Sure, but it turned out to be a lot of text. I also noticed that it takes a long time to assemble in a single thread of unicode.cpp
unicode-data.cpp. I don't know if this is normal or not.
@@ -1575,7 +1575,7 @@ PS C:\neuro\ik_llama.cpp\build\bin\Release>
---
-👤 **intulint** commented the **2025-05-05** at **10:14:49**:
+👤 **intulint** commented on **2025-05-05** at **10:14:49**
Even the benchmark crashes during generation. I don't know what the problem is, but it seems to be related to what happens during generation.
@@ -1703,7 +1703,7 @@ PS C:\neuro\ik_llama.cpp\build\bin\Release>
---
-👤 **ikawrakow** commented the **2025-05-05** at **10:22:33**:
+👤 **ikawrakow** commented on **2025-05-05** at **10:22:33**
Can you try running with `-t 8`?
@@ -1711,15 +1711,7 @@ If that works, try also adding `-fa -rtr -fmoe`.
---
-👤 **ikawrakow** commented the **2025-05-05** at **10:22:33**:
-
-Can you try running with `-t 8`?
-
-If that works, try also adding `-fa -rtr`.
-
----
-
-👤 **intulint** commented the **2025-05-05** at **10:42:45**:
+👤 **intulint** commented on **2025-05-05** at **10:42:45**
8 cores make no difference.
-fa -rtr -fmoe Finally it works, but I noticed that every time before writing a comma the generation stops for half a second. The first time I see this.
@@ -2289,7 +2281,7 @@ srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
---
-👤 **ikawrakow** commented the **2025-05-05** at **11:00:59**:
+👤 **ikawrakow** commented on **2025-05-05** at **11:00:59**
So, with `-rtr -fa -fmoe` it works, but TG is slow (slower than `llama.cpp`). How much slower?
What about prompt processing, or when you have a few thousand tokens in the KV cache?
@@ -2299,13 +2291,13 @@ Without these flags it does not work. If you try `-rtr -fmoe` and `-fa -fmoe` se
---
-👤 **intulint** commented the **2025-05-05** at **11:05:55**:
+👤 **intulint** commented on **2025-05-05** at **11:05:55**
The speeds are in my message above, it is of course long, but I tried to give all the information
---
-👤 **intulint** commented the **2025-05-05** at **11:15:26**:
+👤 **intulint** commented on **2025-05-05** at **11:15:26**
-fa -fmoe - works, but also pauses before displaying commas. The speed is also low
@@ -2316,7 +2308,7 @@ INFO [ print_timings] generation eval time = 40935.66 ms / 426 run
---
-👤 **ikawrakow** commented the **2025-05-05** at **11:15:51**:
+👤 **ikawrakow** commented on **2025-05-05** at **11:15:51**
Ah, OK. I see
* `ik_llama.cpp`: PP = 76.3 t/s (512 tokens), TG = 11.4 t/s (647 tokens)
@@ -2326,7 +2318,7 @@ Correct? I think it would be more fair to compare for the same (or at least simi
---
-👤 **intulint** commented the **2025-05-05** at **11:35:12**:
+👤 **intulint** commented on **2025-05-05** at **11:35:12**
llama.cpp ~ 1000 - 500
prompt eval time = 35744.63 ms / 1053 tokens ( 33.95 ms per token, 29.46 tokens per second)
@@ -2339,7 +2331,7 @@ INFO [ print_timings] generation eval time = 40472.90 ms / 422 run
---
-👤 **ikawrakow** commented the **2025-05-05** at **11:41:03**:
+👤 **ikawrakow** commented on **2025-05-05** at **11:41:03**
OK, thanks. I'll look into the failure without flash attention.
@@ -2349,13 +2341,13 @@ Sorry for asking, but in what language is your conversation? I'm asking because
---
-👤 **intulint** commented the **2025-05-05** at **11:43:33**:
+👤 **intulint** commented on **2025-05-05** at **11:43:33**
This is a good question, I somehow didn't pay attention to what language the pauses in generation are in. Usually Russian, but also English. I'll check now. We need generation in English, right? Or is it important that the entire context is in one language?
---
-👤 **ikawrakow** commented the **2025-05-05** at **11:46:02**:
+👤 **ikawrakow** commented on **2025-05-05** at **11:46:02**
> Or is it important that the entire context is in one language?
@@ -2363,25 +2355,25 @@ I don't know. Just looking for clues what could be slowing it down.
---
-👤 **intulint** commented the **2025-05-05** at **11:54:19**:
+👤 **intulint** commented on **2025-05-05** at **11:54:19**
I launched it only in English and looked more closely, a pause in generation appears after or before the comma is displayed. It lasts a noticeable fraction of a second, and generation continues. Usually in such places - "Okay, the", "So, if", "than B, the"
---
-👤 **intulint** commented the **2025-05-05** at **11:56:28**:
+👤 **intulint** commented on **2025-05-05** at **11:56:28**
To avoid confusion, I checked in 2 frontends. I noticed pauses only on commas.
---
-👤 **ikawrakow** commented the **2025-05-05** at **11:57:24**:
+👤 **ikawrakow** commented on **2025-05-05** at **11:57:24**
Interesting. I don't observe such effects on my Linux box. Are the sampling parameters exactly the same?
---
-👤 **intulint** commented the **2025-05-05** at **12:01:40**:
+👤 **intulint** commented on **2025-05-05** at **12:01:40**
In the native front the servers are standard as far as I understand. I only changed the max tokens when measuring the speed. It didn't affect the pauses.
@@ -2391,13 +2383,13 @@ In the native front the servers are standard as far as I understand. I only chan
---
-👤 **intulint** commented the **2025-05-05** at **12:16:22**:
+👤 **intulint** commented on **2025-05-05** at **12:16:22**
Maybe it's a compiler version? I don't know much, but as I understand it, a fresh one was used during assembly. I remember there were messages during assembly about changing the format of variables and that data loss could occur.
---
-👤 **ikawrakow** commented the **2025-05-05** at **12:17:11**:
+👤 **ikawrakow** commented on **2025-05-05** at **12:17:11**
For reference, here is what I get on my vanilla AVX2 Linux box using 8 threads with the commands
```
@@ -2427,19 +2419,19 @@ Not sure how to debug. I don't have access to a Windows box.
---
-👤 **intulint** commented the **2025-05-05** at **12:23:26**:
+👤 **intulint** commented on **2025-05-05** at **12:23:26**
Got it. I'll try to figure out how and by how much to downgrade the compiler, maybe that will help. If not, I don't know what to do next, I'll run it with llama.cpp.
---
-👤 **ikawrakow** commented the **2025-05-05** at **12:31:36**:
+👤 **ikawrakow** commented on **2025-05-05** at **12:31:36**
You can try building with `GCC or clang`. I cannot give you instructions how one does that as it is a long time since I last did that, so I have forgotten. But IIRC, the GCC build was running ~40% faster than the MSVC build. It wasn't an LLM, but it did involve algorithms with heavy number crunching. It must have been around 2017-2018, so don't know if MSVC has improved since then.
---
-👤 **intulint** commented the **2025-05-05** at **12:33:50**:
+👤 **intulint** commented on **2025-05-05** at **12:33:50**
>Is the llama.cpp build done with MSVC or with GCC/clang?
@@ -2460,33 +2452,13 @@ PS C:\neuro\ik_llama.cpp\build\bin\Release> .\llama-sweep-bench.exe -m F:\llm\Qw
---
-👤 **intulint** commented the **2025-05-05** at **12:33:50**:
-
->Is the llama.cpp build done with MSVC or with GCC/clang?
-I have written a script that downloads the latest official releases; I have never compiled such large projects myself before.
-
-By the way, yes, we found the parameters under which it starts.
-PS C:\neuro\ik_llama.cpp\build\bin\Release> .\llama-sweep-bench.exe -m F:\llm\Qwen3-30B-A3B-Q5_K_M.gguf -c 4096 -t 8 -fa -fmoe
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 9.384 | 54.56 | 8.596 | 14.89 |
-| 512 | 128 | 512 | 10.704 | 47.83 | 8.700 | 14.71 |
-| 512 | 128 | 1024 | 10.833 | 47.26 | 8.572 | 14.93 |
-| 512 | 128 | 1536 | 11.697 | 43.77 | 8.849 | 14.47 |
-| 512 | 128 | 2048 | 12.257 | 41.77 | 9.372 | 13.66 |
-| 512 | 128 | 2560 | 13.290 | 38.53 | 9.859 | 12.98 |
-| 512 | 128 | 3072 | 14.514 | 35.28 | 11.724 | 10.92 |
-| 512 | 128 | 3584 | 14.406 | 35.54 | 10.795 | 11.86 |
-
----
-
-👤 **intulint** commented the **2025-05-05** at **12:35:11**:
+👤 **intulint** commented on **2025-05-05** at **12:35:11**
Got it, I'll try it in the evening if I figure it out.
---
-👤 **ikawrakow** commented the **2025-05-05** at **12:46:18**:
+👤 **ikawrakow** commented on **2025-05-05** at **12:46:18**
You didn't say what your CPU was, so here another reference point from me on a more recent CPU (Ryzen-7950X). Again using 8 threads to be comparable to yours, same command as above:
@@ -2531,13 +2503,13 @@ In comparison, mainline `llama.cpp` on the same computer (just pulled and rebuil
---
-👤 **intulint** commented the **2025-05-05** at **12:59:45**:
+👤 **intulint** commented on **2025-05-05** at **12:59:45**
Ah, indeed. This is an assembly on an old server processor 1660v4 with 4 memory channels, 32 GB in total. The speeds during generation are quite good, since the memory gives somewhere around 55 GB/s. Of course, this is not comparable with modern processors.
---
-👤 **saood06** commented the **2025-05-05** at **22:30:50**:
+👤 **saood06** commented on **2025-05-05** at **22:30:50**
> You can try building with `GCC or clang`. I cannot give you instructions how one does that as it is a long time since I last did that, so I have forgotten.
@@ -2545,7 +2517,7 @@ The easiest way I found to use non MSVC to compile this on Windows was with http
---
-👤 **alex1284B** commented the **2025-05-14** at **16:37:33**:
+👤 **alex1284B** commented on **2025-05-14** at **16:37:33**
I think I have a similar problem, Qwen3 does not produce valid output after two lines of tokens, I tried different quantz IQ_K Q6, the same problems. But Qwen2.5 is fine. Base llama.cpp works fine also. Linux, only CPU.
I'm not sure but the line of samplers is different than base llama.cpp
@@ -2717,7 +2689,7 @@ llama_print_timings: total time = 61937,29 ms / 1527 tokens`
---
-👤 **ikawrakow** commented the **2025-05-14** at **16:57:33**:
+👤 **ikawrakow** commented on **2025-05-14** at **16:57:33**
@alex1284B
@@ -2725,12 +2697,12 @@ I tried your prompt and I see that it does not work. But of you add `-fa -fmoe`,
---
-👤 **alex1284B** commented the **2025-05-14** at **17:23:47**:
+👤 **alex1284B** commented on **2025-05-14** at **17:23:47**
Thank you, I probably missed these options for starting. My bad.
---
-👤 **ikawrakow** commented the **2025-05-25** at **07:10:25**:
+👤 **ikawrakow** commented on **2025-05-25** at **07:10:25**
-Closed via #420
\ No newline at end of file
+Closed via [#420](https://github.com/ikawrakow/ik_llama.cpp/issues/420)
\ No newline at end of file
diff --git a/github-data/issues/381 - ik_llama.cpp_ggml_src_ggml-cuda_fattn.cu_66_ fatal error after latest.md b/github-data/issues/381 - ik_llama.cppggmlsrcggml-cudafattn.cu66 fatal error after latest.md
similarity index 86%
rename from github-data/issues/381 - ik_llama.cpp_ggml_src_ggml-cuda_fattn.cu_66_ fatal error after latest.md
rename to github-data/issues/381 - ik_llama.cppggmlsrcggml-cudafattn.cu66 fatal error after latest.md
index db9b1c7e0..200ebb71a 100644
--- a/github-data/issues/381 - ik_llama.cpp_ggml_src_ggml-cuda_fattn.cu_66_ fatal error after latest.md
+++ b/github-data/issues/381 - ik_llama.cppggmlsrcggml-cudafattn.cu66 fatal error after latest.md
@@ -1,4 +1,4 @@
-### 📝 [#381](https://github.com/ikawrakow/ik_llama.cpp/issues/381) - ik_llama.cpp/ggml/src/ggml-cuda/fattn.cu:66: fatal error after latest
+## 📌 [Issue #381](https://github.com/ikawrakow/ik_llama.cpp/issues/381) - ik_llama.cpp/ggml/src/ggml-cuda/fattn.cu:66: fatal error after latest
| **Author** | `nux` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
did git pull and tried llama-bench:
~/dev/ik_llama.cpp $ ./build/bin/llama-bench -m /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf -p 512 -t 32 -mla 2 -fa 1 -fmoe 1 -ngl 99 --override-tensor "exps=CPU" -amb 512 -ctk q8_0 -ctv q8_0
@@ -75,17 +75,17 @@ Can give more info if needed. Tried to put this on reddit post but got "Server e
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-05** at **05:43:02**:
+👤 **ikawrakow** commented on **2025-05-05** at **05:43:02**
-Thank you for the bug report. PR #370 broke it. Can you check if it works for you now? Thanks.
+Thank you for the bug report. PR [#370](https://github.com/ikawrakow/ik_llama.cpp/issues/370) broke it. Can you check if it works for you now? Thanks.
As a side note: The row-interleaved quants (`*_R4, *_R8`) are not ideal when running on the GPU as there is no CUDA support for them. The effect will be that all calculations will be run on the CPU, and your GPU will be acting as a very expensive RAM module. If you are using partial offload to the GPU, the better option is to use a model without row-interleaved quants, and to specify `-rtr` on the command line. In that case, the tensors that are not offloaded to the GPU will get run-time repacked to row-interleaved for better performance (but this will make model loading time longer).
---
-👤 **nux** commented the **2025-05-05** at **06:55:21**:
+👤 **nux** commented on **2025-05-05** at **06:55:21**
I rebuilt with the latest changes and it works
@@ -95,23 +95,13 @@ Will consider bug report closed - thanks!
---
-👤 **nux** commented the **2025-05-05** at **06:55:21**:
-
-I rebuilt with the latest changes and it works
-
-On that side note - I've stuck with ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4 as it's worked. Would love to hear recommendation on what I should look into or direction I should go for a 768GB ram 3090 setup. Still quite new to this.
-
-Will consider bug report closed - thanks!
-
----
-
-👤 **ikawrakow** commented the **2025-05-05** at **07:09:19**:
+👤 **ikawrakow** commented on **2025-05-05** at **07:09:19**
If you are new to this and don't want to get involved with making your own quantized models, perhaps we should ask @ubergarm to publish his models without row interleaving so they can be run efficiently with with full/partial GPU offload.
---
-👤 **ikawrakow** commented the **2025-05-05** at **07:20:54**:
+👤 **ikawrakow** commented on **2025-05-05** at **07:20:54**
What you can try in the meantime is to see if you get better performance by running CPU-only.
@@ -123,7 +113,7 @@ and then run as you have done above but without the `-ngl 99` argument and using
---
-👤 **nux** commented the **2025-05-05** at **14:49:30**:
+👤 **nux** commented on **2025-05-05** at **14:49:30**
I will look into making my own quantized models
@@ -140,7 +130,7 @@ Thanks!
---
-👤 **ikawrakow** commented the **2025-05-05** at **15:04:10**:
+👤 **ikawrakow** commented on **2025-05-05** at **15:04:10**
> Did something change or a misunderstanding somewhere?
@@ -148,6 +138,6 @@ Oh, I see these have all attention tensors quantized with `Q8_0`. Sorry, didn't
---
-👤 **ubergarm** commented the **2025-05-05** at **15:21:39**:
+👤 **ubergarm** commented on **2025-05-05** at **15:21:39**
Thanks, yeah going forward I've started to release non-repacked quants as a lot of multi-gpu people were complaining. Then folks who want can offline-repack themselves which seems a bit more flexible for general audience.
\ No newline at end of file
diff --git a/github-data/issues/383 - Bug_ Loading DeepSeek R1T Chimera causes _llama_model_load_ error loadi.md b/github-data/issues/383 - Bug Loading DeepSeek R1T Chimera causes llama_model_load error loading model che.md
similarity index 67%
rename from github-data/issues/383 - Bug_ Loading DeepSeek R1T Chimera causes _llama_model_load_ error loadi.md
rename to github-data/issues/383 - Bug Loading DeepSeek R1T Chimera causes llama_model_load error loading model che.md
index 307b11de7..5f1f35270 100644
--- a/github-data/issues/383 - Bug_ Loading DeepSeek R1T Chimera causes _llama_model_load_ error loadi.md
+++ b/github-data/issues/383 - Bug Loading DeepSeek R1T Chimera causes llama_model_load error loading model che.md
@@ -1,14 +1,14 @@
-### 🐛 [#383](https://github.com/ikawrakow/ik_llama.cpp/issues/383) - Bug: Loading DeepSeek R1T Chimera causes \"llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1\"
+## 📌 [Issue #383](https://github.com/ikawrakow/ik_llama.cpp/issues/383) - Bug: Loading DeepSeek R1T Chimera causes "llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, 1"
| **Author** | `Alexey-Akishin` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-06 |
-| **Updated** | 2025-06-01 |
+| **Updated** | 2025-07-25 |
---
-#### Description
+## 📄 Description
### What happened?
@@ -163,9 +163,9 @@ llama_load_model_from_file: failed to load model
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-06** at **00:29:56**:
+👤 **saood06** commented on **2025-05-06** at **00:29:56**
The reason you are seeing an error is the MLA implementation here and in mainline is no longer compatible, and the model linked is using the incompatible MLA implementation. We support creating the MLA tensors on the fly for models that existed before the MLA implementation or models that are converted using convert_hf_to_gguf.py from this repo, where it will add the MLA tensors used here.
@@ -173,13 +173,7 @@ If you want to use the model you can by directly converting from https://hugging
---
-👤 **saood06** commented the **2025-05-06** at **00:29:56**:
-
-The MLA implementation here and in mainline is no longer compatible. We support creating the MLA tensors on the fly for models that existed before the MLA implementation or models that are converted using convert_hf_to_gguf.py from this repo, where it will add the MLA tensors used here.
-
----
-
-👤 **Alexey-Akishin** commented the **2025-05-06** at **00:59:06**:
+👤 **Alexey-Akishin** commented on **2025-05-06** at **00:59:06**
Oh, I see. Is there a way to somehow salvage https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/tree/main/DeepSeek-R1T-Chimera-Q4_K_M quant, either remove or convert incompatible MLA tensors? Maybe there is a way to upconvert it to bf16 and then from there to the quant I need, or that wouldn't work? Unfortunately no other quants on huggingface exist.
@@ -189,17 +183,7 @@ If nothing can be done and it is not a bug, I understand, but I suggest consider
---
-👤 **Alexey-Akishin** commented the **2025-05-06** at **00:59:06**:
-
-Oh, I see. https://huggingface.co/tngtech/DeepSeek-R1T-Chimera seems 163 files, mostly 4.3 GB in size, so about 700GB or half a month of downloading non-stop in my case, or maybe two months if I get speed limited for the rest of the month (since I already made multiple downloads this months and have only 1TB traffic limit before speed is limited).
-
-Is there a way to somehow salvage https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/tree/main/DeepSeek-R1T-Chimera-Q4_K_M quant, either remove or convert incompatible MLA tensors? Maybe upconvert it to bf16 and then from there to the quant I need, or that wouldn't work? Unfortunately no other quants on huggingface exist.
-
-If nothing can be done and it is not a bug, I understand, but I suggest considering adding a clear error message, so it would be easier for users to understand that they are trying to run incompatible quant.
-
----
-
-👤 **saood06** commented the **2025-05-06** at **01:14:23**:
+👤 **saood06** commented on **2025-05-06** at **01:14:23**
> Oh, I see. Is there a way to somehow salvage https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/tree/main/DeepSeek-R1T-Chimera-Q4_K_M quant, either remove or convert incompatible MLA tensors?
@@ -221,7 +205,7 @@ We may end up doing that, I know for now the README for this repo mentions it sa
---
-👤 **Lissanro** commented the **2025-05-06** at **05:08:04**:
+👤 **Lissanro** commented on **2025-05-06** at **05:08:04**
I downloaded the same not compatible quant few days ago, but seeing this bug report inspired me to create a request for the quant creator to consider create one that is compatible with ik_llama.cpp https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/discussions/1 (I figured if it is not just me who needs it, maybe they will consider it). I am yet to download full version of it to make my own quant (I will not be able to upload it though, since I have less than 1 Mbps for upload but around 10-40 Mbps for download).
@@ -231,7 +215,7 @@ Given how much ik_llama.cpp implementation is more mature and faster (by more th
---
-👤 **saood06** commented the **2025-05-06** at **05:37:18**:
+👤 **saood06** commented on **2025-05-06** at **05:37:18**
> I downloaded the same not compatible quant few days ago, but seeing this bug report inspired me to create a request for the quant creator to consider create one that is compatible with ik_llama.cpp https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/discussions/1 (I figured if it is not just me who needs it, maybe they will consider it).
@@ -249,7 +233,7 @@ It is because less people know about and thus use ik_llama.cpp. It also doesn't
---
-👤 **ikawrakow** commented the **2025-05-06** at **05:38:30**:
+👤 **ikawrakow** commented on **2025-05-06** at **05:38:30**
> If nothing can be done and it is not a bug, I understand, but I suggest considering adding a clear error message, so it would be easier for users to understand that they are trying to run incompatible quant.
@@ -259,7 +243,7 @@ I personally find the approach taken in mainline llama.cpp plain irresponsible.
---
-👤 **saood06** commented the **2025-05-06** at **05:42:15**:
+👤 **saood06** commented on **2025-05-06** at **05:42:15**
> This is why I added the IMPORTANT note on the ik_llama.cpp main page, hoping to prevent at least some users wasting their time and traffic limits downloading a giant incompatible model.
@@ -269,15 +253,7 @@ Edit: Thanks for fixing it.
---
-👤 **saood06** commented the **2025-05-06** at **05:42:15**:
-
-> This is why I added the IMPORTANT note on the ik_llama.cpp main page, hoping to prevent at least some users wasting their time and traffic limits downloading a giant incompatible model.
-
-Minor note, there are some typos in that note: "scrip" and "safetnosrs".
-
----
-
-👤 **Lissanro** commented the **2025-05-06** at **07:37:41**:
+👤 **Lissanro** commented on **2025-05-06** at **07:37:41**
> I still think a a script somewhat inspired by https://huggingface.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small/blob/main/swap_embeds.py could remove the incorrect tensors
@@ -291,21 +267,7 @@ It may take many days before I have the full Chimera, but if I will figure out a
---
-👤 **Lissanro** commented the **2025-05-06** at **07:37:41**:
-
-> I still think a a script somewhat inspired by https://huggingface.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small/blob/main/swap_embeds.py could remove the incorrect tensors
-
-I tested it further, and I think a script will not help in this case. Even though https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/tree/main/DeepSeek-R1T-Chimera-Q4_K_M is similar in size to old Q4_K_M quant of R1 by Unsloth (when they did not have UD or XL versions of it), the quality is much lower. It failed many tests, most of my tests are specific to my real world use cases, but some are generic public tests or common questions, for example easiest one to check and that reveals quantization degradation in reasoning models very well, is the [maze test](https://www.reddit.com/r/LocalLLaMA/comments/1j4lqe6/test_if_your_api_provider_is_quantizing_your/) - Chimera at OpenRouter passes it, and so does Q4_K_M quant of R1 from Unsloth, but https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/tree/main/DeepSeek-R1T-Chimera-Q4_K_M consistently fails it.
-
-The point is, even if it was possible to somehow recover this Q4 quant to make it work with ik_llama.cpp, its quality is very bad, so it still would be necessary to recreate it from scratch. I guess I just keep downloading the full version via my 4G connection and hope the provider will not limit my speed.
-
-So far, I only created my own repacked quants for ik_llama.cpp, but not from scratch (last time I checked, on the fly conversion was disabling mmap, so I had to repack R1 and V3 quants to use them without performance loss). I know I will need to convert to bf16 first, but I am not yet sure how to create proper quant that would be comparable to UD-Q4_K_XL from Unsloth in quality. I plan to go through some articles Unsloth posted, maybe they shared how they did it.
-
-It may take many days before I have the full Chimera, but if I will figure out a set of commands to convert to a good ik_llama.cpp quant, I will share here (if this discussion is closed by then, then I will just edit my existing message to add the info to avoid reopening it).
-
----
-
-👤 **ikawrakow** commented the **2025-05-06** at **07:54:50**:
+👤 **ikawrakow** commented on **2025-05-06** at **07:54:50**
> is similar in size to old Q4_K_M quant of R1 by Unsloth (when they did not have UD XL version of it), the quality is much lower.
@@ -313,7 +275,7 @@ This is because the `llama.cpp` experts who decided that breaking backwards comp
---
-👤 **saood06** commented the **2025-05-06** at **10:05:33**:
+👤 **saood06** commented on **2025-05-06** at **10:05:33**
> but I am not yet sure how to create proper quant that would be comparable to UD-Q4_K_XL from Unsloth in quality. I plan to go through some articles Unsloth posted, maybe they shared how they did it.
>
@@ -323,7 +285,7 @@ I would recommend actually looking into the quant types that are exclusive to th
---
-👤 **Ph0rk0z** commented the **2025-05-06** at **12:27:36**:
+👤 **Ph0rk0z** commented on **2025-05-06** at **12:27:36**
I too want to try this model with it's selective thinking. I rather download it than R1 or V3 alone.
@@ -333,7 +295,7 @@ Deepseek v2.5 should be safe, right? https://huggingface.co/bartowski/DeepSeek-V
---
-👤 **city96** commented the **2025-05-08** at **11:41:22**:
+👤 **city96** commented on **2025-05-08** at **11:41:22**
@saood06
> You probably could make a script that does that (I have been meaning to make one that merges my V3 and R1 GGUF in the same way chimera does to avoid downloading it since as you know these models are large).
@@ -344,7 +306,7 @@ It's based on the discussion on [HuggingFace](https://huggingface.co/tngtech/Dee
---
-👤 **saood06** commented the **2025-05-08** at **22:12:47**:
+👤 **saood06** commented on **2025-05-08** at **22:12:47**
> Not 100% sure if it's correct but I made a script that attempts to do that - [GitHub Gist](https://gist.github.com/city96/a05cb7ec6664a5085efb007497f2049b).
>
@@ -354,7 +316,7 @@ Thank you for this. I saw the beginning of that discussion but I hadn't checked
---
-👤 **Lissanro** commented the **2025-05-09** at **02:12:03**:
+👤 **Lissanro** commented on **2025-05-09** at **02:12:03**
I finally finished downloading unquantized Chimera, but cannot figure out how to convert it to BF16 in order to generate my own quants for ik_llama.cpp. I would greatly appreciate if anybody have any idea how to do it?
@@ -396,7 +358,7 @@ ValueError("type fp8e4nv not supported in this architecture. The supported fp8 d
---
-👤 **saood06** commented the **2025-05-09** at **02:20:14**:
+👤 **saood06** commented on **2025-05-09** at **02:20:14**
>I finally finished downloading unquantized Chimera, but cannot figure out how to convert it to BF16 in order to generate my own quants for ik_llama.cpp. I would greatly appreciate if anybody have any idea how to do it?
@@ -406,17 +368,7 @@ I mentioned this before but I'll repeat since I think it still holds true, I've
---
-👤 **saood06** commented the **2025-05-09** at **02:20:14**:
-
->I finally finished downloading unquantized Chimera, but cannot figure out how to convert it to BF16 in order to generate my own quants for ik_llama.cpp. I would greatly appreciate if anybody have any idea how to do it?
-
-The solution I've given others and have used myself is to use this method https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6.
-
-I mentioned this before but I'll repeat since I think it still holds true, I've thought about porting that here but the triton dependence adds more complication than I think it is worth for most people, when more fp8 native models are released, I think something along the lines of https://github.com/ggml-org/llama.cpp/pull/10055 is the best path forward.
-
----
-
-👤 **Lissanro** commented the **2025-05-09** at **05:58:13**:
+👤 **Lissanro** commented on **2025-05-09** at **05:58:13**
It seems the tutorial is outdated. Just creating venv on the next step produces errors about not being able to satisfy dependencies, do you know by any chance what Python version was recommended at the time the tutorial was written? On Ubuntu 25.04, Python 3.13 is the default, but it did not work, failing to satisfy some dependencies. So I tried from scratch with older version of Python:
@@ -469,7 +421,7 @@ I will keep trying to find a solution and if I find one, I will share here. If s
---
-👤 **saood06** commented the **2025-05-09** at **06:34:18**:
+👤 **saood06** commented on **2025-05-09** at **06:34:18**
> It seems the tutorial is outdated. Just creating venv on the next step produces errors about not being able to satisfy dependencies, do you know by any chance what Python version was recommended at the time the tutorial was written? On Ubuntu 25.04, Python 3.13 is the default, but it did not work, failing to satisfy some dependencies.
@@ -533,81 +485,7 @@ index 25f5b017..615457d1 100644
```
->I tried looking into your second link, but it seems the patch wasn't updated in a while and no longer applies to llama.cpp, I tried few different old commits but could not find one yet where it applies successfully. Maybe I need to try even older llama.cpp commits, but not sure, if I go too far into the past, would it even support DeepSeek V3 architecture to convert to BF16? I also could not find any example command how to convert using [ggml-org/llama.cpp#10055](https://github.com/ggml-org/llama.cpp/pull/10055) - maybe it is something obvious I missed, perhaps because I never created GGUF before.
-
-I am sorry, I did not link that for you to use, just as a reference to what I see as a better long term solution to the greater issue of handling fp8 native models would be.
-
-> I will keep trying to find a solution and if I find one, I will share here. If someone has any ideas or an advice, I would appreciate it greatly.
-
-If you feel like trying one more time with triton (and no guarantees that it will work), you can try building the commit I was on (with my changes) on 3.13 and see if that works for you?
-
----
-
-👤 **saood06** commented the **2025-05-09** at **06:34:18**:
-
-> It seems the tutorial is outdated. Just creating venv on the next step produces errors about not being able to satisfy dependencies, do you know by any chance what Python version was recommended at the time the tutorial was written? On Ubuntu 25.04, Python 3.13 is the default, but it did not work, failing to satisfy some dependencies.
-
-I do not, but I know the system where I used triton for this is 3.13.
-
-> It seems you were right about triton dependency adding complications...
-
-I am not happy to be proven right. I ran into some complications myself (but was able to get past them), but up till now I've never had someone I recommended this solution not work for them (which is why I kept recommending it even if I don't think it is the ideal solution). I am really sorry if I wasted your time with something that didn't work for you.
-
-Taking a look at my install `pip list` has:
-
-`triton 3.2.0+git4ce833eb [local path]`
-
-(more specifically this commit hash 4ce833ebbce7b91564d7cc1f30573eb1129629f9)
-
-Looking at the path it was installed and doing a git diff (since I remember having to change things in order to get it to compile, sorry I normally have full logs of what I do but the ones for this session is one of the ones I do not have)
-
-```diff
-diff --git a/CMakeLists.txt b/CMakeLists.txt
-index de6ed239..d8cadd8b 100644
---- a/CMakeLists.txt
-+++ b/CMakeLists.txt
-@@ -143,7 +143,7 @@ endfunction()
-
- # Disable warnings that show up in external code (gtest;pybind11)
- if(NOT MSVC)
-- set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Werror -Wno-covered-switch-default -fvisibility=hidden")
-+ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-covered-switch-default -fvisibility=hidden")
- else()
- set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /wd4244 /wd4624 /wd4715 /wd4530")
- endif()
-diff --git a/third_party/cpu/CMakeLists.txt b/third_party/cpu/CMakeLists.txt
-index 25f5b017..615457d1 100644
---- a/third_party/cpu/CMakeLists.txt
-+++ b/third_party/cpu/CMakeLists.txt
-@@ -1,14 +1,14 @@
- # Find OneDNN ukernel library
--find_package(dnnl CONFIG)
--if (dnnl_FOUND)
-- message(STATUS "Found OneDNN/DNNL")
-- add_compile_definitions(ONEDNN_AVAILABLE)
-- get_target_property(dnnl_include DNNL::dnnl INTERFACE_INCLUDE_DIRECTORIES)
-- # currently used only in triton_cpu.cc and in ConvertDotToOneDNN
-- include_directories(${dnnl_include})
--else ()
-- message(STATUS "Could NOT find OneDNN/DNNL")
--endif()
-+#find_package(dnnl CONFIG)
-+#if (dnnl_FOUND)
-+# message(STATUS "Found OneDNN/DNNL")
-+# add_compile_definitions(ONEDNN_AVAILABLE)
-+# get_target_property(dnnl_include DNNL::dnnl INTERFACE_INCLUDE_DIRECTORIES)
-+# # currently used only in triton_cpu.cc and in ConvertDotToOneDNN
-+# include_directories(${dnnl_include})
-+#else ()
-+# message(STATUS "Could NOT find OneDNN/DNNL")
-+#endif()
-
- # Find XSMM ukernel library
- find_library(LIBXSMM xsmm
-
-```
-
->I tried looking into your second link, but it seems the patch wasn't updated in a while and no longer applies to llama.cpp, I tried few different old commits but could not find one yet where it applies successfully. Maybe I need to try even older llama.cpp commits, but not sure, if I go too far into the past, would it even support DeepSeek V3 architecture to convert to BF16? I also could not find any example command how to convert using [ggml-org/llama.cpp#10055](https://github.com/ggml-org/llama.cpp/pull/10055) - maybe it is something obvious I missed, perhaps because I never created GGUF before.
+>I tried looking into your second link, but it seems the patch wasn't updated in a while and no longer applies to llama.cpp, I tried few different old commits but could not find one yet where it applies successfully. Maybe I need to try even older llama.cpp commits, but not sure, if I go too far into the past, would it even support DeepSeek V3 architecture to convert to BF16? I also could not find any example command how to convert using [ggml-org/llama.cpp[#10055](https://github.com/ikawrakow/ik_llama.cpp/issues/10055)](https://github.com/ggml-org/llama.cpp/pull/10055) - maybe it is something obvious I missed, perhaps because I never created GGUF before.
I am sorry, I did not link that for you to use, just as a reference to what I see as a better long term solution to the greater issue of handling fp8 native models would be.
@@ -617,7 +495,7 @@ If you feel like trying one more time with triton (and no guarantees that it wil
---
-👤 **Panchovix** commented the **2025-05-09** at **19:19:58**:
+👤 **Panchovix** commented on **2025-05-09** at **19:19:58**
Issue should be fixed now on https://github.com/ikawrakow/ik_llama.cpp/commit/43a154d8b8b0e9217114577442cecb224a488d45
@@ -627,7 +505,7 @@ EDIT: Can confirm Chimera works fine as well.
---
-👤 **Lissanro** commented the **2025-05-11** at **07:05:01**:
+👤 **Lissanro** commented on **2025-05-11** at **07:05:01**
@saood06 Thank you, I was able to create BF16 quant after all. I switched to the system version of Python 3.13 without venv, I have applied the patch you shared and also had to bump up torch version in requirements/requirements-convert_hf_to_gguf.txt to torch~=2.5.0, otherwise it refused to proceed on my system. Without venv, I also was able to build triton-cpu. I am not sure exactly what helped out of these steps, so some of them may be unneccary. I finally was able to create BF16 command using this command:
@@ -653,7 +531,7 @@ This is the command I used to create imatrix.dat:
```
numactl --cpunodebind=0 --interleave=all~/pkgs/ik_llama.cpp/build/bin/llama-imatrix \
--model /mnt/neuro/DeepSeek-R1T-Chimera-256x21B-Q8_0-163840seq.gguf \
---ctx-size 102400 --n-gpu-layers 62 --tensor-split 15,25,30,30 -mla 3 -fa -ctk q8_0 -amb 1024 -fmoe -b 4096 -ub 4096 \
+--ctx-size 102400 --n-gpu-layers 62 --tensor-split 15,25,30,30 -mla 1 -fa -ctk q8_0 -amb 1024 -b 4096 -ub 4096 \
-ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0, blk\.3\.ffn_down_exps=CUDA0" \
-ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1, blk\.4\.ffn_down_exps=CUDA1" \
-ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2, blk\.5\.ffn_down_exps=CUDA2" \
@@ -664,6 +542,8 @@ numactl --cpunodebind=0 --interleave=all~/pkgs/ik_llama.cpp/build/bin/llama-imat
--ctx-size 512
```
+UPDATE: I removed `-fmoe` and `-mla 3` options from the command above - it turns out, using them greatly reduced imatrix quality and this is what caused by original imatrix to be much smaller than expected - instead, it is better to use `-mla 1` without `-fmoe`.
+
Context length optional, but it was mentioned [here](https://github.com/ggml-org/llama.cpp/pull/13199#issuecomment-2849293461) that Unsloth may be setting it to something higher than default 512, "possibly using 6144 - 12288" (later testing demonstrated that making imatrix with non-default context length does not help with long context performance, so if unsure better stick with the default 512 length).
More information about dynamic quant creation is here in comments:
@@ -680,10 +560,6 @@ I probably could have just used calibration_datav3.txt and nothing else, but cal
By the way, I remember a post where someone tested creating imatrix.dat file from BF16, Q8, Q6 and some lower quants, and then creating imatrix quant from BF16 with it, and the conclusion was the result was practically identical, especially if higher quants are used to create the imatrix. I did not save the link to it at the time (it was long before now), but I thought I mention it. This means if you are short on memory, you can use Q6 or even non-imatrix Q4 if you must, but using Q8 is recommended if possible to build the imatrix.dat.
-My imatrix: https://dragon.studio/2025/05/DeepSeek-R1T-Chimera-imatrix-8192seq.dat (I renamed it from imatrix.dat for clarity), it took about 12 hours to generate on EPYC 7763 64-core at 3.25 GHz.
-
-Also, here is another imatrix file for recent R1 0528 version: https://dragon.studio/2025/06/imatrix-DeepSeek-R1-0528.dat
-
Now, we can create the final quant:
```
@@ -706,71 +582,7 @@ Note: this comment was updated more recently then following messages below. So,
---
-👤 **Lissanro** commented the **2025-05-11** at **07:05:01**:
-
-@saood06 Thank you, I was able to create BF16 quant after all. I switched to the system version of Python 3.13 without venv, I have applied the patch you shared and also had to bump up torch version in requirements/requirements-convert_hf_to_gguf.txt to torch~=2.5.0, otherwise it refused to proceed on my system. Without venv, I also was able to build triton-cpu. I am not sure exactly what helped out of these steps, so some of them may be unneccary. I finally was able to create BF16 command using this command:
-
- python3 llama.cpp/convert_hf_to_gguf.py --outtype bf16 --split-max-size 50G /mnt/secondary/neuro/DeepSeek-R1T-Chimera-163840seq
-
-...where llama.cpp is the special version from [the tutorial](https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446/discussions/1#67a327570051a98a96ded9e6) you have shared earlier.
-
-Then, using ik_llama.cpp, I created my first GGUF quant, using Q6_K_R4 format:
-
-```
-~/pkgs/ik_llama.cpp/build/bin/llama-quantize \
-/mnt/secondary/neuro/DeepSeek-R1T-Chimera-163840seq/DeepSeek-R1T-Chimera-256x21B-163840seq-BF16-00001-of-00030.gguf \
-/mnt/secondary/neuro/DeepSeek-R1T-Chimera-163840seq/DeepSeek-R1T-Chimera-256x21B-Q6_K_R4-163840seq.gguf \
-Q6_K_R4
-```
-
-This is usable quant, but it is slow (I get about 2 tokens/s instead of 8 tokens/s like with Q4_K_M or UD-Q4_K_XL). However, I had to consider different solution, given I already know that Q4_K_M breaks the Chimera model (since Q4_K_M from huggingface fails the [maze test](https://www.reddit.com/r/LocalLLaMA/comments/1j4lqe6/test_if_your_api_provider_is_quantizing_your/), while Q6_K Chimera quant succeeds, and R1 Q4 quants from Unsloth also succeed).
-
-It turned out that creation of Dynamic Quants [is not documented yet and active work in progress](https://www.reddit.com/r/LocalLLaMA/comments/1kjshnd/comment/mrpacfb/), so I decided to go with creating IQ and imatrix based quants in the hope they work better than normal Q4_K_M from the huggingface.
-
-This is the command I used to create imatrix.dat:
-
-```
-~/pkgs/ik_llama.cpp/build/bin/llama-imatrix \
--m /mnt/neuro/text-generation-webui/models/DeepSeek-R1T-Chimera-256x21B-Q6_K_R4-163840seq/DeepSeek-R1T-Chimera-256x21B-Q6_K_R4-163840seq.gguf \
--f ~/pkgs/imatrix/all.txt \
---n-gpu-layers 62 --tensor-split 25,23,26,26 -mla 2 -fa -ctk q8_0 -amb 1024 -fmoe \
--ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0" \
--ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1" \
--ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2" \
--ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3" \
--ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
---threads 64
-```
-
-The all.txt file is a merge of these (I had to conver parquet to txt first):
-https://huggingface.co/datasets/eaddario/imatrix-calibration/resolve/main/calibration_all_large.parquet
-https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/2c64bb691316d32915b188e495754ef34931ae71/calibration_datav3.txt
-https://gist.github.com/bartowski1182/f003237f2e8612278a6d01622af1cb6f/raw/6cf9d7538b3a234952d927459d0ce42cb3d3ea6e/qwen_calibration_with_chat.txt
-(also, some personal data, but probably will have little compared to the three datasets above).
-
-I probably could have just used calibration_datav3.txt and nothing else, but calibration_all_large contained many languages that are not well represented in calibration_datav3.txt or qwen_calibration_with_chat.txt, and I happen to need support for multiple languages since I often do translation work.
-
-By the way, I remember a post where someone tested creating imatrix.dat file from BF16, Q8, Q6 and some lower quants, and then creating imatrix quant from BF16 with it, and the conclusion was the result was practically identical, especially if higher quants are used to create the imatrix. I did not save the link to it at the time (it was long before now), but I thought I mention it, to explain why I used Q6_K for this purpose.
-
-Estimated time to generate imatrix.dat was 16 hours, and I am still waiting for it to finish. Once I complete generating the imatrix.dat, I plan to run this command to create a final quant:
-
-```
-~/pkgs/ik_llama.cpp/build/bin/llama-quantize \
---imatrix imatrix.dat \
-/mnt/secondary/neuro/DeepSeek-R1T-Chimera-163840seq/DeepSeek-R1T-Chimera-256x21B-163840seq-BF16-00001-of-00030.gguf \
-/mnt/secondary/neuro/DeepSeek-R1T-Chimera-163840seq/DeepSeek-R1T-Chimera-256x21B-IQ4_K_R4-163840seq.gguf \
-IQ4_K_R4
-```
-
-I also plan to try other methods besides IQ4_K_R4, like IQ4_NL_R4 - to see if I will get better performance on my rig with CPU+GPU inference.
-
-Due to my upload speed being around 1Mbps on average, I will not be able to share any of my quants, but I hope documenting the process will help others who may want to create their own quant. Even once this issue is closed, I still will be able to link here in case I want to share my steps elsewhere, since there was a lot of useful discussion and valuable information shared in this thread.
-
-By the way, I also confirm that loading the existing quant from huggingface works now - so it seems the original issue that was reported is fixed. It is amazing that we can now use new MLA-enabled quants created by llama.cpp, but creating own quant may help to achieve better quality and performance, especially for models with very limited selection of quants like in this case. However, figuring out how to do it was really big challenge, and I wouldn't be able to do it without help. Big thanks to @saood06 and @ikawrakow!
-
----
-
-👤 **saood06** commented the **2025-05-11** at **08:19:17**:
+👤 **saood06** commented on **2025-05-11** at **08:19:17**
>Due to my upload speed being around 1Mbps on average, I will not be able to share any of my quants, but I hope documenting the process will help others who may want to create their own quant.
@@ -782,13 +594,13 @@ It is understandable, I am in the same position as are many others. I would be v
---
-👤 **ikawrakow** commented the **2025-05-12** at **05:41:16**:
+👤 **ikawrakow** commented on **2025-05-12** at **05:41:16**
I think this is solved now.
---
-👤 **Alexey-Akishin** commented the **2025-05-12** at **15:33:55**:
+👤 **Alexey-Akishin** commented on **2025-05-12** at **15:33:55**
I just tested and solved indeed, thank you so much!
@@ -796,13 +608,13 @@ I understand from the discussion that the pre-made quant from HG is not perfect,
---
-👤 **Lissanro** commented the **2025-05-12** at **15:57:49**:
+👤 **Lissanro** commented on **2025-05-12** at **15:57:49**
@saood06 I have updated my previous comment based on your feedback: added imatrix link (it turned out to be 130MB) and also fixed commands to properly generate and repack quant using R4 only where needed as you have suggested (the repack pattern for CPU may need to be adjusted for a specific configuration, unless it happen to match with mine). Hope the experience I shared will be useful to those who decide to generate their own quant.
---
-👤 **saood06** commented the **2025-05-13** at **00:31:52**:
+👤 **saood06** commented on **2025-05-13** at **00:31:52**
>added imatrix link (it turned out to be 130MB)
@@ -814,7 +626,7 @@ Thank you for documenting this to help others.
---
-👤 **ubergarm** commented the **2025-05-13** at **20:37:25**:
+👤 **ubergarm** commented on **2025-05-13** at **20:37:25**
@Lissanro great job jumping through all the hoops and finding the breadcrumbs spread around github, reddit, etc!
@@ -905,103 +717,12 @@ custom=$(
---
-👤 **ubergarm** commented the **2025-05-13** at **20:37:25**:
-
-@Lissanro great job jumping through all the hoops and finding the breadcrumbs spread around github, reddit, etc!
-
-> My imatrix: https://dragon.studio/2025/05/DeepSeek-R1T-Chimera-imatrix-8192seq.dat
-
-Just curious, given the date on this is ~3 days ago, I'm guessing it wasn't created with this https://github.com/ikawrakow/ik_llama.cpp/pull/411 ? Not sure how much it will effect you if you're using mostly >~4bpw quants.
-
-If you're looking for speed, a recent PR improved CUDA performance on `iq4_ks`. I'm toying with maybe making a new quant something like this, just playing around for now though.
-
-
-
-Possible quant recipe
-
-```
-#!/usr/bin/env bash
-
-# Notes:
-# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2765210993
-# https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2768567062
-custom="
-# Token embedding and output tensors (GPU)
-# Remember only use _r4 for CPU *only* or offline repack later
-# Remember all attention and shexp isn't so big so could go all q8_0 and still fit under 24GB VRAM w/ 32k MLA context
-# note token_embd cannot be repacked quant type
-token_embd\.weight=iq6_k
-output\.weight=iq6_k
-output_norm\.weight=iq6_k
-
-# First 3 dense layers (0-3) (GPU)
-blk\.[0-2]\.attn_k_b.*=q6_0
-blk\.[0-2]\.attn_.*=iq6_k
-blk\.[0-2]\..*=iq6_k
-
-# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
-# Except blk.*.attn_k_b.weight is not divisible by 256 and no iq6_k so go with q6_0
-blk\.[3-9]\.attn_k_b.*=q6_0
-blk\.[1-5][0-9]\.attn_k_b.*=q6_0
-blk\.60\.attn_k_b.*=q6_0
-
-blk\.[3-9]\.attn_.*=iq6_k
-blk\.[1-5][0-9]\.attn_.*=iq6_k
-blk\.60\.attn_.*=iq6_k
-
-blk\.[3-9]\.ffn_norm\.weight=iq6_k
-blk\.[1-5][0-9]\.ffn_norm\.weight=iq6_k
-blk\.60\.ffn_norm\.weight=iq6_k
-
-blk\.[3-9]\.exp_probs_b\.bias=iq6_k
-blk\.[1-5][0-9]\.exp_probs_b\.bias=iq6_k
-blk\.60\.exp_probs_b\.bias=iq6_k
-
-# Shared Experts (3-60) (GPU)
-blk\.[3-9]\.ffn_down_shexp\.weight=iq6_k
-blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq6_k
-blk\.60\.ffn_down_shexp\.weight=iq6_k
-
-blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
-blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq6_k
-blk\.60\.ffn_(gate|up)_shexp\.weight=iq6_k
-
-# Most of the model size is below
-# Routed Experts (3-60) (CPU)
-# usually ffn_down is made a bit bigger than ffn_(gate|up) but you do you
-blk\.[3-9]\.ffn_down_exps\.weight=iq4_ks
-blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq4_ks
-blk\.60\.ffn_down_exps\.weight=iq4_ks
-
-blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
-blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq4_ks
-blk\.60\.ffn_(gate|up)_exps\.weight=iq4_ks
-"
-
-custom=$(
- echo "$custom" | grep -v '^#' | \
- sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
-)
-
-./build/bin/llama-quantize \
- --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
- --custom-q "$custom" \
- /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
- /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_KS.gguf \
- IQ4_KS \
- 24
-```
-
-
-
----
-
-👤 **Lissanro** commented the **2025-05-13** at **22:16:19**:
+👤 **Lissanro** commented on **2025-05-13** at **22:16:19**
@ubergarm
Thank you for sharing the recipe, I will give it a try, every bit of speed up will make a difference for me. I may have to wait until I get new 8TB SSD, I should get it within 1-2 days (since I ran out of space on my SSDs, and trying to load models from 16TB HDD takes hours instead of minutes like on SSD, making hard to experiment).
-As of #411, it says "This PR fixes imatrix calculation for llama.cpp-style MLA GGUFs", but I generated my imatrix from a normal GGUF derived from BF16 (using ik_llama.cpp's tools), which in turn was derived from the original fp8 model. So most likely it will not have effect on my imatrix, but please correct me if I am wrong and if it worth regenarating.
+As of [#411](https://github.com/ikawrakow/ik_llama.cpp/issues/411), it says "This PR fixes imatrix calculation for llama.cpp-style MLA GGUFs", but I generated my imatrix from a normal GGUF derived from BF16 (using ik_llama.cpp's tools), which in turn was derived from the original fp8 model. So most likely it will not have effect on my imatrix, but please correct me if I am wrong and if it worth regenarating.
@saood06 Not sure then why my imatrix is smaller, but I created it using ik_llama.cpp's llama-imatrix, maybe the larger versions were created by some other tool, or used some special settings?
@@ -1048,59 +769,7 @@ In case someone else decides to test their quants, the command needs to be adjus
---
-👤 **Lissanro** commented the **2025-05-13** at **22:16:19**:
-
-@ubergarm
-Thank you for sharing the recipe, I will give it a try, every bit of speed up will make a difference for me. I may have to wait until I get new 8TB SSD, I should get it within 1-2 days (since I ran out of space on my SSDs, and trying to load models from 16TB HDD takes hours instead of minutes like on SSD, making hard to experiment).
-
-As of #411, it says "This PR fixes imatrix calculation for llama.cpp-style MLA GGUFs", but I generated my imatrix from a normal GGUF derived from BF16 (using ik_llama.cpp's tools), which in turn was derived from the original fp8 model. So most likely it will not have effect on my imatrix, but please correct me if I am wrong and if it worth regenarating.
-
-@saood06 Not sure then why my imatrix is smaller, but I created it using ik_llama.cpp's llama-imatrix, maybe the larger versions were created by some other tool, or used some special settings?
-
-I tried creating another imatrix with default 512 context length, and then compare perplexity of quants generated from it, and this is the result (in R4 quants, only tensors that I run on CPU were repacked as R4):
-
-```
-IQ4_K_R4 from imatrix generated using n_ctx=512:
-Final estimate: PPL = 3.2911 +/- 0.01817 (perplexity tested with n_ctx=512)
-Final estimate: PPL = 3.0219 +/- 0.01568 (perplexity tested with n_ctx=8192)
-```
-
-```
-IQ4_K_R4 from imatrix generated using n_ctx=512
-Final estimate: PPL = 3.2911 +/- 0.01816 (perplexity tested with n_ctx=512)
-Final estimate: PPL = 3.0230 +/- 0.01569 (perplexity tested with n_ctx=8192)
-```
-
-```
-Q6_K reference quant:
-Final estimate: PPL = 3.2611 +/- 0.01791 (perplexity tested with n_ctx=512)
-Final estimate: PPL = 3.0039 +/- 0.01554 (perplexity tested with n_ctx=8192)
-```
-
-The conclusion it seems that generating imatrix with longer context either does not make a difference or makes quality very slightly worse (but within margin of error, so hard to tell). So generating imatrix with the default n_ctx=512 should be sufficient (it was suggested by someone in the discussions I linked in my earlier post that Unsloth may have been using context length within 6144 - 12288 range to generate imatrix, so I wanted to see if it actually makes a difference, but apparently not).
-
-For reference, this is the command I used to test perplexity:
-
-```
-numactl --cpunodebind=0 --interleave=all ~/pkgs/ik_llama.cpp/build/bin/llama-perplexity \
---model /path/to/model.gguf --n-gpu-layers 62 --tensor-split 25,23,26,26 \
--mla 3 -fa -ctk q8_0 -amb 1024 -fmoe \
--ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0" \
--ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1" \
--ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2" \
--ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3" \
--ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
---threads 64 -f /home/lissanro/pkgs/ik_llama.cpp/wikitext-2-raw/wiki.test.ra \
---ctx-size 512
-```
-
-In case someone else decides to test their quants, the command needs to be adjusted for a specific configuration, for non-repacked quants -rtr option may be needed, and ctx-size is 512 by default but can be changed if needed. And to get wiki.test.ra, I had to run the following command:
-
-`~/pkgs/ik_llama.cpp/scripts/get-wikitext-2.sh`
-
----
-
-👤 **saood06** commented the **2025-05-13** at **22:29:01**:
+👤 **saood06** commented on **2025-05-13** at **22:29:01**
> trying to load models from 16TB HDD takes hours instead of minutes like on SSD, making hard to experiment).
@@ -1120,7 +789,7 @@ Thank you for your testing and sharing of the results.
---
-👤 **ubergarm** commented the **2025-05-14** at **01:37:52**:
+👤 **ubergarm** commented on **2025-05-14** at **01:37:52**
@Lissanro
@@ -1197,38 +866,7 @@ With luck I'll have some updated perplexity values using the latest method for g
---
-👤 **ubergarm** commented the **2025-05-14** at **01:37:52**:
-
-> if it's worth regenerating.
-
-tbh I'm not sure myself. if you're using all > ~4bpw quants it might not make a huge deal.
-
-> Not sure then why my imatrix is smaller
-
-I just converted [tngtech/DeepSeek-R1T-Chimera](https://huggingface.co/tngtech/DeepSeek-R1T-Chimera) fp8 to bf16 GGUF with evshiron's llama.cpp fork and triton-cpu. I can't run the full bf16 easily with enough RAM in a single NUMA node so just made a full q8_0 version without imatrix first. Then using the q8_0 as my baseline I kept it simple and old school with
-
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-imatrix \
- --verbosity 1 \
- -m /media/b/data2/models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-Q8_0.gguf \
- -f calibration_data_v5_rc.txt \
- -o DeepSeek-R1T-Chimera.imatrix \
- --ctx-size 512 \
- --numa numactl \
- --threads 40
-```
-Resulting imatrix size is 942MiB and when using it to quantize it prints out: `720 importance matrix entries ... on 213 chunks`.
-
-> The conclusion it seems that generating imatrix with longer context either does not make a difference or makes quality very slightly worse (but within margin of error, so hard to tell). So generating imatrix with the default n_ctx=512 should be sufficient
-
-Hey appreciate the additional data points with your practical empirical approach. If you follow along there is already [much interesting old discussions still available](https://github.com/ggml-org/llama.cpp/discussions/5263) which suggests the same. Apparently [unsloth is using longer context length at least for some GGUF imatrix files now](https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF/discussions/8#6821262ba2ff408c1deccba6) but to be honest I don't follow their logic nor yet see any clear evidence. (I'm not saying its wrong, it might be so, but I don't know.)
-
-With luck I'll have some updated perplexity values using the latest method for generating imatrix and update you. Thanks for sharing your research!
-
----
-
-👤 **Lissanro** commented the **2025-05-15** at **05:49:40**:
+👤 **Lissanro** commented on **2025-05-15** at **05:49:40**
> Resulting imatrix size is 942MiB and when using it to quantize it prints out: 720 importance matrix entries ... on 213 chunks.
@@ -1265,37 +903,7 @@ Since performance and size are similar, normal imatrix IQ4_K_R4 quant seem to be
---
-👤 **Lissanro** commented the **2025-05-15** at **05:49:40**:
-
-> Resulting imatrix size is 942MiB and when using it to quantize it prints out: 720 importance matrix entries ... on 213 chunks.
-
-For me it shows "load_imatrix: loaded 543 importance matrix entries from DeepSeek-R1T-Chimera-imatrix.dat computed on 3660 chunks" (probably because I am using large input file) and resulting size is 130 MB. I wonder what makes mine smaller, maybe because I am creating it from Q6_K instead of Q8_0? However, my imatrix file seems to work as expected as far as I can tell.
-
-> Possible quant recipe
-
-I have tested the recipe for the IQ4_KS quant and based on perplexity it seems to be quite good, the size is slightly smaller, perplexity remained almost exactly the same as for IQ4_K and performance remained similar (slightly more than 8 tokens/s for both IQ4_K and IQ4_KS quants, with only necessary for CPU tensors converted to R4, on EPYC 7763 + 1 TB 3200MHz RAM + 4x3090 GPUs):
-
-```
-IQ4_KS_R (339G)
-Final estimate: PPL = 3.2876 +/- 0.01807
-Final estimate: PPL = 3.0262 +/- 0.01568
-```
-
-```
-IQ4_K_R4 (356G):
-Final estimate: PPL = 3.2911 +/- 0.01817 (perplexity tested with n_ctx=512)
-Final estimate: PPL = 3.0219 +/- 0.01568 (perplexity tested with n_ctx=8192)
-```
-
-```
-Q6_K reference quant (515G):
-Final estimate: PPL = 3.2611 +/- 0.01791 (perplexity tested with n_ctx=512)
-Final estimate: PPL = 3.0039 +/- 0.01554 (perplexity tested with n_ctx=8192)
-```
-
----
-
-👤 **ubergarm** commented the **2025-05-15** at **14:49:15**:
+👤 **ubergarm** commented on **2025-05-15** at **14:49:15**
@Lissanro
@@ -1364,60 +972,4 @@ I guess I'm not sure it applies to use this tokenized style maze test on models
This specific alpha maze test seems to be used on SFTd models to compare the underlying model architectures ability to solve spatial tasks, not to compare quantizations of a model not SFTd with these tokens.
-But I dunno, maybe it is useful?
-
----
-
-👤 **ubergarm** commented the **2025-05-15** at **14:49:15**:
-
-@Lissanro
-
-> However, my imatrix file seems to work as expected as far as I can tell.
-
-Yeah it seems like just having almost any imatrix is generally better than not.
-
-I just got my first numbers on this [DeepSeek-R1T-Chimera-IQ4_KS](https://huggingface.co/ubergarm/DeepSeek-R1T-Chimera-GGUF#deepseek-r1t-chimera-iq4_ks)*:
-
-```
-IQ4_KS - 338.456 GiB - 4.326 BPW
-Final estimate: PPL = 3.4082 +/- 0.01892
-```
-
-*it is super slow to upload, not sure it will ever finish lol... The new imatrix is at least there computed with the latest fixes from PR411
-
-Need to run one on the Q8_0 for comparison but its kinda slow as I haven't optimized the command on this remote rig.
-
-My PPL is higher than yours, could be using iq6_k for all attention, but you have the longer imatrix corpus as well. Too many variables to know for sure but at least another data point.
-
-
-
-perplexity command
-
-```
-# running on single 4090 GPU with plenty of RAM
-$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
-$ gunzip wiki.test.raw.gz
-$ numactl -N 0 -m 0 \
-./build/bin/llama-perplexity \
- -m /models/ubergarm/DeepSeek-R1T-Chimera-GGUF/DeepSeek-R1T-Chimera-IQ4_KS.gguf \
- -f wiki.test.raw \
- --ctx-size 512 \
- --ubatch-size 512 \
- --seed 1337 \
- -ctk f16 \
- -fa \
- -mla 3 \
- -amb 512 \
- -fmoe \
- -ngl 99 \
- --override-tensor exps=CPU \
- -rtr \
- --numa numactl \
- --threads 40
-```
-
-
-
-> Further testing revealed Q4_KS quant quality dropped significantly in reasoning tasks, most noticeable in the [maze test](https://www.reddit.com/r/LocalLLaMA/comments/1j4lqe6/test_if_your_api_provider_is_quantizing_your/):
-
-Huh fascinating, I wonder what is going on there. Both the K and KS have similar perplexities. I haven't looked into the maze test but will maybe try it out on some smaller models locally soon just to see. Is it failing in terms of getting the output directions correct with you as a human looking at the result? Or is it some syntactical errors with it messing up the `<|up|>` formatting resulting in a "failed" run as computed by some python script? I assume sampling may effect the output somewhat. But if it works reliably it could be a useful test, thanks for sharing!
\ No newline at end of file
+But I dunno, maybe it is useful?
\ No newline at end of file
diff --git a/github-data/issues/387 - Bug_ bitnet 1.58 on termux segmentation fault.md b/github-data/issues/387 - Bug bitnet 1.58 on termux segmentation fault.md
similarity index 72%
rename from github-data/issues/387 - Bug_ bitnet 1.58 on termux segmentation fault.md
rename to github-data/issues/387 - Bug bitnet 1.58 on termux segmentation fault.md
index 1651c13fe..0185c75b6 100644
--- a/github-data/issues/387 - Bug_ bitnet 1.58 on termux segmentation fault.md
+++ b/github-data/issues/387 - Bug bitnet 1.58 on termux segmentation fault.md
@@ -1,4 +1,4 @@
-### 🐛 [#387](https://github.com/ikawrakow/ik_llama.cpp/issues/387) - Bug: bitnet 1.58 on termux segmentation fault
+## 📌 [Issue #387](https://github.com/ikawrakow/ik_llama.cpp/issues/387) - Bug: bitnet 1.58 on termux segmentation fault
| **Author** | `Benjamin-Wegener` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -37,9 +37,9 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Benjamin-Wegener** commented the **2025-05-06** at **17:42:16**:
+👤 **Benjamin-Wegener** commented on **2025-05-06** at **17:42:16**
used
cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF
@@ -47,19 +47,19 @@ cmake --build ./build --config Release -j $(nproc)
---
-👤 **ikawrakow** commented the **2025-05-06** at **17:45:58**:
+👤 **ikawrakow** commented on **2025-05-06** at **17:45:58**
You need to convert the model. If you don't find how, I'll add the instructions when back at a computer.
---
-👤 **Benjamin-Wegener** commented the **2025-05-06** at **18:09:09**:
+👤 **Benjamin-Wegener** commented on **2025-05-06** at **18:09:09**
thanks, ill report back
---
-👤 **Benjamin-Wegener** commented the **2025-05-06** at **19:04:56**:
+👤 **Benjamin-Wegener** commented on **2025-05-06** at **19:04:56**
~/ik_llama.cpp $ ./build/bin/llama-quantize --allow-requantize ./models/bitnet1582b4t-iq2_bn_r4.gguf\?download\=true ./models/bitnet.gguf iq2_bn_r4
@@ -69,7 +69,7 @@ Llama: :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
---
-👤 **ikawrakow** commented the **2025-05-06** at **19:37:24**:
+👤 **ikawrakow** commented on **2025-05-06** at **19:37:24**
You need to convert the `i2_s` model that you downloaded previously
```
@@ -79,9 +79,9 @@ You need to convert the `i2_s` model that you downloaded previously
---
-👤 **saood06** commented the **2025-05-06** at **19:51:09**:
+👤 **saood06** commented on **2025-05-06** at **19:51:09**
-I think the issue is #361 which can be worked around using #347
+I think the issue is [#361](https://github.com/ikawrakow/ik_llama.cpp/issues/361) which can be worked around using [#347](https://github.com/ikawrakow/ik_llama.cpp/issues/347)
One indicator of that is if the build process took a short amount of time.
@@ -95,23 +95,9 @@ To test in the server you can send the following request which is lifted straigh
---
-👤 **saood06** commented the **2025-05-06** at **19:51:09**:
+👤 **Benjamin-Wegener** commented on **2025-05-07** at **06:28:44**
-I think the issue is #361 which can be worked around using #347
-
-One indicator of that is if the build process took a short amount of time.
-
-Try adding `-DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16"` to your build.
-
-To test in the server you can send the following request which is lifted straight from from their [transformers PR](https://github.com/huggingface/transformers/pull/37503/files) (the BOS token is ommited as ik_llama.cpp/llama.cpp automatically inserts one):
-
-"User: Hey, are you conscious? Can you talk to me?<|eot_id|>Assistant:"
-
----
-
-👤 **Benjamin-Wegener** commented the **2025-05-07** at **06:28:44**:
-
-> I think the issue is [#361](https://github.com/ikawrakow/ik_llama.cpp/issues/361) which can be worked around using [#347](https://github.com/ikawrakow/ik_llama.cpp/pull/347)
+> I think the issue is [[#361](https://github.com/ikawrakow/ik_llama.cpp/issues/361)](https://github.com/ikawrakow/ik_llama.cpp/issues/361) which can be worked around using [[#347](https://github.com/ikawrakow/ik_llama.cpp/issues/347)](https://github.com/ikawrakow/ik_llama.cpp/pull/347)
>
> One indicator of that is if the build process took a short amount of time.
>
@@ -127,7 +113,7 @@ that helps, now its working, thank you
---
-👤 **Benjamin-Wegener** commented the **2025-05-09** at **04:30:45**:
+👤 **Benjamin-Wegener** commented on **2025-05-09** at **04:30:45**
just for convenience all subsequential commands to install bitnet (or other cpu models) on a fresh termux aarch64:
```bash
@@ -143,42 +129,20 @@ wget https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-
---
-👤 **Benjamin-Wegener** commented the **2025-05-09** at **04:30:45**:
-
-just for convenience all subsequential commands to install bitnet (or other cpu models) on a fresh termux aarch64:
-`
-apt update && apt install wget cmake git -y
-git clone https://github.com/ikawrakow/ik_llama.cpp
-cd ik_llama.cpp
-cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16"
-cmake --build ./build --config Release -j $(nproc)
-wget https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-model-i2_s.gguf?download=true -O ./models/ggml-model-i2_s.gguf
-./build/bin/llama-quantize --allow-requantize ./models/ggml-model-is_s.gguf ./models/bitnet.gguf iq2_bn_r4
-./build/bin/llama-server -mla 3 --model ./models/bitnet.gguf
-`
-
----
-
-👤 **ikawrakow** commented the **2025-05-09** at **08:19:12**:
+👤 **ikawrakow** commented on **2025-05-09** at **08:19:12**
@Benjamin-Wegener Thank you for these instructions. Do you mind if I take them and make a Discussion for better visibility. Or, if you prefer, you can do it yourself. Let me know.
---
-👤 **Benjamin-Wegener** commented the **2025-05-09** at **09:20:13**:
+👤 **Benjamin-Wegener** commented on **2025-05-09** at **09:20:13**
sure, will do
EDIT: done https://github.com/ikawrakow/ik_llama.cpp/discussions/401
---
-👤 **Benjamin-Wegener** commented the **2025-05-09** at **09:20:13**:
-
-sure, will do
-
----
-
-👤 **Manamama** commented the **2025-05-23** at **08:50:18**:
+👤 **Manamama** commented on **2025-05-23** at **08:50:18**
FYI, I have tested your https://github.com/ikawrakow/ik_llama.cpp/issues/387#issuecomment-2865065414 out of curiosity on my "somewhat contaminated" Termux.
@@ -301,134 +265,13 @@ I have taken a peek at this `quantize-stats.cpp` and these strings asre indeed t
---
-👤 **Manamama** commented the **2025-05-23** at **08:50:18**:
-
-FYI, I have tested your https://github.com/ikawrakow/ik_llama.cpp/issues/387#issuecomment-2865065414 out of curiosity on my "somewhat contaminated" Termux.
-
-Both llama.cpp and yours used to compile fine, but at least today:
-1. llama.cpp still compiles fine (but then seg faults on some ggufs only, see https://github.com/ggml-org/llama.cpp/issues/13708#issuecomment-2902117306)
-2. Your one, when I do just that: https://github.com/ikawrakow/ik_llama.cpp/issues/387#issuecomment-2865065414, causes:
-
-```
-Environment at system:
-Linux localhost 4.14.186+ #1 SMP PREEMPT Thu Mar 17 16:28:22 CST 2022 aarch64 Android
-
-
-PATH: /data/data/com.termux/files/usr/google-cloud-sdk/bin:/data/data/com.termux/files/home/.opam/default/bin:/data/data/com.termux/files/usr/bin:/system/bin/:/data/data/com.termux/files/usr/bin:/system/bin/:/data/data/com.termux/files/usr/bin:/data/data/com.termux/files/usr/bin/texlive:/data/data/com.termux/files/usr/bin/texlive:/data/data/com.termux/files/home/.local/bin:/build-tools/30.0.3
-
-LD_PRELOAD: /data/data/com.termux/files/usr/lib/libtermux-exec-direct-ld-preload.so
-
-LD_LIBRARY_PATH:
-
-CC: clang
-CXX: clang++
-C_INCLUDE_PATH:
-FC: lfortran
-CFLAGS:
-CXXFLAGS:
-LDFLAGS: -llog -largp -lm
-CPPFLAGS:
-CMAKE_PREFIX_PATH: :/data/data/com.termux/files/usr/lib/cmake/Qt6HostInfo
+👤 **ikawrakow** commented on **2025-05-23** at **09:02:05**
-JAVA_HOME: /data/data/com.termux/files/usr/lib/jvm/java-17-openjdk
-ANDROID_NDK: /storage/emulated/0/Download/android-ndk-r26b
-ANDROID_SDK: /storage/sdcard1/Installs/Android_ndk_sdk/SDK
-
-```
-~/downloads $ git clone https://github.com/ikawrakow/ik_llama.cpp
-cd ik_llama.cpp
-Cloning into 'ik_llama.cpp'...
-remote: Enumerating objects: 29327, done.
-remote: Counting objects: 100% (8480/8480), done.
-remote: Compressing objects: 100% (788/788), done.
-remote: Total 29327 (delta 8003), reused 7707 (delta 7692), pack-reused 20847 (from 2)
-Receiving objects: 100% (29327/29327), 34.13 MiB | 98.00 KiB/s, done.
-Resolving deltas: 100% (22227/22227), done.
-Updating files: 100% (1027/1027), done.
-~/downloads/ik_llama.cpp $ cd ik^C
-~/downloads/ik_llama.cpp $ ls
- AUTHORS CMakePresets.json convert_hf_to_gguf_update.py examples gguf-py Makefile Package.swift pyproject.toml requirements.txt tests
- ci common convert_llama_ggml_to_gguf.py flake.lock grammars media pocs pyrightconfig.json scripts
- cmake CONTRIBUTING.md convert_lora_to_gguf.py flake.nix include models poetry.lock README.md spm-headers
- CMakeLists.txt convert_hf_to_gguf.py docs ggml LICENSE mypy.ini prompts requirements src
-~/downloads/ik_llama.cpp $
-cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16"
-cmake --build ./build --config Release -j $(nproc)
--- The C compiler identification is Clang 20.1.5
--- The CXX compiler identification is Clang 20.1.5
--- Detecting C compiler ABI info
--- Detecting C compiler ABI info - done
--- Check for working C compiler: /data/data/com.termux/files/usr/bin/clang - skipped
--- Detecting C compile features
--- Detecting C compile features - done
--- Detecting CXX compiler ABI info
--- Detecting CXX compiler ABI info - done
--- Check for working CXX compiler: /data/data/com.termux/files/usr/bin/clang++ - skipped
--- Detecting CXX compile features
--- Detecting CXX compile features - done
--- Found Git: /data/data/com.termux/files/usr/bin/git (found version "2.49.0")
--- Performing Test CMAKE_HAVE_LIBC_PTHREAD
--- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
--- Check if compiler accepts -pthread
--- Check if compiler accepts -pthread - yes
--- Found Threads: TRUE
--- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
--- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
--- Found OpenMP: TRUE (found version "5.1")
--- OpenMP found
--- Using optimized iqk matrix multiplications
--- Enabling IQK Flash Attention kernels
--- Using llamafile
--- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
--- CMAKE_SYSTEM_PROCESSOR: aarch64
--- ARM detected
--- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
--- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
--- Looking for pthread_create in pthreads
--- Looking for pthread_create in pthreads - not found
--- Looking for pthread_create in pthread
--- Looking for pthread_create in pthread - found
--- ARCH_FLAGS = -march=native
--- Configuring done (17.5s)
--- Generating done (1.4s)
--- Build files have been written to: /data/data/com.termux/files/home/downloads/ik_llama.cpp/build
-[ 0%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
-[ 1%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
-...
-[ 79%] Building CXX object examples/perplexity/CMakeFiles/llama-perplexity.dir/perplexity.cpp.o
-[ 80%] Linking CXX executable ../../bin/llama-perplexity
-[ 80%] Built target llama-perplexity
-[ 81%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o
-/data/data/com.termux/files/home/downloads/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:782:57: error: expected ')'
- 782 | if (sumqx*sumqx*sumq2i[j] > best]) {
- | ^
-/data/data/com.termux/files/home/downloads/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:782:28: note: to match this '('
- 782 | if (sumqx*sumqx*sumq2i[j] > best]) {
- | ^
-/data/data/com.termux/files/home/downloads/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:782:57: error: expected expression
- 782 | if (sumqx*sumqx*sumq2i[j] > best]) {
- | ^
-/data/data/com.termux/files/home/downloads/ik_llama.cpp/examples/quantize-stats/quantize-stats.cpp:782:58: error: expected expression
- 782 | if (sumqx*sumqx*sumq2i[j] > best]) {
- | ^
-3 errors generated.
-make[2]: *** [examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/build.make:79: examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o] Error 1
-make[1]: *** [CMakeFiles/Makefile2:3920: examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/all] Error 2
-make: *** [Makefile:146: all] Error 2
-```
-
-I have taken a peek at this `quantize-stats.cpp` and these strings asre indeed there, but I am bad in counting the closing brackets vs the opening ones by hand ...
-```
-
----
-
-👤 **ikawrakow** commented the **2025-05-23** at **09:02:05**:
-
-Does #445 fix it?
+Does [#445](https://github.com/ikawrakow/ik_llama.cpp/issues/445) fix it?
---
-👤 **Manamama** commented the **2025-05-23** at **18:34:02**:
+👤 **Manamama** commented on **2025-05-23** at **18:34:02**
Yes, it compiles now.
Testing:
@@ -459,17 +302,4 @@ Quick update: my trick does not help either.
```
after recompilation, too.
-Ver. 1.3
-
----
-
-👤 **Manamama** commented the **2025-05-23** at **18:34:02**:
-
-Yes, it compiles now.
-Testing:
-```
-wget https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-model-i2_s.gguf?download=true -O ./models/ggml-model-i2_s.gguf
-./build/bin/llama-quantize --allow-requantize ./models/ggml-model-is_s.gguf ./models/bitnet.gguf iq2_bn_r4
-./build/bin/llama-server -mla 3 --model ./models/bitnet.gguf
-```
-...
\ No newline at end of file
+Ver. 1.3
\ No newline at end of file
diff --git a/github-data/issues/388 - Bug_ Clash with mainline llama.cpp .so files.md b/github-data/issues/388 - Bug Clash with mainline llama.cpp .so files.md
similarity index 74%
rename from github-data/issues/388 - Bug_ Clash with mainline llama.cpp .so files.md
rename to github-data/issues/388 - Bug Clash with mainline llama.cpp .so files.md
index 1ea6a9761..30c2f61b2 100644
--- a/github-data/issues/388 - Bug_ Clash with mainline llama.cpp .so files.md
+++ b/github-data/issues/388 - Bug Clash with mainline llama.cpp .so files.md
@@ -1,4 +1,4 @@
-### 🐛 [#388](https://github.com/ikawrakow/ik_llama.cpp/issues/388) - Bug: Clash with mainline llama.cpp .so files
+## 📌 [Issue #388](https://github.com/ikawrakow/ik_llama.cpp/issues/388) - Bug: Clash with mainline llama.cpp .so files
| **Author** | `Manamama` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -105,9 +105,9 @@ Most Linuxex, I presume.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Manamama** commented the **2025-05-06** at **19:03:45**:
+👤 **Manamama** commented on **2025-05-06** at **19:03:45**
Update, still seg fault:
@@ -307,7 +307,7 @@ Segmentation fault
---
-👤 **Manamama** commented the **2025-05-06** at **19:16:14**:
+👤 **Manamama** commented on **2025-05-06** at **19:16:14**
Oh, identical in Termux. Grok AI wrote the below, sorry for the dump paste:
@@ -318,7 +318,7 @@ Oh, identical in Termux. Grok AI wrote the below, sorry for the dump paste:
> Environment:
> OS: Android (Termux)
>
-> Kernel: Linux localhost 4.14.186+ #1 SMP PREEMPT Thu Mar 17 16:28:22 CST 2022 aarch64 Android
+> Kernel: Linux localhost 4.14.186+ [#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) SMP PREEMPT Thu Mar 17 16:28:22 CST 2022 aarch64 Android
>
> Architecture: aarch64
>
@@ -437,11 +437,11 @@ Oh, identical in Termux. Grok AI wrote the below, sorry for the dump paste:
> The system llama-cli issue suggests a broader problem with Termux package management or incomplete installations, which may require coordination with Termux maintainers.
>
> References:
-> uname -a: Linux localhost 4.14.186+ #1 SMP PREEMPT Thu Mar 17 16:28:22 CST 2022 aarch64 Android
+> uname -a: Linux localhost 4.14.186+ [#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) SMP PREEMPT Thu Mar 17 16:28:22 CST 2022 aarch64 Android
---
-👤 **Manamama** commented the **2025-05-06** at **19:57:18**:
+👤 **Manamama** commented on **2025-05-06** at **19:57:18**
Update: this avoids seg faults in Ubuntu: https://github.com/ikawrakow/ik_llama.cpp/issues/387#issuecomment-2855735935
@@ -489,7 +489,7 @@ drwxrwxrwx 1 root root 8192 Apr 22 16:18 ../
---
-👤 **saood06** commented the **2025-05-06** at **19:58:26**:
+👤 **saood06** commented on **2025-05-06** at **19:58:26**
How did you build this on Ubuntu and Android? Do you mind sharing the logs from both builds?
@@ -497,7 +497,7 @@ Also on termux you may want to try adding "-DGGML_ARCH_FLAGS="-march=armv8.2-a+d
---
-👤 **saood06** commented the **2025-05-06** at **20:01:17**:
+👤 **saood06** commented on **2025-05-06** at **20:01:17**
>But I am not sure why the size got from tiny to minuscule:
@@ -512,7 +512,7 @@ which is expected.
---
-👤 **Manamama** commented the **2025-05-06** at **20:10:16**:
+👤 **Manamama** commented on **2025-05-06** at **20:10:16**
Re Droid only.
@@ -749,244 +749,7 @@ I shall `mv` once again and retry your `"-DGGML_ARCH_FLAGS="-march=armv8.2-a+dot
---
-👤 **Manamama** commented the **2025-05-06** at **20:10:16**:
-
-Re Droid only.
-
-New Termux session, so LD_LIBRARY_PATH is standard:
-```
-~/downloads/ik_llama.cpp $ echo $LD_LIBRARY_PATH
-
-~/downloads/ik_llama.cpp $
-```
-- Termux pix up the defaults then, I presume.
-
-
-We move the old working /bin files and recompile and test:
-
-```
-~/downloads/ik_llama.cpp $ ls bin/
- llama-baby-llama llama-cvector-generator llama-gguf-split llama-lookup-create llama-q8dot llama-speculative test-chat-template test-quantize-fns
- llama-batched llama-embedding llama-gritlm llama-lookup-merge llama-quantize llama-sweep-bench test-grad0 test-quantize-perf
- llama-batched-bench llama-eval-callback llama-imatrix llama-lookup-stats llama-quantize-stats llama-tokenize test-grammar-integration test-rope
- llama-bench llama-export-lora llama-infill llama-minicpmv-cli llama-retrieval llama-vdot test-grammar-parser test-sampling
- llama-bench-matmult llama-gbnf-validator llama-llava-cli llama-parallel llama-save-load-state test-autorelease test-json-schema-to-grammar test-tokenizer-0
- llama-cli llama-gguf llama-lookahead llama-passkey llama-server test-backend-ops test-llama-grammar test-tokenizer-1-bpe
- llama-convert-llama2c-to-ggml llama-gguf-hash llama-lookup llama-perplexity llama-simple test-c test-model-load-cancel test-tokenizer-1-spm
-~/downloads/ik_llama.cpp $ mv bin/ bin.1
-~/downloads/ik_llama.cpp $ rm CMakeCache.txt
-~/downloads/ik_llama.cpp $ cmake .
--- The C compiler identification is Clang 20.1.3
--- The CXX compiler identification is Clang 20.1.3
--- Detecting C compiler ABI info
--- Detecting C compiler ABI info - done
--- Check for working C compiler: /data/data/com.termux/files/usr/bin/clang - skipped
--- Detecting C compile features
--- Detecting C compile features - done
--- Detecting CXX compiler ABI info
--- Detecting CXX compiler ABI info - done
--- Check for working CXX compiler: /data/data/com.termux/files/usr/bin/clang++ - skipped
--- Detecting CXX compile features
--- Detecting CXX compile features - done
--- Found Git: /data/data/com.termux/files/usr/bin/git (found version "2.49.0")
--- Performing Test CMAKE_HAVE_LIBC_PTHREAD
--- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
--- Check if compiler accepts -pthread
--- Check if compiler accepts -pthread - yes
--- Found Threads: TRUE
--- Found OpenMP_C: -fopenmp=libomp (found version "5.1")
--- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
--- Found OpenMP: TRUE (found version "5.1")
--- OpenMP found
--- Using optimized iqk matrix multiplications
--- Using llamafile
--- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
--- CMAKE_SYSTEM_PROCESSOR: aarch64
--- ARM detected
--- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E
--- Performing Test COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
--- Looking for pthread_create in pthreads
--- Looking for pthread_create in pthreads - not found
--- Looking for pthread_create in pthread
--- Looking for pthread_create in pthread - found
--- Configuring done (12.1s)
--- Generating done (0.4s)
--- Build files have been written to: /data/data/com.termux/files/home/downloads/ik_llama.cpp
-~/downloads/ik_llama.cpp $ make
-[ 6%] Built target ggml
-[ 10%] Built target llama
-[ 11%] Built target build_info
-[ 15%] Built target common
-[ 16%] Linking CXX executable ../bin/test-tokenizer-0
-[ 17%] Built target test-tokenizer-0
-[ 18%] Linking CXX executable ../bin/test-tokenizer-1-bpe
-[ 18%] Built target test-tokenizer-1-bpe
-[ 19%] Linking CXX executable ../bin/test-tokenizer-1-spm
-[ 19%] Built target test-tokenizer-1-spm
-[ 19%] Linking CXX executable ../bin/test-quantize-fns
-[ 20%] Built target test-quantize-fns
-[ 21%] Linking CXX executable ../bin/test-quantize-perf
-[ 22%] Built target test-quantize-perf
-[ 22%] Linking CXX executable ../bin/test-sampling
-[ 23%] Built target test-sampling
-[ 23%] Linking CXX executable ../bin/test-chat-template
-[ 24%] Built target test-chat-template
-[ 24%] Linking CXX executable ../bin/test-grammar-parser
-[ 25%] Built target test-grammar-parser
-[ 26%] Linking CXX executable ../bin/test-llama-grammar
-[ 27%] Built target test-llama-grammar
-[ 28%] Linking CXX executable ../bin/test-grammar-integration
-[ 29%] Built target test-grammar-integration
-[ 30%] Linking CXX executable ../bin/test-grad0
-[ 31%] Built target test-grad0
-[ 31%] Linking CXX executable ../bin/test-backend-ops
-[ 32%] Built target test-backend-ops
-[ 33%] Linking CXX executable ../bin/test-rope
-[ 34%] Built target test-rope
-[ 35%] Linking CXX executable ../bin/test-model-load-cancel
-[ 36%] Built target test-model-load-cancel
-[ 37%] Linking CXX executable ../bin/test-autorelease
-[ 38%] Built target test-autorelease
-[ 38%] Linking CXX executable ../bin/test-json-schema-to-grammar
-[ 40%] Built target test-json-schema-to-grammar
-[ 41%] Linking C executable ../bin/test-c
-[ 42%] Built target test-c
-[ 42%] Linking CXX executable ../../bin/llama-cvector-generator
-[ 43%] Built target llama-cvector-generator
-[ 43%] Linking CXX executable ../../bin/llama-baby-llama
-[ 44%] Built target llama-baby-llama
-[ 44%] Linking CXX executable ../../bin/llama-batched-bench
-[ 45%] Built target llama-batched-bench
-[ 45%] Linking CXX executable ../../bin/llama-batched
-[ 46%] Built target llama-batched
-[ 47%] Linking CXX executable ../../bin/llama-bench-matmult
-[ 47%] Built target llama-bench-matmult
-[ 48%] Linking CXX executable ../../bin/llama-convert-llama2c-to-ggml
-[ 48%] Built target llama-convert-llama2c-to-ggml
-[ 48%] Linking CXX executable ../../bin/llama-embedding
-[ 49%] Built target llama-embedding
-[ 50%] Linking CXX executable ../../bin/llama-eval-callback
-[ 51%] Built target llama-eval-callback
-[ 52%] Linking CXX executable ../../bin/llama-export-lora
-[ 52%] Built target llama-export-lora
-[ 53%] Linking CXX executable ../../bin/llama-gbnf-validator
-[ 53%] Built target llama-gbnf-validator
-[ 54%] Built target sha256
-[ 55%] Built target xxhash
-[ 55%] Built target sha1
-[ 55%] Linking CXX executable ../../bin/llama-gguf-hash
-[ 56%] Built target llama-gguf-hash
-[ 56%] Linking CXX executable ../../bin/llama-gguf-split
-[ 57%] Built target llama-gguf-split
-[ 58%] Linking CXX executable ../../bin/llama-gguf
-[ 58%] Built target llama-gguf
-[ 58%] Linking CXX executable ../../bin/llama-gritlm
-[ 59%] Built target llama-gritlm
-[ 60%] Linking CXX executable ../../bin/llama-imatrix
-[ 61%] Built target llama-imatrix
-[ 62%] Linking CXX executable ../../bin/llama-infill
-[ 62%] Built target llama-infill
-[ 63%] Linking CXX executable ../../bin/llama-bench
-[ 64%] Built target llama-bench
-[ 66%] Built target llava
-[ 67%] Built target llava_static
-[ 67%] Built target llava_shared
-[ 68%] Linking CXX executable ../../bin/llama-llava-cli
-[ 68%] Built target llama-llava-cli
-[ 69%] Linking CXX executable ../../bin/llama-minicpmv-cli
-[ 69%] Built target llama-minicpmv-cli
-[ 70%] Linking CXX executable ../../bin/llama-lookahead
-[ 70%] Built target llama-lookahead
-[ 70%] Linking CXX executable ../../bin/llama-lookup
-[ 71%] Built target llama-lookup
-[ 71%] Linking CXX executable ../../bin/llama-lookup-create
-[ 72%] Built target llama-lookup-create
-[ 72%] Linking CXX executable ../../bin/llama-lookup-merge
-[ 73%] Built target llama-lookup-merge
-[ 74%] Linking CXX executable ../../bin/llama-lookup-stats
-[ 75%] Built target llama-lookup-stats
-[ 76%] Linking CXX executable ../../bin/llama-cli
-[ 76%] Built target llama-cli
-[ 77%] Linking CXX executable ../../bin/llama-parallel
-[ 77%] Built target llama-parallel
-[ 78%] Linking CXX executable ../../bin/llama-passkey
-[ 78%] Built target llama-passkey
-[ 78%] Linking CXX executable ../../bin/llama-perplexity
-[ 79%] Built target llama-perplexity
-[ 80%] Linking CXX executable ../../bin/llama-quantize-stats
-[ 80%] Built target llama-quantize-stats
-[ 81%] Linking CXX executable ../../bin/llama-quantize
-[ 82%] Built target llama-quantize
-[ 83%] Linking CXX executable ../../bin/llama-retrieval
-[ 83%] Built target llama-retrieval
-[ 84%] Linking CXX executable ../../bin/llama-server
-[ 93%] Built target llama-server
-[ 94%] Linking CXX executable ../../bin/llama-save-load-state
-[ 94%] Built target llama-save-load-state
-[ 95%] Linking CXX executable ../../bin/llama-simple
-[ 95%] Built target llama-simple
-[ 96%] Linking CXX executable ../../bin/llama-speculative
-[ 96%] Built target llama-speculative
-[ 96%] Linking CXX executable ../../bin/llama-sweep-bench
-[ 97%] Built target llama-sweep-bench
-[ 97%] Linking CXX executable ../../bin/llama-tokenize
-[ 98%] Built target llama-tokenize
-[ 98%] Linking CXX executable ../../bin/llama-vdot
-[ 99%] Built target llama-vdot
-[ 99%] Linking CXX executable ../../bin/llama-q8dot
-[100%] Built target llama-q8dot
-~/downloads/ik_llama.cpp $ ls bin/
- llama-baby-llama llama-cvector-generator llama-gguf-split llama-lookup-create llama-q8dot llama-speculative test-chat-template test-quantize-fns
- llama-batched llama-embedding llama-gritlm llama-lookup-merge llama-quantize llama-sweep-bench test-grad0 test-quantize-perf
- llama-batched-bench llama-eval-callback llama-imatrix llama-lookup-stats llama-quantize-stats llama-tokenize test-grammar-integration test-rope
- llama-bench llama-export-lora llama-infill llama-minicpmv-cli llama-retrieval llama-vdot test-grammar-parser test-sampling
- llama-bench-matmult llama-gbnf-validator llama-llava-cli llama-parallel llama-save-load-state test-autorelease test-json-schema-to-grammar test-tokenizer-0
- llama-cli llama-gguf llama-lookahead llama-passkey llama-server test-backend-ops test-llama-grammar test-tokenizer-1-bpe
- llama-convert-llama2c-to-ggml llama-gguf-hash llama-lookup llama-perplexity llama-simple test-c test-model-load-cancel test-tokenizer-1-spm
-~/downloads/ik_llama.cpp $ ldd bin/llama-cli
- liblog.so => /system/lib64/liblog.so
- libargp.so => /data/data/com.termux/files/usr/lib/libargp.so
- libllama.so => /data/data/com.termux/files/usr/lib/libllama.so
- libc.so => /system/lib64/libc.so
- libggml.so => /data/data/com.termux/files/usr/lib/libggml.so
- libc++_shared.so => /data/data/com.termux/files/usr/lib/libc++_shared.so
- libdl.so => /system/lib64/libdl.so
- libm.so => /system/lib64/libm.so
- libc++.so => /system/lib64/libc++.so
- ld-android.so => /system/lib64/ld-android.so
- libggml-cpu.so => /data/data/com.termux/files/usr/lib/libggml-cpu.so
- libggml-base.so => /data/data/com.termux/files/usr/lib/libggml-base.so
-~/downloads/ik_llama.cpp $ bin/llama-cli
-CANNOT LINK EXECUTABLE "bin/llama-cli": cannot locate symbol "llama_print_timings" referenced by "/data/data/com.termux/files/home/downloads/ik_llama.cpp/bin/llama-cli"...
-~/downloads/ik_llama.cpp $
-
-```
-
-Only after my trick above it picks up the rigth .so files:
-
-```
-~/downloads/ik_llama.cpp $ cat _path.sh
-export LD_LIBRARY_PATH=$(pwd)/src/:$(pwd)/ggml/src/:$LD_LIBRARY_PATH
-~/downloads/ik_llama.cpp $ source _path.sh
-~/downloads/ik_llama.cpp $ ldd bin/llama-cli
- liblog.so => /system/lib64/liblog.so
- libargp.so => /data/data/com.termux/files/usr/lib/libargp.so
- libllama.so => /data/data/com.termux/files/home/downloads/ik_llama.cpp/src/libllama.so
- libc.so => /system/lib64/libc.so
- libggml.so => /data/data/com.termux/files/home/downloads/ik_llama.cpp/ggml/src/libggml.so
- libc++_shared.so => /data/data/com.termux/files/usr/lib/libc++_shared.so
- libdl.so => /system/lib64/libdl.so
- libm.so => /system/lib64/libm.so
- libc++.so => /system/lib64/libc++.so
- ld-android.so => /system/lib64/ld-android.so
-~/downloads/ik_llama.cpp $
-
-```
-I shall `mv` once again and retry your `"-DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16"` ...
-
----
-
-👤 **Manamama** commented the **2025-05-06** at **20:22:09**:
+👤 **Manamama** commented on **2025-05-06** at **20:22:09**
The experiment with the flags (methinks, they should not help here, it is the `rpath` type problem) - sorry for pasting all together - do take a peek at my juggling the LD_LIBRARY_PATH to default there so as to evoke that seg fault at first:
@@ -1096,7 +859,7 @@ make: jobserver mkfifo: /data/local/tmp/GMfifo22430: Permission denied
---
-👤 **saood06** commented the **2025-05-06** at **20:36:26**:
+👤 **saood06** commented on **2025-05-06** at **20:36:26**
>[has not progressed, while clang takes some 12 percent of CPU. ]
@@ -1104,7 +867,7 @@ Are you sure? I remember from when I was testing on Android, building the `iqk`
---
-👤 **Manamama** commented the **2025-05-06** at **20:39:52**:
+👤 **Manamama** commented on **2025-05-06** at **20:39:52**
OK, after probably half an hour (vs the asap compilation without these switches):
@@ -1177,72 +940,23 @@ main: seed = 1746564079
---
-👤 **Manamama** commented the **2025-05-06** at **20:39:52**:
-
-OK, after probably half an hour (vs the asap compilation without these switches):
-
-```
-[ 87%] Linking CXX executable ../../bin/llama-vdot
-[ 88%] Built target llama-sweep-bench
-[ 89%] Built target llama-speculative
-[ 89%] Built target llama-tokenize
-[ 89%] Linking CXX executable ../../bin/llama-q8dot
-[ 90%] Built target llama-vdot
-[ 91%] Built target llama-q8dot
-[100%] Built target llama-server
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $
-~/downloads/ik_llama.cpp $ ldd bin/llama-cli
- liblog.so => /system/lib64/liblog.so
- libargp.so => /data/data/com.termux/files/usr/lib/libargp.so
- libllama.so => /data/data/com.termux/files/usr/lib/libllama.so
- libc.so => /system/lib64/libc.so
- libggml.so => /data/data/com.termux/files/usr/lib/libggml.so
- libc++_shared.so => /data/data/com.termux/files/usr/lib/libc++_shared.so
- libdl.so => /system/lib64/libdl.so
- libm.so => /system/lib64/libm.so
- libc++.so => /system/lib64/libc++.so
- ld-android.so => /system/lib64/ld-android.so
- libggml-cpu.so => /data/data/com.termux/files/usr/lib/libggml-cpu.so
- libggml-base.so => /data/data/com.termux/files/usr/lib/libggml-base.so
-~/downloads/ik_llama.cpp $ bin/llama-cli
-CANNOT LINK EXECUTABLE "bin/llama-cli": cannot locate symbol "llama_print_timings" referenced by "/data/data/com.termux/files/home/downloads/ik_llama.cpp/bin/llama-cli"...
-~/downloads/ik_llama.cpp $
-
-```
-
-So `rpath` like is needed (or my ugly trick).
-
----
-
-👤 **ikawrakow** commented the **2025-05-07** at **07:00:41**:
+👤 **ikawrakow** commented on **2025-05-07** at **07:00:41**
> OK, after probably half an hour (vs the asap compilation without these switches):
-ASAP compilation means the resulting build is useless. The `iqk_mul_mat.cpp` file that takes a very long time to compile is 18,000 lines of heavily templated C++ code, so yes, it takes a long time to compile. There is issue #183 precisely because of that.
+ASAP compilation means the resulting build is useless. The `iqk_mul_mat.cpp` file that takes a very long time to compile is 18,000 lines of heavily templated C++ code, so yes, it takes a long time to compile. There is issue [#183](https://github.com/ikawrakow/ik_llama.cpp/issues/183) precisely because of that.
Concerning the clash with mainline `llama.cpp`: OK, so this project does not consider the possibility of having mainline installed to a system-wide directory, and then trying to use `ik_llama.cpp` built in a user folder. So, yes, you need to use something like `LD_LIBRARY_PATH` to have the user build directory searched first.
---
-👤 **ikawrakow** commented the **2025-05-25** at **07:09:05**:
+👤 **ikawrakow** commented on **2025-05-25** at **07:09:05**
I don't think we will be solving this one.
---
-👤 **Manamama** commented the **2025-05-25** at **18:13:04**:
+👤 **Manamama** commented on **2025-05-25** at **18:13:04**
Note to self:
```
diff --git a/github-data/issues/389 - Bug_ llama-batched-bench crashed with batch size _2.md b/github-data/issues/389 - Bug llama-batched-bench crashed with batch size 2.md
similarity index 61%
rename from github-data/issues/389 - Bug_ llama-batched-bench crashed with batch size _2.md
rename to github-data/issues/389 - Bug llama-batched-bench crashed with batch size 2.md
index 559276fa4..82e56b1cf 100644
--- a/github-data/issues/389 - Bug_ llama-batched-bench crashed with batch size _2.md
+++ b/github-data/issues/389 - Bug llama-batched-bench crashed with batch size 2.md
@@ -1,4 +1,4 @@
-### 🐛 [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389) - Bug: llama-batched-bench crashed with batch size >2
+## 📌 [Issue #389](https://github.com/ikawrakow/ik_llama.cpp/issues/389) - Bug: llama-batched-bench crashed with batch size >2
| **Author** | `QuPengfei` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -260,15 +260,15 @@ Aborted (core dumped)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-07** at **05:21:57**:
+👤 **ikawrakow** commented on **2025-05-07** at **05:21:57**
This assert almost always indicates a NaN somewhere in the calculation. What happens if you remove `-amb 1 -ser 7,1 -mla 1`
---
-👤 **QuPengfei** commented the **2025-05-07** at **06:58:07**:
+👤 **QuPengfei** commented on **2025-05-07** at **06:58:07**
Just confirmed, this happened with -ser 7,1.
@@ -283,7 +283,7 @@ Pengfei
---
-👤 **ikawrakow** commented the **2025-05-07** at **07:04:50**:
+👤 **ikawrakow** commented on **2025-05-07** at **07:04:50**
Try building with BLAS disabled. I expect this to improve performance quite a bit.
@@ -291,7 +291,7 @@ I'll have to investigate why `-ser 7,1` leads to a problem. Normally it should w
---
-👤 **QuPengfei** commented the **2025-05-07** at **13:04:45**:
+👤 **QuPengfei** commented on **2025-05-07** at **13:04:45**
@ikawrakow
@@ -520,7 +520,7 @@ main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_sha
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
+/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: OMP: Warning [#191](https://github.com/ikawrakow/ik_llama.cpp/issues/191): Forking a process while a parallel region is active is potentially unsafe.
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
GGML_ASSERT(fms.S[j] > 0) failed
@@ -645,390 +645,12 @@ GGML_ASSERT(fms.S[j] > 0) failed
/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
-OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
+OMP: Warning [#191](https://github.com/ikawrakow/ik_llama.cpp/issues/191): Forking a process while a parallel region is active is potentially unsafe.
+OMP: Warning [#191](https://github.com/ikawrakow/ik_llama.cpp/issues/191): Forking a process while a parallel region is active is potentially unsafe.
libggml.so(+0x221ab)[0x77d53049d1ab]
libggml.so(ggml_abort+0x15e)[0x77d53049f76e]
libggml.so(+0x1c1217)[0x77d53063c217]
-OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
-libggml.so(+0x1caef9)[0x77d530645ef9]
-libggml.so(+0x96ff2f)[0x77d530deaf2f]
-libggml.so(+0xc4787f)[0x77d5310c287f]
-libggml.so(_Z19iqk_flash_attn_impliiiiiiiiiiiPKfPKvS2_S2_ffPfS3_S3_+0x74b)[0x77d5310d275b]
-libggml.so(iqk_flash_attn_noalibi+0xa70)[0x77d5310d3760]
-libggml.so(+0x2dee0)[0x77d5304a8ee0]
-libggml.so(+0x61f52)[0x77d5304dcf52]
-libggml.so(+0x636bc)[0x77d5304de6bc]
-libggml.so(+0x638a9)[0x77d5304de8a9]
-/usr/local/lib/libiomp5.so(+0xa942b)[0x77d5314a942b]
-/usr/local/lib/libiomp5.so(__kmp_invoke_microtask+0x93)[0x77d531545603]
-/usr/local/lib/libiomp5.so(+0xca633)[0x77d5314ca633]
-/usr/local/lib/libiomp5.so(+0xc90ae)[0x77d5314c90ae]
-/usr/local/lib/libiomp5.so(+0x146c21)[0x77d531546c21]
-/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x77d5300baac3]
-/lib/x86_64-linux-gnu/libc.so.6(+0x126850)[0x77d53014c850]
-Aborted (core dumped)
-
----
-
-👤 **QuPengfei** commented the **2025-05-07** at **13:04:45**:
-
-@ikawrakow
-
-i see the similar issue on the DeepSeek-R1-Q4_K_M
-
-here are observation with different runs:
-- if run with --cache-type-k q4_0, bs1 got lower performance and bs2 performance is back.
-
-
-
-- if run with --cache-type-k q8_0, bs1 performance is normal but failed when bs > 2
-- if remove -ser 7,1 , performance will be very low.
-
-here is command and log:
-====
-numactl -m 1 -C 128-255 ./llama-batched-bench -m /models1/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf -c 8192 -b 2048 -ub 512 -ngl 0 -npp 128 -ntg 128 -npl 1,2,4,8 --cache-type-k q8_0 --numa numactl --threads 64 --threads-batch 128 -fa -fmoe -amb 1 -ser 7,1 -mla 0 --no-mmap
-warning: not compiled with GPU offload support, --gpu-layers option will be ignored
-warning: see main README.md for information on enabling GPU BLAS support
-WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
-llama_model_loader: additional 8 GGUFs metadata loaded.
-llama_model_loader: loaded meta data with 48 key-value pairs and 1025 tensors from /models1/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16
-llama_model_loader: - kv 3: general.quantized_by str = Unsloth
-llama_model_loader: - kv 4: general.size_label str = 256x20B
-llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
-llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<▒...
-llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
-llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815
-llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 43: general.quantization_version u32 = 2
-llama_model_loader: - kv 44: general.file_type u32 = 15
-llama_model_loader: - kv 45: split.no u16 = 0
-llama_model_loader: - kv 46: split.tensors.count i32 = 1025
-llama_model_loader: - kv 47: split.count u16 = 9
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q4_K: 606 tensors
-llama_model_loader: - type q6_K: 58 tensors
-llm_load_vocab: special tokens cache size = 819
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_swa_pattern = 1
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = Q4_K - Medium
-llm_load_print_meta: model params = 671.026 B
-llm_load_print_meta: model size = 376.650 GiB (4.822 BPW)
-llm_load_print_meta: repeating layers = 375.457 GiB (4.820 BPW, 669.173 B parameters)
-llm_load_print_meta: general.name = DeepSeek R1 BF16
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.42 MiB
-llm_load_tensors: CPU buffer size = 385689.63 MiB
-....................................................................................................
-============ llm_load_tensors: need to compute 61 wk_b tensors
-Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
-llama_new_context_with_model: n_ctx = 8192
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 0
-llama_new_context_with_model: attn_max_b = 1
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = 7, 1
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CPU KV buffer size = 28060.00 MiB
-llama_new_context_with_model: KV self size = 28060.00 MiB, K (q8_0): 12444.00 MiB, V (f16): 15616.00 MiB
-llama_new_context_with_model: CPU output buffer size = 3.95 MiB
-llama_new_context_with_model: CPU compute buffer size = 266.50 MiB
-llama_new_context_with_model: graph nodes = 3365
-llama_new_context_with_model: graph splits = 1
-
-main: n_kv_max = 8192, n_batch = 2048, n_ubatch = 512, flash_attn = 1, is_pp_shared = 0, n_gpu_layers = 0, n_threads = 64, n_threads_batch = 128
-
-| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
-|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
-| 128 | 128 | 1 | 256 | 1.560 | 82.05 | 10.533 | 12.15 | 12.094 | 21.17 |
-| 128 | 128 | 2 | 512 | 2.663 | 96.14 | 9.856 | 25.97 | 12.519 | 40.90 |
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failedGGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-
-GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: /app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-GGML_ASSERT(fms.S[j] > 0) failed
-
-
-
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-
-
-
-
-/app/llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:16600: GGML_ASSERT(fms.S[j] > 0) failed
-OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
-OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
-libggml.so(+0x221ab)[0x77d53049d1ab]
-libggml.so(ggml_abort+0x15e)[0x77d53049f76e]
-libggml.so(+0x1c1217)[0x77d53063c217]
-OMP: Warning #191: Forking a process while a parallel region is active is potentially unsafe.
+OMP: Warning [#191](https://github.com/ikawrakow/ik_llama.cpp/issues/191): Forking a process while a parallel region is active is potentially unsafe.
libggml.so(+0x1caef9)[0x77d530645ef9]
libggml.so(+0x96ff2f)[0x77d530deaf2f]
libggml.so(+0xc4787f)[0x77d5310c287f]
@@ -1049,13 +671,13 @@ Aborted (core dumped)
---
-👤 **saood06** commented the **2025-05-16** at **11:09:52**:
+👤 **saood06** commented on **2025-05-16** at **11:09:52**
-Now that SER has been fixed (#404 #415 #416) can you try again?
+Now that SER has been fixed ([#404](https://github.com/ikawrakow/ik_llama.cpp/issues/404) [#415](https://github.com/ikawrakow/ik_llama.cpp/issues/415) [#416](https://github.com/ikawrakow/ik_llama.cpp/issues/416)) can you try again?
---
-👤 **QuPengfei** commented the **2025-05-21** at **01:20:24**:
+👤 **QuPengfei** commented on **2025-05-21** at **01:20:24**
thanks. it worked now.
diff --git a/github-data/issues/398 - Bug_ -fmoe causing illegal memory access.md b/github-data/issues/398 - Bug -fmoe causing illegal memory access.md
similarity index 85%
rename from github-data/issues/398 - Bug_ -fmoe causing illegal memory access.md
rename to github-data/issues/398 - Bug -fmoe causing illegal memory access.md
index 88d740659..95f89837e 100644
--- a/github-data/issues/398 - Bug_ -fmoe causing illegal memory access.md
+++ b/github-data/issues/398 - Bug -fmoe causing illegal memory access.md
@@ -1,4 +1,4 @@
-### 🐛 [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398) - Bug: -fmoe causing illegal memory access
+## 📌 [Issue #398](https://github.com/ikawrakow/ik_llama.cpp/issues/398) - Bug: -fmoe causing illegal memory access
| **Author** | `pt13762104` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -192,22 +192,22 @@ CUDA error: an illegal memory access was encountered
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-08** at **11:11:23**:
+👤 **ikawrakow** commented on **2025-05-08** at **11:11:23**
Can you add the command line you used? Thanks.
---
-👤 **pt13762104** commented the **2025-05-08** at **14:15:50**:
+👤 **pt13762104** commented on **2025-05-08** at **14:15:50**
`ik_llama.cpp/build/bin/llama-server -m /root/Qwen3-30B-A3B-UD-Q4_K_XL.gguf -c 32768 -fmoe -fa -ngl 99`
It starts to do this in 2-3 prompts. Maybe it's related to the fact that the T4 doesn't have BF16 capability?
---
-👤 **ikawrakow** commented the **2025-05-08** at **14:42:29**:
+👤 **ikawrakow** commented on **2025-05-08** at **14:42:29**
It is more likely due to a bug that shows up in a multi-GPU setup that I cannot debug because I only have a single GPU.
@@ -225,7 +225,7 @@ to put the first 30 layers on the first GPU and everything else on the CPU.
---
-👤 **pt13762104** commented the **2025-05-09** at **01:35:39**:
+👤 **pt13762104** commented on **2025-05-09** at **01:35:39**
I can't even try this:
```
@@ -872,31 +872,25 @@ munmap_chunk(): invalid pointer # could be free() or it just disappears
---
-👤 **pt13762104** commented the **2025-05-09** at **01:36:06**:
+👤 **pt13762104** commented on **2025-05-09** at **01:36:06**
Removing `.*=CUDA0` fixed that
---
-👤 **pt13762104** commented the **2025-05-09** at **01:36:06**:
-
-Let me try IQ4_K model instead.
-
----
-
-👤 **pt13762104** commented the **2025-05-09** at **01:59:34**:
+👤 **pt13762104** commented on **2025-05-09** at **01:59:34**
@ikawrakow I haven't found issues while using -fmoe on 1 GPU. It seems like a multi-GPU issue, given that the error always occur on device 1. The IQ4_K model doesn't seem to run into this bug.
---
-👤 **Ph0rk0z** commented the **2025-05-09** at **11:52:43**:
+👤 **Ph0rk0z** commented on **2025-05-09** at **11:52:43**
I'm not sure how it is done here but afaik, real cudaMemcpyAsync is not supported on SM75.
---
-👤 **schynce** commented the **2025-05-12** at **18:47:03**:
+👤 **schynce** commented on **2025-05-12** at **18:47:03**
Hey @ikawrakow and @pt13762104,
@@ -967,13 +961,13 @@ I would be happy to provide logs or test specific configurations to help debug t
---
-👤 **Ph0rk0z** commented the **2025-05-13** at **11:51:23**:
+👤 **Ph0rk0z** commented on **2025-05-13** at **11:51:23**
Oh snap.. that's the FA error?! Try without flash attention and see if it still crashes.
---
-👤 **ikawrakow** commented the **2025-05-13** at **12:33:36**:
+👤 **ikawrakow** commented on **2025-05-13** at **12:33:36**
> Only the mix-IQ3_K seems to be working without crashing (and it is a ik_llama.cpp specific). The crash happens regardless of -fmoe. I can run the mix-IQ3_K quant with -fmoe without problems, like this:
@@ -983,7 +977,7 @@ The problem is that I cannot trigger the bug on my single-GPU system. I need to
---
-👤 **schynce** commented the **2025-05-13** at **22:33:11**:
+👤 **schynce** commented on **2025-05-13** at **22:33:11**
> Oh snap.. that's the FA error?! Try without flash attention and see if it still crashes.
@@ -1004,7 +998,7 @@ Long prompts seemed to reliably crash it before with flash attention. So, I ran
---
-👤 **Panchovix** commented the **2025-05-14** at **16:32:23**:
+👤 **Panchovix** commented on **2025-05-14** at **16:32:23**
Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
@@ -1111,221 +1105,7 @@ Not using -fmoe makes it work without issues.
---
-👤 **Panchovix** commented the **2025-05-14** at **16:32:23**:
-
-Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
-
-```
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 468.00 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 360.00 MiB
-llama_kv_cache_init: CUDA2 KV buffer size = 360.00 MiB
-llama_kv_cache_init: CUDA3 KV buffer size = 360.00 MiB
-llama_kv_cache_init: CUDA4 KV buffer size = 648.00 MiB
-llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
-llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
-llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 3520.01 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 1540.01 MiB
-llama_new_context_with_model: CUDA2 compute buffer size = 1540.01 MiB
-llama_new_context_with_model: CUDA3 compute buffer size = 1540.01 MiB
-llama_new_context_with_model: CUDA4 compute buffer size = 1540.02 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
-llama_new_context_with_model: graph nodes = 3304
-llama_new_context_with_model: graph splits = 393
-INFO [ init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
-INFO [ init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="140562497785856" timestamp=1747239254
-INFO [ main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
-INFO [ update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
-INFO [ launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
-CUDA error: an illegal memory access was encountered
- current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
- cudaGetLastError()
-/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
-[New LWP 25355]
-[New LWP 25354]
-[New LWP 25353]
-[New LWP 25352]
-[New LWP 25351]
-[New LWP 25350]
-[New LWP 25349]
-[New LWP 25348]
-[New LWP 25347]
-[New LWP 25346]
-[New LWP 25345]
-[New LWP 25344]
-[New LWP 25343]
-[New LWP 25342]
-[New LWP 25341]
-[New LWP 25340]
-[New LWP 24655]
-[New LWP 24654]
-[New LWP 24653]
-[New LWP 24652]
-[New LWP 24651]
-[New LWP 24650]
-[New LWP 24649]
-[New LWP 23954]
-[New LWP 23953]
-[New LWP 23952]
-[New LWP 23951]
-[New LWP 23950]
-[New LWP 23949]
-[New LWP 23948]
-[New LWP 23947]
-[New LWP 23942]
-[New LWP 23941]
-[New LWP 23940]
-
-This GDB supports auto-downloading debuginfo from the following URLs:
-
-Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
-Debuginfod has been disabled.
-To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
-Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
-Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
-Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
-Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
-[Thread debugging using libthread_db enabled]
-Using host libthread_db library "/lib64/libthread_db.so.1".
-0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
-#0 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
-#1 0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
-#2 0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
-#3 0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
-#4 0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-#5 0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-#6 0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-#7 0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-#8 0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-#9 0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
-#10 0x000000000049a2d4 in server_context::update_slots() ()
-#11 0x000000000046cafc in server_queue::start_loop() ()
-#12 0x0000000000416977 in main ()
-[Inferior 1 (process 23939) detached]
-```
-
-Ran it with
-
-```
-./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1
-```
-
-Not using -fmoe makes it work without issues.
-
----
-
-👤 **p4s2wd** commented the **2025-05-15** at **00:13:20**:
-
-> 顺便说一下,我在 DeepSeekV3 0324 上使用 -fmoe 时遇到了 CUDA 非法内存访问
->
-> ```
-> llama_new_context_with_model: freq_scale = 0.025
-> llama_kv_cache_init: CUDA0 KV buffer size = 468.00 MiB
-> llama_kv_cache_init: CUDA1 KV buffer size = 360.00 MiB
-> llama_kv_cache_init: CUDA2 KV buffer size = 360.00 MiB
-> llama_kv_cache_init: CUDA3 KV buffer size = 360.00 MiB
-> llama_kv_cache_init: CUDA4 KV buffer size = 648.00 MiB
-> llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
-> llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
-> llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-> llama_new_context_with_model: CUDA0 compute buffer size = 3520.01 MiB
-> llama_new_context_with_model: CUDA1 compute buffer size = 1540.01 MiB
-> llama_new_context_with_model: CUDA2 compute buffer size = 1540.01 MiB
-> llama_new_context_with_model: CUDA3 compute buffer size = 1540.01 MiB
-> llama_new_context_with_model: CUDA4 compute buffer size = 1540.02 MiB
-> llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
-> llama_new_context_with_model: graph nodes = 3304
-> llama_new_context_with_model: graph splits = 393
-> INFO [ init] initializing slots | tid="140562497785856" timestamp=1747239254 n_slots=1
-> INFO [ init] new slot | tid="140562497785856" timestamp=1747239254 id_slot=0 n_ctx_slot=32768
-> INFO [ main] model loaded | tid="140562497785856" timestamp=1747239254
-> INFO [ main] chat template | tid="140562497785856" timestamp=1747239254 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-> INFO [ main] HTTP server listening | tid="140562497785856" timestamp=1747239254 n_threads_http="15" port="8080" hostname="127.0.0.1"
-> INFO [ update_slots] all slots are idle | tid="140562497785856" timestamp=1747239254
-> INFO [ launch_slot_with_task] slot is processing task | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0
-> INFO [ update_slots] kv cache rm [p0, end) | tid="140562497785856" timestamp=1747239313 id_slot=0 id_task=0 p0=0
-> CUDA error: an illegal memory access was encountered
-> current device: 0, in function ggml_cuda_op_mul_mat at /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:1743
-> cudaGetLastError()
-> /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
-> [New LWP 25355]
-> [New LWP 25354]
-> [New LWP 25353]
-> [New LWP 25352]
-> [New LWP 25351]
-> [New LWP 25350]
-> [New LWP 25349]
-> [New LWP 25348]
-> [New LWP 25347]
-> [New LWP 25346]
-> [New LWP 25345]
-> [New LWP 25344]
-> [New LWP 25343]
-> [New LWP 25342]
-> [New LWP 25341]
-> [New LWP 25340]
-> [New LWP 24655]
-> [New LWP 24654]
-> [New LWP 24653]
-> [New LWP 24652]
-> [New LWP 24651]
-> [New LWP 24650]
-> [New LWP 24649]
-> [New LWP 23954]
-> [New LWP 23953]
-> [New LWP 23952]
-> [New LWP 23951]
-> [New LWP 23950]
-> [New LWP 23949]
-> [New LWP 23948]
-> [New LWP 23947]
-> [New LWP 23942]
-> [New LWP 23941]
-> [New LWP 23940]
->
-> This GDB supports auto-downloading debuginfo from the following URLs:
->
-> Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
-> Debuginfod has been disabled.
-> To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
-> Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
-> Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
-> Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
-> Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
-> [Thread debugging using libthread_db enabled]
-> Using host libthread_db library "/lib64/libthread_db.so.1".
-> 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
-> #0 0x00007fd73d0876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
-> #1 0x00007fd73d07b9da in __internal_syscall_cancel () from /lib64/libc.so.6
-> #2 0x00007fd73d07ba24 in __syscall_cancel () from /lib64/libc.so.6
-> #3 0x00007fd73d0eb5af in wait4 () from /lib64/libc.so.6
-> #4 0x00007fd741c58908 in ggml_abort () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-> #5 0x00007fd741dded43 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-> #6 0x00007fd741decb09 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.1] () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-> #7 0x00007fd741df42dd in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-> #8 0x00007fd741caf9b3 in ggml_backend_sched_graph_compute_async () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/ggml/src/libggml.so
-> #9 0x00007fd79656af1a in llama_decode () from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/lenux/src/libllama.so
-> #10 0x000000000049a2d4 in server_context::update_slots() ()
-> #11 0x000000000046cafc in server_queue::start_loop() ()
-> #12 0x0000000000416977 in main ()
-> [Inferior 1 (process 23939) detached]
-> ```
->
-> 运行它
->
-> ```
-> ./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-00001-of-00007.gguf' -c 32768 --no-mmap -ngl 999 -ot "blk.(0|1|2|3|4|5|6).ffn.=CUDA0" -ot "blk.(7|8|9|10).ffn.=CUDA1" -ot "blk.(11|12|13|14).ffn.=CUDA2" -ot "blk.(15|16|17).ffn.=CUDA3" -ot "blk.(18|19|20|21|22|23|24|25).ffn.=CUDA4" -ot "ffn.*=CPU" -fa -mg 0 -ub 2048 -mla 1 -fmoe
-> ```
->
-> 不使用 -fm
-
----
-
-👤 **p4s2wd** commented the **2025-05-15** at **00:21:27**:
+👤 **p4s2wd** commented on **2025-05-15** at **00:21:27**
> Just chiming in, I get a CUDA illegal memory access when using -fmoe on DeepSeekV3 0324
>
@@ -1434,7 +1214,7 @@ As you're using GPU+CPU, please try to replace "-mla 1" with "-mla 2".
---
-👤 **ikawrakow** commented the **2025-05-15** at **04:35:23**:
+👤 **ikawrakow** commented on **2025-05-15** at **04:35:23**
> As you're using GPU+CPU, please try to replace "-mla 1" with "-mla 2".
@@ -1444,7 +1224,7 @@ Concerning the error, it is not triggered in a function related to `-fmoe`, so I
---
-👤 **Panchovix** commented the **2025-05-15** at **22:22:06**:
+👤 **Panchovix** commented on **2025-05-15** at **22:22:06**
Okay tested again, after updating and rebooting Fedora and now -fmoe works fine with MLA 1 + FA on CUDA+CPU (I use it like to save vram on compute buffers)
@@ -1452,7 +1232,7 @@ Not sure exactly what would have causes the issue.
---
-👤 **schynce** commented the **2025-05-15** at **22:32:20**:
+👤 **schynce** commented on **2025-05-15** at **22:32:20**
> Okay tested again, after updating and rebooting Fedora and now -fmoe works fine with MLA 1 + FA on CUDA+CPU (I use it like to save vram on compute buffers)
>
@@ -1462,19 +1242,13 @@ Are you sure that it is actually fixed? I am asking because I had some commands
---
-👤 **Panchovix** commented the **2025-05-15** at **22:45:52**:
+👤 **Panchovix** commented on **2025-05-15** at **22:45:52**
@schynce you're correct, tried a few more and it got the illegal memory access again.
---
-👤 **Panchovix** commented the **2025-05-15** at **22:45:52**:
-
-@schynce you're correct, tried a few more it got the illegal memory access.
-
----
-
-👤 **divine-taco** commented the **2025-05-19** at **23:10:44**:
+👤 **divine-taco** commented on **2025-05-19** at **23:10:44**
Another data point. I'm not entirely sure `-fmoe` is the problem here. This is running multi gpu (3090) with cpu offload.
@@ -1497,34 +1271,15 @@ Suspect https://github.com/ikawrakow/ik_llama.cpp/issues/425 may be the same iss
---
-👤 **divine-taco** commented the **2025-05-19** at **23:10:44**:
-
-Another data point. I'm not entirely sure `-fmoe` is the problem here. This is running multi gpu (3090) with cpu offload.
-
-I can also report that it is rare for the crash to occur immediately. It's usually after a handful of turns.
-
-Note this seems this a recently introduced bug:
-`-fmoe -mla 2` does not crash on 6c23618ca5d680bd00f06a143dc4a1b386c827e3
-
-It stopped working somewhen after this.
-`-fmoe -mla 2` is broken for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
-
-`-mla 2` without fmoe is also broken for 2ec2229f2e9847d4e96bd7f163201810c8f8299a
-
-If I get some time this week I'll try to isolate when the bug was introduced.
-Probably worth someone else trying `6c23618ca5d680bd00f06a143dc4a1b386c827e3` to confirm this is the same issue everyone seems to be running into with multi gpu.
-
----
-
-👤 **ikawrakow** commented the **2025-05-20** at **04:34:00**:
+👤 **ikawrakow** commented on **2025-05-20** at **04:34:00**
@divine-taco It would be useful to share your command line when reporting a problem.
-The most significant change between https://github.com/ikawrakow/ik_llama.cpp/commit/6c23618ca5d680bd00f06a143dc4a1b386c827e3 and https://github.com/ikawrakow/ik_llama.cpp/commit/2ec2229f2e9847d4e96bd7f163201810c8f8299a is PR #405. Prior to this PR the fused `ffn_up/ffn_gate` operation was not offloaded to the GPU if the tensors were on the CPU. After #405 the op is offloaded. You can disable that and restore the behavior prior to #405 using `-op 29,0`. Can you try that? Thanks.
+The most significant change between https://github.com/ikawrakow/ik_llama.cpp/commit/6c23618ca5d680bd00f06a143dc4a1b386c827e3 and https://github.com/ikawrakow/ik_llama.cpp/commit/2ec2229f2e9847d4e96bd7f163201810c8f8299a is PR [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405). Prior to this PR the fused `ffn_up/ffn_gate` operation was not offloaded to the GPU if the tensors were on the CPU. After [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405) the op is offloaded. You can disable that and restore the behavior prior to [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405) using `-op 29,0`. Can you try that? Thanks.
---
-👤 **divine-taco** commented the **2025-05-20** at **05:56:42**:
+👤 **divine-taco** commented on **2025-05-20** at **05:56:42**
~~@ikawrakow `-op 29,0` seems to fix the issues running with the latest commit - 2ec2229f2e9847d4e96bd7f163201810c8f8299a~~
@@ -1559,53 +1314,23 @@ CUDA error: an illegal memory access was encountered
---
-👤 **divine-taco** commented the **2025-05-20** at **05:56:42**:
-
-@ikawrakow `-op 29,0` seems to fix the issues running with the latest commit - 2ec2229f2e9847d4e96bd7f163201810c8f8299a
-
-Full command:
-
-```
-llama-server \
- --parallel 1 \
- -ctk f16 -ctv f16 \
- -ts 17,17,17,17,17,17,17,17,17 \
- --model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
- --ctx-size 44000 \
- -fmoe -rtr -mla 3 -fa \
- -b 2048 -ub 2048 -amb 512 \
- -op 29,0 \
- --no-mmap \
- --threads 64 --threads-batch 64 \
- -ngl 99 \
- -ot exps=CPU
-```
-
----
-
-👤 **schynce** commented the **2025-05-20** at **13:44:34**:
+👤 **schynce** commented on **2025-05-20** at **13:44:34**
For me, the best way to trigger the bug quickly is to dump in a 30K token prompt. It seems to crash during the prompt processing or before generating a single token.
---
-👤 **schynce** commented the **2025-05-20** at **13:44:34**:
-
-For me, the best way to trigger the bug quickly is to dump in a 30K token prompt. It seems to crash during the prompt processing.
-
----
-
-👤 **ikawrakow** commented the **2025-05-20** at **14:23:18**:
+👤 **ikawrakow** commented on **2025-05-20** at **14:23:18**
-Does PR #438 help?
+Does PR [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) help?
---
-👤 **schynce** commented the **2025-05-20** at **15:58:47**:
+👤 **schynce** commented on **2025-05-20** at **15:58:47**
-> Does PR [#438](https://github.com/ikawrakow/ik_llama.cpp/pull/438) help?
+> Does PR [[#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438)](https://github.com/ikawrakow/ik_llama.cpp/pull/438) help?
-I tested #438 (branch ik/desperate_bug_fix_attempt) but unfortunately, it crashed almost straight away:
+I tested [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) (branch ik/desperate_bug_fix_attempt) but unfortunately, it crashed almost straight away:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
@@ -1633,9 +1358,9 @@ Aborted (core dumped)
---
-👤 **divine-taco** commented the **2025-05-20** at **21:36:55**:
+👤 **divine-taco** commented on **2025-05-20** at **21:36:55**
-~~PR #438 - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.~~
+~~PR [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.~~
Command used:
```
@@ -1655,33 +1380,11 @@ llama-server \
@schynce - Have a link to the Qwen3-235B-A22B quant you used? I can try that as well.
-Update: Failed with illegal memory access again on PR #438 with deepseek 0324 after I ran some automated completions tests. I don't have enough data yet to be confident, but it does seem to fail less frequently. I'll try running `--mla 2` on PR #438 to see if this makes any difference.
-
----
-
-👤 **divine-taco** commented the **2025-05-20** at **21:36:55**:
-
-PR #438 - 82871cc2a3366dfdeff758f04fdfcf5ae5859829 - looks to fix the issue for me. Tried 30 turn completions at long context and saw no issues.
-
-Command used:
-```
-llama-server \
- --parallel 1 \
- -ctk f16 -ctv f16 \
- -ts 17,17,17,17,17,17,17,17,17 \
- --model /home/mx01/DeepSeek-V3-0324-GGUF-Q8_0 --host 0.0.0.0 --port 8080 \
- --ctx-size 44000 \
- -fmoe -rtr -mla 3 -fa \
- -b 2048 -ub 2048 -amb 512 \
- --no-mmap \
- --threads 64 --threads-batch 64 \
- -ngl 99 \
- -ot exps=CPU
-```
+Update: Failed with illegal memory access again on PR [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) with deepseek 0324 after I ran some automated completions tests. I don't have enough data yet to be confident, but it does seem to fail less frequently. I'll try running `--mla 2` on PR [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) to see if this makes any difference.
---
-👤 **schynce** commented the **2025-05-20** at **21:49:54**:
+👤 **schynce** commented on **2025-05-20** at **21:49:54**
@divine-taco
@@ -1693,15 +1396,15 @@ However, I notice that there have been some updates in the first split file sinc
---
-👤 **ikawrakow** commented the **2025-05-21** at **06:02:41**:
+👤 **ikawrakow** commented on **2025-05-21** at **06:02:41**
-Please use branch in PR #442 and post the CUDA call trace that will be printed when the application crashes.
+Please use branch in PR [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) and post the CUDA call trace that will be printed when the application crashes.
---
-👤 **schynce** commented the **2025-05-21** at **12:11:08**:
+👤 **schynce** commented on **2025-05-21** at **12:11:08**
-> Please use branch in PR [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442) and post the CUDA call trace that will be printed when the application crashes.
+> Please use branch in PR [[#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442)](https://github.com/ikawrakow/ik_llama.cpp/pull/442) and post the CUDA call trace that will be printed when the application crashes.
```
llm_load_tensors: CUDA_Host buffer size = 52313.37 MiB
@@ -1789,7 +1492,7 @@ The program is not being run.
---
-👤 **ikawrakow** commented the **2025-05-21** at **12:37:17**:
+👤 **ikawrakow** commented on **2025-05-21** at **12:37:17**
Thank you!
@@ -1797,7 +1500,7 @@ So, it crashes in a matrix multiplication. I have pushed another commit on the b
---
-👤 **schynce** commented the **2025-05-21** at **13:29:25**:
+👤 **schynce** commented on **2025-05-21** at **13:29:25**
> Thank you!
>
@@ -1848,13 +1551,13 @@ CUDA error: an illegal memory access was encountered
---
-👤 **ikawrakow** commented the **2025-05-21** at **13:55:41**:
+👤 **ikawrakow** commented on **2025-05-21** at **13:55:41**
I was confused. If there was something wrong with the matrix multiplications, it would have aborted there. The computations succeed, but then something goes wrong in the back-end. I have now added 2 additional asserts in the back-end at the place where the back-trace was when we did the debugging session.
---
-👤 **schynce** commented the **2025-05-21** at **14:10:05**:
+👤 **schynce** commented on **2025-05-21** at **14:10:05**
> I was confused. If there was something wrong with the matrix multiplications, it would have aborted there. The computations succeed, but then something goes wrong in the back-end. I have now added 2 additional asserts in the back-end at the place where the back-trace was when we did the debugging session.
@@ -1903,19 +1606,19 @@ CUDA error: an illegal memory access was encountered
---
-👤 **ikawrakow** commented the **2025-05-21** at **14:27:12**:
+👤 **ikawrakow** commented on **2025-05-21** at **14:27:12**
Thanks! I'll keep digging.
---
-👤 **ikawrakow** commented the **2025-05-21** at **15:26:00**:
+👤 **ikawrakow** commented on **2025-05-21** at **15:26:00**
I have now added a trace to the back-end, so when the crash occurs it will print from where `ggml_backend_cuda_synchronize` was called. Can you try another time? Thanks!
---
-👤 **schynce** commented the **2025-05-21** at **16:31:48**:
+👤 **schynce** commented on **2025-05-21** at **16:31:48**
> I have now added a trace to the back-end, so when the crash occurs it will print from where `ggml_backend_cuda_synchronize` was called. Can you try another time? Thanks!
@@ -1962,13 +1665,13 @@ CUDA error: an illegal memory access was encountered
---
-👤 **ikawrakow** commented the **2025-05-21** at **16:43:24**:
+👤 **ikawrakow** commented on **2025-05-21** at **16:43:24**
@schynce You are running with `--no-kv-offload`, right? Your error is different. What happens if you don't use `--no-kv-offload`?
---
-👤 **schynce** commented the **2025-05-21** at **16:55:42**:
+👤 **schynce** commented on **2025-05-21** at **16:55:42**
> [@schynce](https://github.com/schynce) You are running with `--no-kv-offload`, right? Your error is different. What happens if you don't use `--no-kv-offload`?
@@ -2046,15 +1749,15 @@ CUDA error: an illegal memory access was encountered
---
-👤 **ikawrakow** commented the **2025-05-22** at **06:44:46**:
+👤 **ikawrakow** commented on **2025-05-22** at **06:44:46**
-If you are not tired of testing, there are new changes on #442
+If you are not tired of testing, there are new changes on [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442)
---
-👤 **schynce** commented the **2025-05-22** at **07:43:25**:
+👤 **schynce** commented on **2025-05-22** at **07:43:25**
-> If you are not tired of testing, there are new changes on [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442)
+> If you are not tired of testing, there are new changes on [[#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442)](https://github.com/ikawrakow/ik_llama.cpp/pull/442)
Not even close to being tired yet, thank you for taking the time to look into this :)
diff --git a/github-data/issues/407 - Feature Request_ Support for function calling in llama-server.md b/github-data/issues/407 - Feature Request Support for function calling in llama-server.md
similarity index 76%
rename from github-data/issues/407 - Feature Request_ Support for function calling in llama-server.md
rename to github-data/issues/407 - Feature Request Support for function calling in llama-server.md
index 218f8c901..b71adfe66 100644
--- a/github-data/issues/407 - Feature Request_ Support for function calling in llama-server.md
+++ b/github-data/issues/407 - Feature Request Support for function calling in llama-server.md
@@ -1,14 +1,15 @@
-### ✨ [#407](https://github.com/ikawrakow/ik_llama.cpp/issues/407) - Feature Request: Support for function calling in llama-server
+## 📌 [Issue #407](https://github.com/ikawrakow/ik_llama.cpp/issues/407) - Feature Request: Support for function calling in llama-server
| **Author** | `vijaysaayi` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-05-11 |
| **Updated** | 2025-06-08 |
+| **Labels** | `enhancement`, `help wanted` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -36,9 +37,9 @@ https://github.com/ggml-org/llama.cpp/pull/9639
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-12** at **05:38:22**:
+👤 **ikawrakow** commented on **2025-05-12** at **05:38:22**
I have never used function calling myself, so I'm not familiar with this feature.
@@ -46,30 +47,30 @@ Help will be appreciated.
---
-👤 **vijaysaayi** commented the **2025-05-16** at **15:47:05**:
+👤 **vijaysaayi** commented on **2025-05-16** at **15:47:05**
Thanks for all the efforts on this. Would it be possible to update to latest llama.cpp (These functionalities are implemented)
---
-👤 **ikawrakow** commented the **2025-05-16** at **15:52:09**:
+👤 **ikawrakow** commented on **2025-05-16** at **15:52:09**
The code here has not been synced with `llama.cpp` since last August, and as a result the two code bases have totally diverged. Almost nothing is just a copy/paste from upstream.
---
-👤 **ubergarm** commented the **2025-05-18** at **15:38:27**:
+👤 **ubergarm** commented on **2025-05-18** at **15:38:27**
@vijaysaayi Check out this wrapper/reverse-proxy which might be able to do what you want: https://github.com/ikawrakow/ik_llama.cpp/discussions/403#discussioncomment-13098276
---
-👤 **vijaysaayi** commented the **2025-05-26** at **07:57:13**:
+👤 **vijaysaayi** commented on **2025-05-26** at **07:57:13**
Thanks for sharing this. I will check this out.
---
-👤 **mtcl** commented the **2025-06-08** at **06:07:47**:
+👤 **mtcl** commented on **2025-06-08** at **06:07:47**
@vijaysaayi let me know if you need any help with the function calling wrapper. Here is the video walkthrough of it. https://www.youtube.com/watch?v=JGo9HfkzAmc
\ No newline at end of file
diff --git a/github-data/issues/412 - Bug_ Static asserts trip during compile..md b/github-data/issues/412 - Bug Static asserts trip during compile.md
similarity index 83%
rename from github-data/issues/412 - Bug_ Static asserts trip during compile..md
rename to github-data/issues/412 - Bug Static asserts trip during compile.md
index b955a1a52..5b0552879 100644
--- a/github-data/issues/412 - Bug_ Static asserts trip during compile..md
+++ b/github-data/issues/412 - Bug Static asserts trip during compile.md
@@ -1,4 +1,4 @@
-### 🐛 [#412](https://github.com/ikawrakow/ik_llama.cpp/issues/412) - Bug: Static asserts trip during compile.
+## 📌 [Issue #412](https://github.com/ikawrakow/ik_llama.cpp/issues/412) - Bug: Static asserts trip during compile.
| **Author** | `Ph0rk0z` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -52,50 +52,38 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-12** at **11:41:48**:
+👤 **ikawrakow** commented on **2025-05-12** at **11:41:48**
What is the architecture?
---
-👤 **Ph0rk0z** commented the **2025-05-12** at **11:51:28**:
+👤 **Ph0rk0z** commented on **2025-05-12** at **11:51:28**
The system? It's a xeon 5120 w/cuda. I tested qwen 235 with the binary that came out and it worked. Haven't tried deepseek yet.
---
-👤 **Ph0rk0z** commented the **2025-05-12** at **11:51:28**:
-
-The system? It's a xeon 5120. I tested qwen 235 with the binary that came out and it worked. Haven't tried deepseek yet.
-
----
-
-👤 **ikawrakow** commented the **2025-05-12** at **11:53:12**:
+👤 **ikawrakow** commented on **2025-05-12** at **11:53:12**
I mean the CUDA architecture (Turing, Ampere, etc.). Or simpler, what is the GPU?
---
-👤 **ikawrakow** commented the **2025-05-12** at **11:53:12**:
-
-I mean the CUDA architecture (Turing, Ampere, etc.)
-
----
-
-👤 **Ph0rk0z** commented the **2025-05-12** at **12:03:40**:
+👤 **Ph0rk0z** commented on **2025-05-12** at **12:03:40**
I have ampere and turning but only inferencing on ampere. I guess turning gets picked up during compile.
---
-👤 **ikawrakow** commented the **2025-05-12** at **12:04:28**:
+👤 **ikawrakow** commented on **2025-05-12** at **12:04:28**
-Does #413 fix it?
+Does [#413](https://github.com/ikawrakow/ik_llama.cpp/issues/413) fix it?
---
-👤 **Ph0rk0z** commented the **2025-05-12** at **12:08:03**:
+👤 **Ph0rk0z** commented on **2025-05-12** at **12:08:03**
Yep, just undid my comments and changed it to CC_TURNING
\ No newline at end of file
diff --git a/github-data/issues/419 - qwen3 metrics in expert parallel_2x P100_.md b/github-data/issues/419 - qwen3 metrics in expert parallel2x P100.md
similarity index 95%
rename from github-data/issues/419 - qwen3 metrics in expert parallel_2x P100_.md
rename to github-data/issues/419 - qwen3 metrics in expert parallel2x P100.md
index ac07530fa..090b57a72 100644
--- a/github-data/issues/419 - qwen3 metrics in expert parallel_2x P100_.md
+++ b/github-data/issues/419 - qwen3 metrics in expert parallel2x P100.md
@@ -1,4 +1,4 @@
-### 📝 [#419](https://github.com/ikawrakow/ik_llama.cpp/issues/419) - qwen3 metrics in expert parallel(2x P100)
+## 📌 [Issue #419](https://github.com/ikawrakow/ik_llama.cpp/issues/419) - qwen3 metrics in expert parallel(2x P100)
| **Author** | `VinnyG9` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
so i set a snoop mode in bios which does some kind of speculative decoding called Home dir w/ OSB+, and it gives a big boost with numa enabled
all tests with HT off
@@ -107,15 +107,15 @@ WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to i
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-15** at **04:26:42**:
+👤 **ikawrakow** commented on **2025-05-15** at **04:26:42**
You regex is incorrect, so everything goes to the GPU. Try `-ot exps=CPU` instead. When that works and you see how much VRAM you have left on each GPU, you can offload some of the experts to the GPU using additional regular expressions for that that precede the `exps=CPU` expression.
---
-👤 **VinnyG9** commented the **2025-05-15** at **14:08:28**:
+👤 **VinnyG9** commented on **2025-05-15** at **14:08:28**
> You regex is incorrect, so everything goes to the GPU. Try `-ot exps=CPU` instead. When that works and you see how much VRAM you have left on each GPU, you can offload some of the experts to the GPU using additional regular expressions for that that precede the `exps=CPU` expression.
@@ -137,7 +137,7 @@ https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html
---
-👤 **ikawrakow** commented the **2025-05-15** at **14:13:55**:
+👤 **ikawrakow** commented on **2025-05-15** at **14:13:55**
The attention tensors are on the GPU, so you don't really want to use `-nkvo` (unless extremely desperate to save more VRAM).
@@ -145,7 +145,7 @@ What is the quantization type you are using? Full log, including command line ar
---
-👤 **VinnyG9** commented the **2025-05-15** at **17:31:23**:
+👤 **VinnyG9** commented on **2025-05-15** at **17:31:23**
when i do "exps\.=CPU" only 6GB total are offloaded to the GPUs is that normal?
in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
@@ -169,13 +169,13 @@ log> https://pastebin.com/1VEd7tuD
---
-👤 **VinnyG9** commented the **2025-05-15** at **18:31:10**:
+👤 **VinnyG9** commented on **2025-05-15** at **18:31:10**
this tensor override thing makes no sense, i'm testing the Q2K quant it's using 40% of vram and if i set only one more tensor-layer the cuda malloc explodes
---
-👤 **Ph0rk0z** commented the **2025-05-15** at **21:23:16**:
+👤 **Ph0rk0z** commented on **2025-05-15** at **21:23:16**
>in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
@@ -185,7 +185,7 @@ I had best luck with numa distribute. Maybe you should do a benchmark of your ra
---
-👤 **ubergarm** commented the **2025-05-16** at **21:30:59**:
+👤 **ubergarm** commented on **2025-05-16** at **21:30:59**
@Fuckingnameless
@@ -203,7 +203,7 @@ have fun!
---
-👤 **VinnyG9** commented the **2025-05-17** at **01:18:44**:
+👤 **VinnyG9** commented on **2025-05-17** at **01:18:44**
> > in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
>
@@ -223,7 +223,7 @@ numa is not working right for me i need to fiddle with snoop modes is my guess
---
-👤 **VinnyG9** commented the **2025-05-17** at **01:25:58**:
+👤 **VinnyG9** commented on **2025-05-17** at **01:25:58**
> [@Fuckingnameless](https://github.com/Fuckingnameless)
>
@@ -250,7 +250,7 @@ i thought that was default, also read somewhere that doing 2 copies aka data par
---
-👤 **ubergarm** commented the **2025-05-17** at **14:41:33**:
+👤 **ubergarm** commented on **2025-05-17** at **14:41:33**
@Fuckingnameless
@@ -270,7 +270,7 @@ I have a [whole discussion on the NUMA stuff here](https://github.com/ggml-org/l
---
-👤 **Ph0rk0z** commented the **2025-05-17** at **15:03:48**:
+👤 **Ph0rk0z** commented on **2025-05-17** at **15:03:48**
>Also as @Ph0rk0z you might want to try compiling with -DGGML_SCHED_MAX_COPIES=1
@@ -288,7 +288,7 @@ If you do it sequentially and just fill as many layers before OOM, you'll have a
---
-👤 **VinnyG9** commented the **2025-05-18** at **02:01:19**:
+👤 **VinnyG9** commented on **2025-05-18** at **02:01:19**
> > Also as [@Ph0rk0z](https://github.com/Ph0rk0z) you might want to try compiling with -DGGML_SCHED_MAX_COPIES=1
>
@@ -310,13 +310,13 @@ I updated the OP with benchmarks
---
-👤 **Ph0rk0z** commented the **2025-05-18** at **11:33:22**:
+👤 **Ph0rk0z** commented on **2025-05-18** at **11:33:22**
Try some different regex for CPU. In the benchmark command line above its missing the wildcard.
---
-👤 **VinnyG9** commented the **2025-05-20** at **14:49:53**:
+👤 **VinnyG9** commented on **2025-05-20** at **14:49:53**
$ CUDA_VISIBLE_DEVICES=0,1 bin/llama-bench -t 31 -p 64,128,256 -n 32,64,128 -m moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "blk.([0-9]|[1][0-3]).ffn_.*=CUDA1","output.=CUDA1","blk.([0-3][0-9]|4[0-6]).ffn_norm.=CUDA1" -ot "blk.(4[7-9]|[5-9][0-9]).ffn_norm.=CUDA0" -ot "blk.([3][1-9]|[4-9][0-9]).ffn_.*=CPU" -fa 1 -fmoe 1 -rtr 1 --numa distribute
@@ -362,7 +362,7 @@ norm layers split 1/1, output layers on last gpu
---
-👤 **saood06** commented the **2025-05-25** at **05:08:13**:
+👤 **saood06** commented on **2025-05-25** at **05:08:13**
> ̶E̶d̶i̶t̶;̶ ̶f̶i̶x̶e̶d̶ ̶b̶y̶ ̶d̶i̶s̶a̶b̶l̶i̶n̶g̶ ̶c̶u̶b̶l̶a̶s̶
diff --git a/github-data/issues/420 - Bug_ standard attention is broken.md b/github-data/issues/420 - Bug standard attention is broken.md
similarity index 74%
rename from github-data/issues/420 - Bug_ standard attention is broken.md
rename to github-data/issues/420 - Bug standard attention is broken.md
index f79a6042b..8a309f8df 100644
--- a/github-data/issues/420 - Bug_ standard attention is broken.md
+++ b/github-data/issues/420 - Bug standard attention is broken.md
@@ -1,4 +1,4 @@
-### 🐛 [#420](https://github.com/ikawrakow/ik_llama.cpp/issues/420) - Bug: standard attention is broken
+## 📌 [Issue #420](https://github.com/ikawrakow/ik_llama.cpp/issues/420) - Bug: standard attention is broken
| **Author** | `ikawrakow` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
diff --git a/github-data/issues/423 - Bug_ Compile failure undefined reference to _void mul_mat_q_case.md b/github-data/issues/423 - Bug Compile failure undefined reference to void mul_mat_q_case.md
similarity index 81%
rename from github-data/issues/423 - Bug_ Compile failure undefined reference to _void mul_mat_q_case.md
rename to github-data/issues/423 - Bug Compile failure undefined reference to void mul_mat_q_case.md
index 60b49406c..7fa5e32f6 100644
--- a/github-data/issues/423 - Bug_ Compile failure undefined reference to _void mul_mat_q_case.md
+++ b/github-data/issues/423 - Bug Compile failure undefined reference to void mul_mat_q_case.md
@@ -1,4 +1,4 @@
-### 🐛 [#423](https://github.com/ikawrakow/ik_llama.cpp/issues/423) - Bug: Compile failure undefined reference to `void mul_mat_q_case
+## 📌 [Issue #423](https://github.com/ikawrakow/ik_llama.cpp/issues/423) - Bug: Compile failure undefined reference to `void mul_mat_q_case
| **Author** | `nux` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -44,7 +44,7 @@ cmake --build build --config Release -j --clean-first
3d92d7f8
-Debian latest: Linux red 6.1.0-34-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.135-1 (2025-04-25) x86_64 GNU/Linux
+Debian latest: Linux red 6.1.0-34-amd64 [#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) SMP PREEMPT_DYNAMIC Debian 6.1.135-1 (2025-04-25) x86_64 GNU/Linux
### What operating system are you seeing the problem on?
@@ -59,14 +59,14 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-15** at **13:50:36**:
+👤 **ikawrakow** commented on **2025-05-15** at **13:50:36**
Sorry, forgot to add a file. It should work now.
---
-👤 **nux** commented the **2025-05-15** at **13:50:54**:
+👤 **nux** commented on **2025-05-15** at **13:50:54**
Thanks! Committed fix before my attempt to build just llama-server completed!
\ No newline at end of file
diff --git a/github-data/issues/425 - Bug_ CUDA error_ an illegal memory access was encountered.md b/github-data/issues/425 - Bug CUDA error an illegal memory access was encountered.md
similarity index 89%
rename from github-data/issues/425 - Bug_ CUDA error_ an illegal memory access was encountered.md
rename to github-data/issues/425 - Bug CUDA error an illegal memory access was encountered.md
index 6532adf74..9b7be3226 100644
--- a/github-data/issues/425 - Bug_ CUDA error_ an illegal memory access was encountered.md
+++ b/github-data/issues/425 - Bug CUDA error an illegal memory access was encountered.md
@@ -1,4 +1,4 @@
-### 🐛 [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425) - Bug: CUDA error: an illegal memory access was encountered
+## 📌 [Issue #425](https://github.com/ikawrakow/ik_llama.cpp/issues/425) - Bug: CUDA error: an illegal memory access was encountered
| **Author** | `nux` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -44,16 +44,16 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-15** at **14:03:32**:
+👤 **ikawrakow** commented on **2025-05-15** at **14:03:32**
What was the command line?
Are you running this model for the first time? If not, did you experience this error on an earlier `ik_llama.cpp` version?
---
-👤 **nux** commented the **2025-05-15** at **14:15:21**:
+👤 **nux** commented on **2025-05-15** at **14:15:21**
Here is the command I am running:
/home/nux/dev/ik_llama.cpp/build/bin/llama-server --model /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf --ctx-size 32768 -mla 2 -fa -amb 512 -fmoe --temp 0.3 --min-p 0.05 --n-gpu-layers 63 --override-tensor "exps=CPU" --parallel 1 --threads 32 --host 0.0.0.0 --port 8081
@@ -76,13 +76,13 @@ Do you want me to try and get the prompt posted for you? Would try and remove pa
---
-👤 **Panchovix** commented the **2025-05-15** at **14:18:06**:
+👤 **Panchovix** commented on **2025-05-15** at **14:18:06**
If you try without -fmoe, does it works?
---
-👤 **nux** commented the **2025-05-15** at **14:19:31**:
+👤 **nux** commented on **2025-05-15** at **14:19:31**
Nope:
@@ -99,19 +99,19 @@ nux@red ~/dev/ik_llama.cpp $ /home/nux/dev/ik_llama.cpp/build/bin/llama-server -
---
-👤 **nux** commented the **2025-05-15** at **14:19:54**:
+👤 **nux** commented on **2025-05-15** at **14:19:54**
Would you like me to try with llama.cpp vanilla? Err...I'm not sure that model loads there. Perhaps I could try other models if you think it would be useful
---
-👤 **Panchovix** commented the **2025-05-15** at **14:21:22**:
+👤 **Panchovix** commented on **2025-05-15** at **14:21:22**
I think R4 doesn't work on llamacpp, yeah. You can try with unsloth quants there https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD.
---
-👤 **ikawrakow** commented the **2025-05-15** at **14:24:51**:
+👤 **ikawrakow** commented on **2025-05-15** at **14:24:51**
There is a place in the log that looks like this:
```
@@ -123,7 +123,7 @@ Seeing this will be helpful.
---
-👤 **nux** commented the **2025-05-15** at **14:29:04**:
+👤 **nux** commented on **2025-05-15** at **14:29:04**
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 612 tensors
@@ -139,13 +139,13 @@ Edit: I will throw another prompt at the model I had a problem with for some oth
---
-👤 **nux** commented the **2025-05-15** at **14:35:38**:
+👤 **nux** commented on **2025-05-15** at **14:35:38**
Worked with another php related prompt (first one had a ~80 line function pasted in, this one was only 5 lines). Odd...
---
-👤 **ikawrakow** commented the **2025-05-15** at **14:36:20**:
+👤 **ikawrakow** commented on **2025-05-15** at **14:36:20**
> It looks like I do have unsloth/DeepSeek-V3-0324-GGUF/UD-Q4_K_XL on a network storage. If you want me to test that I can.
@@ -159,7 +159,7 @@ I have no hypothesis what changed. You can try using `-mla 3` instead of `-mla 2
---
-👤 **nux** commented the **2025-05-15** at **18:04:11**:
+👤 **nux** commented on **2025-05-15** at **18:04:11**
Interesting...I've been trying various combinations of models/parameters, and so far here's what I have:
@@ -192,7 +192,7 @@ total time = 120104.78 ms / 1141 tokens
---
-👤 **ciprianveg** commented the **2025-05-17** at **07:10:26**:
+👤 **ciprianveg** commented on **2025-05-17** at **07:10:26**
> Here is the command I am running: /home/nux/dev/ik_llama.cpp/build/bin/llama-server --model /mnt/nvme/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf --ctx-size 32768 -mla 2 -fa -amb 512 -fmoe --temp 0.3 --min-p 0.05 --n-gpu-layers 63 --override-tensor "exps=CPU" --parallel 1 --threads 32 --host 0.0.0.0 --port 8081
>
@@ -210,13 +210,13 @@ Similar thing happens to me, it worked 2 days ago, i rebuilt it with latest sour
---
-👤 **ikawrakow** commented the **2025-05-17** at **07:36:58**:
+👤 **ikawrakow** commented on **2025-05-17** at **07:36:58**
@ciprianveg Can you also give the build for the last version that worked, tell us if the crash happens during PP or during TG, and post the line from the log where it says where the illegal memory access was encountered? Thanks. Also, is it a single GPU or a multi-GPU setup?
---
-👤 **ciprianveg** commented the **2025-05-17** at **08:23:13**:
+👤 **ciprianveg** commented on **2025-05-17** at **08:23:13**
Hello, it was built from main 20 h ago, now i rebuilt from main 30m ago with latest changes (from 2h ago) and same error:
INFO [ update_slots] kv cache rm [p0, end) | tid="136731577430016" timestamp=1747469764 id_slot=0 id_task=0 p0=0
@@ -250,43 +250,14 @@ last main pull done, that worked was 3 days ago..
---
-👤 **ciprianveg** commented the **2025-05-17** at **08:23:13**:
-
-Hello, it was built from main 20 h ago, now i rebuilt from main 30m ago with latest changes (from 2h ago) and same error:
-INFO [ update_slots] kv cache rm [p0, end) | tid="136731577430016" timestamp=1747469764 id_slot=0 id_task=0 p0=0
-VERB [ update_slots] prompt processing progress | tid="136731577430016" timestamp=1747469764 id_slot=0 n_past=33 n_ctx=20480 n_tokens=33 progress=1.0
-VERB [ update_slots] prompt done | tid="136731577430016" timestamp=1747469764 id_slot=0 n_past=33 n_ctx=20480 n_tokens=33
-VERB [ update_slots] decoding batch | tid="136731577430016" timestamp=1747469764 n_tokens=33
-CUDA error: an illegal memory access was encountered
- current device: 2, in function ggml_backend_cuda_synchronize at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
- cudaStreamSynchronize(cuda_ctx->stream())
-/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
-Could not attach to process. If your uid matches the uid of the target
-process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
-again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
-ptrace: Operation not permitted.
-No stack.
-The program is not being run.
-Aborted (core dumped)
-
-This was the test query: Tell me a random fun fact about the Roman Empire
-
----
-
-👤 **ciprianveg** commented the **2025-05-17** at **08:31:25**:
+👤 **ciprianveg** commented on **2025-05-17** at **08:31:25**
it happens with both Qwen3-235B-A22B-UD-Q3_K_XL and Qwen3-235B-A22B-UD-Q4_K_XL. I am using 2 3090 gpus and 2 A4000, built with 1 copy of cache parameter. I think the multiple gpus can be the issue.. but iti is very strange that llama sweep bench works
..
---
-👤 **ciprianveg** commented the **2025-05-17** at **08:31:25**:
-
-it happens with both Qwen3-235B-A22B-UD-Q3_K_XL and Qwen3-235B-A22B-UD-Q4_K_XL
-
----
-
-👤 **ikawrakow** commented the **2025-05-17** at **08:37:44**:
+👤 **ikawrakow** commented on **2025-05-17** at **08:37:44**
Strange. Nothing really changed since 3 days ago that could affect your use case.
The illegal memory access is triggered in the back-end, so most likely when data is being copied from the CPU to the GPU.
@@ -299,14 +270,14 @@ to checkout the last version from 4 days ago, and then build & run as usual?
---
-👤 **ciprianveg** commented the **2025-05-17** at **08:40:15**:
+👤 **ciprianveg** commented on **2025-05-17** at **08:40:15**
I will try and let you know. I added 2 more gpus to my first 2... maybe it
also matters
On Sat, 17 May 2025, 11:38 Kawrakow, ***@***.***> wrote:
-> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp#425)
+> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp[#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425))
>
>
> Strange. Nothing really changed since 3 days ago that could affect your
@@ -333,49 +304,43 @@ On Sat, 17 May 2025, 11:38 Kawrakow, ***@***.***> wrote:
---
-👤 **ciprianveg** commented the **2025-05-17** at **09:15:05**:
+👤 **ciprianveg** commented on **2025-05-17** at **09:15:05**
i checked out and built the above version from 4 days ago and the same error, so it looks like it has to do with multiple gpus..
---
-👤 **ikawrakow** commented the **2025-05-17** at **09:19:01**:
+👤 **ikawrakow** commented on **2025-05-17** at **09:19:01**
OK, it is the bug that happens with multiple GPUs and partial offload (multi-GPU with full offload is known to work) that has been reported by several users. It is a bug that I currently cannot solve because I don't have access to a multi-GPU system.
---
-👤 **ciprianveg** commented the **2025-05-17** at **09:22:18**:
+👤 **ciprianveg** commented on **2025-05-17** at **09:22:18**
i tried same command, on llama.cpp, without -fmoe (obvious) and it works, much slower pp process peed but it works. On ik_llama same error happens with or without -fmoe param.
---
-👤 **ciprianveg** commented the **2025-05-17** at **09:22:18**:
-
-i treied same command, on llama.cpp, without -fmoe (obvious) and it works, much slower pp process peed but it works
-
----
-
-👤 **ciprianveg** commented the **2025-05-17** at **09:25:06**:
+👤 **ciprianveg** commented on **2025-05-17** at **09:25:06**
what is very strange is that the sweep-bench works, till the max cache length set, so what can be different?
---
-👤 **ikawrakow** commented the **2025-05-17** at **09:33:11**:
+👤 **ikawrakow** commented on **2025-05-17** at **09:33:11**
Are you exceeding the max cache size and it crashes then? Or does it crash before?
---
-👤 **ciprianveg** commented the **2025-05-17** at **09:34:12**:
+👤 **ciprianveg** commented on **2025-05-17** at **09:34:12**
llama-sweep-bench works till it exceeds the max cache size
---
-👤 **ikawrakow** commented the **2025-05-17** at **09:37:03**:
+👤 **ikawrakow** commented on **2025-05-17** at **09:37:03**
> llama-sweep-bench works till it exceeds the max cache size
@@ -383,7 +348,7 @@ Yes, I got that part. So, I'm wondering if `llama-server` crashes after the max.
---
-👤 **ikawrakow** commented the **2025-05-17** at **09:56:36**:
+👤 **ikawrakow** commented on **2025-05-17** at **09:56:36**
> llama-sweep-bench works till it exceeds the max cache size
@@ -391,7 +356,7 @@ OK, this gives me another idea. Can you try running `sweep-bench` with some unus
---
-👤 **ciprianveg** commented the **2025-05-17** at **10:11:51**:
+👤 **ciprianveg** commented on **2025-05-17** at **10:11:51**
I tried with unusual ub it still works, also with unusual nbatch and it works..
main: n_kv_max = 20480, n_batch = 1234, n_ubatch = 873, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
@@ -402,31 +367,31 @@ main: n_kv_max = 20480, n_batch = 1234, n_ubatch = 873, flash_attn = 1, n_gpu_la
---
-👤 **ikawrakow** commented the **2025-05-17** at **10:17:34**:
+👤 **ikawrakow** commented on **2025-05-17** at **10:17:34**
OK, this is becoming a real puzzle. Have you tried `llama-cli` ?
---
-👤 **ciprianveg** commented the **2025-05-17** at **14:44:57**:
+👤 **ciprianveg** commented on **2025-05-17** at **14:44:57**
llama-cli seems to work, but is not webui issue as it appeared also from other client
---
-👤 **nux** commented the **2025-05-17** at **15:00:59**:
+👤 **nux** commented on **2025-05-17** at **15:00:59**
Was reading latest comments on this and wanted to point out I have a single GPU. If you want me to test any more stuff let me know
---
-👤 **ciprianveg** commented the **2025-05-17** at **15:02:55**:
+👤 **ciprianveg** commented on **2025-05-17** at **15:02:55**
On one gpu the issue doesn't happen
---
-👤 **ikawrakow** commented the **2025-05-17** at **15:19:32**:
+👤 **ikawrakow** commented on **2025-05-17** at **15:19:32**
It seems the issue only occurs when using `llama-server`.
@@ -438,99 +403,55 @@ and would send the backtrace when it crashes, that would be very useful.
---
-👤 **nux** commented the **2025-05-17** at **15:43:50**:
+👤 **nux** commented on **2025-05-17** at **15:43:50**
-#0 __pthread_kill_implementation (threadid=, signo=signo@entry=6,
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) __pthread_kill_implementation (threadid=, signo=signo@entry=6,
no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
-#1 0x00007fffeb8a9f4f in __pthread_kill_internal (signo=6, threadid=)
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) 0x00007fffeb8a9f4f in __pthread_kill_internal (signo=6, threadid=)
at ./nptl/pthread_kill.c:78
-#2 0x00007fffeb85afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
-#3 0x00007fffeb845472 in __GI_abort () at ./stdlib/abort.c:79
-#4 0x000055555558ff52 in ggml_abort (
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) 0x00007fffeb85afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffeb845472 in __GI_abort () at ./stdlib/abort.c:79
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x000055555558ff52 in ggml_abort (
file=0x55555634ba10 "/home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=110,
fmt=) at /home/nux/dev/ik_llama.cpp/ggml/src/ggml.c:270
-#5 0x0000555555810534 in ggml_cuda_error (
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x0000555555810534 in ggml_cuda_error (
stmt=stmt@entry=0x55555634c128 "cudaStreamSynchronize(cuda_ctx->stream())",
func=func@entry=0x55555634b5bc "ggml_backend_cuda_synchronize",
file=file@entry=0x55555634ba10 "/home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu",
line=line@entry=3067, msg=0x7ffff7c95d68 "an illegal memory access was encountered")
at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu:110
-#6 0x0000555555810f0a in ggml_backend_cuda_synchronize (backend=)
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x0000555555810f0a in ggml_backend_cuda_synchronize (backend=)
at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
-#7 0x00005555557f627b in ggml_backend_synchronize (backend=0x555566e6d9b0)
+[#7](https://github.com/ikawrakow/ik_llama.cpp/issues/7) 0x00005555557f627b in ggml_backend_synchronize (backend=0x555566e6d9b0)
at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:273
-#8 ggml_backend_sched_compute_splits (sched=0x5555647fdcb0)
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) ggml_backend_sched_compute_splits (sched=0x5555647fdcb0)
at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:1833
-#9 ggml_backend_sched_graph_compute_async (sched=0x5555647fdcb0, graph=)
+[#9](https://github.com/ikawrakow/ik_llama.cpp/issues/9) ggml_backend_sched_graph_compute_async (sched=0x5555647fdcb0, graph=)
at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:2043
-#10 0x00005555556fef93 in llama_graph_compute (n_threads=32, gf=0x7f9f020fc030, lctx=...)
+[#10](https://github.com/ikawrakow/ik_llama.cpp/issues/10) 0x00005555556fef93 in llama_graph_compute (n_threads=32, gf=0x7f9f020fc030, lctx=...)
at /home/nux/dev/ik_llama.cpp/src/llama.cpp:17694
-#11 llama_decode_internal (batch_all=..., lctx=...)
+[#11](https://github.com/ikawrakow/ik_llama.cpp/issues/11) llama_decode_internal (batch_all=..., lctx=...)
at /home/nux/dev/ik_llama.cpp/src/llama.cpp:17910
-#12 llama_decode (ctx=0x555563ffcf60, batch=...) at /home/nux/dev/ik_llama.cpp/src/llama.cpp:22305
-#13 0x000055555567ad49 in server_context::update_slots (this=0x7fffffffda30)
+[#12](https://github.com/ikawrakow/ik_llama.cpp/issues/12) llama_decode (ctx=0x555563ffcf60, batch=...) at /home/nux/dev/ik_llama.cpp/src/llama.cpp:22305
+[#13](https://github.com/ikawrakow/ik_llama.cpp/issues/13) 0x000055555567ad49 in server_context::update_slots (this=0x7fffffffda30)
--Type for more, q to quit, c to continue without paging--
at /home/nux/dev/ik_llama.cpp/examples/server/server.cpp:2355
-#14 0x0000555555655b4a in std::function::operator()() const (this=0x7fffffffe650)
+[#14](https://github.com/ikawrakow/ik_llama.cpp/issues/14) 0x0000555555655b4a in std::function::operator()() const (this=0x7fffffffe650)
at /usr/include/c++/12/bits/std_function.h:591
-#15 server_queue::start_loop (this=this@entry=0x7fffffffe568)
+[#15](https://github.com/ikawrakow/ik_llama.cpp/issues/15) server_queue::start_loop (this=this@entry=0x7fffffffe568)
at /home/nux/dev/ik_llama.cpp/examples/server/server.cpp:501
-#16 0x00005555555936d0 in main (argc=, argv=)
+[#16](https://github.com/ikawrakow/ik_llama.cpp/issues/16) 0x00005555555936d0 in main (argc=, argv=)
at /home/nux/dev/ik_llama.cpp/examples/server/server.cpp:3509
---
-👤 **nux** commented the **2025-05-17** at **15:43:50**:
-
-`
-#0 __pthread_kill_implementation (threadid=, signo=signo@entry=6,
- no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
-#1 0x00007fffeb8a9f4f in __pthread_kill_internal (signo=6, threadid=)
- at ./nptl/pthread_kill.c:78
-#2 0x00007fffeb85afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
-#3 0x00007fffeb845472 in __GI_abort () at ./stdlib/abort.c:79
-#4 0x000055555558ff52 in ggml_abort (
- file=0x55555634ba10 "/home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=110,
- fmt=) at /home/nux/dev/ik_llama.cpp/ggml/src/ggml.c:270
-#5 0x0000555555810534 in ggml_cuda_error (
- stmt=stmt@entry=0x55555634c128 "cudaStreamSynchronize(cuda_ctx->stream())",
- func=func@entry=0x55555634b5bc "ggml_backend_cuda_synchronize",
- file=file@entry=0x55555634ba10 "/home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu",
- line=line@entry=3067, msg=0x7ffff7c95d68 "an illegal memory access was encountered")
- at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu:110
-#6 0x0000555555810f0a in ggml_backend_cuda_synchronize (backend=)
- at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
-#7 0x00005555557f627b in ggml_backend_synchronize (backend=0x555566e6d9b0)
- at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:273
-#8 ggml_backend_sched_compute_splits (sched=0x5555647fdcb0)
- at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:1833
-#9 ggml_backend_sched_graph_compute_async (sched=0x5555647fdcb0, graph=)
- at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:2043
-#10 0x00005555556fef93 in llama_graph_compute (n_threads=32, gf=0x7f9f020fc030, lctx=...)
- at /home/nux/dev/ik_llama.cpp/src/llama.cpp:17694
-#11 llama_decode_internal (batch_all=..., lctx=...)
- at /home/nux/dev/ik_llama.cpp/src/llama.cpp:17910
-#12 llama_decode (ctx=0x555563ffcf60, batch=...) at /home/nux/dev/ik_llama.cpp/src/llama.cpp:22305
-#13 0x000055555567ad49 in server_context::update_slots (this=0x7fffffffda30)
---Type for more, q to quit, c to continue without paging--
- at /home/nux/dev/ik_llama.cpp/examples/server/server.cpp:2355
-#14 0x0000555555655b4a in std::function::operator()() const (this=0x7fffffffe650)
- at /usr/include/c++/12/bits/std_function.h:591
-#15 server_queue::start_loop (this=this@entry=0x7fffffffe568)
- at /home/nux/dev/ik_llama.cpp/examples/server/server.cpp:501
-#16 0x00005555555936d0 in main (argc=, argv=)
- at /home/nux/dev/ik_llama.cpp/examples/server/server.cpp:3509
-`
-
----
-
-👤 **nux** commented the **2025-05-17** at **15:46:08**:
+👤 **nux** commented on **2025-05-17** at **15:46:08**
[llama-server-bt-full.txt](https://github.com/user-attachments/files/20265607/llama-server-bt-full.txt) Or is this better?
---
-👤 **ikawrakow** commented the **2025-05-17** at **16:21:42**:
+👤 **ikawrakow** commented on **2025-05-17** at **16:21:42**
@nux Thank you for the backtrace. I cannot diagnose what has happened from it alone. I could now start asking you to give me the values of some variables, but this is really too tedious. But perhaps just one thing:
```
@@ -540,13 +461,13 @@ p *input
---
-👤 **nux** commented the **2025-05-17** at **16:34:05**:
+👤 **nux** commented on **2025-05-17** at **16:34:05**
Yes I can do that - how exactly do I get that for you? I had to look up that I have to type run into gdb the first time. Never used gdb before.
---
-👤 **ikawrakow** commented the **2025-05-17** at **16:41:44**:
+👤 **ikawrakow** commented on **2025-05-17** at **16:41:44**
When it crashes, and the backtrace is the same as before, you can select the frame where it is in the ` ggml_backend_sched_compute_splits` function. You do this by typing `frame 8` (8 was the frame index in the backtrace you sent). And then you type `p *input`. This will output the content of the `input` tensor. The code is basically iterating over the inputs of the next operation in the graph, and copying data to the appropriate back-end if needed, and I want to see what is the tensor being processed when the crash happens.
@@ -554,16 +475,10 @@ But I have to go now, I'll look at the outcome tomorrow.
---
-👤 **ikawrakow** commented the **2025-05-17** at **16:41:44**:
-
-When it crashes, and the backtrace is the same as before, you can select the frame where it is in the ` ggml_backend_sched_compute_splits` function. You do this by typing `frame 8` (8 was the frame index in the backtrace you sent). And then you type `p *input`. This will output the content of the `input` tensor. The code is basically iterating over the inputs of the next operation in the graph, and copying data to the appropriate back-end if needed, and I want to see what is the tensor being processed when the crash happens.
-
----
-
-👤 **nux** commented the **2025-05-17** at **17:19:05**:
+👤 **nux** commented on **2025-05-17** at **17:19:05**
(gdb) frame 8
-#8 ggml_backend_sched_compute_splits (sched=0x5555647fdcb0)
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) ggml_backend_sched_compute_splits (sched=0x5555647fdcb0)
at /home/nux/dev/ik_llama.cpp/ggml/src/ggml-backend.c:1833
1833 ggml_backend_synchronize(input_backend);
(gdb) p *input
@@ -575,13 +490,13 @@ $1 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x5555641b
---
-👤 **ikawrakow** commented the **2025-05-18** at **06:19:46**:
+👤 **ikawrakow** commented on **2025-05-18** at **06:19:46**
-@nux Thank you! Based on the above, I have added PR #430. Hopefully this fixes it.
+@nux Thank you! Based on the above, I have added PR [#430](https://github.com/ikawrakow/ik_llama.cpp/issues/430). Hopefully this fixes it.
---
-👤 **ciprianveg** commented the **2025-05-18** at **07:27:52**:
+👤 **ciprianveg** commented on **2025-05-18** at **07:27:52**
cd ik_llama.cpp/
git checkout disable_multi_add
@@ -610,35 +525,7 @@ Same command works on llama.cpp
---
-👤 **ciprianveg** commented the **2025-05-18** at **07:27:52**:
-
-1990 cd ik_llama.cpp/
- 1991 git checkout disable_multi_add
- 1992 git fetch origin
- 1993 git checkout ik/disable_multi_add
- 1994 git pull origin ik/disable_multi_add
- 1996 history | grep cmake
- 1997 cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
- 1998 cmake --build ./build --config Release -j $(nproc)
- 1999 ./build/bin/llama-server --model /home/ciprian/ai/models/Qwen3-235B-UD_Q4_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf --alias Qwen3-235B-A22B-UD-Q4_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 20480 -ot "blk.(?:[x]|[5-9][0-9]).ffn.*=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --no-mmap --ubatch-size 3072 --batch-size 3072 -ts 68,70,60,240 -v
-same issue: (maybe it has something todo with the chat template considering the sweep-bench and cli are working fine?)
-
-INFO [ update_slots] kv cache rm [p0, end) | tid="124177210875904" timestamp=1747553203 id_slot=0 id_task=0 p0=0
-VERB [ update_slots] prompt processing progress | tid="124177210875904" timestamp=1747553203 id_slot=0 n_past=18 n_ctx=20480 n_tokens=18 progress=1.0
-VERB [ update_slots] prompt done | tid="124177210875904" timestamp=1747553203 id_slot=0 n_past=18 n_ctx=20480 n_tokens=18
-VERB [ update_slots] decoding batch | tid="124177210875904" timestamp=1747553203 n_tokens=18
-CUDA error: an illegal memory access was encountered
- current device: 2, in function ggml_backend_cuda_synchronize at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
- cudaStreamSynchronize(cuda_ctx->stream())
-/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
-Could not attach to process. If your uid matches the uid of the target
-process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
-again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
-ptrace: Operation not permitted.
-
----
-
-👤 **ikawrakow** commented the **2025-05-18** at **07:42:09**:
+👤 **ikawrakow** commented on **2025-05-18** at **07:42:09**
@ciprianveg Thanks for testing. Are you willing to do a similar debugging session?
```
@@ -649,7 +536,7 @@ When it crashes, `type backtrace` and post the output.
---
-👤 **ciprianveg** commented the **2025-05-18** at **08:00:14**:
+👤 **ciprianveg** commented on **2025-05-18** at **08:00:14**
sure:
VERB [ update_slots] decoding batch | tid="140737203113984" timestamp=1747555159 n_tokens=18
@@ -671,37 +558,37 @@ Download failed: Invalid argument. Continuing without source file ./nptl/./nptl
__pthread_kill_implementation (no_tid=0, signo=6, threadid=) at ./nptl/pthread_kill.c:44
warning: 44 ./nptl/pthread_kill.c: No such file or directory
(gdb) backtrace
-#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=) at ./nptl/pthread_kill.c:44
-#1 __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
-#2 __GI___pthread_kill (threadid=, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
-#3 0x00007fffee84527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
-#4 0x00007fffee8288ff in __GI_abort () at ./stdlib/abort.c:79
-#5 0x00007fffef0333a5 in ggml_abort (file=0x7fffefa4cfc0 "/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=110, fmt=0x7fffefa35a7c "CUDA error")
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) __pthread_kill_implementation (no_tid=0, signo=6, threadid=) at ./nptl/pthread_kill.c:44
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) __GI___pthread_kill (threadid=, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffee84527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x00007fffee8288ff in __GI_abort () at ./stdlib/abort.c:79
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffef0333a5 in ggml_abort (file=0x7fffefa4cfc0 "/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=110, fmt=0x7fffefa35a7c "CUDA error")
at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml.c:270
-#6 0x00007fffef18ed67 in ggml_cuda_error (stmt=stmt@entry=0x7fffefa4d698 "cudaStreamSynchronize(cuda_ctx->stream())", func=func@entry=0x7fffefa35b77 "ggml_backend_cuda_synchronize",
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffef18ed67 in ggml_cuda_error (stmt=stmt@entry=0x7fffefa4d698 "cudaStreamSynchronize(cuda_ctx->stream())", func=func@entry=0x7fffefa35b77 "ggml_backend_cuda_synchronize",
file=file@entry=0x7fffefa4cfc0 "/home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu", line=line@entry=3067, msg=0x7fffee48ece8 "an illegal memory access was encountered")
at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:110
-#7 0x00007fffef18f8aa in ggml_backend_cuda_synchronize (backend=) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
-#8 0x00007fffef0aeed8 in ggml_backend_sched_compute_splits (sched=0x55555655d7c0) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-backend.c:1837
-#9 ggml_backend_sched_graph_compute_async (sched=0x55555655d7c0, graph=) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-backend.c:2043
-#10 0x00007ffff7ea3803 in llama_graph_compute (n_threads=16, gf=0x7fdfa06fb030, lctx=...) at /home/ciprian/ai/ik_llama.cpp/src/llama.cpp:17688
-#11 llama_decode_internal (batch_all=..., lctx=...) at /home/ciprian/ai/ik_llama.cpp/src/llama.cpp:17904
-#12 llama_decode (ctx=0x55555b677230, batch=...) at /home/ciprian/ai/ik_llama.cpp/src/llama.cpp:22299
-#13 0x0000555555608122 in server_context::update_slots (this=0x7fffffffccc0) at /home/ciprian/ai/ik_llama.cpp/examples/server/server.cpp:2355
-#14 0x00005555555e235b in std::function::operator()() const (this=0x7fffffffd8e0) at /usr/include/c++/13/bits/std_function.h:591
-#15 server_queue::start_loop (this=this@entry=0x7fffffffd7f8) at /home/ciprian/ai/ik_llama.cpp/examples/server/server.cpp:501
-#16 0x000055555557e3dc in main (argc=, argv=) at /home/ciprian/ai/ik_llama.cpp/examples/server/server.cpp:3509
+[#7](https://github.com/ikawrakow/ik_llama.cpp/issues/7) 0x00007fffef18f8aa in ggml_backend_cuda_synchronize (backend=) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) 0x00007fffef0aeed8 in ggml_backend_sched_compute_splits (sched=0x55555655d7c0) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-backend.c:1837
+[#9](https://github.com/ikawrakow/ik_llama.cpp/issues/9) ggml_backend_sched_graph_compute_async (sched=0x55555655d7c0, graph=) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-backend.c:2043
+[#10](https://github.com/ikawrakow/ik_llama.cpp/issues/10) 0x00007ffff7ea3803 in llama_graph_compute (n_threads=16, gf=0x7fdfa06fb030, lctx=...) at /home/ciprian/ai/ik_llama.cpp/src/llama.cpp:17688
+[#11](https://github.com/ikawrakow/ik_llama.cpp/issues/11) llama_decode_internal (batch_all=..., lctx=...) at /home/ciprian/ai/ik_llama.cpp/src/llama.cpp:17904
+[#12](https://github.com/ikawrakow/ik_llama.cpp/issues/12) llama_decode (ctx=0x55555b677230, batch=...) at /home/ciprian/ai/ik_llama.cpp/src/llama.cpp:22299
+[#13](https://github.com/ikawrakow/ik_llama.cpp/issues/13) 0x0000555555608122 in server_context::update_slots (this=0x7fffffffccc0) at /home/ciprian/ai/ik_llama.cpp/examples/server/server.cpp:2355
+[#14](https://github.com/ikawrakow/ik_llama.cpp/issues/14) 0x00005555555e235b in std::function::operator()() const (this=0x7fffffffd8e0) at /usr/include/c++/13/bits/std_function.h:591
+[#15](https://github.com/ikawrakow/ik_llama.cpp/issues/15) server_queue::start_loop (this=this@entry=0x7fffffffd7f8) at /home/ciprian/ai/ik_llama.cpp/examples/server/server.cpp:501
+[#16](https://github.com/ikawrakow/ik_llama.cpp/issues/16) 0x000055555557e3dc in main (argc=, argv=) at /home/ciprian/ai/ik_llama.cpp/examples/server/server.cpp:3509
(gdb)
---
-👤 **ciprianveg** commented the **2025-05-18** at **08:01:05**:
+👤 **ciprianveg** commented on **2025-05-18** at **08:01:05**
this is from ik/disable_multi_add branch
---
-👤 **ikawrakow** commented the **2025-05-18** at **08:11:02**:
+👤 **ikawrakow** commented on **2025-05-18** at **08:11:02**
OK, now
```
@@ -727,22 +614,22 @@ p *split->inputs[1], etc., up to j
---
-👤 **ciprianveg** commented the **2025-05-18** at **08:13:48**:
+👤 **ciprianveg** commented on **2025-05-18** at **08:13:48**
(gdb) frame 8
-#8 0x00007fffef0aeed8 in ggml_backend_sched_compute_splits (sched=0x55555655d7c0) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-backend.c:1837
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) 0x00007fffef0aeed8 in ggml_backend_sched_compute_splits (sched=0x55555655d7c0) at /home/ciprian/ai/ik_llama.cpp/ggml/src/ggml-backend.c:1837
1837 ggml_backend_synchronize(split_backend);
(gdb)
---
-👤 **ikawrakow** commented the **2025-05-18** at **08:16:19**:
+👤 **ikawrakow** commented on **2025-05-18** at **08:16:19**
And the second part with `p sched->n_splits` etc.?
---
-👤 **ciprianveg** commented the **2025-05-18** at **08:20:09**:
+👤 **ciprianveg** commented on **2025-05-18** at **08:20:09**
(gdb) p sched->n_splits
$1 = 93
@@ -806,7 +693,7 @@ Cannot access memory at address 0x0
---
-👤 **ikawrakow** commented the **2025-05-18** at **09:03:55**:
+👤 **ikawrakow** commented on **2025-05-18** at **09:03:55**
Don't know. Thanks for helping.
@@ -814,7 +701,7 @@ It is attempting to copy the inputs for layer 43 to a GPU. They consist of the r
---
-👤 **ciprianveg** commented the **2025-05-19** at **12:30:36**:
+👤 **ciprianveg** commented on **2025-05-19** at **12:30:36**
Hello, some feedback that might help: With 3 gpus it is working, and considering that is faster than llama.cpp with 4 gpus, it is a win for me. Just fyi, it is not the gpu, because i put all the 3 gpus combination among all my gpus, to be sure i do not have a deffective one and they worked. Maybe because the last pcie is at lower speed and lags behind the rest? and maybe in llama.cpp lower speed being achieved it is still speedy enough?
@@ -822,7 +709,7 @@ Non related question, is there a downside to set a large u_batch, n_batch? setti
---
-👤 **ikawrakow** commented the **2025-05-19** at **12:48:12**:
+👤 **ikawrakow** commented on **2025-05-19** at **12:48:12**
> Non related question, is there a downside to set a large u_batch, n_batch? setting u_batch =3072, n_batch=3072 increased the pp speed from 80t/s (when they were set to 1024) to 180t/s
@@ -834,7 +721,7 @@ The reason MoE models are different from dense models are the experts. If you us
---
-👤 **ikawrakow** commented the **2025-05-19** at **13:06:22**:
+👤 **ikawrakow** commented on **2025-05-19** at **13:06:22**
> Hello, some feedback that might help: With 3 gpus it is working,
@@ -844,13 +731,13 @@ I'm maybe grasping at straws here, but is it possible that your power supply can
---
-👤 **ikawrakow** commented the **2025-05-19** at **13:11:59**:
+👤 **ikawrakow** commented on **2025-05-19** at **13:11:59**
Also related to `u-batch`: If you don't have enough VRAM to go to batch=u-batch=4096, but PP performance is important to you, you may keep one extra layer per GPU on the CPU so you can use the larger u-batch. This will slightly slow down TG, but the decrease in TG performance with fewer layers offloaded to the GPU is quite modest, so you may still prefer the increase in PP performance.
---
-👤 **ciprianveg** commented the **2025-05-19** at **13:15:24**:
+👤 **ciprianveg** commented on **2025-05-19** at **13:15:24**
> > Hello, some feedback that might help: With 3 gpus it is working,
>
@@ -862,7 +749,7 @@ I don't think the power is the issue, nvidia-smi shows the power usage very low,
---
-👤 **Lissanro** commented the **2025-05-20** at **11:13:14**:
+👤 **Lissanro** commented on **2025-05-20** at **11:13:14**
I think I have the same issue, seems to happen periodically. I am using the following command:
@@ -895,56 +782,15 @@ I am using 4x3090 GPUs on EPYC 7763 with 1TB 3200MHz RAM. I am using 2880W serve
---
-👤 **Lissanro** commented the **2025-05-20** at **11:13:14**:
-
-I think I have the same issue, seems to happen periodically. I am using the following command:
-
-```
-/pkgs/ik_llama.cpp/build/bin/llama-server \
---model /mnt/neuro/models/DeepSeek-R1T-Chimera-256x21B-IQ4_K_R4-163840seq/DeepSeek-R1T-Chimera-256x21B-IQ4_K_R4-163840seq.gguf \
---ctx-size 81920 --n-gpu-layers 62 --tensor-split 25,23,26,26 -mla 3 -fa -ctk q8_0 -amb 1024 -fmoe \
--ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0" \
--ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1" \
--ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2" \
--ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3" \
--ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
---threads 64 --host 0.0.0.0 --port 5000
-```
-
-Few lines of log before the error and the error itself look very similar to this bug report:
-
-```
-INFO [ log_server_request] request | tid="139488642715648" timestamp=1747701084 remote_addr="127.0.0.1" remote_port=57838 status=200 method="POST" path="/completion" params={}
-INFO [ update_slots] all slots are idle | tid="139972738117632" timestamp=1747701084
-INFO [ launch_slot_with_task] slot is processing task | tid="139972738117632" timestamp=1747726885 id_slot=0 id_task=11339
-INFO [ update_slots] kv cache rm [p0, end) | tid="139972738117632" timestamp=1747726886 id_slot=0 id_task=11339 p0=47064
-CUDA error: an illegal memory access was encountered
- current device: 0, in function ggml_backend_cuda_synchronize at /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3067
- cudaStreamSynchronize(cuda_ctx->stream())
-/home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
-```
-
-I am using 4x3090 GPUs on EPYC 7763 with 1TB 3200MHz RAM. I am using server grade PSU to power the video cards and online UPS, and GPUs are stable in all other tasks, including passing overnight memtest_vulkan testing (which verifies VRAM integrity). In case additional debug information from my side could be of help, please let me know.
-
----
-
-👤 **ikawrakow** commented the **2025-05-20** at **14:26:32**:
+👤 **ikawrakow** commented on **2025-05-20** at **14:26:32**
@Lissanro All the experts in this mode use `*_R4` quants? If so, why are you offloading them to the GPUs? The data will have to be copied back to the CPU to do the matrix multiplications.
-To all participants: Does #438 help?
+To all participants: Does [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) help?
---
-👤 **ikawrakow** commented the **2025-05-20** at **14:26:32**:
-
-@Lissanro All the experts in this mode use `*_R4` quants? If so, why are you offloading them to the GPUs? The data will have to be copied back to the CPU to do the matrix multiplications.
-
-@all Does #438 help?
-
----
-
-👤 **nux** commented the **2025-05-20** at **14:56:58**:
+👤 **nux** commented on **2025-05-20** at **14:56:58**
Just rebuilt and tried and got the error:
May 20 09:47:03 red llama-swap[1412]: CUDA error: an illegal memory access was encountered
@@ -959,13 +805,13 @@ I sent another prompt with only the regex and it didn't crash....hmm
---
-👤 **Panchovix** commented the **2025-05-20** at **14:57:15**:
+👤 **Panchovix** commented on **2025-05-20** at **14:57:15**
I will try to test ASAP, I'm on vacations so my time is a bit more limited to try it via ssh
---
-👤 **ciprianveg** commented the **2025-05-20** at **15:21:17**:
+👤 **ciprianveg** commented on **2025-05-20** at **15:21:17**
same error:
@@ -979,46 +825,46 @@ again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
---
-👤 **ikawrakow** commented the **2025-05-20** at **15:23:43**:
+👤 **ikawrakow** commented on **2025-05-20** at **15:23:43**
-OK, thanks. So #438 does not fix it.
+OK, thanks. So [#438](https://github.com/ikawrakow/ik_llama.cpp/issues/438) does not fix it.
---
-👤 **ciprianveg** commented the **2025-05-20** at **16:07:36**:
+👤 **ciprianveg** commented on **2025-05-20** at **16:07:36**
@ikawrakow can it have something to do with not sanitizing the prompt? it would explain why in bench and cli it doesnt happen..
openwebui appends the "/no_prompt" and some tools. It is strange that I removed "\no_think" from the prompt and it didn't crash.. Cand be also related to exact prompt length and how it is split..
---
-👤 **ikawrakow** commented the **2025-05-20** at **16:25:22**:
+👤 **ikawrakow** commented on **2025-05-20** at **16:25:22**
@ciprianveg I don't know. The crash reports are inconsistent with any hypothesis that I had. And in my own testing I'm just not able to crash it. Some users have found workarounds. For some users it does not crash. I have no idea what it is.
---
-👤 **ciprianveg** commented the **2025-05-20** at **16:54:27**:
+👤 **ciprianveg** commented on **2025-05-20** at **16:54:27**
Workarounds other than limiting the no of gpus?
---
-👤 **nux** commented the **2025-05-20** at **17:03:07**:
+👤 **nux** commented on **2025-05-20** at **17:03:07**
I only have one GPU. If I put a single layer -ngl 1 on the gpu it will crash for me. https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2884657811
---
-👤 **ikawrakow** commented the **2025-05-21** at **04:43:40**:
+👤 **ikawrakow** commented on **2025-05-21** at **04:43:40**
-> I only have one GPU. If I put a single layer -ngl 1 on the gpu it will crash for me. [#425 (comment)](https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2884657811)
+> I only have one GPU. If I put a single layer -ngl 1 on the gpu it will crash for me. [[#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425) (comment)](https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2884657811)
This is what makes it even more confusing. Everybody else reporting a crash has more than one GPU. I have one GPU and can never make it fail. I almost always use partial offload as only toy models fit on my 16 GB GPU.
---
-👤 **Lissanro** commented the **2025-05-21** at **05:24:08**:
+👤 **Lissanro** commented on **2025-05-21** at **05:24:08**
@ikawrakow
> All the experts in this mode use *_R4 quants? If so, why are you offloading them to the GPUs? The data will have to be copied back to the CPU to do the matrix multiplications.
@@ -1039,25 +885,25 @@ In the meantime, I will keep testing using the latest patch to see if the crash
---
-👤 **ikawrakow** commented the **2025-05-21** at **06:02:16**:
+👤 **ikawrakow** commented on **2025-05-21** at **06:02:16**
-Please use branch in PR #442 and post the CUDA call trace that will be printed when the application crashes.
+Please use branch in PR [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) and post the CUDA call trace that will be printed when the application crashes.
---
-👤 **ikawrakow** commented the **2025-05-21** at **06:18:06**:
+👤 **ikawrakow** commented on **2025-05-21** at **06:18:06**
@Lissanro
If you are observing such huge compute buffers, you most likely need to rebuild using `-DGGML_SCHED_MAX_COPIES=1`.
-There was also PR #405, which changed the GPU offload policy. After that PR the fused experts operation that gets used when `-fmoe` is specified gets offloaded to the GPU for PP. This speeds up PP quite a bit especially, if you use a large value for u-batch. But the offloading will only happen if the tensors are not repacked. After rebuilding with `-DGGML_SCHED_MAX_COPIES=1` you can try using your not repacked model with `-b 4096 -ub 4096`. If you don't have enough VRAM, you can offload fewer tensors to the GPU. The larger u-batch will increase PP speed with a very modest impact on TG performance due to the fewer experts offloaded to the GPU. With experts ops offloaded to the GPU it is also better to offload all 3 types of experts (as opposed to pre-#405, where it was better to offload more layers of `ffn_up_exps` and `ffn_gate_exps`).
+There was also PR [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405), which changed the GPU offload policy. After that PR the fused experts operation that gets used when `-fmoe` is specified gets offloaded to the GPU for PP. This speeds up PP quite a bit especially, if you use a large value for u-batch. But the offloading will only happen if the tensors are not repacked. After rebuilding with `-DGGML_SCHED_MAX_COPIES=1` you can try using your not repacked model with `-b 4096 -ub 4096`. If you don't have enough VRAM, you can offload fewer tensors to the GPU. The larger u-batch will increase PP speed with a very modest impact on TG performance due to the fewer experts offloaded to the GPU. With experts ops offloaded to the GPU it is also better to offload all 3 types of experts (as opposed to pre-[#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405), where it was better to offload more layers of `ffn_up_exps` and `ffn_gate_exps`).
-The downside of the above is that you will increase the probability for a crash. But if you use #442, this may help debug the issue.
+The downside of the above is that you will increase the probability for a crash. But if you use [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442), this may help debug the issue.
---
-👤 **Lissanro** commented the **2025-05-21** at **13:24:25**:
+👤 **Lissanro** commented on **2025-05-21** at **13:24:25**
@ikawrakow Thank you, I recompiled with `-DGGML_SCHED_MAX_COPIES=1` as you suggested and now can use `-b 4096 -ub 4096`, and I had room to add more tensors as well:
@@ -1079,7 +925,7 @@ By the way, is my understanding correct that repacking no longer necessary, or i
---
-Unfortunately, the issue is still there (I have applied #442 for debugging):
+Unfortunately, the issue is still there (I have applied [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) for debugging):
```txt
CUDA error: an illegal memory access was encountered
@@ -1130,80 +976,7 @@ Another observation, does not seem to depend on context length. Both short (less
---
-👤 **Lissanro** commented the **2025-05-21** at **13:24:25**:
-
-@ikawrakow Thank you, I recompiled with `-DGGML_SCHED_MAX_COPIES=1` as you suggested and now can use `-b 4096 -ub 4096`, and I had room to add more tensors as well:
-
-```
-/pkgs/ik_llama.cpp/build/bin/llama-server \
---model /mnt/neuro/models/DeepSeek-R1T-Chimera-256x21B-IQ4_K-163840seq/DeepSeek-R1T-Chimera-256x21B-IQ4_K-163840seq.gguf \
---ctx-size 81920 --n-gpu-layers 62 --tensor-split 25,23,26,26 -mla 3 -fa -ctk q8_0 -amb 1024 -fmoe -b 4096 -ub 4096 \
--ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0, blk\.3\.ffn_down_exps=CUDA0" \
--ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1, blk\.4\.ffn_down_exps=CUDA1" \
--ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2, blk\.5\.ffn_down_exps=CUDA2" \
--ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3, blk\.6\.ffn_down_exps=CUDA3" \
--ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
---threads 64 --host 0.0.0.0 --port 5000
-```
-
-Now I am getting 100-105 tokens/s for input processing, with little impact on generation speed - which is excellent, given I often work with long context tasks and long prompts.
-
-By the way, is my understanding correct that repacking no longer necessary, or is there still some benefit to repack CPU-only tensors as R4?
-
----
-
-Unfortunately, the issue is still there (I have applied #439 and #442):
-
-```txt
-CUDA error: an illegal memory access was encountered
- current device: 0, in function ggml_backend_cuda_synchronize at /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:3085
- cudaStreamSynchronize(cuda_ctx->stream())
-========================== CUDA trace: 5239365 previous calls
- 5239364: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239363: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
- 5239362: function ggml_cuda_op_mul_mat_cublas, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1388
- 5239361: function ggml_cuda_op_mul_mat_cublas, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1387
- 5239360: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239359: function ggml_cuda_set_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
- 5239358: function ggml_cuda_set_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
- 5239357: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
- 5239356: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239355: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239354: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239353: function ggml_cuda_set_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
- 5239352: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1632
- 5239351: function ggml_cuda_set_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 129
- 5239350: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
- 5239349: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239348: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239347: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1745
- 5239346: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1735
- 5239345: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
- 5239344: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239343: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239342: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1745
- 5239341: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1735
- 5239340: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
- 5239339: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239338: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239337: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1745
- 5239336: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1735
- 5239335: function ggml_cuda_op_mul_mat, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 1755
- 5239334: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239333: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
- 5239332: function ggml_cuda_get_device, file /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu, line 140
-/home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:122: CUDA error
-```
-
-As far as I can tell, probability of it happening is about the same as before. What I noticed though, it seems to never happen on the first try, usually when I try to regenerate a message, or maybe on the next message. It is also hard to reproduce - using exactly the same input prompt, sometimes I can regenerate messages all I want, sometimes it crashes on the second try.
-
-For some reason, if I let it generate without thinking first, then try to force thinking by specifying "" as the start of a reply, and then regenerate a message, it is very likely to crash ("" by itself does not cause the crash, if AI's reply starts with it, and I then regenerate, then it does not crash usually regardless if the next message with or without thinking). Not sure yet if this is truly affects probability of the crash or just few coincidences, but I thought I mention this - I tried few times with different prompts and seems like generating first message without thinking, then with thinking, is the fastest way to trigger the bug.
-
-Another observation, does not seem to depend on context length. Both short (less than 1K) and long (40K+) context seem to have about the same probability of the crash.
-
----
-
-👤 **ikawrakow** commented the **2025-05-21** at **13:48:35**:
+👤 **ikawrakow** commented on **2025-05-21** at **13:48:35**
> By the way, is my understanding correct that repacking no longer necessary, or is there still some benefit to repack CPU-only tensors as R4?
@@ -1211,13 +984,13 @@ It depends where the matrix multiplications for PP are done (TG is always done w
---
-👤 **ikawrakow** commented the **2025-05-21** at **15:15:52**:
+👤 **ikawrakow** commented on **2025-05-21** at **15:15:52**
-I have added a trace to synchronize calls in the ggml-backend to #442 if someone wants to try.
+I have added a trace to synchronize calls in the ggml-backend to [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) if someone wants to try.
---
-👤 **ciprianveg** commented the **2025-05-21** at **15:58:44**:
+👤 **ciprianveg** commented on **2025-05-21** at **15:58:44**
Hi @ikawrakow, here it is:
@@ -1262,7 +1035,7 @@ CUDA error: an illegal memory access was encountered
---
-👤 **maxious** commented the **2025-05-21** at **15:59:52**:
+👤 **maxious** commented on **2025-05-21** at **15:59:52**
same here
```
@@ -1310,7 +1083,7 @@ CUDA error: an illegal memory access was encountered
---
-👤 **ikawrakow** commented the **2025-05-21** at **17:04:08**:
+👤 **ikawrakow** commented on **2025-05-21** at **17:04:08**
In both of these data is copied from one device to another. Then the back-end attempts to synchronize before copying the next tensor, and that's where it crashes.
@@ -1320,19 +1093,13 @@ I could try printf debugging (will flood your terminals with printouts), but it
---
-👤 **ciprianveg** commented the **2025-05-21** at **18:17:31**:
+👤 **ikawrakow** commented on **2025-05-22** at **06:45:01**
-do these suggestions make sense or are hallucinations: https://chat.qwen.ai/s/b35fc22c-a36c-4b50-a296-6058ba15f313?fev=0.0.95
+If you are not tired of testing, there are new changes on [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442)
---
-👤 **ikawrakow** commented the **2025-05-22** at **06:45:01**:
-
-If you are not tired of testing, there are new changes on #442
-
----
-
-👤 **ciprianveg** commented the **2025-05-22** at **07:06:53**:
+👤 **ciprianveg** commented on **2025-05-22** at **07:06:53**
Hi @ikawrakow, this is the log:
ggml_backend_cuda_synchronize: curent device is 3, context device is 0
@@ -1412,7 +1179,7 @@ Could not attach to process. If your uid matches the uid of the target
---
-👤 **ikawrakow** commented the **2025-05-22** at **07:15:50**:
+👤 **ikawrakow** commented on **2025-05-22** at **07:15:50**
Thanks!
@@ -1420,7 +1187,7 @@ What if you build with `-DGGML_CUDA_NO_PEER_COPY=1` ?
---
-👤 **ciprianveg** commented the **2025-05-22** at **07:31:23**:
+👤 **ciprianveg** commented on **2025-05-22** at **07:31:23**
i built it like this:
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_NO_PEER_COPY=1
@@ -1437,7 +1204,7 @@ llm_load_tensors: CUDA3 buffer size = 11628.83 MiB
---
-👤 **ikawrakow** commented the **2025-05-22** at **07:36:22**:
+👤 **ikawrakow** commented on **2025-05-22** at **07:36:22**
OK, then discard `DGGML_CUDA_NO_PEER_COPY=1`. There was another peer to peer copy without a check, so pushed a new commit.
@@ -1445,7 +1212,7 @@ The thing I don't understand is how this can work in `llama.cpp` when I don't se
---
-👤 **ciprianveg** commented the **2025-05-22** at **07:44:58**:
+👤 **ciprianveg** commented on **2025-05-22** at **07:44:58**
i build without: -DGGML_CUDA_NO_PEER_COPY=1 and i still get the loading seg fault(should i delete all build dir to start from 0?):
llm_load_tensors: offloaded 95/95 layers to GPU
@@ -1459,7 +1226,7 @@ llm_load_tensors: CUDA3 buffer size = 12339.69 MiB
---
-👤 **ikawrakow** commented the **2025-05-22** at **08:13:56**:
+👤 **ikawrakow** commented on **2025-05-22** at **08:13:56**
Are you using `ccache`? My experience with `ccache` is that it does get confused and does not always rebuild correctly.
@@ -1467,13 +1234,13 @@ If you don't have anything of value in the build folder, yes, just delete it and
---
-👤 **ikawrakow** commented the **2025-05-22** at **08:14:40**:
+👤 **ikawrakow** commented on **2025-05-22** at **08:14:40**
Oh, and pull another time.
---
-👤 **ciprianveg** commented the **2025-05-22** at **08:53:00**:
+👤 **ciprianveg** commented on **2025-05-22** at **08:53:00**
@ikawrakow Done:
INFO [ update_slots] kv cache rm [p0, end) | tid="134731138850816" timestamp=1747903765 id_slot=0 id_task=0 p0=0
@@ -1662,7 +1429,7 @@ ggml_backend_cuda_synchronize: reverting device to 3
---
-👤 **ikawrakow** commented the **2025-05-22** at **09:41:44**:
+👤 **ikawrakow** commented on **2025-05-22** at **09:41:44**
So, there is no peer-to-peer access for your devices?
@@ -1670,7 +1437,7 @@ OK, so then let's try to follow the other Qwen3 suggestion: use `cuda-memcheck y
---
-👤 **ciprianveg** commented the **2025-05-22** at **11:30:31**:
+👤 **ciprianveg** commented on **2025-05-22** at **11:30:31**
it is a lot of output from compute-sanitizer:
@@ -3129,19 +2896,19 @@ The program is not being run.
---
-👤 **ikawrakow** commented the **2025-05-22** at **12:31:43**:
+👤 **ikawrakow** commented on **2025-05-22** at **12:31:43**
Thank you for this. You are using UD-Q4_K_XL ?
---
-👤 **ciprianveg** commented the **2025-05-22** at **12:33:49**:
+👤 **ciprianveg** commented on **2025-05-22** at **12:33:49**
Yes. Same thing happens also with UD-Q3_K_XL, in ik_llama only. Do you want me to test with another 235b model? A non UD one?
---
-👤 **ikawrakow** commented the **2025-05-22** at **13:44:32**:
+👤 **ikawrakow** commented on **2025-05-22** at **13:44:32**
So, the only hypothesis I can make is that somehow the tensor metadata for one of the tensors is incorrect (else we cannot get the out of bounds access reported by the sanitizer). That's why I asked for the model. In UD-XL the `ffn_down` experts are quantized with more bits than `ffn_up` and `ffn_gate` in the first few layers. If we somehow are using the metadata (quantization type, etc.) for such a tensor in later layers, than we can get the out-of-bounds access.
@@ -3149,7 +2916,7 @@ To confirm, I have pushed another change that checks for an error in `ggml_cuda
---
-👤 **ciprianveg** commented the **2025-05-22** at **14:06:23**:
+👤 **ciprianveg** commented on **2025-05-22** at **14:06:23**
@ikawrakow, logs:
@@ -3210,25 +2977,25 @@ The program is not being run.
---
-👤 **ikawrakow** commented the **2025-05-22** at **14:11:23**:
+👤 **ikawrakow** commented on **2025-05-22** at **14:11:23**
This is a new. What is different to the previous times?
---
-👤 **ciprianveg** commented the **2025-05-22** at **14:22:44**:
+👤 **ciprianveg** commented on **2025-05-22** at **14:22:44**
Just git pull and rebuilt..
---
-👤 **ikawrakow** commented the **2025-05-22** at **14:24:29**:
+👤 **ikawrakow** commented on **2025-05-22** at **14:24:29**
You left out `-fmoe`
---
-👤 **ciprianveg** commented the **2025-05-22** at **14:45:09**:
+👤 **ciprianveg** commented on **2025-05-22** at **14:45:09**
@ikawrakow, you are right:
@@ -3255,13 +3022,13 @@ The program is not being run.
---
-👤 **ikawrakow** commented the **2025-05-22** at **15:04:05**:
+👤 **ikawrakow** commented on **2025-05-22** at **15:04:05**
Are you tired of testing yet? I have pushed another change.
---
-👤 **ikawrakow** commented the **2025-05-22** at **15:27:06**:
+👤 **ikawrakow** commented on **2025-05-22** at **15:27:06**
Btw, with the regex you are using for the tensor overrides, the small `ffn` tensors (`ffn_gate_inp` and `ffn_norm`) remain on the CPU. This results in more graph splits. Testing with Qwen3-30B-A3B with a single RTX-4080, I get
@@ -3272,13 +3039,13 @@ PP is the same.
---
-👤 **ciprianveg** commented the **2025-05-22** at **15:32:27**:
+👤 **ciprianveg** commented on **2025-05-22** at **15:32:27**
I will rebuild, change the regex and retest, in about an hour, i am out a bit..
On Thu, 22 May 2025, 18:27 Kawrakow, ***@***.***> wrote:
-> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp#425)
+> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp[#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425))
>
>
> Btw, with the regex you are using for the tensor overrides, the small ffn
@@ -3304,39 +3071,7 @@ On Thu, 22 May 2025, 18:27 Kawrakow, ***@***.***> wrote:
---
-👤 **ciprianveg** commented the **2025-05-22** at **15:32:27**:
-
-I will change the regex and retest, in about an hour, i am out a bit..
-
-On Thu, 22 May 2025, 18:27 Kawrakow, ***@***.***> wrote:
-
-> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp#425)
->
->
-> Btw, with the regex you are using for the tensor overrides, the small ffn
-> tensors (ffn_gate_inp and ffn_norm) remain on the CPU. This results in
-> more graph splits. Testing with Qwen3-30B-A3B with a single RTX-4080, I get
->
-> - TG = 70.4 t/s using -ot "blk\.[3-4][0-9].ffn_.*_exps=CPU". There are
-> 38 graph splits
-> - TG = 66.7 t/s using `-ot "blk.[3-4][0-9].ffn.*=CPU". There are 74
-> graph splits.
->
-> PP is the same.
->
-> —
-> Reply to this email directly, view it on GitHub
-> ,
-> or unsubscribe
->
-> .
-> You are receiving this because you were mentioned.Message ID:
-> ***@***.***>
->
-
----
-
-👤 **ciprianveg** commented the **2025-05-22** at **17:43:10**:
+👤 **ciprianveg** commented on **2025-05-22** at **17:43:10**
Hi @ikawrakow, here it is:
ggml_backend_cuda_buffer_cpy_tensor: attempt to copy from device 0 to device 2 without access enabled
@@ -3388,13 +3123,13 @@ The program is not being run.
---
-👤 **ciprianveg** commented the **2025-05-22** at **18:09:34**:
+👤 **ciprianveg** commented on **2025-05-22** at **18:09:34**
and also thanks for the regex tip, i got a 6% increase in gen speed.
---
-👤 **ikawrakow** commented the **2025-05-23** at **05:06:33**:
+👤 **ikawrakow** commented on **2025-05-23** at **05:06:33**
Hopefully the last change fixes it...
@@ -3402,15 +3137,15 @@ There really was a bug showing up when 2 or 3 tokens are processed.
---
-👤 **ciprianveg** commented the **2025-05-23** at **08:17:10**:
+👤 **ciprianveg** commented on **2025-05-23** at **08:17:10**
I won't be able to test it till tomorrow evening..
---
-👤 **Lissanro** commented the **2025-05-23** at **08:19:53**:
+👤 **Lissanro** commented on **2025-05-23** at **08:19:53**
-I rebuilt from the latest git, and it crashed when regenarating reply by getting triggered the same way as before, so unfortunately seem to be no change on my end. However, for some strange reason applying #442 "fixes" the bug. Below I provide detailed debug info.
+I rebuilt from the latest git, and it crashed when regenarating reply by getting triggered the same way as before, so unfortunately seem to be no change on my end. However, for some strange reason applying [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) "fixes" the bug. Below I provide detailed debug info.
First, I generate reply without thinking, which works fine, then with the `` tag, which crashes it; if I start generating first message with `` then the bug usually does not trigger when I try to regenerate it. May be it has nothing to do with the thinking mode, but slightly bigger partial match in the cache when the next message regenerates, forcing slightly different timings? Just regenerating non-thinking replies or thinking replies may not trigger it at all, but so far, generating non-thinking then thinking reply triggers it in all of cases that I have tried regardless if prompt is less than 1K tokens or 40K+ tokens long. Since I tried relatively few times, I am not yet 100% sure if is the most reliable way to trigger it, but so far it does it for me:
@@ -3421,7 +3156,7 @@ CUDA error: an illegal memory access was encountered
/home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
```
-With #442 applied the bug does not trigger anymore (or becomes much less probable to happen), but I get a lot of warnings like both before I send my first prompt, and after:
+With [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) applied the bug does not trigger anymore (or becomes much less probable to happen), but I get a lot of warnings like both before I send my first prompt, and after:
```
ggml_backend_cuda_cpy_tensor_async: attempt to copy from device 0 to device 1 without access enabled
@@ -3430,38 +3165,38 @@ ggml_backend_cuda_buffer_cpy_tensor: attempt to copy from device 0 to device 1 w
Full log (most of repeated lines replaced with "..." since they look the same) after generating first reply: https://pastebin.com/8F1YNFyw
-Second log after generating the second reply with the `` tag, which usually triggers the bug without #442 applied: https://pastebin.com/VUgDKehw
+Second log after generating the second reply with the `` tag, which usually triggers the bug without [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) applied: https://pastebin.com/VUgDKehw
-My only guess, #442 changes timings somehow and workarounds the bug in most cases. Just to be sure, I tried rebuilding without the patch, and the bug is back again, very reproducible using the method described above, no matter the content of the prompt as far as I can tell.
+My only guess, [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) changes timings somehow and workarounds the bug in most cases. Just to be sure, I tried rebuilding without the patch, and the bug is back again, very reproducible using the method described above, no matter the content of the prompt as far as I can tell.
-Previously, I tried with older #442 version and the bug still could trigger (I shared the debug output here in the previous messages), so I guess updated version #442 started to work as a workaround.
+Previously, I tried with older [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) version and the bug still could trigger (I shared the debug output here in the previous messages), so I guess updated version [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) started to work as a workaround.
Also, I wonder if it is supposed to attempt to copy from device to device without access enabled? Maybe fixing this warning could lead to an actual fix?
---
-👤 **ikawrakow** commented the **2025-05-23** at **08:23:31**:
+👤 **ikawrakow** commented on **2025-05-23** at **08:23:31**
-The bug is fixed on #442, but only as of this morning European time.
+The bug is fixed on [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442), but only as of this morning European time.
-It is not fixed on the main branch. I wanted to first have confirmation that the last change in #442 actually fixes it before making a fresh bug fix PR.
+It is not fixed on the main branch. I wanted to first have confirmation that the last change in [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) actually fixes it before making a fresh bug fix PR.
---
-👤 **ikawrakow** commented the **2025-05-23** at **08:47:09**:
+👤 **ikawrakow** commented on **2025-05-23** at **08:47:09**
> Also, I wonder if it is supposed to attempt to copy from device to device without access enabled? Maybe fixing this warning could lead to an actual fix?
So, this was a wisdom from Qwen3. But the only place in mainline `llama.cpp` where peer-to-peer access is explicitly enabled or disabled is when using split mode row, which is not the case here. Considering that mainline works, these checks are not required.
-The bug was in the matrix-vector multiplication kernel. It only shows up when the number of rows being processed (i.e., tokens) is 2 or 3 (the matrix-vector kernel confusingly processes up to 8 rows). This is not used during TG, and only triggers if an expert ends up with 2 or 3 rows, which is rare. I think all other changes on #442 are not required. The reason it took me so long to find is my lack of GPU experience (and my laziness to actually read the CUDA API specification). I realized only yesterday that checking for an error after launching a CUDA kernel does not tell us that the kernel was successfully executed, but only tells us that the kernel was successfully **queued** for execution. If there is a bug in the kernel (e.g., illegal memory access), the resulting error will get reported in some later call. Hence we were observing the illegal memory access error in synchronization calls, which made me think that there was something wrong in the back-end, data copying between devices, etc. So, most of what Qwen3 wrote were useless hallucinations. But at the end Qwen3 was actually useful, as the hallucinations were what made me go and read the CUDA programming guide.
+The bug was in the matrix-vector multiplication kernel. It only shows up when the number of rows being processed (i.e., tokens) is 2 or 3 (the matrix-vector kernel confusingly processes up to 8 rows). This is not used during TG, and only triggers if an expert ends up with 2 or 3 rows, which is rare. I think all other changes on [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) are not required. The reason it took me so long to find is my lack of GPU experience (and my laziness to actually read the CUDA API specification). I realized only yesterday that checking for an error after launching a CUDA kernel does not tell us that the kernel was successfully executed, but only tells us that the kernel was successfully **queued** for execution. If there is a bug in the kernel (e.g., illegal memory access), the resulting error will get reported in some later call. Hence we were observing the illegal memory access error in synchronization calls, which made me think that there was something wrong in the back-end, data copying between devices, etc. So, most of what Qwen3 wrote were useless hallucinations. But at the end Qwen3 was actually useful, as the hallucinations were what made me go and read the CUDA programming guide.
---
-👤 **Lissanro** commented the **2025-05-23** at **08:56:21**:
+👤 **Lissanro** commented on **2025-05-23** at **08:56:21**
> The bug is fixed on https://github.com/ikawrakow/ik_llama.cpp/pull/442, but only as of this morning European time.
-I see, I guess I got confused by "CUDA call tracer #442" title, and did not pay enough attention to notice it also adds fixes, not just call traces. My apologies.
+I see, I guess I got confused by "CUDA call tracer [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442)" title, and did not pay enough attention to notice it also adds fixes, not just call traces. My apologies.
In order to confirm what fixed the bug, I rebuilt with only [Fix bug in MMVQ kernel](https://github.com/ikawrakow/ik_llama.cpp/pull/442/commits/b79be8a191c10883a84d725ae9e70ec693ab3b6b) applied, and the bug seems to be fixed as far as I can tell using just this one commit.
\ No newline at end of file
diff --git a/github-data/issues/432 - Refactor_ GGUF v14 broke compatibility with IQx_KS quants.md b/github-data/issues/432 - Refactor GGUF v14 broke compatibility with IQx_KS quants.md
similarity index 77%
rename from github-data/issues/432 - Refactor_ GGUF v14 broke compatibility with IQx_KS quants.md
rename to github-data/issues/432 - Refactor GGUF v14 broke compatibility with IQx_KS quants.md
index 934299724..8e07c019c 100644
--- a/github-data/issues/432 - Refactor_ GGUF v14 broke compatibility with IQx_KS quants.md
+++ b/github-data/issues/432 - Refactor GGUF v14 broke compatibility with IQx_KS quants.md
@@ -1,4 +1,4 @@
-### 📝 [#432](https://github.com/ikawrakow/ik_llama.cpp/issues/432) - Refactor: GGUF v14 broke compatibility with IQx_KS quants
+## 📌 [Issue #432](https://github.com/ikawrakow/ik_llama.cpp/issues/432) - Refactor: GGUF v14 broke compatibility with IQx_KS quants
| **Author** | `Nexesenex` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### Background Description
@@ -42,17 +42,17 @@ Well, I tried to check that by myself when GGUF v14 was out, where was the intro
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-18** at **15:07:29**:
+👤 **ikawrakow** commented on **2025-05-18** at **15:07:29**
-#45 in this repository or a PR somewhere else?
+[#45](https://github.com/ikawrakow/ik_llama.cpp/issues/45) in this repository or a PR somewhere else?
What is GGUF v14 anyway and why should we care about it here?
---
-👤 **Nexesenex** commented the **2025-05-18** at **15:21:31**:
+👤 **Nexesenex** commented on **2025-05-18** at **15:21:31**
Yes, PR 45 in the IK Llama repo.
@@ -75,30 +75,7 @@ I just wanted to point out what happened, because I spent a few hours trying to
---
-👤 **Nexesenex** commented the **2025-05-18** at **15:21:31**:
-
-Yes, PR 45 in the IK Llama repo.
-
-Since the 14th revision of the GGUF format went out on mainline, it seems that some screws got tightened.
-
-https://github.com/ggml-org/llama.cpp/pull/11030
-
-Maybe one of those 2 "restrictions" :
-```
-- Restricted the key general.alignment to uint32_t and powers of 2. On master this key can be set to other types (allowing users to write a file that then causes an error on read) and other values (which don't work correctly with GGML_PAD). There is now a macro GGUF_KEY_GENERAL_ALIGNMENT since this key has a special meaning.
-- If user code tries to call gguf_get_arr_data on a string array an error is raised. On master this returns a pointer of type gguf_str, a type defined in ggml.c. I would consider this a misuse of the API.
-```
-
-Before that PR, I could use all your quants (at the time, IQ_K, IQ_KS, and IQ_KT). After that, only the first gen of IQ_K quants (2,3,4,5,6) are functioning, the rest produce offset errors.
-
-You have absolutely no reason to help me on this, except to maintain some relative compatibility between the quants produced by IK_LLama and a fork of mainline implementing the IK quants.
-But I understand perfectly that you most likely will not want to waste your time trying to fix compatibility with some - potentially adverse - or at least factually incompatible mainline coding and refactoring which is unrelated to IK_LLama.
-
-I just wanted to point out what happened, because I spent a few hours trying to figure this out a few months ago before giving up, and deciding to follow the mainline move to avoid a growing merge-hell later on.
-
----
-
-👤 **ikawrakow** commented the **2025-05-18** at **15:44:53**:
+👤 **ikawrakow** commented on **2025-05-18** at **15:44:53**
@Nexesenex
@@ -158,7 +135,7 @@ Let me know how it goes.
---
-👤 **JohannesGaessler** commented the **2025-05-18** at **16:28:45**:
+👤 **JohannesGaessler** commented on **2025-05-18** at **16:28:45**
On the mainline repository the implementation is
@@ -173,7 +150,7 @@ Doing this calculation manually can be seen as a defect but it only manifests as
---
-👤 **ikawrakow** commented the **2025-05-18** at **16:38:18**:
+👤 **ikawrakow** commented on **2025-05-18** at **16:38:18**
> On the mainline repository the implementation is
@@ -183,13 +160,13 @@ Yes, this is the current implementation. But that implementation can change, and
---
-👤 **JohannesGaessler** commented the **2025-05-18** at **16:52:34**:
+👤 **JohannesGaessler** commented on **2025-05-18** at **16:52:34**
Yes, I agree that it's better to use `ggml_row_size`. If I write new code or touch existing code I will replace it as appropriate. It's a defect. But as there are no inputs that can provoke incorrect results on the mainline repository this defect is not manifesting as a bug and it is fairly low-priority. If this issue is of higher priority for someone else they will need to go through the code and fix the defect where applicable themself.
---
-👤 **Nexesenex** commented the **2025-05-18** at **17:43:13**:
+👤 **Nexesenex** commented on **2025-05-18** at **17:43:13**
@ikawrakow : it works. Tyvm!
@@ -203,12 +180,12 @@ Note : I speak on my own and sole behalf, but I needed to say this.
---
-👤 **ikawrakow** commented the **2025-05-19** at **14:11:02**:
+👤 **ikawrakow** commented on **2025-05-19** at **14:11:02**
@Nexesenex I think I can close this now.
---
-👤 **Nexesenex** commented the **2025-05-19** at **15:03:12**:
+👤 **Nexesenex** commented on **2025-05-19** at **15:03:12**
Yep. Thank again, @ikawrakow.
\ No newline at end of file
diff --git a/github-data/issues/433 - Feature Request_ CORS support.md b/github-data/issues/433 - Feature Request CORS support.md
similarity index 85%
rename from github-data/issues/433 - Feature Request_ CORS support.md
rename to github-data/issues/433 - Feature Request CORS support.md
index 638d97212..cbc4644e3 100644
--- a/github-data/issues/433 - Feature Request_ CORS support.md
+++ b/github-data/issues/433 - Feature Request CORS support.md
@@ -1,14 +1,15 @@
-### ✨ [#433](https://github.com/ikawrakow/ik_llama.cpp/issues/433) - Feature Request: CORS support
+## 📌 [Issue #433](https://github.com/ikawrakow/ik_llama.cpp/issues/433) - Feature Request: CORS support
| **Author** | `KCS-Mack` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-05-18 |
| **Updated** | 2025-05-18 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -37,9 +38,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-18** at **15:39:10**:
+👤 **ubergarm** commented on **2025-05-18** at **15:39:10**
You could use any reverse proxy to add this yourself e.g. nginx, caddy server, etc.
diff --git a/github-data/issues/436 - Bug_ Saving the prompt cache causes Segfault.md b/github-data/issues/436 - Bug Saving the prompt cache causes Segfault.md
similarity index 84%
rename from github-data/issues/436 - Bug_ Saving the prompt cache causes Segfault.md
rename to github-data/issues/436 - Bug Saving the prompt cache causes Segfault.md
index 1bf441a39..fe6373545 100644
--- a/github-data/issues/436 - Bug_ Saving the prompt cache causes Segfault.md
+++ b/github-data/issues/436 - Bug Saving the prompt cache causes Segfault.md
@@ -1,4 +1,4 @@
-### 🐛 [#436](https://github.com/ikawrakow/ik_llama.cpp/issues/436) - Bug: Saving the prompt cache causes Segfault
+## 📌 [Issue #436](https://github.com/ikawrakow/ik_llama.cpp/issues/436) - Bug: Saving the prompt cache causes Segfault
| **Author** | `saood06` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -36,9 +36,9 @@ Segmentation fault (core dumped)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-28** at **06:30:58**:
+👤 **saood06** commented on **2025-05-28** at **06:30:58**
I finally got some time to look into this more and I think the cause of issue seems to be the fact that the function [here](https://github.com/ikawrakow/ik_llama.cpp/blob/ccd6d9cdf6851f7042c48d682daf47bc0e2eca27/src/llama.cpp#L21453) references kv_self.k_l and kv_self.v_l and since I was using Deepseek with FlashMLA-3 where kv_l see [here](https://github.com/ikawrakow/ik_llama.cpp/blob/ccd6d9cdf6851f7042c48d682daf47bc0e2eca27/src/llama.cpp#L2995) is used instead (and kvt_l would have also been used if I was using a different implementation of MLA).
@@ -46,7 +46,7 @@ I finally got some time to look into this more and I think the cause of issue se
---
-👤 **ikawrakow** commented the **2025-05-28** at **08:08:32**:
+👤 **ikawrakow** commented on **2025-05-28** at **08:08:32**
Yes, this part has not been updated at all. There are two issues:
* Using `kv_l` and possibly `kvt_l` instead of `k_l` and `v_l`. I guess, it would be best to just get rid of `kv_l` and `kvt_l` (they came from the initial implementation) and just use `k_l` and `v_l` instead. This would be relatively easy to change.
@@ -54,7 +54,7 @@ Yes, this part has not been updated at all. There are two issues:
---
-👤 **saood06** commented the **2025-05-28** at **08:56:12**:
+👤 **saood06** commented on **2025-05-28** at **08:56:12**
>Using `kv_l` and possibly `kvt_l` instead of `k_l` and `v_l`. I guess, it would be best to just get rid of `kv_l` and `kvt_l` (they came from the initial implementation) and just use `k_l` and `v_l` instead. This would be relatively easy to change.
@@ -66,13 +66,13 @@ That is your decision to make. Alternatively couldn't we just put a warning when
---
-👤 **ikawrakow** commented the **2025-05-28** at **09:17:16**:
+👤 **ikawrakow** commented on **2025-05-28** at **09:17:16**
OK, let's start with the required changes without worrying about `Q8_KV`. Do you want to do it?
---
-👤 **saood06** commented the **2025-05-28** at **09:25:04**:
+👤 **saood06** commented on **2025-05-28** at **09:25:04**
>Do you want to do it?
@@ -80,7 +80,7 @@ I don't mind giving it an attempt, but I'm heading off for now and won't be avai
---
-👤 **ikawrakow** commented the **2025-05-28** at **09:30:40**:
+👤 **ikawrakow** commented on **2025-05-28** at **09:30:40**
> but I'm heading off for now and won't be available till tomorrow at the earliest.
@@ -90,13 +90,13 @@ I'm experimenting with some stuff right now, but if I find a moment before tomor
---
-👤 **ikawrakow** commented the **2025-05-28** at **11:21:25**:
+👤 **ikawrakow** commented on **2025-05-28** at **11:21:25**
-See #469
+See [#469](https://github.com/ikawrakow/ik_llama.cpp/issues/469)
---
-👤 **saood06** commented the **2025-06-02** at **01:23:45**:
+👤 **saood06** commented on **2025-06-02** at **01:23:45**
Although it was tested and works, there may still be some issues with it, since I just crashed with this when attempting to save (and it didn't even write the prompt to the file before it crashed)
@@ -141,17 +141,7 @@ Edit: Happens consistently now (might be larger prompts?) and might as well shar
---
-👤 **saood06** commented the **2025-06-02** at **01:23:45**:
-
-Although it was tested and works, there may still be some issues with it, since I just crashed with this.
-
-`/ik_llama.cpp/ggml/src/ggml-backend.c:251: GGML_ASSERT(offset + size <= ggml_nbytes(tensor) && "tensor read out of bounds") failed`
-
-I have the coredump and will debug it later.
-
----
-
-👤 **saood06** commented the **2025-06-03** at **12:52:48**:
+👤 **saood06** commented on **2025-06-03** at **12:52:48**
I poked around the coredump a bit, and for the ggml_backend_tensor_get call I saw the offset is 0, with size of 175865856. I manually calculated ggml_nbytes to be 92307456, which is close to half the size.
@@ -163,13 +153,13 @@ Would confirming that it breaks past a token size be useful? Or is there somethi
---
-👤 **ikawrakow** commented the **2025-06-03** at **13:37:33**:
+👤 **ikawrakow** commented on **2025-06-03** at **13:37:33**
There is a confusion with the size of the tensor, and one needs to carefully go through the code to sort it out. As I wrote earlier, I have changed the K cache to be `k_had_size x n_head x n_tokens`, while the code is written from the point of view that the K cache is `k_head_size * n_head x n_tokens`. Somewhere things go wrong because of that. If you don't see it, and I don't see it, I can revert the shape change (it is isolated to a very few places).
---
-👤 **saood06** commented the **2025-06-03** at **14:15:24**:
+👤 **saood06** commented on **2025-06-03** at **14:15:24**
> There is a confusion with the size of the tensor, and one needs to carefully go through the code to sort it out. As I wrote earlier, I have changed the K cache to be `k_had_size x n_head x n_tokens`, while the code is written from the point of view that the K cache is `k_head_size * n_head x n_tokens`. Somewhere things go wrong because of that. If you don't see it, and I don't see it, I can revert the shape change (it is isolated to a very few places).
@@ -179,7 +169,7 @@ I will gladly test whatever change you think will fix this (whether that be if y
---
-👤 **saood06** commented the **2025-06-06** at **06:49:31**:
+👤 **saood06** commented on **2025-06-06** at **06:49:31**
@ikawrakow
@@ -220,7 +210,7 @@ I do think the changes needed will be isolated to `write_kv_cache_data` / `read_
---
-👤 **ikawrakow** commented the **2025-06-06** at **07:10:08**:
+👤 **ikawrakow** commented on **2025-06-06** at **07:10:08**
We have `n_embd_k_gqa = n_embd_head_k * n_head_kv`, so a 1D tensor of size `n_embd_k_gqa * kv_size` is the same as a 1D tensor of size `n_embd_head_k * n_head_kv * kv_size`, which can be viewed as a 2D tensor of size `n_embd_head_k x n_head_kv*kv_size`.
@@ -230,7 +220,7 @@ Does this answer the question?
---
-👤 **ikawrakow** commented the **2025-06-06** at **07:26:35**:
+👤 **ikawrakow** commented on **2025-06-06** at **07:26:35**
So, the presence of `hparams.n_embd_k_s()` (needed for Mamba) makes it more complicated. But my K-cache change to 2D does not work with Mamba anyway (does `ik_llama.cpp` work for Mamba at all? I wouldn't think so).
@@ -238,7 +228,7 @@ So, we can simply disregard Mamba. One needs to change `n_embd_k_gqa` in case it
---
-👤 **saood06** commented the **2025-06-06** at **07:29:43**:
+👤 **saood06** commented on **2025-06-06** at **07:29:43**
> We have `n_embd_k_gqa = n_embd_head_k * n_head_kv`, so a 1D tensor of size `n_embd_k_gqa * kv_size` is the same as a 1D tensor of size `n_embd_head_k * n_head_kv * kv_size`, which can be viewed as a 2D tensor of size `n_embd_head_k x n_head_kv*kv_size`.
@@ -254,13 +244,13 @@ I think so. That does line up with the ~43x factor that the size was off by. (Fo
---
-👤 **ikawrakow** commented the **2025-06-06** at **07:35:42**:
+👤 **ikawrakow** commented on **2025-06-06** at **07:35:42**
So, this is done just using the `llama_hparams` struct. Which does not know if MLA is being used because the MLA flag is in the `llama_cparams` struct. I have run into this stupid issue a number of times, but never took the time to sort this out. The cache writing needs to know if MLA was used to calculate it so it can use and record the correct cache size.
---
-👤 **saood06** commented the **2025-06-06** at **07:47:29**:
+👤 **saood06** commented on **2025-06-06** at **07:47:29**
> So, this is done just using the `llama_hparams` struct. Which does not know if MLA is being used because the MLA flag is in the `llama_cparams` struct. I have run into this stupid issue a number of times, but never took the time to sort this out. The cache writing needs to know if MLA was used to calculate it so it can use and record the correct cache size.
@@ -268,33 +258,25 @@ You have access to the ctx object (which contains `cparams` which is a `llama_cp
---
-👤 **saood06** commented the **2025-06-06** at **07:47:29**:
-
-> So, this is done just using the `llama_hparams` struct. Which does not know if MLA is being used because the MLA flag is in the `llama_cparams` struct. I have run into this stupid issue a number of times, but never took the time to sort this out. The cache writing needs to know if MLA was used to calculate it so it can use and record the correct cache size.
-
-You have access to the ctx object (which contains llama_cparams) so I don't see why that is an issue.
-
----
-
-👤 **ikawrakow** commented the **2025-06-06** at **07:52:34**:
+👤 **ikawrakow** commented on **2025-06-06** at **07:52:34**
> You have access to the ctx object (which contains llama_cparams) so I don't see why that is an issue.
-You don't have access to `llama_cparams` when loading the mode for instance. If you have access to the context when writing the cache, you can do it that way. Otherwise, #490 has a quick hack to add the MLA flag to `llama_hparams`. If it set, the `n_embd_k_gqa()` will now return the correct size needed when writing the cache.
+You don't have access to `llama_cparams` when loading the mode for instance. If you have access to the context when writing the cache, you can do it that way. Otherwise, [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490) has a quick hack to add the MLA flag to `llama_hparams`. If it set, the `n_embd_k_gqa()` will now return the correct size needed when writing the cache.
---
-👤 **saood06** commented the **2025-06-06** at **08:04:43**:
+👤 **saood06** commented on **2025-06-06** at **08:04:43**
->You don't have access to `llama_cparams` when loading the mode for instance. If you have access to the context when writing the cache, you can do it that way. Otherwise, [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490) has a quick hack to add the MLA flag to `llama_hparams`. If it set, the `n_embd_k_gqa()` will now return the correct size needed when writing the cache.
+>You don't have access to `llama_cparams` when loading the mode for instance. If you have access to the context when writing the cache, you can do it that way. Otherwise, [[#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490)](https://github.com/ikawrakow/ik_llama.cpp/issues/490) has a quick hack to add the MLA flag to `llama_hparams`. If it set, the `n_embd_k_gqa()` will now return the correct size needed when writing the cache.
-I'm testing a fix without #490. If it works I'll make the PR. I don't think #490 is needed for this, but you know better if it is helpful in other situations.
+I'm testing a fix without [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490). If it works I'll make the PR. I don't think [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490) is needed for this, but you know better if it is helpful in other situations.
---
-👤 **saood06** commented the **2025-06-06** at **08:50:01**:
+👤 **saood06** commented on **2025-06-06** at **08:50:01**
-Just in case anyone reads through this later #496 is the PR with the hack that was not used, and not #490.
+Just in case anyone reads through this later [#496](https://github.com/ikawrakow/ik_llama.cpp/issues/496) is the PR with the hack that was not used, and not [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490).
>(does ik_llama.cpp work for Mamba at all? I wouldn't think so).
diff --git a/github-data/issues/437 - Feature Request_ support intel amx for further accelerate.md b/github-data/issues/437 - Feature Request support intel amx for further accelerate.md
similarity index 82%
rename from github-data/issues/437 - Feature Request_ support intel amx for further accelerate.md
rename to github-data/issues/437 - Feature Request support intel amx for further accelerate.md
index 702365e6c..3f809e27f 100644
--- a/github-data/issues/437 - Feature Request_ support intel amx for further accelerate.md
+++ b/github-data/issues/437 - Feature Request support intel amx for further accelerate.md
@@ -1,14 +1,15 @@
-### ✨ [#437](https://github.com/ikawrakow/ik_llama.cpp/issues/437) - Feature Request: support intel amx for further accelerate
+## 📌 [Issue #437](https://github.com/ikawrakow/ik_llama.cpp/issues/437) - Feature Request: support intel amx for further accelerate
| **Author** | `zhaoyukoon` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-05-20 |
| **Updated** | 2025-07-06 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -33,9 +34,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-20** at **09:13:24**:
+👤 **ikawrakow** commented on **2025-05-20** at **09:13:24**
If someone gives me access to a system with AMX support, then sure, I would work on that.
@@ -43,7 +44,7 @@ But out of curiosity, do you have a performance comparison between ik_llama.cpp
---
-👤 **zhaoyukoon** commented the **2025-05-20** at **10:11:14**:
+👤 **zhaoyukoon** commented on **2025-05-20** at **10:11:14**
> If someone gives me access to a system with AMX support, then sure, I would work on that.
>
@@ -59,7 +60,7 @@ https://mp.weixin.qq.com/s/vIrvbVJ6Nv00Ehre1zZwMw [In Chinese]
---
-👤 **ikawrakow** commented the **2025-05-20** at **10:37:39**:
+👤 **ikawrakow** commented on **2025-05-20** at **10:37:39**
I cannot say that I'm particularly impressed with the performance reported in [ktransformers-Intel-AMX](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/AMX.md). For convenience here is what they report:
@@ -89,7 +90,7 @@ So, 15X their prefill performance and 3X their "4-way decode" performance ("cons
---
-👤 **ikawrakow** commented the **2025-05-20** at **10:48:25**:
+👤 **ikawrakow** commented on **2025-05-20** at **10:48:25**
> I can access a server equipped with AMX Intel CPUs, however I have no permission to add other uses. I can help to run test on this server.
@@ -97,7 +98,7 @@ This will be way too tedious. I have to build with AMX instructions enabled, the
---
-👤 **zhaoyukoon** commented the **2025-05-20** at **11:08:03**:
+👤 **zhaoyukoon** commented on **2025-05-20** at **11:08:03**
> > I can access a server equipped with AMX Intel CPUs, however I have no permission to add other uses. I can help to run test on this server.
>
@@ -107,7 +108,7 @@ Do you have any requirements on CPU and memory for development? Is server with 1
---
-👤 **ikawrakow** commented the **2025-05-20** at **11:47:29**:
+👤 **ikawrakow** commented on **2025-05-20** at **11:47:29**
> Do you have any requirements on CPU and memory for development? Is server with 16 vCPU AMX and 32GB enough?
@@ -122,7 +123,7 @@ Let's also make sure that the expectations are aligned:
---
-👤 **kirnat** commented the **2025-05-20** at **14:17:01**:
+👤 **kirnat** commented on **2025-05-20** at **14:17:01**
While I’d be excited to see AMX support, I can’t say the kTransformers Qwen3 benchmark proves its usefulness. I can’t verify the pp/tg window sizes or the exact model they used, but as an inexact comparison, I got the below results in ik_llama for Qwen3 235B with Xeon 8480 (ES), 8-channel 4800MT DDR5 and a blackwell GPU.
@@ -150,41 +151,13 @@ Thanks a lot for the impressive work ikawrakow!
---
-👤 **kirnat** commented the **2025-05-20** at **14:17:01**:
-
-While I’d be excited to see AMX support, I can’t say the kTransformers Qwen3 benchmark proves its usefulness. I can’t verify the pp/tg window sizes or the exact model they used, but as an inexact comparison, I got the below results for Qwen3 235B with Xeon 8480 (ES), 8-channel 4800MT DDR5 and a blackwell GPU.
-
-Model used:
-**unsloth/Qwen3-235B-A22B-GGUF/UD-Q4_K_XL/Qwen3-235B-A22B-UD-Q4_K_XL**
-| size | params | backend | ngl | threads | n_batch | n_ubatch | fa | rtr | fmoe | test | t/s |
-| ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --: | ---: | ------------: | ---------------: |
-| 124.91 GiB | 235.09 B | CUDA | 93 | 52 | 8192 | 8192 | 1 | 1 | 1 | pp2048 | 192.02 ± 0.06 |
-| 124.91 GiB | 235.09 B | CUDA | 93 | 52 | 8192 | 8192 | 1 | 1 | 1 | pp16384 | 185.33 ± 0.34 |
-| 124.91 GiB | 235.09 B | CUDA | 93 | 52 | 8192 | 8192 | 1 | 1 | 1 | tg512 | 18.74 ± 0.02 |
-| 124.91 GiB | 235.09 B | CUDA | 93 | 52 | 8192 | 8192 | 1 | 1 | 1 | tg2048 | 18.58 ± 0.03 |
-
-The 30B model performs really well on CPU only, below is with GPU hidden.
-
-Model used:
-**unsloth/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-UD-Q4_K_XL**
-| size | params | backend | ngl | threads | fa | fmoe | test | t/s |
-| ---------: | ---------: | ---------- | --: | ------: | -: | ---: | ------------: | ---------------: |
-| 16.49 GiB | 30.53 B | CUDA | 0 | 32 | 1 | 1 | pp512 | 510.65 ± 2.49 |
-| 16.49 GiB | 30.53 B | CUDA | 0 | 32 | 1 | 1 | pp2048 | 454.62 ± 0.18 |
-| 16.49 GiB | 30.53 B | CUDA | 0 | 32 | 1 | 1 | tg128 | 69.77 ± 0.02 |
-| 16.49 GiB | 30.53 B | CUDA | 0 | 32 | 1 | 1 | tg512 | 69.15 ± 0.01 |
-
-Thanks a lot for the impressive work ikawrakow!
-
----
-
-👤 **ikawrakow** commented the **2025-05-20** at **15:06:56**:
+👤 **ikawrakow** commented on **2025-05-20** at **15:06:56**
Has anyone tried mainline `llama.cpp` AMX implementation?
---
-👤 **zhaoyukoon** commented the **2025-05-20** at **16:09:15**:
+👤 **zhaoyukoon** commented on **2025-05-20** at **16:09:15**
> Has anyone tried mainline `llama.cpp` AMX implementation?
@@ -194,7 +167,7 @@ It seems that llama.cpp supports amx
---
-👤 **ikawrakow** commented the **2025-05-20** at **16:28:01**:
+👤 **ikawrakow** commented on **2025-05-20** at **16:28:01**
> It seems that llama.cpp supports amx.
@@ -202,7 +175,7 @@ That's why I asked if somebody has tried. It would be even more interesting if s
---
-👤 **kirnat** commented the **2025-05-20** at **19:18:44**:
+👤 **kirnat** commented on **2025-05-20** at **19:18:44**
### Confirming AMX buffer
**llama.cpp/build/bin/llama-cli -m ./models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf**
@@ -254,7 +227,7 @@ Let me know if you want me to test with another model specific settings. I used
---
-👤 **ikawrakow** commented the **2025-05-20** at **19:33:34**:
+👤 **ikawrakow** commented on **2025-05-20** at **19:33:34**
Thanks!
@@ -262,7 +235,7 @@ You could try adding `-rtr 1` to the `ik_llama.cpp` benchmark run. This normally
---
-👤 **kirnat** commented the **2025-05-20** at **20:47:09**:
+👤 **kirnat** commented on **2025-05-20** at **20:47:09**
I hadn't even considered it for CPU only inference. I have used it alot day to day for hybrid inference with great results.
@@ -279,13 +252,13 @@ Still amazed how relatively low the slow down is in ik at larger context sizes.
---
-👤 **ikawrakow** commented the **2025-05-21** at **04:39:40**:
+👤 **ikawrakow** commented on **2025-05-21** at **04:39:40**
So, `ik_llama.cpp` without AMX is nearly two times faster than `llama.cpp` with AMX.
---
-👤 **mtcl** commented the **2025-06-08** at **06:03:35**:
+👤 **mtcl** commented on **2025-06-08** at **06:03:35**
Specifically for the new r1-0528 (but results are similar for v3-0324):
@@ -301,7 +274,7 @@ I love your work here @ikawrakow and I would love to contribute in anyway to mak
---
-👤 **ikawrakow** commented the **2025-06-08** at **06:26:52**:
+👤 **ikawrakow** commented on **2025-06-08** at **06:26:52**
> I have an amx supported pc, and I can confirm that performance for ktransformers is noticibly better than ik_llama and llama.cpp (in that same order) for prompt processing. In general I get about 50 prefill (prompt processing) on ktransformers and 10 tk/s on generation. It is the prompt processing that has massive benefits on ktransformers. I get less than half on the prompt processing on ik_llama.cpp. token generation is comparable (but ktransformers has about 10% advantage).
@@ -309,7 +282,7 @@ If you share your `ik_llama.cpp` command line you used to measure performance, p
---
-👤 **ubergarm** commented the **2025-06-08** at **15:48:30**:
+👤 **ubergarm** commented on **2025-06-08** at **15:48:30**
@mtcl
@@ -329,7 +302,7 @@ But as ik says, share your commands and might be able to get you a boost.
---
-👤 **mtcl** commented the **2025-06-09** at **00:26:07**:
+👤 **mtcl** commented on **2025-06-09** at **00:26:07**
@ubergarm and @ikawrakow Below is for Qwen3-235 billion parameter model. Tthank you for the pointers! For the qwen models, i added "-b 2048 -ub 2048" and that resulted in the max speeds for me. I am getting 150+ prompt processing tk/seconds on that now! That is insane!
@@ -459,13 +432,13 @@ I will make a full video on this and will post this, unedited version of it, so
---
-👤 **mtcl** commented the **2025-06-09** at **00:27:01**:
+👤 **mtcl** commented on **2025-06-09** at **00:27:01**
@ubergarm Would you be able to post a guide on how to make the IQ4 version of the Qwen Model?
---
-👤 **ikawrakow** commented the **2025-06-09** at **04:18:48**:
+👤 **ikawrakow** commented on **2025-06-09** at **04:18:48**
@mtcl
@@ -475,7 +448,7 @@ On the "crash": the DeepSeek self attention mechanism is special (different from
---
-👤 **mtcl** commented the **2025-06-09** at **04:24:19**:
+👤 **mtcl** commented on **2025-06-09** at **04:24:19**
> [@mtcl](https://github.com/mtcl)
>
@@ -489,13 +462,13 @@ https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/tree/main/IQ3_K_R4
---
-👤 **ikawrakow** commented the **2025-06-09** at **04:25:44**:
+👤 **ikawrakow** commented on **2025-06-09** at **04:25:44**
That is with `ik_llama.cpp`. My question was what model are you running with KTransformers?
---
-👤 **mtcl** commented the **2025-06-09** at **04:30:16**:
+👤 **mtcl** commented on **2025-06-09** at **04:30:16**
> That is with `ik_llama.cpp`. My question was what model are you running with KTransformers?
@@ -506,7 +479,7 @@ do you know if by using multiple 4090s i can increase context limit? I am also g
---
-👤 **ikawrakow** commented the **2025-06-09** at **04:41:50**:
+👤 **ikawrakow** commented on **2025-06-09** at **04:41:50**
> do you know if by using multiple 4090s i cna increate context limit? I am also getting a 5090 tomorrow, so potentially it will help with more context on one GPU.
@@ -514,7 +487,7 @@ Yes, some people with multiple GPU's have reported running full context length.
---
-👤 **mtcl** commented the **2025-06-09** at **04:45:10**:
+👤 **mtcl** commented on **2025-06-09** at **04:45:10**
Can you please help me in modifying this command to get more context length with 2X4090 setup.
@@ -540,19 +513,19 @@ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
---
-👤 **ikawrakow** commented the **2025-06-09** at **04:48:05**:
+👤 **ikawrakow** commented on **2025-06-09** at **04:48:05**
Can you post the log? I don't know by heart how much VRAM gets used for model weights and KV cache, and how big CUDA compute buffers are.
---
-👤 **ikawrakow** commented the **2025-06-09** at **05:06:31**:
+👤 **ikawrakow** commented on **2025-06-09** at **05:06:31**
I think if you are able to offload two layers of experts per GPU you have in the range of 11 GB free on each GPU excuding the experts. It is likely that if you don't offload any experts to the GPU, you can a) nearly double prefill speed by using `-b 4096 -ub 4096` or b) increase context length to at least 65k tokens, or c) do a) and b).
---
-👤 **mtcl** commented the **2025-06-09** at **05:32:06**:
+👤 **mtcl** commented on **2025-06-09** at **05:32:06**
ok I posted the whole video here, showing every command i ran with all the log outputs.
@@ -566,7 +539,7 @@ I am trying to understand how do i achieve this. What command can i run to give
---
-👤 **mtcl** commented the **2025-06-09** at **05:46:15**:
+👤 **mtcl** commented on **2025-06-09** at **05:46:15**
i tried modifying the command like this, but i get error:
@@ -943,13 +916,13 @@ Segmentation fault (core dumped)
---
-👤 **ikawrakow** commented the **2025-06-09** at **05:49:54**:
+👤 **ikawrakow** commented on **2025-06-09** at **05:49:54**
Try `cmake -DGGML_SCHED_MAX_COPIES=1 ...`
---
-👤 **mtcl** commented the **2025-06-09** at **05:52:47**:
+👤 **mtcl** commented on **2025-06-09** at **05:52:47**
OK, I can do that. I had used this command earlier:
@@ -963,19 +936,19 @@ let me do that, rebuild and come back here.
---
-👤 **ikawrakow** commented the **2025-06-09** at **05:53:12**:
+👤 **ikawrakow** commented on **2025-06-09** at **05:53:12**
Yes.
---
-👤 **ikawrakow** commented the **2025-06-09** at **06:05:31**:
+👤 **ikawrakow** commented on **2025-06-09** at **06:05:31**
So, I normally don't watch YT. Had to go to another computer as I don't have sound on my development machine. I tried watching the video, but it is constantly being interrupted by advertisements. I think it is better to keep the conversation here. From the log you posted we see that the KV cache is just 600 MB, and the model is taking just ~14 GB. The only thing that we still need to see is the compute buffer size after you rebuild and ran with `-DGGML_SCHED_MAX_COPIES=1`
---
-👤 **mtcl** commented the **2025-06-09** at **06:09:53**:
+👤 **mtcl** commented on **2025-06-09** at **06:09:53**
I understand, here is the full log after your suggestion.
@@ -1646,289 +1619,128 @@ INFO [ update_slots] all slots are idle | tid="124625576644608" times
---
-👤 **mtcl** commented the **2025-06-09** at **06:09:53**:
+👤 **mtcl** commented on **2025-06-09** at **06:10:23**
-I understand, here is the full log after your suggestion.
+I have about 18GB on both GPUs now.
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
--- OpenMP found
--- Using optimized iqk matrix multiplications
--- Enabling IQK Flash Attention kernels
--- Using llamafile
--- CUDA found
--- Using CUDA architectures: native
--- CUDA host compiler is GNU 13.3.0
+---
--- Warning: ccache not found - consider installing it for faster compilation or disable this warning with GGML_CCACHE=OFF
--- CMAKE_SYSTEM_PROCESSOR: x86_64
--- x86 detected
--- ARCH_FLAGS = -march=native
--- Configuring done (0.4s)
--- Generating done (0.1s)
--- Build files have been written to: /home/mukul/dev-ai/ik_llama.cpp/build
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ cmake --build ./build --config Release -j 100
-[ 0%] Built target build_info
-[ 0%] Built target sha256
-[ 1%] Built target xxhash
-[ 1%] Built target sha1
-[ 1%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
-[ 2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
-[ 2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
-[ 2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
-[ 3%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/acc.cu.o
-[ 3%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/arange.cu.o
-[ 4%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/argsort.cu.o
-[ 4%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/binbcast.cu.o
-[ 4%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/clamp.cu.o
-[ 5%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/concat.cu.o
-[ 5%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/conv-transpose-1d.cu.o
-[ 5%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/convert.cu.o
-[ 6%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/cpy.cu.o
-[ 6%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/diagmask.cu.o
-[ 6%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/dmmv.cu.o
-[ 7%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/iqk_mmvq.cu.o
-[ 8%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-new-mma.cu.o
-[ 9%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-tile-f16.cu.o
-[ 9%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/mmvq.cu.o
-[ 9%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn-tile-f32.cu.o
-[ 10%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/fattn.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/pool2d.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/im2col.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/getrows.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/quantize.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/mmq.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/norm.cu.o
-[ 11%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/pad.cu.o
-[ 12%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/scale.cu.o
-[ 12%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/rope.cu.o
-[ 12%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/softcap.cu.o
-[ 13%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/sumrows.cu.o
-[ 13%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/tsembd.cu.o
-[ 13%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/softmax.cu.o
-[ 13%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/unary.cu.o
-[ 14%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/upscale.cu.o
-[ 14%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda.cu.o
-[ 15%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb16.cu.o
-[ 15%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqfloat-cpb32.cu.o
-[ 15%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb16.cu.o
-[ 16%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb32.cu.o
-[ 16%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-wmma-f16-instance-kqhalf-cpb8.cu.o
-[ 16%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_1-ncols2_8.cu.o
-[ 17%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_1.cu.o
-[ 17%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_2.cu.o
-[ 17%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_16-ncols2_4.cu.o
-[ 18%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_4.cu.o
-[ 18%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_2-ncols2_8.cu.o
-[ 18%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_32-ncols2_1.cu.o
-[ 19%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_32-ncols2_2.cu.o
-[ 19%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_2.cu.o
-[ 20%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_8.cu.o
-[ 20%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_64-ncols2_1.cu.o
-[ 21%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_1.cu.o
-[ 21%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_2.cu.o
-[ 21%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_4-ncols2_4.cu.o
-[ 21%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_4.cu.o
-[ 22%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-mma-f16-instance-ncols1_8-ncols2_8.cu.o
-[ 22%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq1_s.cu.o
-[ 23%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_k.cu.o
-[ 23%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_ks.cu.o
-[ 23%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_s.cu.o
-[ 23%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_xxs.cu.o
-[ 24%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq2_xs.cu.o
-[ 24%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq1_s_r4.cu.o
-[ 24%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_k.cu.o
-[ 25%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_s.cu.o
-[ 26%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_k.cu.o
-[ 26%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_ks.cu.o
-[ 26%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_ks_r4.cu.o
-[ 27%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_nl.cu.o
-[ 27%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq4_xs.cu.o
-[ 27%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq5_k.cu.o
-[ 28%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq5_ks.cu.o
-[ 28%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq3_xxs.cu.o
-[ 28%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq5_ks_r4.cu.o
-[ 29%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q2_k.cu.o
-[ 29%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q3_k.cu.o
-[ 29%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-iq6_k.cu.o
-[ 29%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_0.cu.o
-[ 30%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_1.cu.o
-[ 30%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q4_k.cu.o
-[ 30%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_0.cu.o
-[ 31%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_1.cu.o
-[ 31%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q5_k.cu.o
-[ 32%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q6_0.cu.o
-[ 32%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q6_k.cu.o
-[ 32%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/mmq-instance-q8_0.cu.o
-[ 33%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q4_0-q4_0.cu.o
-[ 33%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q4_0-q4_0.cu.o
-[ 33%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q8_0.cu.o
-[ 34%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs192-q8_0-q8_0.cu.o
-[ 34%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-q8_0-q8_0.cu.o
-[ 34%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q8_0.cu.o
-[ 35%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs192-q8_0-q8_0.cu.o
-[ 35%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-q8_0-q8_0.cu.o
-[ 35%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-f16-f16.cu.o
-[ 36%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs192-f16-f16.cu.o
-[ 36%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs256-f16-f16.cu.o
-[ 37%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs64-f16-f16.cu.o
-[ 37%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-f16-f16.cu.o
-[ 37%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs192-f16-f16.cu.o
-[ 38%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs256-f16-f16.cu.o
-[ 38%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs64-f16-f16.cu.o
-[ 38%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-iq4_nl.cu.o
-[ 39%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-iq4_nl.cu.o
-[ 39%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-iq4_nl-iq4_nl.cu.o
-[ 39%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-iq4_nl-iq4_nl.cu.o
-[ 40%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q6_0-q5_0.cu.o
-[ 40%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q6_0-q5_0.cu.o
-[ 40%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f16-instance-hs128-q8_0-q6_0.cu.o
-[ 41%] Building CUDA object ggml/src/CMakeFiles/ggml.dir/ggml-cuda/template-instances/fattn-vec-f32-instance-hs128-q8_0-q6_0.cu.o
-[ 41%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
-[ 41%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
-[ 42%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
-[ 42%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
-[ 43%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
-[ 43%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
-[ 43%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
-[ 44%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
-[ 44%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
-[ 44%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
-[ 45%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
-[ 45%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_ktquants.cpp.o
-[ 45%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
-[ 46%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
-[ 46%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
-[ 46%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
-[ 47%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o
-[ 47%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
-[ 48%] Linking CXX shared library libggml.so
-[ 48%] Built target ggml
-[ 48%] Linking CXX executable ../../bin/llama-gguf-hash
-[ 48%] Linking CXX shared library libllama.so
-[ 48%] Linking CXX executable ../../bin/llama-gguf
-[ 49%] Built target llama-gguf-hash
-[ 50%] Built target llama-gguf
-[ 52%] Built target llama
-[ 52%] Linking C executable ../bin/test-c
-[ 52%] Linking CXX executable ../../bin/llama-quantize-stats
-[ 53%] Linking CXX executable ../../bin/llama-bench-matmult
-[ 54%] Built target llava
-[ 57%] Built target common
-[ 58%] Built target llava_static
-[ 59%] Linking CXX executable ../bin/test-tokenizer-0
-[ 59%] Linking CXX executable ../bin/test-tokenizer-1-bpe
-[ 59%] Linking CXX shared library libllava_shared.so
-[ 59%] Linking CXX executable ../bin/test-tokenizer-1-spm
-[ 59%] Linking CXX executable ../bin/test-chat-template
-[ 59%] Linking CXX executable ../bin/test-quantize-perf
-[ 60%] Linking CXX executable ../bin/test-model-load-cancel
-[ 62%] Linking CXX executable ../bin/test-quantize-fns
-[ 62%] Linking CXX executable ../bin/test-grammar-parser
-[ 63%] Linking CXX executable ../bin/test-backend-ops
-[ 63%] Linking CXX executable ../bin/test-sampling
-[ 64%] Linking CXX executable ../../bin/llama-baby-llama
-[ 64%] Linking CXX executable ../../bin/llama-convert-llama2c-to-ggml
-[ 65%] Linking CXX executable ../bin/test-grammar-integration
-[ 66%] Linking CXX executable ../../bin/llama-cvector-generator
-[ 67%] Linking CXX executable ../bin/test-autorelease
-[ 67%] Built target test-c
-[ 67%] Linking CXX executable ../../bin/llama-batched
-[ 67%] Linking CXX executable ../../bin/llama-embedding
-[ 67%] Linking CXX executable ../../bin/llama-imatrix
-[ 68%] Linking CXX executable ../../bin/llama-infill
-[ 69%] Linking CXX executable ../../bin/llama-gguf-split
-[ 70%] Linking CXX executable ../bin/test-llama-grammar
-[ 70%] Linking CXX executable ../bin/test-rope
-[ 71%] Linking CXX executable ../../bin/llama-bench
-[ 71%] Linking CXX executable ../../bin/llama-batched-bench
-[ 72%] Linking CXX executable ../../bin/llama-lookahead
-[ 72%] Linking CXX executable ../bin/test-grad0
-[ 72%] Linking CXX executable ../../bin/llama-minicpmv-cli
-[ 73%] Linking CXX executable ../../bin/llama-export-lora
-[ 73%] Linking CXX executable ../../bin/llama-eval-callback
-[ 73%] Linking CXX executable ../../bin/llama-gritlm
-[ 74%] Linking CXX executable ../bin/test-json-schema-to-grammar
-[ 75%] Linking CXX executable ../../bin/llama-lookup-create
-[ 75%] Linking CXX executable ../../bin/llama-gbnf-validator
-[ 75%] Linking CXX executable ../../bin/llama-lookup-merge
-[ 75%] Linking CXX executable ../../bin/llama-parallel
-[ 75%] Linking CXX executable ../../bin/llama-lookup
-[ 75%] Linking CXX executable ../../bin/llama-llava-cli
-[ 75%] Linking CXX executable ../../bin/llama-lookup-stats
-[ 75%] Linking CXX executable ../../bin/llama-cli
-[ 75%] Linking CXX executable ../../bin/llama-passkey
-[ 76%] Linking CXX executable ../../bin/llama-quantize
-[ 76%] Linking CXX executable ../../bin/llama-perplexity
-[ 77%] Linking CXX executable ../../bin/llama-retrieval
-[ 77%] Linking CXX executable ../../bin/llama-speculative
-[ 77%] Linking CXX executable ../../bin/llama-sweep-bench
-[ 78%] Linking CXX executable ../../bin/llama-simple
-[ 78%] Linking CXX executable ../../bin/llama-vdot
-[ 79%] Linking CXX executable ../../bin/llama-server
-[ 80%] Linking CXX executable ../../bin/llama-q8dot
-[ 80%] Linking CXX executable ../../bin/llama-save-load-state
-[ 81%] Linking CXX executable ../../bin/llama-tokenize
-[ 81%] Built target llama-bench-matmult
-[ 82%] Built target llama-quantize-stats
-[ 82%] Built target llava_shared
-[ 83%] Built target test-grad0
-[ 83%] Built target test-quantize-fns
-[ 84%] Built target test-autorelease
-[ 84%] Built target test-llama-grammar
-[ 84%] Built target llama-lookup-merge
-[ 85%] Built target llama-gbnf-validator
-[ 85%] Built target test-sampling
-[ 86%] Built target test-grammar-integration
-[ 86%] Built target llama-q8dot
-[ 86%] Built target test-grammar-parser
-[ 86%] Built target llama-vdot
-[ 87%] Built target test-tokenizer-1-spm
-[ 87%] Built target test-tokenizer-1-bpe
-[ 87%] Built target test-tokenizer-0
-[ 88%] Built target test-chat-template
-[ 88%] Built target test-json-schema-to-grammar
-[ 88%] Built target test-model-load-cancel
-[ 88%] Built target llama-cvector-generator
-[ 88%] Built target llama-batched
-[ 89%] Built target llama-imatrix
-[ 89%] Built target llama-minicpmv-cli
-[ 90%] Built target llama-batched-bench
-[ 90%] Built target llama-infill
-[ 90%] Built target llama-gritlm
-[ 92%] Built target llama-eval-callback
-[ 92%] Built target llama-lookahead
-[ 93%] Built target llama-lookup-stats
-[ 94%] Built target llama-convert-llama2c-to-ggml
-[ 94%] Built target llama-retrieval
-[ 94%] Built target llama-bench
-[ 94%] Built target llama-llava-cli
-[ 94%] Built target llama-parallel
-[ 94%] Built target llama-export-lora
-[ 95%] Built target llama-passkey
-[ 95%] Built target llama-cli
-[ 95%] Built target llama-speculative
-[ 95%] Built target llama-save-load-state
-[ 95%] Built target llama-gguf-split
-[ 95%] Built target test-backend-ops
-[ 95%] Built target llama-tokenize
-[ 95%] Built target llama-simple
-[ 95%] Built target llama-embedding
-[ 96%] Built target llama-server
-[ 96%] Built target llama-lookup-create
-[ 96%] Built target llama-perplexity
-[ 97%] Built target llama-lookup
-[ 98%] Built target test-rope
-[ 99%] Built target test-quantize-perf
-[100%] Built target llama-sweep-bench
-[100%] Built target llama-quantize
-[100%] Built target llama-baby-llama
+👤 **ikawrakow** commented on **2025-06-09** at **06:14:03**
+
+OK, first try `-b 4096 -ub 4096` to see if this will fit. If it fits, you will give you a much better prefill if you are processing long contexts.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **06:15:16**
+
+OK, with 4 tokens of prompt nobody can get more than 10 t/s prefill. You need to try a few thousand tokens prompt (that's when the prefill speed starts to matter).
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **06:16:51**
+
+Oh yes, i just tried a hi, i understand that 10tk/sec is not accurate for such a short prompt. I am running it on about 16k tokens now and i will report back. after that i will modify this and send same back to the model.
+
+```bash
+CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
+ --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
+ --alias ubergarm/DeepSeek-R1-0528-GGUF \
+ --ctx-size 32768 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -b 4096 -ub 4096 \
+ -amb 512 \
+ -fmoe \
+ --n-gpu-layers 63 \
+ -ot "blk\.(3)\.ffn_.*=CUDA0" \
+ -ot "blk\.(5)\.ffn_.*=CUDA1" \
+ --override-tensor exps=CPU \
+ --parallel 1 \
+ --threads 57 \
+ --host 0.0.0.0 \
+ --port 10002
+```
+
+---
+
+👤 **saood06** commented on **2025-06-09** at **06:19:38**
+
+> Try `cmake -DGGML_SCHED_MAX_COPIES=1 ...`
+
+I keep forgetting to mention this to you, but I think the reason people keep needing to set this is pipeline parallelism checks for whether the model is fully offloaded by `model->n_gpu_layers > (int)model->hparams.n_layer` and with tensor offload that assumption is no longer true. So just adding a check for if `override-tensor` is used and not enabling it would solve the issue (and I think that is what mainline did from my memory)
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **06:26:03**
+
+> OK, with 4 tokens of prompt nobody can get more than 10 t/s prefill. You need to try a few thousand tokens prompt (that's when the prefill speed starts to matter).
+
+OK here is the 16K context processing:
+
+```
+INFO [ launch_slot_with_task] slot is processing task | tid="124625576644608" timestamp=1749449891 id_slot=0 id_task=1558
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449891 id_slot=0 id_task=1558 p0=2
+
+
+
+
+
+INFO [ log_server_request] request | tid="124623282716672" timestamp=1749449897 remote_addr="172.17.0.3" remote_port=46266 status=200 method="GET" path="/v1/models" params={}
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449924 id_slot=0 id_task=1558 p0=2050
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449957 id_slot=0 id_task=1558 p0=4098
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449990 id_slot=0 id_task=1558 p0=6146
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450024 id_slot=0 id_task=1558 p0=8194
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450057 id_slot=0 id_task=1558 p0=10242
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450091 id_slot=0 id_task=1558 p0=12290
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450125 id_slot=0 id_task=1558 p0=14338
+INFO [ log_server_request] request | tid="124623274323968" timestamp=1749450145 remote_addr="172.17.0.3" remote_port=33860 status=200 method="GET" path="/v1/models" params={}
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450159 id_slot=0 id_task=1558 p0=16386
+INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450194 id_slot=0 id_task=1558 p0=18434
+INFO [ print_timings] prompt eval time = 336582.34 ms / 19383 tokens ( 17.36 ms per token, 57.59 tokens per second) | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_prompt_processing=336582.34 n_prompt_tokens_processed=19383 t_token=17.364821751018937 n_tokens_second=57.58769161804508
+INFO [ print_timings] generation eval time = 42214.69 ms / 388 runs ( 108.80 ms per token, 9.19 tokens per second) | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_token_generation=42214.691 n_decoded=388 t_token=108.80075 n_tokens_second=9.191113112731301
+INFO [ print_timings] total time = 378797.03 ms | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_prompt_processing=336582.34 t_token_generation=42214.691 t_total=378797.031
+INFO [ update_slots] slot released | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 n_ctx=32768 n_past=19772 n_system_tokens=0 n_cache_tokens=19772 truncated=false
+INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749450270
+INFO [ log_server_request] request | tid="124623291109376" timestamp=1749450270 remote_addr="172.17.0.3" remote_port=46258 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749450270
+
+
+```
+
+i will now use below and try:
+
+```bash
+CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
+ --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
+ --alias ubergarm/DeepSeek-R1-0528-GGUF \
+ --ctx-size 32768 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -b 4096 -ub 4096 \
+ -amb 512 \
+ -fmoe \
+ --n-gpu-layers 63 \
+ -ot "blk\.(3)\.ffn_.*=CUDA0" \
+ -ot "blk\.(5)\.ffn_.*=CUDA1" \
+ --override-tensor exps=CPU \
+ --parallel 1 \
+ --threads 57 \
+ --host 0.0.0.0 \
+ --port 10002
+```
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **06:35:12**
+
+```
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
--model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
--alias ubergarm/DeepSeek-R1-0528-GGUF \
--ctx-size 32768 \
-ctk q8_0 \
-mla 3 -fa \
- -b 2048 -ub 2048 \
+ -b 4096 -ub 4096 \
-amb 512 \
-fmoe \
--n-gpu-layers 63 \
@@ -1939,13 +1751,15 @@ I understand, here is the full log after your suggestion.
--threads 57 \
--host 0.0.0.0 \
--port 10002
+```
+```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="124625576644608" timestamp=1749449241 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="124625576644608" timestamp=1749449241 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+INFO [ main] build info | tid="132484236767232" timestamp=1749450548 build=3737 commit="58f08e43"
+INFO [ main] system info | tid="132484236767232" timestamp=1749450548 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 6 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1147 tensors from /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
@@ -2266,8 +2080,8 @@ llm_load_tensors: CUDA0 buffer size = 13995.99 MiB
llm_load_tensors: CUDA1 buffer size = 13730.03 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 2048
-llama_new_context_with_model: n_ubatch = 2048
+llama_new_context_with_model: n_batch = 4096
+llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 512
@@ -2280,57 +2094,65 @@ llama_kv_cache_init: CUDA1 KV buffer size = 573.76 MiB
llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 3588.01 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 3560.02 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 312.02 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 4104.02 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 4176.03 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 624.05 MiB
llama_new_context_with_model: graph nodes = 8245
llama_new_context_with_model: graph splits = 149
-INFO [ init] initializing slots | tid="124625576644608" timestamp=1749449287 n_slots=1
-INFO [ init] new slot | tid="124625576644608" timestamp=1749449287 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="124625576644608" timestamp=1749449287
-INFO [ main] chat template | tid="124625576644608" timestamp=1749449287 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="124625576644608" timestamp=1749449287 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749449287
-INFO [ log_server_request] request | tid="124623467307008" timestamp=1749449303 remote_addr="172.17.0.3" remote_port=41390 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="124623458914304" timestamp=1749449310 remote_addr="172.17.0.3" remote_port=41406 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="124623375036416" timestamp=1749449312 remote_addr="172.17.0.3" remote_port=50732 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="124623366643712" timestamp=1749449314 remote_addr="172.17.0.3" remote_port=50738 status=200 method="GET" path="/v1/models" params={}
-INFO [ launch_slot_with_task] slot is processing task | tid="124625576644608" timestamp=1749449314 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449314 id_slot=0 id_task=0 p0=0
-INFO [ print_timings] prompt eval time = 373.61 ms / 4 tokens ( 93.40 ms per token, 10.71 tokens per second) | tid="124625576644608" timestamp=1749449333 id_slot=0 id_task=0 t_prompt_processing=373.606 n_prompt_tokens_processed=4 t_token=93.4015 n_tokens_second=10.70646617024352
-INFO [ print_timings] generation eval time = 17732.40 ms / 181 runs ( 97.97 ms per token, 10.21 tokens per second) | tid="124625576644608" timestamp=1749449333 id_slot=0 id_task=0 t_token_generation=17732.405 n_decoded=181 t_token=97.96908839779005 n_tokens_second=10.20730126567716
-INFO [ print_timings] total time = 18106.01 ms | tid="124625576644608" timestamp=1749449333 id_slot=0 id_task=0 t_prompt_processing=373.606 t_token_generation=17732.405 t_total=18106.011
-INFO [ update_slots] slot released | tid="124625576644608" timestamp=1749449333 id_slot=0 id_task=0 n_ctx=32768 n_past=184 n_system_tokens=0 n_cache_tokens=184 truncated=false
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749449333
-INFO [ log_server_request] request | tid="124623358251008" timestamp=1749449333 remote_addr="172.17.0.3" remote_port=50750 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749449333
+INFO [ init] initializing slots | tid="132484236767232" timestamp=1749450594 n_slots=1
+INFO [ init] new slot | tid="132484236767232" timestamp=1749450594 id_slot=0 n_ctx_slot=32768
+INFO [ main] model loaded | tid="132484236767232" timestamp=1749450594
+INFO [ main] chat template | tid="132484236767232" timestamp=1749450594 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+INFO [ main] HTTP server listening | tid="132484236767232" timestamp=1749450594 n_threads_http="111" port="10002" hostname="0.0.0.0"
+INFO [ update_slots] all slots are idle | tid="132484236767232" timestamp=1749450594
+INFO [ log_server_request] request | tid="132482150162432" timestamp=1749450608 remote_addr="172.17.0.3" remote_port=51312 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="132482141769728" timestamp=1749450614 remote_addr="172.17.0.3" remote_port=39990 status=200 method="GET" path="/v1/models" params={}
+```
+
+```
+INFO [ launch_slot_with_task] slot is processing task | tid="132484236767232" timestamp=1749450615 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450615 id_slot=0 id_task=0 p0=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450650 id_slot=0 id_task=0 p0=4096
+INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450686 id_slot=0 id_task=0 p0=8192
+INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450723 id_slot=0 id_task=0 p0=12288
+INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450761 id_slot=0 id_task=0 p0=16384
+INFO [ print_timings] prompt eval time = 182575.11 ms / 19385 tokens ( 9.42 ms per token, 106.18 tokens per second) | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 t_prompt_processing=182575.108 n_prompt_tokens_processed=19385 t_token=9.418370286303844 n_tokens_second=106.1754814900616
+INFO [ print_timings] generation eval time = 59163.59 ms / 538 runs ( 109.97 ms per token, 9.09 tokens per second) | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 t_token_generation=59163.594 n_decoded=538 t_token=109.96950557620818 n_tokens_second=9.09342999007126
+INFO [ print_timings] total time = 241738.70 ms | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 t_prompt_processing=182575.108 t_token_generation=59163.594 t_total=241738.702
+INFO [ update_slots] slot released | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 n_ctx=32768 n_past=19922 n_system_tokens=0 n_cache_tokens=19922 truncated=false
+INFO [ update_slots] all slots are idle | tid="132484236767232" timestamp=1749450856
+INFO [ log_server_request] request | tid="132482133377024" timestamp=1749450856 remote_addr="172.17.0.3" remote_port=39998 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="132484236767232" timestamp=1749450856
+```
---
-👤 **mtcl** commented the **2025-06-09** at **06:10:23**:
+👤 **mtcl** commented on **2025-06-09** at **06:36:14**
-I have about 18GB on both GPUs now.
+So by using two GPUs I can get 100+ tokens/second pp on the IQ3 Quants. Impressive! I also have 32k context length and I got ctk of Q8. So overall amazing results!
---
-👤 **ikawrakow** commented the **2025-06-09** at **06:14:03**:
+👤 **ikawrakow** commented on **2025-06-09** at **06:37:13**
-OK, first try `-b 4096 -ub 4096` to see if this will fit. If it fits, you will give you a much better prefill if you are processing long contexts.
+OK, that's great! I think you have enough free VRAM to increase the context to 65k or even 131k tokens.
---
-👤 **ikawrakow** commented the **2025-06-09** at **06:15:16**:
+👤 **mtcl** commented on **2025-06-09** at **06:37:46**
-OK, with 4 tokens of prompt nobody can get more than 10 t/s prefill. You need to try a few thousand tokens prompt (that's when the prefill speed starts to matter).
+let me try 64K, if I can do that I will be super happy!!
---
-👤 **mtcl** commented the **2025-06-09** at **06:16:51**:
+👤 **mtcl** commented on **2025-06-09** at **06:38:57**
-Oh yes, i just tried a hi, i understand that 10tk/sec is not accurate for such a short prompt. I am running it on about 16k tokens now and i will report back. after that i will modify this and send same back to the model.
+> OK, that's great! I think you have enough free VRAM to increase the context to 65k or even 131k tokens.
+
+Can you please review my prompt to see if this is correctly optimized?
```bash
-CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
+(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
--model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
--alias ubergarm/DeepSeek-R1-0528-GGUF \
--ctx-size 32768 \
@@ -2349,163 +2171,52 @@ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
--port 10002
```
+i am especially looking at ot attribute there.
+
---
-👤 **saood06** commented the **2025-06-09** at **06:19:38**:
+👤 **ikawrakow** commented on **2025-06-09** at **06:45:07**
-> Try `cmake -DGGML_SCHED_MAX_COPIES=1 ...`
+You need `--ctx-size 65536` to get 65k tokens.
-I keep forgetting to mention this to you, but I think the reason people keep needing to set this is pipeline parallelism checks for whether the model is fully offloaded by `model->n_gpu_layers > (int)model->hparams.n_layer` and with tensor offload that assumption is no longer true. So just adding a check for if `override-tensor` is used and not enabling it would solve the issue (and I think that is what mainline did from my memory)
+You are offloading only 1 layer per GPU, that is not going to make a big difference for performance (you gain in the range of 3-5%).
----
+Btw, I'm not familiar with the Xeon CPU you have. I would be also interested to see CPU-only performance on it. To do that, you just prepend `CUDA_VISIBLE_DEVICES=` to the command line when starting the server.
-👤 **mtcl** commented the **2025-06-09** at **06:26:03**:
+---
-> OK, with 4 tokens of prompt nobody can get more than 10 t/s prefill. You need to try a few thousand tokens prompt (that's when the prefill speed starts to matter).
+👤 **mtcl** commented on **2025-06-09** at **06:45:16**
-OK here is the 16K context processing:
+ok it crashed at 64K
```
-INFO [ launch_slot_with_task] slot is processing task | tid="124625576644608" timestamp=1749449891 id_slot=0 id_task=1558
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449891 id_slot=0 id_task=1558 p0=2
-
-
-
+(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
+ --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
+ --alias ubergarm/DeepSeek-R1-0528-GGUF \
+ --ctx-size 65536 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -b 4096 -ub 4096 \
+ -amb 512 \
+ -fmoe \
+ --n-gpu-layers 63 \
+ -ot "blk\.(3)\.ffn_.*=CUDA0" \
+ -ot "blk\.(5)\.ffn_.*=CUDA1" \
+ --override-tensor exps=CPU \
+ --parallel 1 \
+ --threads 57 \
+ --host 0.0.0.0 \
+ --port 10002
+```
-
-INFO [ log_server_request] request | tid="124623282716672" timestamp=1749449897 remote_addr="172.17.0.3" remote_port=46266 status=200 method="GET" path="/v1/models" params={}
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449924 id_slot=0 id_task=1558 p0=2050
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449957 id_slot=0 id_task=1558 p0=4098
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449990 id_slot=0 id_task=1558 p0=6146
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450024 id_slot=0 id_task=1558 p0=8194
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450057 id_slot=0 id_task=1558 p0=10242
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450091 id_slot=0 id_task=1558 p0=12290
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450125 id_slot=0 id_task=1558 p0=14338
-INFO [ log_server_request] request | tid="124623274323968" timestamp=1749450145 remote_addr="172.17.0.3" remote_port=33860 status=200 method="GET" path="/v1/models" params={}
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450159 id_slot=0 id_task=1558 p0=16386
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450194 id_slot=0 id_task=1558 p0=18434
-INFO [ print_timings] prompt eval time = 336582.34 ms / 19383 tokens ( 17.36 ms per token, 57.59 tokens per second) | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_prompt_processing=336582.34 n_prompt_tokens_processed=19383 t_token=17.364821751018937 n_tokens_second=57.58769161804508
-INFO [ print_timings] generation eval time = 42214.69 ms / 388 runs ( 108.80 ms per token, 9.19 tokens per second) | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_token_generation=42214.691 n_decoded=388 t_token=108.80075 n_tokens_second=9.191113112731301
-INFO [ print_timings] total time = 378797.03 ms | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_prompt_processing=336582.34 t_token_generation=42214.691 t_total=378797.031
-INFO [ update_slots] slot released | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 n_ctx=32768 n_past=19772 n_system_tokens=0 n_cache_tokens=19772 truncated=false
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749450270
-INFO [ log_server_request] request | tid="124623291109376" timestamp=1749450270 remote_addr="172.17.0.3" remote_port=46258 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749450270
-
-
-```
-
-i will now use below and try:
-
-```bash
-CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
- --alias ubergarm/DeepSeek-R1-0528-GGUF \
- --ctx-size 32768 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- -ot "blk\.(3)\.ffn_.*=CUDA0" \
- -ot "blk\.(5)\.ffn_.*=CUDA1" \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:26:03**:
-
-> OK, with 4 tokens of prompt nobody can get more than 10 t/s prefill. You need to try a few thousand tokens prompt (that's when the prefill speed starts to matter).
-
-OK here is the 16K context processing:
-
-```
-INFO [ launch_slot_with_task] slot is processing task | tid="124625576644608" timestamp=1749449891 id_slot=0 id_task=1558
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449891 id_slot=0 id_task=1558 p0=2
-
-
-
-
-
-INFO [ log_server_request] request | tid="124623282716672" timestamp=1749449897 remote_addr="172.17.0.3" remote_port=46266 status=200 method="GET" path="/v1/models" params={}
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449924 id_slot=0 id_task=1558 p0=2050
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449957 id_slot=0 id_task=1558 p0=4098
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749449990 id_slot=0 id_task=1558 p0=6146
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450024 id_slot=0 id_task=1558 p0=8194
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450057 id_slot=0 id_task=1558 p0=10242
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450091 id_slot=0 id_task=1558 p0=12290
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450125 id_slot=0 id_task=1558 p0=14338
-INFO [ log_server_request] request | tid="124623274323968" timestamp=1749450145 remote_addr="172.17.0.3" remote_port=33860 status=200 method="GET" path="/v1/models" params={}
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450159 id_slot=0 id_task=1558 p0=16386
-INFO [ update_slots] kv cache rm [p0, end) | tid="124625576644608" timestamp=1749450194 id_slot=0 id_task=1558 p0=18434
-INFO [ print_timings] prompt eval time = 336582.34 ms / 19383 tokens ( 17.36 ms per token, 57.59 tokens per second) | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_prompt_processing=336582.34 n_prompt_tokens_processed=19383 t_token=17.364821751018937 n_tokens_second=57.58769161804508
-INFO [ print_timings] generation eval time = 42214.69 ms / 388 runs ( 108.80 ms per token, 9.19 tokens per second) | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_token_generation=42214.691 n_decoded=388 t_token=108.80075 n_tokens_second=9.191113112731301
-INFO [ print_timings] total time = 378797.03 ms | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 t_prompt_processing=336582.34 t_token_generation=42214.691 t_total=378797.031
-INFO [ update_slots] slot released | tid="124625576644608" timestamp=1749450270 id_slot=0 id_task=1558 n_ctx=32768 n_past=19772 n_system_tokens=0 n_cache_tokens=19772 truncated=false
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749450270
-INFO [ log_server_request] request | tid="124623291109376" timestamp=1749450270 remote_addr="172.17.0.3" remote_port=46258 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="124625576644608" timestamp=1749450270
-
-
-```
-
-i will now use below and try:
-
-```bash
-CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
- --alias ubergarm/DeepSeek-R1-0528-GGUF \
- --ctx-size 32768 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 20 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:35:12**:
-
-```
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
- --alias ubergarm/DeepSeek-R1-0528-GGUF \
- --ctx-size 32768 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- -ot "blk\.(3)\.ffn_.*=CUDA0" \
- -ot "blk\.(5)\.ffn_.*=CUDA1" \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="132484236767232" timestamp=1749450548 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="132484236767232" timestamp=1749450548 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+INFO [ main] build info | tid="132104685867008" timestamp=1749451448 build=3737 commit="58f08e43"
+INFO [ main] system info | tid="132104685867008" timestamp=1749451448 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 6 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1147 tensors from /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
@@ -2825,7 +2536,7 @@ llm_load_tensors: CPU buffer size = 938.98 MiB
llm_load_tensors: CUDA0 buffer size = 13995.99 MiB
llm_load_tensors: CUDA1 buffer size = 13730.03 MiB
....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_ctx = 65536
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
@@ -2835,117 +2546,36 @@ llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 592.89 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 573.76 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
+llama_kv_cache_init: CUDA0 KV buffer size = 1185.77 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 1147.51 MiB
+llama_new_context_with_model: KV self size = 2333.25 MiB, c^KV (q8_0): 2333.25 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 4104.02 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 4176.03 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 624.05 MiB
-llama_new_context_with_model: graph nodes = 8245
-llama_new_context_with_model: graph splits = 149
-INFO [ init] initializing slots | tid="132484236767232" timestamp=1749450594 n_slots=1
-INFO [ init] new slot | tid="132484236767232" timestamp=1749450594 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="132484236767232" timestamp=1749450594
-INFO [ main] chat template | tid="132484236767232" timestamp=1749450594 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="132484236767232" timestamp=1749450594 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="132484236767232" timestamp=1749450594
-INFO [ log_server_request] request | tid="132482150162432" timestamp=1749450608 remote_addr="172.17.0.3" remote_port=51312 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="132482141769728" timestamp=1749450614 remote_addr="172.17.0.3" remote_port=39990 status=200 method="GET" path="/v1/models" params={}
-```
-
-```
-INFO [ launch_slot_with_task] slot is processing task | tid="132484236767232" timestamp=1749450615 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450615 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450650 id_slot=0 id_task=0 p0=4096
-INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450686 id_slot=0 id_task=0 p0=8192
-INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450723 id_slot=0 id_task=0 p0=12288
-INFO [ update_slots] kv cache rm [p0, end) | tid="132484236767232" timestamp=1749450761 id_slot=0 id_task=0 p0=16384
-INFO [ print_timings] prompt eval time = 182575.11 ms / 19385 tokens ( 9.42 ms per token, 106.18 tokens per second) | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 t_prompt_processing=182575.108 n_prompt_tokens_processed=19385 t_token=9.418370286303844 n_tokens_second=106.1754814900616
-INFO [ print_timings] generation eval time = 59163.59 ms / 538 runs ( 109.97 ms per token, 9.09 tokens per second) | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 t_token_generation=59163.594 n_decoded=538 t_token=109.96950557620818 n_tokens_second=9.09342999007126
-INFO [ print_timings] total time = 241738.70 ms | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 t_prompt_processing=182575.108 t_token_generation=59163.594 t_total=241738.702
-INFO [ update_slots] slot released | tid="132484236767232" timestamp=1749450856 id_slot=0 id_task=0 n_ctx=32768 n_past=19922 n_system_tokens=0 n_cache_tokens=19922 truncated=false
-INFO [ update_slots] all slots are idle | tid="132484236767232" timestamp=1749450856
-INFO [ log_server_request] request | tid="132482133377024" timestamp=1749450856 remote_addr="172.17.0.3" remote_port=39998 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="132484236767232" timestamp=1749450856
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:36:14**:
-
-So by using two GPUs I can get 100+ tokens/second pp on the IQ3 Quants. Impressive! I also have 32k context length and I got ctk of Q8. So overall amazing results!
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:36:14**:
-
-So by using two GPUs I can get 100+ tokens/second pp on the IQ3 Quants. Impressive!
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **06:37:13**:
-
-OK, that's great! I think you have enough free VRAM to increase the context to 65k or even 131k tokens.
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:37:46**:
-
-let me try 64K, if I can do that I will be super happy!!
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:38:57**:
-
-> OK, that's great! I think you have enough free VRAM to increase the context to 65k or even 131k tokens.
-
-Can you please review my prompt to see if this is correctly optimized?
-
-```bash
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
- --alias ubergarm/DeepSeek-R1-0528-GGUF \
- --ctx-size 32768 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- -ot "blk\.(3)\.ffn_.*=CUDA0" \
- -ot "blk\.(5)\.ffn_.*=CUDA1" \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
+ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7688.02 MiB on device 0: cudaMalloc failed: out of memory
+ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 8061468672
+llama_new_context_with_model: failed to allocate compute buffers
+llama_init_from_gpt_params: error: failed to create context with model '/media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf'
+ ERR [ load_model] unable to load model | tid="132104685867008" timestamp=1749451473 model="/media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf"
+Segmentation fault (core dumped)
```
-i am especially looking at ot attribute there.
-
---
-👤 **ikawrakow** commented the **2025-06-09** at **06:45:07**:
-
-You need `--ctx-size 65536` to get 65k tokens.
-
-You are offloading only 1 layer per GPU, that is not going to make a big difference for performance (you gain in the range of 3-5%).
+👤 **ikawrakow** commented on **2025-06-09** at **06:48:26**
-Btw, I'm not familiar with the Xeon CPU you have. I would be also interested to see CPU-only performance on it. To do that, you just prepend `CUDA_VISIBLE_DEVICES=` to the command line when starting the server.
+OK, I guess you just remove `-ot "blk\.(3)\.ffn_.*=CUDA0"` and `-ot "blk\.(5)\.ffn_.*=CUDA0"` arguments. You will get 3-5% lower performance, but you should be able to run with 65k context.
---
-👤 **mtcl** commented the **2025-06-09** at **06:45:16**:
+👤 **mtcl** commented on **2025-06-09** at **06:48:34**
-ok it crashed at 64K
+OK i changed the context size to 60K and it worked:
```
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
--model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
--alias ubergarm/DeepSeek-R1-0528-GGUF \
- --ctx-size 65536 \
+ --ctx-size 61440 \
-ctk q8_0 \
-mla 3 -fa \
-b 4096 -ub 4096 \
@@ -2967,8 +2597,8 @@ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="132104685867008" timestamp=1749451448 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="132104685867008" timestamp=1749451448 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+INFO [ main] build info | tid="134334334976000" timestamp=1749451619 build=3737 commit="58f08e43"
+INFO [ main] system info | tid="134334334976000" timestamp=1749451619 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 6 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1147 tensors from /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
@@ -3288,7 +2918,7 @@ llm_load_tensors: CPU buffer size = 938.98 MiB
llm_load_tensors: CUDA0 buffer size = 13995.99 MiB
llm_load_tensors: CUDA1 buffer size = 13730.03 MiB
....................................................................................................
-llama_new_context_with_model: n_ctx = 65536
+llama_new_context_with_model: n_ctx = 61440
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
@@ -3298,30 +2928,69 @@ llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 1185.77 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 1147.51 MiB
-llama_new_context_with_model: KV self size = 2333.25 MiB, c^KV (q8_0): 2333.25 MiB, kv^T: not used
+llama_kv_cache_init: CUDA0 KV buffer size = 1111.66 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 1075.80 MiB
+llama_new_context_with_model: KV self size = 2187.42 MiB, c^KV (q8_0): 2187.42 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-ggml_backend_cuda_buffer_type_alloc_buffer: allocating 7688.02 MiB on device 0: cudaMalloc failed: out of memory
-ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 8061468672
-llama_new_context_with_model: failed to allocate compute buffers
-llama_init_from_gpt_params: error: failed to create context with model '/media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf'
- ERR [ load_model] unable to load model | tid="132104685867008" timestamp=1749451473 model="/media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf"
-Segmentation fault (core dumped)
-```
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **06:48:26**:
-
-OK, I guess you just remove `-ot "blk\.(3)\.ffn_.*=CUDA0"` and `-ot "blk\.(5)\.ffn_.*=CUDA0"` arguments. You will get 3-5% lower performance, but you should be able to run with 65k context.
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **06:48:34**:
-
-OK i changed the context size to 60K and it worked:
+llama_new_context_with_model: CUDA0 compute buffer size = 7272.02 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 6632.03 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 1072.05 MiB
+llama_new_context_with_model: graph nodes = 13613
+llama_new_context_with_model: graph splits = 149
+INFO [ init] initializing slots | tid="134334334976000" timestamp=1749451665 n_slots=1
+INFO [ init] new slot | tid="134334334976000" timestamp=1749451665 id_slot=0 n_ctx_slot=61440
+INFO [ main] model loaded | tid="134334334976000" timestamp=1749451665
+INFO [ main] chat template | tid="134334334976000" timestamp=1749451665 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+INFO [ main] HTTP server listening | tid="134334334976000" timestamp=1749451665 n_threads_http="111" port="10002" hostname="0.0.0.0"
+INFO [ update_slots] all slots are idle | tid="134334334976000" timestamp=1749451665
+
+
+```
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **06:49:32**
+
+> OK, I guess you just remove `-ot "blk\.(3)\.ffn_.*=CUDA0"` and `-ot "blk\.(5)\.ffn_.*=CUDA0"` arguments. You will get 3-5% lower performance, but you should be able to run with 65k context.
+
+Let me try this right after I run a 16k context processing right now.
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **06:51:52**
+
+OK with the 60K context eventhough the server started, it crashed when i sent 16k context as a prompt:
+
+```
+erver_request] request | tid="134332240879616" timestamp=1749451794 remote_addr="172.17.0.3" remote_port=42270 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="134332232486912" timestamp=1749451795 remote_addr="172.17.0.3" remote_port=42272 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="134332148609024" timestamp=1749451800 remote_addr="172.17.0.3" remote_port=42282 status=200 method="GET" path="/v1/models" params={}
+INFO [ launch_slot_with_task] slot is processing task | tid="134334334976000" timestamp=1749451801 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="134334334976000" timestamp=1749451801 id_slot=0 id_task=0 p0=0
+CUDA error: out of memory
+ current device: 0, in function alloc at /home/mukul/dev-ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:384
+ cuMemCreate(&handle, reserve_size, &prop, 0)
+/home/mukul/dev-ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Operation not permitted.
+No stack.
+The program is not being run.
+Aborted (core dumped)
+```
+
+I will just try your suggestion now
+
+
+> OK, I guess you just remove `-ot "blk\.(3)\.ffn_.*=CUDA0"` and `-ot "blk\.(5)\.ffn_.*=CUDA0"` arguments. You will get 3-5% lower performance, but you should be able to run with 65k context.
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **06:59:50**
+
+Here are the results:
```
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
@@ -3334,8 +3003,6 @@ OK i changed the context size to 60K and it worked:
-amb 512 \
-fmoe \
--n-gpu-layers 63 \
- -ot "blk\.(3)\.ffn_.*=CUDA0" \
- -ot "blk\.(5)\.ffn_.*=CUDA1" \
--override-tensor exps=CPU \
--parallel 1 \
--threads 57 \
@@ -3343,14 +3010,15 @@ OK i changed the context size to 60K and it worked:
--port 10002
```
+
```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="134334334976000" timestamp=1749451619 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="134334334976000" timestamp=1749451619 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+INFO [ main] build info | tid="133677812011008" timestamp=1749451965 build=3737 commit="58f08e43"
+INFO [ main] system info | tid="133677812011008" timestamp=1749451965 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 6 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1147 tensors from /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
@@ -3472,25 +3140,15 @@ llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
llm_load_tensors: ggml ctx size = 1.40 MiB
-Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA1
-Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
@@ -3659,7 +3317,7 @@ Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 61 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 62/62 layers to GPU
-llm_load_tensors: CPU buffer size = 36486.67 MiB
+llm_load_tensors: CPU buffer size = 41735.95 MiB
llm_load_tensors: CPU buffer size = 43905.23 MiB
llm_load_tensors: CPU buffer size = 43534.23 MiB
llm_load_tensors: CPU buffer size = 43534.23 MiB
@@ -3667,8 +3325,8 @@ llm_load_tensors: CPU buffer size = 43905.23 MiB
llm_load_tensors: CPU buffer size = 43534.23 MiB
llm_load_tensors: CPU buffer size = 44473.21 MiB
llm_load_tensors: CPU buffer size = 938.98 MiB
-llm_load_tensors: CUDA0 buffer size = 13995.99 MiB
-llm_load_tensors: CUDA1 buffer size = 13730.03 MiB
+llm_load_tensors: CUDA0 buffer size = 9056.64 MiB
+llm_load_tensors: CUDA1 buffer size = 8687.38 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 61440
llama_new_context_with_model: n_batch = 4096
@@ -3690,65 +3348,103 @@ llama_new_context_with_model: CUDA1 compute buffer size = 6632.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1072.05 MiB
llama_new_context_with_model: graph nodes = 13613
llama_new_context_with_model: graph splits = 149
-INFO [ init] initializing slots | tid="134334334976000" timestamp=1749451665 n_slots=1
-INFO [ init] new slot | tid="134334334976000" timestamp=1749451665 id_slot=0 n_ctx_slot=61440
-INFO [ main] model loaded | tid="134334334976000" timestamp=1749451665
-INFO [ main] chat template | tid="134334334976000" timestamp=1749451665 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="134334334976000" timestamp=1749451665 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="134334334976000" timestamp=1749451665
+INFO [ init] initializing slots | tid="133677812011008" timestamp=1749452022 n_slots=1
+INFO [ init] new slot | tid="133677812011008" timestamp=1749452022 id_slot=0 n_ctx_slot=61440
+INFO [ main] model loaded | tid="133677812011008" timestamp=1749452022
+INFO [ main] chat template | tid="133677812011008" timestamp=1749452022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+INFO [ main] HTTP server listening | tid="133677812011008" timestamp=1749452022 n_threads_http="111" port="10002" hostname="0.0.0.0"
+```
+
+```
+INFO [ update_slots] all slots are idle | tid="133677812011008" timestamp=1749452022
+INFO [ log_server_request] request | tid="133675714863104" timestamp=1749452032 remote_addr="172.17.0.3" remote_port=52820 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="133675706470400" timestamp=1749452066 remote_addr="172.17.0.3" remote_port=60306 status=200 method="GET" path="/v1/models" params={}
+INFO [ launch_slot_with_task] slot is processing task | tid="133677812011008" timestamp=1749452066 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452066 id_slot=0 id_task=0 p0=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452106 id_slot=0 id_task=0 p0=4096
+INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452143 id_slot=0 id_task=0 p0=8192
+INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452181 id_slot=0 id_task=0 p0=12288
+INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452220 id_slot=0 id_task=0 p0=16384
+INFO [ print_timings] prompt eval time = 188356.81 ms / 17617 tokens ( 10.69 ms per token, 93.53 tokens per second) | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 t_prompt_processing=188356.814 n_prompt_tokens_processed=17617 t_token=10.691764432082648 n_tokens_second=93.52993197262298
+INFO [ print_timings] generation eval time = 62522.24 ms / 540 runs ( 115.78 ms per token, 8.64 tokens per second) | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 t_token_generation=62522.242 n_decoded=540 t_token=115.78192962962963 n_tokens_second=8.636926359742507
+INFO [ print_timings] total time = 250879.06 ms | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 t_prompt_processing=188356.814 t_token_generation=62522.242 t_total=250879.056
+INFO [ update_slots] slot released | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 n_ctx=61440 n_past=18156 n_system_tokens=0 n_cache_tokens=18156 truncated=false
+INFO [ update_slots] all slots are idle | tid="133677812011008" timestamp=1749452317
+INFO [ log_server_request] request | tid="133675622592512" timestamp=1749452317 remote_addr="172.17.0.3" remote_port=60314 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="133677812011008" timestamp=1749452317
```
---
-👤 **mtcl** commented the **2025-06-09** at **06:49:32**:
+👤 **mtcl** commented on **2025-06-09** at **07:01:50**
-> OK, I guess you just remove `-ot "blk\.(3)\.ffn_.*=CUDA0"` and `-ot "blk\.(5)\.ffn_.*=CUDA0"` arguments. You will get 3-5% lower performance, but you should be able to run with 65k context.
+You are absolutely the best! I can now fit 60K context with 93 tk/s prefill and 8.63 tk/s generation. So i have to pick and choose what I want the most. That helps. Thank you again!
-Let me try this right after I run a 16k context processing right now.
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **07:05:30**
+
+@saood06 Can you point me to the specific place whee the check is being made. But apart from this, I still think there is a bug because it does not make sense that the scheduler wants to allocate such insane amount of memory. I haven't come around to look why that happens.
---
-👤 **mtcl** commented the **2025-06-09** at **06:51:52**:
+👤 **mtcl** commented on **2025-06-09** at **07:09:03**
-OK with the 60K context eventhough the server started, it crashed when i sent 16k context as a prompt:
+would you know if I want to do to optimize my Qwen3 startup command here? what should I change here?
-```
-erver_request] request | tid="134332240879616" timestamp=1749451794 remote_addr="172.17.0.3" remote_port=42270 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="134332232486912" timestamp=1749451795 remote_addr="172.17.0.3" remote_port=42272 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="134332148609024" timestamp=1749451800 remote_addr="172.17.0.3" remote_port=42282 status=200 method="GET" path="/v1/models" params={}
-INFO [ launch_slot_with_task] slot is processing task | tid="134334334976000" timestamp=1749451801 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="134334334976000" timestamp=1749451801 id_slot=0 id_task=0 p0=0
-CUDA error: out of memory
- current device: 0, in function alloc at /home/mukul/dev-ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:384
- cuMemCreate(&handle, reserve_size, &prop, 0)
-/home/mukul/dev-ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
-Could not attach to process. If your uid matches the uid of the target
-process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
-again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
-ptrace: Operation not permitted.
-No stack.
-The program is not being run.
-Aborted (core dumped)
-```
+this is what I had on @ubergarm 's guide
-I will just try your suggestion now
+```bash
+CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
+ --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
+ --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
+ -fa \
+ -ctk q8_0 -ctv q8_0 \
+ -c 32768 \
+ -fmoe \
+ -b 1024 -ub 1024 \
+ -amb 512 \
+ -rtr \
+ -ot blk\.1[2-9]\.ffn.*=CPU \
+ -ot blk\.[2-8][0-9]\.ffn.*=CPU \
+ -ot blk\.9[0-3]\.ffn.*=CPU \
+ -ngl 99 \
+ --threads 57 \
+ --host 0.0.0.0 \
+ --port 10002
+```
+and I am trying this now, is there something else i should bring over from above command though?
-> OK, I guess you just remove `-ot "blk\.(3)\.ffn_.*=CUDA0"` and `-ot "blk\.(5)\.ffn_.*=CUDA0"` arguments. You will get 3-5% lower performance, but you should be able to run with 65k context.
+```bash
+CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
+ --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
+ --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
+ --ctx-size 65536 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -b 4096 -ub 4096 \
+ -amb 512 \
+ -fmoe \
+ --n-gpu-layers 63 \
+ --override-tensor exps=CPU \
+ --parallel 1 \
+ --threads 57 \
+ --host 0.0.0.0 \
+ --port 10002
+```
---
-👤 **mtcl** commented the **2025-06-09** at **06:59:50**:
-
-Here are the results:
+👤 **mtcl** commented on **2025-06-09** at **07:10:22**
```
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf \
- --alias ubergarm/DeepSeek-R1-0528-GGUF \
- --ctx-size 61440 \
+ --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
+ --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
+ --ctx-size 65536 \
-ctk q8_0 \
-mla 3 -fa \
-b 4096 -ub 4096 \
@@ -3761,995 +3457,18 @@ Here are the results:
--host 0.0.0.0 \
--port 10002
```
-
-
```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="133677812011008" timestamp=1749451965 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="133677812011008" timestamp=1749451965 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-llama_model_loader: additional 6 GGUFs metadata loaded.
-llama_model_loader: loaded meta data with 52 key-value pairs and 1147 tensors from /media/mukul/backup/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf (version GGUF V3 (latest))
+INFO [ main] build info | tid="133181503258624" timestamp=1749452693 build=3737 commit="58f08e43"
+INFO [ main] system info | tid="133181503258624" timestamp=1749452693 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+llama_model_loader: additional 2 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 40 key-value pairs and 1131 tensors from /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek R1 0528
-llama_model_loader: - kv 3: general.version str = 0528
-llama_model_loader: - kv 4: general.basename str = DeepSeek-R1
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 15: general.file_type u32 = 339
-llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
-llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
-llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 44: general.quantization_version u32 = 2
-llama_model_loader: - kv 45: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-R1...
-llama_model_loader: - kv 46: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
-llama_model_loader: - kv 47: quantize.imatrix.entries_count i32 = 721
-llama_model_loader: - kv 48: quantize.imatrix.chunks_count i32 = 812
-llama_model_loader: - kv 49: split.no u16 = 0
-llama_model_loader: - kv 50: split.count u16 = 7
-llama_model_loader: - kv 51: split.tensors.count i32 = 1147
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq3_k_r4: 116 tensors
-llama_model_loader: - type iq4_ks_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_swa_pattern = 1
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ3_K_R4 - 3.4325 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 300.938 GiB (3.847 BPW)
-llm_load_print_meta: repeating layers = 299.104 GiB (3.834 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek R1 0528
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 1.40 MiB
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
-llm_load_tensors: offloading 61 repeating layers to GPU
-llm_load_tensors: offloading non-repeating layers to GPU
-llm_load_tensors: offloaded 62/62 layers to GPU
-llm_load_tensors: CPU buffer size = 41735.95 MiB
-llm_load_tensors: CPU buffer size = 43905.23 MiB
-llm_load_tensors: CPU buffer size = 43534.23 MiB
-llm_load_tensors: CPU buffer size = 43534.23 MiB
-llm_load_tensors: CPU buffer size = 43905.23 MiB
-llm_load_tensors: CPU buffer size = 43534.23 MiB
-llm_load_tensors: CPU buffer size = 44473.21 MiB
-llm_load_tensors: CPU buffer size = 938.98 MiB
-llm_load_tensors: CUDA0 buffer size = 9056.64 MiB
-llm_load_tensors: CUDA1 buffer size = 8687.38 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 61440
-llama_new_context_with_model: n_batch = 4096
-llama_new_context_with_model: n_ubatch = 4096
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 512
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 1111.66 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 1075.80 MiB
-llama_new_context_with_model: KV self size = 2187.42 MiB, c^KV (q8_0): 2187.42 MiB, kv^T: not used
-llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
-llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 7272.02 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 6632.03 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 1072.05 MiB
-llama_new_context_with_model: graph nodes = 13613
-llama_new_context_with_model: graph splits = 149
-INFO [ init] initializing slots | tid="133677812011008" timestamp=1749452022 n_slots=1
-INFO [ init] new slot | tid="133677812011008" timestamp=1749452022 id_slot=0 n_ctx_slot=61440
-INFO [ main] model loaded | tid="133677812011008" timestamp=1749452022
-INFO [ main] chat template | tid="133677812011008" timestamp=1749452022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="133677812011008" timestamp=1749452022 n_threads_http="111" port="10002" hostname="0.0.0.0"
-```
-
-
-```
-INFO [ update_slots] all slots are idle | tid="133677812011008" timestamp=1749452022
-INFO [ log_server_request] request | tid="133675714863104" timestamp=1749452032 remote_addr="172.17.0.3" remote_port=52820 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="133675706470400" timestamp=1749452066 remote_addr="172.17.0.3" remote_port=60306 status=200 method="GET" path="/v1/models" params={}
-INFO [ launch_slot_with_task] slot is processing task | tid="133677812011008" timestamp=1749452066 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452066 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452106 id_slot=0 id_task=0 p0=4096
-INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452143 id_slot=0 id_task=0 p0=8192
-INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452181 id_slot=0 id_task=0 p0=12288
-INFO [ update_slots] kv cache rm [p0, end) | tid="133677812011008" timestamp=1749452220 id_slot=0 id_task=0 p0=16384
-INFO [ print_timings] prompt eval time = 188356.81 ms / 17617 tokens ( 10.69 ms per token, 93.53 tokens per second) | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 t_prompt_processing=188356.814 n_prompt_tokens_processed=17617 t_token=10.691764432082648 n_tokens_second=93.52993197262298
-INFO [ print_timings] generation eval time = 62522.24 ms / 540 runs ( 115.78 ms per token, 8.64 tokens per second) | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 t_token_generation=62522.242 n_decoded=540 t_token=115.78192962962963 n_tokens_second=8.636926359742507
-INFO [ print_timings] total time = 250879.06 ms | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 t_prompt_processing=188356.814 t_token_generation=62522.242 t_total=250879.056
-INFO [ update_slots] slot released | tid="133677812011008" timestamp=1749452317 id_slot=0 id_task=0 n_ctx=61440 n_past=18156 n_system_tokens=0 n_cache_tokens=18156 truncated=false
-INFO [ update_slots] all slots are idle | tid="133677812011008" timestamp=1749452317
-INFO [ log_server_request] request | tid="133675622592512" timestamp=1749452317 remote_addr="172.17.0.3" remote_port=60314 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="133677812011008" timestamp=1749452317
-
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **07:01:50**:
-
-You are absolutely the best! I can now fit 60K context with 93 tk/s prefill and 8.63 tk/s generation. So i have to pick and choose what I want the most. That helps. Thank you again!
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **07:05:30**:
-
-@saood06 Can you point me to the specific place whee the check is being made. But apart from this, I still think there is a bug because it does not make sense that the scheduler wants to allocate such insane amount of memory. I haven't come around to look why that happens.
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **07:09:03**:
-
-would you know if I want to do to optimize my Qwen3 startup command here? what should I change here?
-
-this is what I had on @ubergarm 's guide
-
-```bash
-CUDA_VISIBLE_DEVICES="1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
- --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
- -fa \
- -ctk q8_0 -ctv q8_0 \
- -c 32768 \
- -fmoe \
- -b 1024 -ub 1024 \
- -amb 512 \
- -rtr \
- -ot blk\.1[2-9]\.ffn.*=CPU \
- -ot blk\.[2-8][0-9]\.ffn.*=CPU \
- -ot blk\.9[0-3]\.ffn.*=CPU \
- -ngl 99 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-
-and I am trying this now, is there something else i should bring over from above command though?
-
-```bash
-CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
- --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
- --ctx-size 65536 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **07:10:22**:
-
-```
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
- --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
- --ctx-size 65536 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-```
-ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
-ggml_cuda_init: found 2 CUDA devices:
- Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
- Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="133181503258624" timestamp=1749452693 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="133181503258624" timestamp=1749452693 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-llama_model_loader: additional 2 GGUFs metadata loaded.
-llama_model_loader: loaded meta data with 40 key-value pairs and 1131 tensors from /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = qwen3moe
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = Qwen3 235B A22B
-llama_model_loader: - kv 3: general.basename str = Qwen3
-llama_model_loader: - kv 4: general.size_label str = 235B-A22B
-llama_model_loader: - kv 5: general.license str = apache-2.0
-llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
-llama_model_loader: - kv 7: general.tags arr[str,1] = ["text-generation"]
-llama_model_loader: - kv 8: qwen3moe.block_count u32 = 94
-llama_model_loader: - kv 9: qwen3moe.context_length u32 = 40960
-llama_model_loader: - kv 10: qwen3moe.embedding_length u32 = 4096
-llama_model_loader: - kv 11: qwen3moe.feed_forward_length u32 = 12288
-llama_model_loader: - kv 12: qwen3moe.attention.head_count u32 = 64
-llama_model_loader: - kv 13: qwen3moe.attention.head_count_kv u32 = 4
-llama_model_loader: - kv 14: qwen3moe.rope.freq_base f32 = 1000000.000000
-llama_model_loader: - kv 15: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 16: qwen3moe.expert_used_count u32 = 8
-llama_model_loader: - kv 17: qwen3moe.attention.key_length u32 = 128
-llama_model_loader: - kv 18: qwen3moe.attention.value_length u32 = 128
-llama_model_loader: - kv 19: general.file_type u32 = 139
-llama_model_loader: - kv 20: qwen3moe.expert_count u32 = 128
-llama_model_loader: - kv 21: qwen3moe.expert_feed_forward_length u32 = 1536
-llama_model_loader: - kv 22: general.quantization_version u32 = 2
-llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
-llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
-llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
-llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
-llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151643
-llama_model_loader: - kv 30: tokenizer.ggml.bos_token_id u32 = 151643
-llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
-llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
-llama_model_loader: - kv 33: quantize.imatrix.file str = /mnt/raid/models/ubergarm/Qwen3-235B-...
-llama_model_loader: - kv 34: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 35: quantize.imatrix.entries_count i32 = 753
-llama_model_loader: - kv 36: quantize.imatrix.chunks_count i32 = 225
-llama_model_loader: - kv 37: split.no u16 = 0
-llama_model_loader: - kv 38: split.count u16 = 3
-llama_model_loader: - kv 39: split.tensors.count i32 = 1131
-llama_model_loader: - type f32: 471 tensors
-llama_model_loader: - type q8_0: 2 tensors
-llama_model_loader: - type iq3_k: 188 tensors
-llama_model_loader: - type iq4_k: 94 tensors
-llama_model_loader: - type iq6_k: 376 tensors
-llm_load_vocab: special tokens cache size = 26
-llm_load_vocab: token to piece cache size = 0.9311 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = qwen3moe
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 151936
-llm_load_print_meta: n_merges = 151387
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 40960
-llm_load_print_meta: n_embd = 4096
-llm_load_print_meta: n_layer = 94
-llm_load_print_meta: n_head = 64
-llm_load_print_meta: n_head_kv = 4
-llm_load_print_meta: n_rot = 128
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_swa_pattern = 1
-llm_load_print_meta: n_embd_head_k = 128
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 16
-llm_load_print_meta: n_embd_k_gqa = 512
-llm_load_print_meta: n_embd_v_gqa = 512
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 12288
-llm_load_print_meta: n_expert = 128
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 2
-llm_load_print_meta: rope scaling = linear
-llm_load_print_meta: freq_base_train = 1000000.0
-llm_load_print_meta: freq_scale_train = 1
-llm_load_print_meta: n_ctx_orig_yarn = 40960
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = ?B
-llm_load_print_meta: model ftype = IQ3_K - 3.4325 bpw
-llm_load_print_meta: model params = 235.094 B
-llm_load_print_meta: model size = 106.830 GiB (3.903 BPW)
-llm_load_print_meta: repeating layers = 105.598 GiB (3.879 BPW, 233.849 B parameters)
-llm_load_print_meta: general.name = Qwen3 235B A22B
-llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
-llm_load_print_meta: EOS token = 151645 '<|im_end|>'
-llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
-llm_load_print_meta: LF token = 148848 'ÄĬ'
-llm_load_print_meta: EOT token = 151645 '<|im_end|>'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_ff_exp = 1536
-llm_load_tensors: ggml ctx size = 1.49 MiB
-Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.61.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.62.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.63.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.64.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.65.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.66.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.67.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.68.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.69.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.70.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.71.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.72.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.73.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.74.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.75.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.76.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.77.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.78.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.79.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.80.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.81.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.82.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.83.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.84.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.85.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.86.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.87.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.88.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.89.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.90.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
-llm_load_tensors: offloading 63 repeating layers to GPU
-llm_load_tensors: offloaded 63/95 layers to GPU
-llm_load_tensors: CPU buffer size = 36422.69 MiB
-llm_load_tensors: CPU buffer size = 37141.03 MiB
-llm_load_tensors: CPU buffer size = 35082.59 MiB
-llm_load_tensors: CPU buffer size = 36291.28 MiB
-llm_load_tensors: CPU buffer size = 1722.64 MiB
-llm_load_tensors: CUDA0 buffer size = 1808.69 MiB
-llm_load_tensors: CUDA1 buffer size = 1867.03 MiB
-....................................................................................................
-=====================================================================
- MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
-=====================================================================
-llama_new_context_with_model: n_ctx = 65536
-llama_new_context_with_model: n_batch = 4096
-llama_new_context_with_model: n_ubatch = 4096
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 0
-llama_new_context_with_model: attn_max_b = 512
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 1000000.0
-llama_new_context_with_model: freq_scale = 1
-ggml_cuda_host_malloc: failed to allocate 3038.00 MiB of pinned memory: invalid argument
-llama_kv_cache_init: CPU KV buffer size = 3038.00 MiB
-llama_kv_cache_init: CUDA0 KV buffer size = 3038.02 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 3136.02 MiB
-llama_new_context_with_model: KV self size = 9212.00 MiB, K (q8_0): 3196.00 MiB, V (f16): 6016.00 MiB
-llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
-llama_new_context_with_model: CUDA0 compute buffer size = 3068.61 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 896.03 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 1088.05 MiB
-llama_new_context_with_model: graph nodes = 3672
-llama_new_context_with_model: graph splits = 595
-INFO [ init] initializing slots | tid="133181503258624" timestamp=1749452713 n_slots=1
-INFO [ init] new slot | tid="133181503258624" timestamp=1749452713 id_slot=0 n_ctx_slot=65536
-INFO [ main] model loaded | tid="133181503258624" timestamp=1749452713
-INFO [ main] chat template | tid="133181503258624" timestamp=1749452713 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
-INFO [ main] HTTP server listening | tid="133181503258624" timestamp=1749452713 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="133181503258624" timestamp=1749452713
-INFO [ log_server_request] request | tid="133179400769536" timestamp=1749452792 remote_addr="172.17.0.3"
-INFO [ launch_slot_with_task] slot is processing task | tid="133181503258624" timestamp=1749452800 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452800 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452820 id_slot=0 id_task=0 p0=4096
-INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452834 id_slot=0 id_task=0 p0=8192
-INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452848 id_slot=0 id_task=0 p0=12288
-INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452862 id_slot=0 id_task=0 p0=16384
-INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452877 id_slot=0 id_task=0 p0=20480
-.INFO [ print_timings] prompt eval time = 89767.54 ms / 21880 tokens ( 4.10 ms per token, 243.74 tokens per second) | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 t_prompt_processing=89767.542 n_prompt_tokens_processed=21880 t_token=4.1027212979890315 n_tokens_second=243.74066073904527
-INFO [ print_timings] generation eval time = 72821.50 ms / 563 runs ( 129.35 ms per token, 7.73 tokens per second) | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 t_token_generation=72821.5 n_decoded=563 t_token=129.34547069271758 n_tokens_second=7.731233220958097
-INFO [ print_timings] total time = 162589.04 ms | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 t_prompt_processing=89767.542 t_token_generation=72821.5 t_total=162589.04200000002
-INFO [ update_slots] slot released | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 n_ctx=65536 n_past=22442 n_system_tokens=0 n_cache_tokens=22442 truncated=false
-INFO [ update_slots] all slots are idle | tid="133181503258624" timestamp=1749452963
-INFO [ log_server_request] request | tid="133179285434368" timestamp=1749452963 remote_addr="172.17.0.3" remote_port=47690 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="133181503258624" timestamp=1749452963
-```
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **07:17:58**:
-
-* You need `-ctv q8_0` to also have the V cache quantized.
-* You need to change `--n-gpu-layers` to 100 (Qwen3 has more layers than DeepSeek
-* Remove `-mla` (not applicable to any model other than DeepSeek)
-* You don't need the `-amb` argument
-
-Let's see what buffer sizes you get with that. Then we will know how many layers you can put on the GPU.
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **07:19:10**:
-
-Oh, try changing threads to 56 from 57. 57 is a really strange number of threads.
-
----
-
-👤 **saood06** commented the **2025-06-09** at **07:19:50**:
-
-> [@saood06](https://github.com/saood06) Can you point me to the specific place whee the check is being made.
-
-@ikawrakow
-
-https://github.com/ikawrakow/ik_llama.cpp/blob/58f08e43859a942dcc4d585f04b729eb50603264/src/llama.cpp#L20758
-
->But apart from this, I still think there is a bug because it does not make sense that the scheduler wants to allocate such insane amount of memory.
-
-Yes that doesn't make much sense to me either.
-
-> I haven't come around to look why that happens.
-
-If you ever do I'd be interested to hear anything you find out.
-
----
-
-👤 **saood06** commented the **2025-06-09** at **07:19:50**:
-
-> [@saood06](https://github.com/saood06) Can you point me to the specific place whee the check is being made.
-
-https://github.com/ikawrakow/ik_llama.cpp/blob/58f08e43859a942dcc4d585f04b729eb50603264/src/llama.cpp#L20758
-
->But apart from this, I still think there is a bug because it does not make sense that the scheduler wants to allocate such insane amount of memory.
-
-Yes that doesn't make much sense to me either.
-
-> I haven't come around to look why that happens.
-
-If you ever do I'd be interested to hear anything you find out.
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **07:20:00**:
-
-ok i tried something before your comment, so I will post that here anyway, and then I will now try your settings in there.
-
-below is from what I was experimenting with 32k
-
-```
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0,1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
- --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
- -fa \
- -ctk q8_0 -ctv q8_0 \
- -c 32768 \
- -fmoe \
- -b 4096 -ub 4096 \
- -amb 512 \
- -rtr \
- -ot blk\.1[2-9]\.ffn.*=CPU \
- -ot blk\.[2-8][0-9]\.ffn.*=CPU \
- -ot blk\.9[0-3]\.ffn.*=CPU \
- -ngl 99 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-
-```
-ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
-ggml_cuda_init: found 2 CUDA devices:
- Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
- Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="135402454921216" timestamp=1749453237 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="135402454921216" timestamp=1749453237 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-llama_model_loader: additional 2 GGUFs metadata loaded.
-llama_model_loader: loaded meta data with 40 key-value pairs and 1131 tensors from /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3 235B A22B
llama_model_loader: - kv 3: general.basename str = Qwen3
@@ -4849,426 +3568,302 @@ llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp = 1536
llm_load_tensors: ggml ctx size = 1.49 MiB
-Tensor blk.12.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
+Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.61.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.61.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.62.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.62.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.63.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.63.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.64.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.64.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.65.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.65.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.66.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.66.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.67.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.67.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.68.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.68.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.69.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.69.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.70.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.70.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.71.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.71.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.72.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.72.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.73.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.73.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.74.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.74.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.75.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.75.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.76.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.76.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.77.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.77.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.78.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.78.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.79.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.79.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.80.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.80.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.81.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.81.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.82.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.82.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.83.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.83.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.84.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.84.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.85.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.85.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.86.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.86.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.87.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.87.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.88.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.88.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.89.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.89.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.90.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.90.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.91.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.91.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.92.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.92.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
-llm_load_tensors: offloading 94 repeating layers to GPU
-llm_load_tensors: offloading non-repeating layers to GPU
-llm_load_tensors: offloaded 95/95 layers to GPU
-llm_load_tensors: CPU buffer size = 89709.28 MiB
-llm_load_tensors: CUDA_Host buffer size = 630.59 MiB
-llm_load_tensors: CUDA0 buffer size = 15775.66 MiB
-llm_load_tensors: CUDA1 buffer size = 3278.08 MiB
+llm_load_tensors: offloading 63 repeating layers to GPU
+llm_load_tensors: offloaded 63/95 layers to GPU
+llm_load_tensors: CPU buffer size = 36422.69 MiB
+llm_load_tensors: CPU buffer size = 37141.03 MiB
+llm_load_tensors: CPU buffer size = 35082.59 MiB
+llm_load_tensors: CPU buffer size = 36291.28 MiB
+llm_load_tensors: CPU buffer size = 1722.64 MiB
+llm_load_tensors: CUDA0 buffer size = 1808.69 MiB
+llm_load_tensors: CUDA1 buffer size = 1867.03 MiB
....................................................................................................
-============ Repacked 246 tensors
-llama_new_context_with_model: n_ctx = 32768
+=====================================================================
+ MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
+=====================================================================
+llama_new_context_with_model: n_ctx = 65536
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
@@ -5278,73 +3873,101 @@ llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
-llama_kv_cache_init: CUDA0 KV buffer size = 1598.02 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 1598.02 MiB
-llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
+ggml_cuda_host_malloc: failed to allocate 3038.00 MiB of pinned memory: invalid argument
+llama_kv_cache_init: CPU KV buffer size = 3038.00 MiB
+llama_kv_cache_init: CUDA0 KV buffer size = 3038.02 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 3136.02 MiB
+llama_new_context_with_model: KV self size = 9212.00 MiB, K (q8_0): 3196.00 MiB, V (f16): 6016.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
-llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 2096.02 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 2502.00 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 576.05 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 3068.61 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 896.03 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 1088.05 MiB
llama_new_context_with_model: graph nodes = 3672
-llama_new_context_with_model: graph splits = 378
-INFO [ init] initializing slots | tid="135402454921216" timestamp=1749453354 n_slots=1
-INFO [ init] new slot | tid="135402454921216" timestamp=1749453354 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="135402454921216" timestamp=1749453354
-INFO [ main] chat template | tid="135402454921216" timestamp=1749453354 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
-INFO [ main] HTTP server listening | tid="135402454921216" timestamp=1749453354 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="135402454921216" timestamp=1749453354
-INFO [ log_server_request] request | tid="135400345559040" timestamp=1749453357 remote_addr="172.17.0.3" remote_port=37824 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="135400337166336" timestamp=1749453360 remote_addr="172.17.0.3" remote_port=37832 status=200 method="GET" path="/v1/models" params={}
-INFO [ launch_slot_with_task] slot is processing task | tid="135402454921216" timestamp=1749453362 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453362 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453376 id_slot=0 id_task=0 p0=4096
-INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453389 id_slot=0 id_task=0 p0=8192
-INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453403 id_slot=0 id_task=0 p0=12288
-INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453418 id_slot=0 id_task=0 p0=16384
-INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453433 id_slot=0 id_task=0 p0=20480
-INFO [ print_timings] prompt eval time = 82402.70 ms / 21880 tokens ( 3.77 ms per token, 265.53 tokens per second) | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 t_prompt_processing=82402.695 n_prompt_tokens_processed=21880 t_token=3.7661195155393057 n_tokens_second=265.52529622969246
-INFO [ print_timings] generation eval time = 43959.36 ms / 547 runs ( 80.36 ms per token, 12.44 tokens per second) | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 t_token_generation=43959.358 n_decoded=547 t_token=80.36445703839122 n_tokens_second=12.443311842725272
-INFO [ print_timings] total time = 126362.05 ms | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 t_prompt_processing=82402.695 t_token_generation=43959.358 t_total=126362.05300000001
-INFO [ update_slots] slot released | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 n_ctx=32768 n_past=22426 n_system_tokens=0 n_cache_tokens=22426 truncated=false
-INFO [ update_slots] all slots are idle | tid="135402454921216" timestamp=1749453488
-INFO [ log_server_request] request | tid="135400328773632" timestamp=1749453488 remote_addr="172.17.0.3" remote_port=48428 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="135402454921216" timestamp=1749453488
-
+llama_new_context_with_model: graph splits = 595
+INFO [ init] initializing slots | tid="133181503258624" timestamp=1749452713 n_slots=1
+INFO [ init] new slot | tid="133181503258624" timestamp=1749452713 id_slot=0 n_ctx_slot=65536
+INFO [ main] model loaded | tid="133181503258624" timestamp=1749452713
+INFO [ main] chat template | tid="133181503258624" timestamp=1749452713 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
+INFO [ main] HTTP server listening | tid="133181503258624" timestamp=1749452713 n_threads_http="111" port="10002" hostname="0.0.0.0"
+INFO [ update_slots] all slots are idle | tid="133181503258624" timestamp=1749452713
+INFO [ log_server_request] request | tid="133179400769536" timestamp=1749452792 remote_addr="172.17.0.3"
+INFO [ launch_slot_with_task] slot is processing task | tid="133181503258624" timestamp=1749452800 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452800 id_slot=0 id_task=0 p0=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452820 id_slot=0 id_task=0 p0=4096
+INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452834 id_slot=0 id_task=0 p0=8192
+INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452848 id_slot=0 id_task=0 p0=12288
+INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452862 id_slot=0 id_task=0 p0=16384
+INFO [ update_slots] kv cache rm [p0, end) | tid="133181503258624" timestamp=1749452877 id_slot=0 id_task=0 p0=20480
+.INFO [ print_timings] prompt eval time = 89767.54 ms / 21880 tokens ( 4.10 ms per token, 243.74 tokens per second) | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 t_prompt_processing=89767.542 n_prompt_tokens_processed=21880 t_token=4.1027212979890315 n_tokens_second=243.74066073904527
+INFO [ print_timings] generation eval time = 72821.50 ms / 563 runs ( 129.35 ms per token, 7.73 tokens per second) | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 t_token_generation=72821.5 n_decoded=563 t_token=129.34547069271758 n_tokens_second=7.731233220958097
+INFO [ print_timings] total time = 162589.04 ms | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 t_prompt_processing=89767.542 t_token_generation=72821.5 t_total=162589.04200000002
+INFO [ update_slots] slot released | tid="133181503258624" timestamp=1749452963 id_slot=0 id_task=0 n_ctx=65536 n_past=22442 n_system_tokens=0 n_cache_tokens=22442 truncated=false
+INFO [ update_slots] all slots are idle | tid="133181503258624" timestamp=1749452963
+INFO [ log_server_request] request | tid="133179285434368" timestamp=1749452963 remote_addr="172.17.0.3" remote_port=47690 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="133181503258624" timestamp=1749452963
```
---
-👤 **mtcl** commented the **2025-06-09** at **07:26:45**:
+👤 **ikawrakow** commented on **2025-06-09** at **07:17:58**
-> * You need `-ctv q8_0` to also have the V cache quantized.
->
-> * You need to change `--n-gpu-layers` to 100 (Qwen3 has more layers than DeepSeek
->
-> * Remove `-mla` (not applicable to any model other than DeepSeek)
->
-> * You don't need the `-amb` argument
->
->
-> Let's see what buffer sizes you get with that. Then we will know how many layers you can put on the GPU.
+* You need `-ctv q8_0` to also have the V cache quantized.
+* You need to change `--n-gpu-layers` to 100 (Qwen3 has more layers than DeepSeek
+* Remove `-mla` (not applicable to any model other than DeepSeek)
+* You don't need the `-amb` argument
-Here are the results (both of my GPUs have about 10GB VRAM occupied now)
+Let's see what buffer sizes you get with that. Then we will know how many layers you can put on the GPU.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **07:19:10**
+
+Oh, try changing threads to 56 from 57. 57 is a really strange number of threads.
+
+---
+
+👤 **saood06** commented on **2025-06-09** at **07:19:50**
+
+> [@saood06](https://github.com/saood06) Can you point me to the specific place whee the check is being made.
+
+@ikawrakow
+
+https://github.com/ikawrakow/ik_llama.cpp/blob/58f08e43859a942dcc4d585f04b729eb50603264/src/llama.cpp#L20758
+
+>But apart from this, I still think there is a bug because it does not make sense that the scheduler wants to allocate such insane amount of memory.
+
+Yes that doesn't make much sense to me either.
+
+> I haven't come around to look why that happens.
+
+If you ever do I'd be interested to hear anything you find out.
+
+---
+
+👤 **mtcl** commented on **2025-06-09** at **07:20:00**
+
+ok i tried something before your comment, so I will post that here anyway, and then I will now try your settings in there.
+
+below is from what I was experimenting with 32k
```
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
+(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0,1" ./build/bin/llama-server \
+ --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
--alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
- --ctx-size 65536 \
- -ctk q8_0 -ctv q8_0 \
- -fa \
- -b 4096 -ub 4096 \
- -fmoe \
- --n-gpu-layers 100 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 56 \
- --host 0.0.0.0 \
- --port 10002
+ -fa \
+ -ctk q8_0 -ctv q8_0 \
+ -c 32768 \
+ -fmoe \
+ -b 4096 -ub 4096 \
+ -amb 512 \
+ -rtr \
+ -ot blk\.1[2-9]\.ffn.*=CPU \
+ -ot blk\.[2-8][0-9]\.ffn.*=CPU \
+ -ot blk\.9[0-3]\.ffn.*=CPU \
+ -ngl 99 \
+ --threads 57 \
+ --host 0.0.0.0 \
+ --port 10002
```
```
@@ -5353,8 +3976,8 @@ ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="126884355235840" timestamp=1749453754 build=3737 commit="58f08e43"
-INFO [ main] system info | tid="126884355235840" timestamp=1749453754 n_threads=56 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+INFO [ main] build info | tid="135402454921216" timestamp=1749453237 build=3737 commit="58f08e43"
+INFO [ main] system info | tid="135402454921216" timestamp=1749453237 n_threads=57 n_threads_batch=-1 total_threads=112 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: additional 2 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 40 key-value pairs and 1131 tensors from /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
@@ -5458,346 +4081,473 @@ llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_ff_exp = 1536
llm_load_tensors: ggml ctx size = 1.49 MiB
-Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.61.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.61.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.61.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.61.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.62.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.62.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.62.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.62.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.63.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.63.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.63.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.63.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.64.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.64.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.64.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.64.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.65.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.65.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.65.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.65.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.66.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.66.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.66.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.66.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.67.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.67.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.67.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.67.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.68.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.68.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.68.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.68.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.69.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.69.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.69.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.69.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.70.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.70.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.70.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.70.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.71.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.71.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.71.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.71.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.72.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.72.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.72.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.72.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.73.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.73.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.73.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.73.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.74.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.74.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.74.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.74.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.75.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.75.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.75.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.75.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.76.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.76.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.76.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.76.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.77.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.77.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.77.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.77.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.78.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.78.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.78.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.78.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.79.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.79.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.79.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.79.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.80.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.80.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.80.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.80.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.81.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.81.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.81.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.81.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.82.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.82.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.82.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.82.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.83.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.83.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.83.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.83.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.84.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.84.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.84.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.84.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.85.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.85.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.85.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.85.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.86.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.86.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.86.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.86.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.87.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.87.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.87.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.87.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.88.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.88.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.88.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.88.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.89.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.89.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.89.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.89.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.90.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.90.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.90.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.90.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.91.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.91.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.92.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.92.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.93.ffn_norm.weight buffer type overriden to CPU
+Tensor blk.93.ffn_gate_inp.weight buffer type overriden to CPU
Tensor blk.93.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.93.ffn_up_exps.weight buffer type overriden to CPU
llm_load_tensors: offloading 94 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 95/95 layers to GPU
-llm_load_tensors: CPU buffer size = 36422.69 MiB
-llm_load_tensors: CPU buffer size = 37141.03 MiB
-llm_load_tensors: CPU buffer size = 35082.59 MiB
-llm_load_tensors: CPU buffer size = 630.59 MiB
-llm_load_tensors: CUDA0 buffer size = 2742.20 MiB
-llm_load_tensors: CUDA1 buffer size = 3372.81 MiB
+llm_load_tensors: CPU buffer size = 89709.28 MiB
+llm_load_tensors: CUDA_Host buffer size = 630.59 MiB
+llm_load_tensors: CUDA0 buffer size = 15775.66 MiB
+llm_load_tensors: CUDA1 buffer size = 3278.08 MiB
....................................................................................................
-llama_new_context_with_model: n_ctx = 65536
+============ Repacked 246 tensors
+llama_new_context_with_model: n_ctx = 32768
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
-llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
-llama_kv_cache_init: CUDA0 KV buffer size = 3196.02 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 3196.02 MiB
-llama_new_context_with_model: KV self size = 6392.00 MiB, K (q8_0): 3196.00 MiB, V (q8_0): 3196.00 MiB
+llama_kv_cache_init: CUDA0 KV buffer size = 1598.02 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 1598.02 MiB
+llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 2432.02 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 2096.02 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 2502.00 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 1088.05 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 576.05 MiB
llama_new_context_with_model: graph nodes = 3672
-llama_new_context_with_model: graph splits = 238
-INFO [ init] initializing slots | tid="126884355235840" timestamp=1749453800 n_slots=1
-INFO [ init] new slot | tid="126884355235840" timestamp=1749453800 id_slot=0 n_ctx_slot=65536
-INFO [ main] model loaded | tid="126884355235840" timestamp=1749453800
-INFO [ main] chat template | tid="126884355235840" timestamp=1749453800 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
-INFO [ main] HTTP server listening | tid="126884355235840" timestamp=1749453800 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="126884355235840" timestamp=1749453800
-INFO [ log_server_request] request | tid="126882284560384" timestamp=1749453803 remote_addr="172.17.0.3" remote_port=55816 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="126882276167680" timestamp=1749453805 remote_addr="172.17.0.3" remote_port=55832 status=200 method="GET" path="/v1/models" params={}
-INFO [ launch_slot_with_task] slot is processing task | tid="126884355235840" timestamp=1749453805 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="126884355235840" timestamp=1749453805 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="126884355235840" timestamp=1749453821 id_slot=0 id_task=0 p0=4096
-INFO [ update_slots] kv cache rm [p0, end) | tid="126884355235840" timestamp=1749453835 id_slot=0 id_task=0 p0=8192
-INFO [ update_slots] kv cache rm [p0, end) | tid="126884355235840" timestamp=1749453849 id_slot=0 id_task=0 p0=12288
-INFO [ update_slots] kv cache rm [p0, end) | tid="126884355235840" timestamp=1749453863 id_slot=0 id_task=0 p0=16384
-INFO [ update_slots] kv cache rm [p0, end) | tid="126884355235840" timestamp=1749453877 id_slot=0 id_task=0 p0=20480
-INFO [ log_server_request] request | tid="126882183897088" timestamp=1749453927 remote_addr="172.17.0.3" remote_port=34580 status=200 method="GET" path="/v1/models" params={}
-INFO [ print_timings] prompt eval time = 83617.03 ms / 21880 tokens ( 3.82 ms per token, 261.67 tokens per second) | tid="126884355235840" timestamp=1749453936 id_slot=0 id_task=0 t_prompt_processing=83617.034 n_prompt_tokens_processed=21880 t_token=3.821619469835466 n_tokens_second=261.6691713796019
-INFO [ print_timings] generation eval time = 46598.42 ms / 473 runs ( 98.52 ms per token, 10.15 tokens per second) | tid="126884355235840" timestamp=1749453936 id_slot=0 id_task=0 t_token_generation=46598.424 n_decoded=473 t_token=98.51675264270612 n_tokens_second=10.150557881528353
-INFO [ print_timings] total time = 130215.46 ms | tid="126884355235840" timestamp=1749453936 id_slot=0 id_task=0 t_prompt_processing=83617.034 t_token_generation=46598.424 t_total=130215.458
-INFO [ update_slots] slot released | tid="126884355235840" timestamp=1749453936 id_slot=0 id_task=0 n_ctx=65536 n_past=22352 n_system_tokens=0 n_cache_tokens=22352 truncated=false
-INFO [ update_slots] all slots are idle | tid="126884355235840" timestamp=1749453936
-INFO [ log_server_request] request | tid="126882192289792" timestamp=1749453936 remote_addr="172.17.0.3" remote_port=55846 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="126884355235840" timestamp=1749453936
+llama_new_context_with_model: graph splits = 378
+INFO [ init] initializing slots | tid="135402454921216" timestamp=1749453354 n_slots=1
+INFO [ init] new slot | tid="135402454921216" timestamp=1749453354 id_slot=0 n_ctx_slot=32768
+INFO [ main] model loaded | tid="135402454921216" timestamp=1749453354
+INFO [ main] chat template | tid="135402454921216" timestamp=1749453354 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
+INFO [ main] HTTP server listening | tid="135402454921216" timestamp=1749453354 n_threads_http="111" port="10002" hostname="0.0.0.0"
+INFO [ update_slots] all slots are idle | tid="135402454921216" timestamp=1749453354
+INFO [ log_server_request] request | tid="135400345559040" timestamp=1749453357 remote_addr="172.17.0.3" remote_port=37824 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="135400337166336" timestamp=1749453360 remote_addr="172.17.0.3" remote_port=37832 status=200 method="GET" path="/v1/models" params={}
+INFO [ launch_slot_with_task] slot is processing task | tid="135402454921216" timestamp=1749453362 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453362 id_slot=0 id_task=0 p0=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453376 id_slot=0 id_task=0 p0=4096
+INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453389 id_slot=0 id_task=0 p0=8192
+INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453403 id_slot=0 id_task=0 p0=12288
+INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453418 id_slot=0 id_task=0 p0=16384
+INFO [ update_slots] kv cache rm [p0, end) | tid="135402454921216" timestamp=1749453433 id_slot=0 id_task=0 p0=20480
+INFO [ print_timings] prompt eval time = 82402.70 ms / 21880 tokens ( 3.77 ms per token, 265.53 tokens per second) | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 t_prompt_processing=82402.695 n_prompt_tokens_processed=21880 t_token=3.7661195155393057 n_tokens_second=265.52529622969246
+INFO [ print_timings] generation eval time = 43959.36 ms / 547 runs ( 80.36 ms per token, 12.44 tokens per second) | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 t_token_generation=43959.358 n_decoded=547 t_token=80.36445703839122 n_tokens_second=12.443311842725272
+INFO [ print_timings] total time = 126362.05 ms | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 t_prompt_processing=82402.695 t_token_generation=43959.358 t_total=126362.05300000001
+INFO [ update_slots] slot released | tid="135402454921216" timestamp=1749453488 id_slot=0 id_task=0 n_ctx=32768 n_past=22426 n_system_tokens=0 n_cache_tokens=22426 truncated=false
+INFO [ update_slots] all slots are idle | tid="135402454921216" timestamp=1749453488
+INFO [ log_server_request] request | tid="135400328773632" timestamp=1749453488 remote_addr="172.17.0.3" remote_port=48428 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="135402454921216" timestamp=1749453488
+
```
---
-👤 **mtcl** commented the **2025-06-09** at **07:26:45**:
+👤 **mtcl** commented on **2025-06-09** at **07:26:45**
> * You need `-ctv q8_0` to also have the V cache quantized.
>
@@ -5810,7 +4560,7 @@ INFO [ update_slots] all slots are idle | tid="126884355235840" times
>
> Let's see what buffer sizes you get with that. Then we will know how many layers you can put on the GPU.
-Here are the results:
+Here are the results (both of my GPUs have about 10GB VRAM occupied now)
```
(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
@@ -6279,7 +5029,7 @@ INFO [ update_slots] all slots are idle | tid="126884355235840" times
---
-👤 **ikawrakow** commented the **2025-06-09** at **07:32:50**:
+👤 **ikawrakow** commented on **2025-06-09** at **07:32:50**
So, it looks like you can put about 15 layers on each GPU. Try adding before the CPU override
```
@@ -6290,7 +5040,7 @@ If it crashes with OOM, keep reducing the number of layers by 1 until it runs (b
---
-👤 **mtcl** commented the **2025-06-09** at **07:34:05**:
+👤 **mtcl** commented on **2025-06-09** at **07:34:05**
so something like this?
@@ -6353,32 +5103,7 @@ Segmentation fault (core dumped)
---
-👤 **mtcl** commented the **2025-06-09** at **07:34:05**:
-
-so something like this?
-
-```bash
-CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
- --model /media/mukul/backup/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
- --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
- --ctx-size 65536 \
- -ctk q8_0 -ctv q8_0 \
- -fa \
- -b 4096 -ub 4096 \
- -fmoe \
- --n-gpu-layers 100 \
- -ot "blk\.[0-9]\.ffn=CUDA0,blk\.1[0-4]\.ffn=CUDA0 \
- -ot "blk\.1[5-9]\.ffn=CUDA1,blk\.2[0-9]\.ffn=CUDA1 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 56 \
- --host 0.0.0.0 \
- --port 10002
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **07:37:16**:
+👤 **mtcl** commented on **2025-06-09** at **07:37:16**
> If it crashes with OOM, keep reducing the number of layers by 1 until it runs (but adjust the regex so the offloaded layers are consecutive).
@@ -6405,7 +5130,7 @@ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-server \
---
-👤 **ikawrakow** commented the **2025-06-09** at **07:41:21**:
+👤 **ikawrakow** commented on **2025-06-09** at **07:41:21**
Oh, sorry, it is missing the closing quotes. It must be
```
@@ -6422,7 +5147,7 @@ etc.
---
-👤 **saood06** commented the **2025-06-09** at **08:05:23**:
+👤 **saood06** commented on **2025-06-09** at **08:05:23**
> ```
> -ot "blk\.[0-9]\.ffn=CUDA0,blk\.1[0-3]\.ffn=CUDA0" \
@@ -6438,7 +5163,7 @@ Just to be clear the second line should be
---
-👤 **mtcl** commented the **2025-06-09** at **08:09:04**:
+👤 **mtcl** commented on **2025-06-09** at **08:09:04**
ok, i used qwen3 to help and i started reducing one by one and this is the one that works with 22GB per GPU:
@@ -6955,7 +5680,7 @@ INFO [ update_slots] all slots are idle | tid="137280187613184" times
---
-👤 **mtcl** commented the **2025-06-09** at **08:11:45**:
+👤 **mtcl** commented on **2025-06-09** at **08:11:45**
And this is when I sent 20K+ prompts to this, like this is insane for me! 329 tokens/second prefill and 13-14 tk/second on token generation.
@@ -6977,35 +5702,13 @@ INFO [ update_slots] all slots are idle | tid="137280187613184" times
---
-👤 **mtcl** commented the **2025-06-09** at **08:11:45**:
-
-And this is when I sent 20K+ prompts to this:
-
-```
-NFO [ update_slots] kv cache rm [p0, end) | tid="137280187613184" timestamp=1749456577 id_slot=0 id_task=4315 p0=4099
-INFO [ update_slots] kv cache rm [p0, end) | tid="137280187613184" timestamp=1749456588 id_slot=0 id_task=4315 p0=8195
-INFO [ update_slots] kv cache rm [p0, end) | tid="137280187613184" timestamp=1749456599 id_slot=0 id_task=4315 p0=12291
-INFO [ update_slots] kv cache rm [p0, end) | tid="137280187613184" timestamp=1749456610 id_slot=0 id_task=4315 p0=16387
-INFO [ update_slots] kv cache rm [p0, end) | tid="137280187613184" timestamp=1749456622 id_slot=0 id_task=4315 p0=20483
-INFO [ print_timings] prompt eval time = 66324.27 ms / 21877 tokens ( 3.03 ms per token, 329.85 tokens per second) | tid="137280187613184" timestamp=1749456668 id_slot=0 id_task=4315 t_prompt_processing=66324.269 n_prompt_tokens_processed=21877 t_token=3.0316893998263015 n_tokens_second=329.8490933989789
-INFO [ print_timings] generation eval time = 35943.26 ms / 476 runs ( 75.51 ms per token, 13.24 tokens per second) | tid="137280187613184" timestamp=1749456668 id_slot=0 id_task=4315 t_token_generation=35943.258 n_decoded=476 t_token=75.5110462184874 n_tokens_second=13.243095547988442
-INFO [ print_timings] total time = 102267.53 ms | tid="137280187613184" timestamp=1749456668 id_slot=0 id_task=4315 t_prompt_processing=66324.269 t_token_generation=35943.258 t_total=102267.527
-INFO [ update_slots] slot released | tid="137280187613184" timestamp=1749456668 id_slot=0 id_task=4315 n_ctx=65536 n_past=22355 n_system_tokens=0 n_cache_tokens=22355 truncated=false
-INFO [ update_slots] all slots are idle | tid="137280187613184" timestamp=1749456668
-INFO [ log_server_request] request | tid="137278001233920" timestamp=1749456668 remote_addr="172.17.0.3" remote_port=43228 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="137280187613184" timestamp=1749456668
-
-```
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **08:23:01**:
+👤 **ikawrakow** commented on **2025-06-09** at **08:23:01**
You may want to try the `iq4_ks_r4` model from the same @ubergarm HF repository later (this will give you another video for your channel 😄 ). DeepSeek should run with the same command you used for `iq3_k_r4`. For Qwen3 you may need to reduce number of layers (but start with the 12 layers you used here and only reduce if necessary). Prefill is likely to be better. TG is not easy to predict. The `iq4_ks` matrix multiplication kernel is faster, but there is more data to be fetched from RAM, so one needs to try to see what happens.
---
-👤 **mtcl** commented the **2025-06-09** at **08:27:17**:
+👤 **mtcl** commented on **2025-06-09** at **08:27:17**
> You may want to try the `iq4_ks_r4` model from the same @ubergarm HF repository later (this will give you another video for your channel 😄 ). DeepSeek should run with the same command you used for `iq3_k_r4`. For Qwen3 you may need to reduce number of layers (but start with the 12 layers you used here and only reduce if necessary). Prefill is likely to be better. TG is not easy to predict. The `iq4_ks` matrix multiplication kernel is faster, but there is more data to be fetched from RAM, so one needs to try to see what happens.
@@ -7016,7 +5719,7 @@ If I want to make my own quants, is there a guide out there for it? Like I want
---
-👤 **ikawrakow** commented the **2025-06-09** at **08:38:44**:
+👤 **ikawrakow** commented on **2025-06-09** at **08:38:44**
To make you own quants, you need an imatrix file. You can get those from ubergarm, Bartowsi or Unsloth. Then you use
```
@@ -7032,7 +5735,7 @@ will change all tensors with names that match the regular expression `attn` to u
---
-👤 **mtcl** commented the **2025-06-09** at **09:10:42**:
+👤 **mtcl** commented on **2025-06-09** at **09:10:42**
Thank you for the details on the Quants!
@@ -7438,25 +6141,25 @@ Aborted (core dumped)
---
-👤 **mtcl** commented the **2025-06-09** at **09:11:49**:
+👤 **mtcl** commented on **2025-06-09** at **09:11:49**
once it hit the max context, instead of gracefully stopping, it quit.
---
-👤 **ikawrakow** commented the **2025-06-09** at **09:16:26**:
+👤 **ikawrakow** commented on **2025-06-09** at **09:16:26**
As the message tells you, context shifting is not supported for DeepSeek. You started the server with a context of 40960 tokens (which is the Qwen3 maximum context size), and then tried to have a prompt with more than 40k tokens.
---
-👤 **ikawrakow** commented the **2025-06-09** at **09:18:53**:
+👤 **ikawrakow** commented on **2025-06-09** at **09:18:53**
It is more tricky to do context shifting with MLA, so that's not implemented.
---
-👤 **saood06** commented the **2025-06-09** at **09:19:00**:
+👤 **saood06** commented on **2025-06-09** at **09:19:00**
> once it hit the max context, instead of gracefully stopping, it quit.
@@ -7466,7 +6169,7 @@ I'm not sure if mainline ever fixed it, but I am just very careful about not hit
---
-👤 **mtcl** commented the **2025-06-09** at **09:23:43**:
+👤 **mtcl** commented on **2025-06-09** at **09:23:43**
OK, i downloaded the IQ4 of @ubergarm and ran this, it worked with shorter prompt, but with 10K prompt it failed for me, can you please take a look here too?
@@ -7875,7 +6578,7 @@ Aborted (core dumped)
---
-👤 **ikawrakow** commented the **2025-06-09** at **09:29:48**:
+👤 **ikawrakow** commented on **2025-06-09** at **09:29:48**
Remove
```
@@ -7885,7 +6588,7 @@ Remove
---
-👤 **ikawrakow** commented the **2025-06-09** at **09:38:22**:
+👤 **ikawrakow** commented on **2025-06-09** at **09:38:22**
> I'm not sure if mainline ever fixed it, but I am just very careful about not hitting it. (My frontend uses the tokenize endpoint so I can know exactly how many tokens I am at all times [even before I send it])
@@ -7895,7 +6598,7 @@ Strange that there are no issues related to that. Maybe I'm just missing somethi
---
-👤 **mtcl** commented the **2025-06-09** at **09:39:11**:
+👤 **mtcl** commented on **2025-06-09** at **09:39:11**
I was also able to do 32k with -b 2048 -ub 2048. I will try your option after this
@@ -7980,7 +6683,7 @@ INFO [ update_slots] all slots are idle | tid="140240679325696" times
---
-👤 **mtcl** commented the **2025-06-09** at **09:42:24**:
+👤 **mtcl** commented on **2025-06-09** at **09:42:24**
And this is with 64K context.
@@ -8337,95 +7040,45 @@ llama_new_context_with_model: KV self size = 2333.25 MiB, c^KV (q8_0): 2333.25
llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 7688.02 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 6992.03 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 1136.05 MiB
-llama_new_context_with_model: graph nodes = 13613
-llama_new_context_with_model: graph splits = 149
-INFO [ init] initializing slots | tid="128557749198848" timestamp=1749461834 n_slots=1
-INFO [ init] new slot | tid="128557749198848" timestamp=1749461834 id_slot=0 n_ctx_slot=65536
-INFO [ main] model loaded | tid="128557749198848" timestamp=1749461834
-INFO [ main] chat template | tid="128557749198848" timestamp=1749461834 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="128557749198848" timestamp=1749461834 n_threads_http="111" port="10002" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="128557749198848" timestamp=1749461834
-INFO [ log_server_request] request | tid="128291451105280" timestamp=1749461834 remote_addr="172.17.0.3" remote_port=44606 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="128291442712576" timestamp=1749461834 remote_addr="172.17.0.3" remote_port=44614 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="128555551813632" timestamp=1749461843 remote_addr="172.17.0.3" remote_port=51610 status=200 method="GET" path="/v1/models" params={}
-INFO [ log_server_request] request | tid="128555543420928" timestamp=1749461855 remote_addr="172.17.0.3" remote_port=39760 status=200 method="GET" path="/v1/models" params={}
-INFO [ launch_slot_with_task] slot is processing task | tid="128557749198848" timestamp=1749461855 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="128557749198848" timestamp=1749461855 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="128557749198848" timestamp=1749461910 id_slot=0 id_task=0 p0=4096
-
-
-INFO [ print_timings] prompt eval time = 109691.87 ms / 7074 tokens ( 15.51 ms per token, 64.49 tokens per second) | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 t_prompt_processing=109691.87 n_prompt_tokens_processed=7074 t_token=15.506342945999434 n_tokens_second=64.48973839173314
-INFO [ print_timings] generation eval time = 127046.30 ms / 1118 runs ( 113.64 ms per token, 8.80 tokens per second) | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 t_token_generation=127046.301 n_decoded=1118 t_token=113.63712075134168 n_tokens_second=8.799941369406733
-INFO [ print_timings] total time = 236738.17 ms | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 t_prompt_processing=109691.87 t_token_generation=127046.301 t_total=236738.171
-INFO [ update_slots] slot released | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 n_ctx=65536 n_past=8191 n_system_tokens=0 n_cache_tokens=8191 truncated=false
-INFO [ update_slots] all slots are idle | tid="128557749198848" timestamp=1749462092
-INFO [ log_server_request] request | tid="128555535028224" timestamp=1749462092 remote_addr="172.17.0.3" remote_port=39770 status=200 method="POST" path="/v1/chat/completions" params={}
-INFO [ update_slots] all slots are idle | tid="128557749198848" timestamp=1749462092
-```
-
----
-
-👤 **saood06** commented the **2025-06-09** at **09:42:30**:
-
->Strange that there are no issues related to that. Maybe I'm just missing something and it does work? Or maybe it is just that with the mainline snail speed it is very hard to arrive at the situation where context shifting is needed.
-
-I thought there was in mainline, this has been an issue for as long as I can remember, but once I experienced it, I just became vigilant about not hitting the limit. (why I liked #290 as it is a QoL feature as you can overallocate KV cache without being punished with a crash, just degraded performance once you cross the threshold)
-
----
-
-👤 **saood06** commented the **2025-06-09** at **09:42:30**:
-
->Strange that there are no issues related to that. Maybe I'm just missing something and it does work? Or maybe it is just that with the mainline snail speed it is very hard to arrive at the situation where context shifting is needed.
-
-I thought there was in mainline, this has been an issue for as long as I can remember, but once I experienced it, I just became vigilant about not hitting the limit. (#290 is a QoL feature as you can overallocate KV cache without being punished with a crash, just degraded performance)
-
----
-
-👤 **ubergarm** commented the **2025-06-09** at **14:22:17**:
-
-@mtcl
-
-Looks like y'all had a busy day! Glad to see you managed to achieve much improved speeds learning the commands to match your hardware.
-
-If you want a more visible and understandable benchmark of speeds for a given configuration, you can change out the binary from `llama-server` to `llama-sweep-bench` and run it e.g.:
-
-```bash
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-sweep-bench \
- --model /home/mukul/dev-ai/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ4_KS_R4/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf \
- --alias ubergarm/DeepSeek-R1-0528-IQ4_KS_R4 \
- --ctx-size 65536 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-
-You can remove `--alias` `--parallel` `--host` and `--port` but its probably fine to just leave them there as well (to keep it simple) as they are not used for `llama-sweep-bench`.
+llama_new_context_with_model: CUDA1 compute buffer size = 6992.03 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 1136.05 MiB
+llama_new_context_with_model: graph nodes = 13613
+llama_new_context_with_model: graph splits = 149
+INFO [ init] initializing slots | tid="128557749198848" timestamp=1749461834 n_slots=1
+INFO [ init] new slot | tid="128557749198848" timestamp=1749461834 id_slot=0 n_ctx_slot=65536
+INFO [ main] model loaded | tid="128557749198848" timestamp=1749461834
+INFO [ main] chat template | tid="128557749198848" timestamp=1749461834 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+INFO [ main] HTTP server listening | tid="128557749198848" timestamp=1749461834 n_threads_http="111" port="10002" hostname="0.0.0.0"
+INFO [ update_slots] all slots are idle | tid="128557749198848" timestamp=1749461834
+INFO [ log_server_request] request | tid="128291451105280" timestamp=1749461834 remote_addr="172.17.0.3" remote_port=44606 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="128291442712576" timestamp=1749461834 remote_addr="172.17.0.3" remote_port=44614 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="128555551813632" timestamp=1749461843 remote_addr="172.17.0.3" remote_port=51610 status=200 method="GET" path="/v1/models" params={}
+INFO [ log_server_request] request | tid="128555543420928" timestamp=1749461855 remote_addr="172.17.0.3" remote_port=39760 status=200 method="GET" path="/v1/models" params={}
+INFO [ launch_slot_with_task] slot is processing task | tid="128557749198848" timestamp=1749461855 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="128557749198848" timestamp=1749461855 id_slot=0 id_task=0 p0=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="128557749198848" timestamp=1749461910 id_slot=0 id_task=0 p0=4096
-Then you can see how the speed drops with longer context and better understand the consequences of long context etc.
-> Would you be able to post a guide on how to make the IQ4 version of the Qwen Model?
+INFO [ print_timings] prompt eval time = 109691.87 ms / 7074 tokens ( 15.51 ms per token, 64.49 tokens per second) | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 t_prompt_processing=109691.87 n_prompt_tokens_processed=7074 t_token=15.506342945999434 n_tokens_second=64.48973839173314
+INFO [ print_timings] generation eval time = 127046.30 ms / 1118 runs ( 113.64 ms per token, 8.80 tokens per second) | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 t_token_generation=127046.301 n_decoded=1118 t_token=113.63712075134168 n_tokens_second=8.799941369406733
+INFO [ print_timings] total time = 236738.17 ms | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 t_prompt_processing=109691.87 t_token_generation=127046.301 t_total=236738.171
+INFO [ update_slots] slot released | tid="128557749198848" timestamp=1749462092 id_slot=0 id_task=0 n_ctx=65536 n_past=8191 n_system_tokens=0 n_cache_tokens=8191 truncated=false
+INFO [ update_slots] all slots are idle | tid="128557749198848" timestamp=1749462092
+INFO [ log_server_request] request | tid="128555535028224" timestamp=1749462092 remote_addr="172.17.0.3" remote_port=39770 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="128557749198848" timestamp=1749462092
+```
-I have posted a [quant cookers guide here](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) to help people get started and show some examples. As ik mentions, feel free to use an existing imatrix file from myself, unsloth, or bartowski etc. Or you can make your own using the instructions I provided.
+---
-If you check my huggingface repo's I list some of my "secret recipes", which you can use as a starting point for your mixes.
+👤 **saood06** commented on **2025-06-09** at **09:42:30**
-The guide does not discuss how to convert DeepSeek fp8 to bf16 GGUF. That is an extra first step only for DeepSeek safetensors. You can find some of that buried in my original guide in a details fold about `triton-cpu` and the evshiron llama.cpp fork. Give you have newer GPUs you might be able to cast it from fp8 directly on GPU with the "normal" way, but I've never done that myself.
+>Strange that there are no issues related to that. Maybe I'm just missing something and it does work? Or maybe it is just that with the mainline snail speed it is very hard to arrive at the situation where context shifting is needed.
-Enjoy your setup and new GPUs and good luck with your latest videos!
+I thought there was in mainline, this has been an issue for as long as I can remember, but once I experienced it, I just became vigilant about not hitting the limit. (why I liked [#290](https://github.com/ikawrakow/ik_llama.cpp/issues/290) as it is a QoL feature as you can overallocate KV cache without being punished with a crash, just degraded performance once you cross the threshold)
---
-👤 **ubergarm** commented the **2025-06-09** at **14:22:17**:
+👤 **ubergarm** commented on **2025-06-09** at **14:22:17**
@mtcl
@@ -8459,13 +7112,15 @@ Then you can see how the speed drops with longer context and better understand t
I have posted a [quant cookers guide here](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) to help people get started and show some examples. As ik mentions, feel free to use an existing imatrix file from myself, unsloth, or bartowski etc. Or you can make your own using the instructions I provided.
+If you check my huggingface repo's I list some of my "secret recipes", which you can use as a starting point for your mixes.
+
The guide does not discuss how to convert DeepSeek fp8 to bf16 GGUF. That is an extra first step only for DeepSeek safetensors. You can find some of that buried in my original guide in a details fold about `triton-cpu` and the evshiron llama.cpp fork. Give you have newer GPUs you might be able to cast it from fp8 directly on GPU with the "normal" way, but I've never done that myself.
Enjoy your setup and new GPUs and good luck with your latest videos!
---
-👤 **ikawrakow** commented the **2025-06-09** at **15:19:10**:
+👤 **ikawrakow** commented on **2025-06-09** at **15:19:10**
@ubergarm
@@ -8473,7 +7128,7 @@ Btw, what type did you use for the `output.weight` tensor in these models? The H
---
-👤 **ubergarm** commented the **2025-06-09** at **15:32:13**:
+👤 **ubergarm** commented on **2025-06-09** at **15:32:13**
> Btw, what type did you use for the output.weight tensor in these models? The HF model browser does not work with them and don't feel like downloading 50 GB to check. Or more general, did you use IQ6_K for any of your published models?
@@ -8487,13 +7142,13 @@ If you need more specifics I can run a gguf-dump on anything.
---
-👤 **ikawrakow** commented the **2025-06-09** at **15:39:13**:
+👤 **ikawrakow** commented on **2025-06-09** at **15:39:13**
No need, thanks. It is just that as we were running these experiments I thought that the compute buffers were larger than I was expecting them to be, and one hypothesis I had was that the output tensor was `IQ6_K` and that `IQ6_K` does not have MMQ, so needs to be dequantized to `f16`, and that increases the compute buffer quite a bit. But I just checked, `IQ6_K` does have MMQ, so that's not it.
---
-👤 **mtcl** commented the **2025-06-09** at **19:18:33**:
+👤 **mtcl** commented on **2025-06-09** at **19:18:33**
> [@mtcl](https://github.com/mtcl)
@@ -8722,128 +7377,7 @@ main: n_kv_max = 40960, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_l
---
-👤 **mtcl** commented the **2025-06-09** at **19:18:33**:
-
-> [@mtcl](https://github.com/mtcl)
->
-> Looks like y'all had a busy day! Glad to see you managed to achieve much improved speeds learning the commands to match your hardware.
->
-> If you want a more visible and understandable benchmark of speeds for a given configuration, you can change out the binary from `llama-server` to `llama-sweep-bench` and run it e.g.:
->
-> (base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-sweep-bench \
-> --model /home/mukul/dev-ai/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ4_KS_R4/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf \
-> --alias ubergarm/DeepSeek-R1-0528-IQ4_KS_R4 \
-> --ctx-size 65536 \
-> -ctk q8_0 \
-> -mla 3 -fa \
-> -b 4096 -ub 4096 \
-> -amb 512 \
-> -fmoe \
-> --n-gpu-layers 63 \
-> --override-tensor exps=CPU \
-> --parallel 1 \
-> --threads 57 \
-> --host 0.0.0.0 \
-> --port 10002
->
-> You can remove `--alias` `--parallel` `--host` and `--port` but its probably fine to just leave them there as well (to keep it simple) as they are not used for `llama-sweep-bench`.
->
-> Then you can see how the speed drops with longer context and better understand the consequences of long context etc.
->
-> > Would you be able to post a guide on how to make the IQ4 version of the Qwen Model?
->
-> I have posted a [quant cookers guide here](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) to help people get started and show some examples. As ik mentions, feel free to use an existing imatrix file from myself, unsloth, or bartowski etc. Or you can make your own using the instructions I provided.
->
-> If you check my huggingface repo's I list some of my "secret recipes", which you can use as a starting point for your mixes.
->
-> The guide does not discuss how to convert DeepSeek fp8 to bf16 GGUF. That is an extra first step only for DeepSeek safetensors. You can find some of that buried in my original guide in a details fold about `triton-cpu` and the evshiron llama.cpp fork. Give you have newer GPUs you might be able to cast it from fp8 directly on GPU with the "normal" way, but I've never done that myself.
->
-> Enjoy your setup and new GPUs and good luck with your latest videos!
-
-Thank you @ubergarm !
-
-Now I need to learn how to read these outputs :)
-
-```
-(base) mukul@jarvis:~/dev-ai/ik_llama.cpp$ CUDA_VISIBLE_DEVICES="0, 1" ./build/bin/llama-sweep-bench \
- --model /home/mukul/dev-ai/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ4_KS_R4/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf \
- --alias ubergarm/DeepSeek-R1-0528-IQ4_KS_R4 \
- --ctx-size 65536 \
- -ctk q8_0 \
- -mla 3 -fa \
- -b 4096 -ub 4096 \
- -amb 512 \
- -fmoe \
- --n-gpu-layers 63 \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 57 \
- --host 0.0.0.0 \
- --port 10002
-```
-```
-llm_load_tensors: offloading 61 repeating layers to GPU
-llm_load_tensors: offloading non-repeating layers to GPU
-llm_load_tensors: offloaded 62/62 layers to GPU
-llm_load_tensors: CPU buffer size = 38317.39 MiB
-llm_load_tensors: CPU buffer size = 42582.45 MiB
-llm_load_tensors: CPU buffer size = 40481.67 MiB
-llm_load_tensors: CPU buffer size = 42840.67 MiB
-llm_load_tensors: CPU buffer size = 40481.67 MiB
-llm_load_tensors: CPU buffer size = 42840.67 MiB
-llm_load_tensors: CPU buffer size = 40481.67 MiB
-llm_load_tensors: CPU buffer size = 42840.67 MiB
-llm_load_tensors: CPU buffer size = 41420.65 MiB
-llm_load_tensors: CPU buffer size = 938.98 MiB
-llm_load_tensors: CUDA0 buffer size = 9056.64 MiB
-llm_load_tensors: CUDA1 buffer size = 8687.38 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 65536
-llama_new_context_with_model: n_batch = 4096
-llama_new_context_with_model: n_ubatch = 4096
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 512
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 1185.77 MiB
-llama_kv_cache_init: CUDA1 KV buffer size = 1147.51 MiB
-llama_new_context_with_model: KV self size = 2333.25 MiB, c^KV (q8_0): 2333.25 MiB, kv^T: not used
-llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
-llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
-llama_new_context_with_model: CUDA0 compute buffer size = 7688.02 MiB
-llama_new_context_with_model: CUDA1 compute buffer size = 6992.03 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 1136.05 MiB
-llama_new_context_with_model: graph nodes = 13613
-llama_new_context_with_model: graph splits = 149
-
-main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 63, n_threads = 57, n_threads_batch = 57
-
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 4096 | 1024 | 0 | 40.839 | 100.30 | 124.582 | 8.22 |
-| 4096 | 1024 | 4096 | 40.847 | 100.28 | 112.796 | 9.08 |
-| 4096 | 1024 | 8192 | 41.224 | 99.36 | 116.865 | 8.76 |
-| 4096 | 1024 | 12288 | 41.860 | 97.85 | 115.780 | 8.84 |
-| 4096 | 1024 | 16384 | 42.717 | 95.89 | 110.798 | 9.24 |
-| 4096 | 1024 | 20480 | 43.358 | 94.47 | 119.139 | 8.59 |
-| 4096 | 1024 | 24576 | 44.067 | 92.95 | 118.138 | 8.67 |
-| 4096 | 1024 | 28672 | 44.897 | 91.23 | 120.028 | 8.53 |
-| 4096 | 1024 | 32768 | 46.109 | 88.83 | 116.720 | 8.77 |
-| 4096 | 1024 | 36864 | 47.268 | 86.65 | 119.693 | 8.56 |
-| 4096 | 1024 | 40960 | 48.326 | 84.76 | 124.217 | 8.24 |
-| 4096 | 1024 | 45056 | 47.720 | 85.83 | 122.807 | 8.34 |
-| 4096 | 1024 | 49152 | 48.337 | 84.74 | 129.565 | 7.90 |
-| 4096 | 1024 | 53248 | 49.039 | 83.53 | 128.600 | 7.96 |
-| 4096 | 1024 | 57344 | 49.896 | 82.09 | 119.462 | 8.57 |
-| 4096 | 1024 | 61440 | 51.657 | 79.29 | 130.716 | 7.83 |
-```
-
----
-
-👤 **mtcl** commented the **2025-06-09** at **20:58:25**:
+👤 **mtcl** commented on **2025-06-09** at **20:58:25**
@ubergarm or @ikawrakow a question for you, if I don't want to cook my own quant, what is the easiest way to find the quant on huggingface that will be most compatible with ik_llama?
@@ -8851,7 +7385,7 @@ Do all these parameters also work with q4_k_m if i already have some q4_k_m mode
---
-👤 **ubergarm** commented the **2025-06-09** at **21:40:36**:
+👤 **ubergarm** commented on **2025-06-09** at **21:40:36**
@mtcl
@@ -8881,7 +7415,7 @@ Again for Qwen3-235B-A22B you see how the big increase in prompt processing does
---
-👤 **ubergarm** commented the **2025-06-09** at **21:48:00**:
+👤 **ubergarm** commented on **2025-06-09** at **21:48:00**
> if I don't want to cook my own quant, what is the easiest way to find the quant on huggingface that will be most compatible with ik_llama?
@@ -8897,13 +7431,13 @@ Cheers!
---
-👤 **mtcl** commented the **2025-06-09** at **23:55:27**:
+👤 **mtcl** commented on **2025-06-09** at **23:55:27**
Hmm, I'm struggling with this 5090 and ik_llama not sure what's going wrong here. It works fine with 4090 but crashes on 5090.
---
-👤 **mtcl** commented the **2025-06-10** at **00:30:44**:
+👤 **mtcl** commented on **2025-06-10** at **00:30:44**
5090 errors out:
```
@@ -9629,13 +8163,13 @@ INFO [ update_slots] all slots are idle | tid="128766126837760" times
---
-👤 **mtcl** commented the **2025-06-16** at **22:38:57**:
+👤 **mtcl** commented on **2025-06-16** at **22:38:57**
How do i check if my AMX enabled processor is using its "AMX capabilities"? Is there anyway to perform a build on mainline with any specific parameter that enabled AMX and then I can run a comparison between IK and mainline?
---
-👤 **ubergarm** commented the **2025-06-16** at **22:56:24**:
+👤 **ubergarm** commented on **2025-06-16** at **22:56:24**
@mtcl
@@ -9658,7 +8192,7 @@ I forget your exact rig specs, besides 2x5090s 😛 , but as the linked disussio
---
-👤 **mtcl** commented the **2025-06-16** at **23:13:18**:
+👤 **mtcl** commented on **2025-06-16** at **23:13:18**
Hey Thank you for the reply! I checked `lscpu | grep -i amx`, I have all these three these flags `amx_bf16 amx_tile amx_int8`. But how do i make sure that i am compiling the mainline with the correct AMX extensions in the compiled library? Should I use something like this? I do not even know if this is a flag, but i did that anyway `-DGGML_USE_AMX` because I saw it somewhere and cannot locate it anymore.
@@ -9675,7 +8209,7 @@ And i have 2x5090 + 2x4090s now. I was going to sell 4090s but then i could not
---
-👤 **ubergarm** commented the **2025-06-17** at **00:46:02**:
+👤 **ubergarm** commented on **2025-06-17** at **00:46:02**
@mtcl
@@ -9697,7 +8231,7 @@ It might be slower.
---
-👤 **mtcl** commented the **2025-06-17** at **00:56:37**:
+👤 **mtcl** commented on **2025-06-17** at **00:56:37**
> > I was going to sell 4090s but then i could not lol :)
>
@@ -9707,7 +8241,7 @@ I would love to hear your thoughts on how to effectively use this much of VRAM.
---
-👤 **SlavikCA** commented the **2025-07-06** at **05:01:34**:
+👤 **SlavikCA** commented on **2025-07-06** at **05:01:34**
I ran this model https://huggingface.co/unsloth/DeepSeek-TNG-R1T2-Chimera-GGUF
with UD-IQ2_M quants (213 GB)
@@ -9836,134 +8370,7 @@ But will it even more faster with AMX?
---
-👤 **SlavikCA** commented the **2025-07-06** at **05:01:34**:
-
-I ran this model https://huggingface.co/unsloth/DeepSeek-TNG-R1T2-Chimera-GGUF
-with UD-IQ2_M quants (213 GB)
-Both on llama.cpp (in Docker) and ik_llama.cpp
-
-System:
-- Ubuntu 24.04
-- Intel Xeon W5-3425 (12 cores, AMX)
-- 512GB DDR5-4800 (8 channels * 64GB), but my memory somehow still not working at the top speed.
-- RTX 4090D 48GB VRAM
-
-**llama.cpp:**
-```
-prompt eval time = 58561.21 ms / 1273 tokens ( 46.00 ms per token, 21.74 tokens per second)
- eval time = 371584.74 ms / 1566 tokens ( 237.28 ms per token, 4.21 tokens per second)
-```
-
-**ik_llama.cpp:**
-```
-prompt eval time = 21474.45 ms / 1265 tokens ( 16.98 ms per token, 58.91 tokens per second)
-generation eval time = 396856.15 ms / 1690 runs ( 234.83 ms per token, 4.26 tokens per second)
-```
-
-So, token generation is about the same, but prompt eval is almost 3x faster on llama.cpp and I think that's because of AMX. But I'm not sure how to confirm that.
-
-llama.cpp params:
-```
---model /models/UD-IQ2_M/DeepSeek-TNG-R1T2-Chimera-UD-IQ2_M-00001-of-00005.gguf
---ctx-size 32768
---cache-type-k q8_0
---cache-type-v q8_0
---flash-attn
---threads 12
---host 0.0.0.0 --port 37000
---temp 0.6 --top-p 0.95
---n-gpu-layers 999
---override-tensor "blk\.(3|4|5|6|7|8|9|10|11)\.ffn_.*=CUDA0"
---override-tensor exps=CPU
-```
-
-llama.cpp logs:
-```
-ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
- Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
-load_backend: loaded CUDA backend from /app/[libggml-cuda.so](http://libggml-cuda.so/)
-load_backend: loaded CPU backend from /app/[libggml-cpu-sapphirerapids.so](http://libggml-cpu-sapphirerapids.so/)
-build: 5830 (bac8bed2) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
-system_info: n_threads = 12 (n_threads_batch = 12) / 12
- | CUDA : ARCHS = 500,610,700,750,800,860,890
- | USE_GRAPHS = 1
- | PEER_MAX_BATCH_SIZE = 128
- | CPU : SSE3 = 1
- | SSSE3 = 1
- | AVX = 1
- | AVX2 = 1
- | F16C = 1
- | FMA = 1
- | BMI2 = 1
- | AVX512 = 1
- | AVX512_VBMI = 1
- | AVX512_VNNI = 1
- | AVX512_BF16 = 1
- | AMX_INT8 = 1
- | LLAMAFILE = 1
- | OPENMP = 1
- | REPACK = 1
-```
-
-ik_llama params:
-```
-./llama-server \
- --model /models/UD-IQ2_M/DeepSeek-TNG-R1T2-Chimera-UD-IQ2_M-00001-of-00005.gguf \
- --ctx-size 32768 \
- -b 4096 -ub 4096 \
- -ctk q8_0 -fa -mla 3 \
- -amb 512 \
- -fmoe \
- --temp 0.6 --top-p 0.95 \
- --n-gpu-layers 999 \
- --override-tensor "blk\.(3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
- --override-tensor exps=CPU \
- --parallel 1 \
- --threads 12 \
- --host 0.0.0.0 --port 41000
-```
-
-ik_llama logs:
-```
-gml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
-ggml_cuda_init: found 1 CUDA devices:
- Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="123756457574400" timestamp=1751776004 build=3787 commit="0678427f"
-INFO [ main] system info | tid="123756457574400" timestamp=1751776004 n_threads=12 n_threads_batch=-1 total_threads=12 system_info="
-| AVX = 1
-| AVX_VNNI = 1
-| AVX2 = 1
-| AVX512 = 1
-| AVX512_VBMI = 1
-| AVX512_VNNI = 1
-| AVX512_BF16 = 1
-| FMA = 1
-| NEON = 0
-| SVE = 0
-| ARM_FMA = 0
-| F16C = 1
-| FP16_VA = 0
-| WASM_SIMD = 0
-| BLAS = 1
-| SSE3 = 1
-| SSSE3 = 1
-| VSX = 0
-| MATMUL_INT8 = 0
-| LLAMAFILE = 1 | "
-llama_model_loader: additional 4 GGUFs metadata loaded.
-llama_model_loader: loaded meta data with 69 key-value pairs and 1086 tensors from /home/slavik/.cache/huggingface/hub/models--unsloth--DeepSeek-TNG-R1T2-Chimera-GGUF/snapshots/1703b3d3bc20c493045ecc8e521a12a62d3b83a6/UD-IQ2_M/DeepSeek-TNG-R1T2-Chimera-UD-IQ2_M-00001-of-00005.gguf (version GGUF V3 (latest))
-==========================================================================
-Detected incompatible DeepSeek model.
-Will try to fix, but there are no guarantees
-
-*** Your prompt processing speed will be crippled ***
-```
-
----
-
-👤 **ikawrakow** commented the **2025-07-06** at **05:24:45**:
+👤 **ikawrakow** commented on **2025-07-06** at **05:24:45**
> So, token generation is about the same, but prompt eval is almost 3x faster on llama.cpp and I think that's because of AMX. But I'm not sure how to confirm that.
@@ -9971,7 +8378,7 @@ You mean `ik_llama.cpp` is almost 3X faster than `llama.cpp`?
---
-👤 **SlavikCA** commented the **2025-07-06** at **05:29:39**:
+👤 **SlavikCA** commented on **2025-07-06** at **05:29:39**
🤦
You're right.
diff --git a/github-data/issues/440 - Feature Request_ Top n-sigma sampler.md b/github-data/issues/440 - Feature Request Top n-sigma sampler.md
similarity index 85%
rename from github-data/issues/440 - Feature Request_ Top n-sigma sampler.md
rename to github-data/issues/440 - Feature Request Top n-sigma sampler.md
index 2111dda79..69bd0d0cf 100644
--- a/github-data/issues/440 - Feature Request_ Top n-sigma sampler.md
+++ b/github-data/issues/440 - Feature Request Top n-sigma sampler.md
@@ -1,14 +1,15 @@
-### ✨ [#440](https://github.com/ikawrakow/ik_llama.cpp/issues/440) - Feature Request: Top n-sigma sampler
+## 📌 [Issue #440](https://github.com/ikawrakow/ik_llama.cpp/issues/440) - Feature Request: Top n-sigma sampler
| **Author** | `Ph0rk0z` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-20 |
| **Updated** | 2025-06-03 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -35,14 +36,14 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-20** at **15:47:10**:
+👤 **ikawrakow** commented on **2025-05-20** at **15:47:10**
So, the quoted PR just integrates it into the standard `llama.cpp` sampling mechanism. The actual sampler is implemented in their PR 11233. I looked at 11233, and it is a pretty trivial thing, so very easy to implement. I had never actually looked at the sampling code, but a quick check shows that it is not a copy/paste. Also this has been completely reorganized in mainline (they just love pushing pieces of code from here to there). Here sampling is part of `common`, over there it is now part of `llama.cpp` itself. So, adding a new sampler involves me first getting familiar with how sampling is done in this fork.
---
-👤 **Ph0rk0z** commented the **2025-06-03** at **13:58:36**:
+👤 **Ph0rk0z** commented on **2025-06-03** at **13:58:36**
https://github.com/ikawrakow/ik_llama.cpp/pull/489
\ No newline at end of file
diff --git a/github-data/issues/447 - Compilation Error_ Error C2676.md b/github-data/issues/447 - Compilation Error Error C2676.md
similarity index 75%
rename from github-data/issues/447 - Compilation Error_ Error C2676.md
rename to github-data/issues/447 - Compilation Error Error C2676.md
index 52d16559a..150f479ca 100644
--- a/github-data/issues/447 - Compilation Error_ Error C2676.md
+++ b/github-data/issues/447 - Compilation Error Error C2676.md
@@ -1,4 +1,4 @@
-### 📝 [#447](https://github.com/ikawrakow/ik_llama.cpp/issues/447) - Compilation Error: Error C2676
+## 📌 [Issue #447](https://github.com/ikawrakow/ik_llama.cpp/issues/447) - Compilation Error: Error C2676
| **Author** | `quasar-of-mikus` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Got this when trying to compile the latest commit. The last time I ran a build was commit `2ec2229` and that was successful.
Windows 10
@@ -62,15 +62,15 @@ C:\Textgen\ik_llama.cpp>
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-23** at **12:10:01**:
+👤 **ikawrakow** commented on **2025-05-23** at **12:10:01**
-Does #448 fix it?
+Does [#448](https://github.com/ikawrakow/ik_llama.cpp/issues/448) fix it?
---
-👤 **quasar-of-mikus** commented the **2025-05-23** at **12:30:39**:
+👤 **quasar-of-mikus** commented on **2025-05-23** at **12:30:39**
Yep, it compiles and runs fine with that PR. Don't know if this is related but I saw this message come up even though it built:
```
@@ -100,14 +100,14 @@ ction taking 0 arguments [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\
---
-👤 **ikawrakow** commented the **2025-05-23** at **12:56:37**:
+👤 **ikawrakow** commented on **2025-05-23** at **12:56:37**
These are in the `quantize-stats` tool that fails to build (but everything else build correctly).
Somehow MSVC disagrees with GCC and clang on the scope of `constexpr`'s. Can you check if the commit I just pushed fixes it? Thanks.
---
-👤 **quasar-of-mikus** commented the **2025-05-23** at **13:14:15**:
+👤 **quasar-of-mikus** commented on **2025-05-23** at **13:14:15**
No, on commit [f015390](https://github.com/ikawrakow/ik_llama.cpp/pull/448/commits/f015390efa54b21752e3a76c212c93614cfff7ca) I am still getting an error, same as last time minus an error for `kBlockSize`:
```
@@ -131,31 +131,7 @@ ction taking 0 arguments [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\
---
-👤 **quasar-of-mikus** commented the **2025-05-23** at **13:14:15**:
-
-No, I am still getting an error, same as last time minus an error for `kBlockSize`:
-```
-C:\Textgen\ik_llama.cpp\examples\quantize-stats\quantize-stats.cpp(555,1): error C3493: 'kGroupSize' cannot be implicit
-ly captured because no default capture mode has been specified [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\l
-lama-quantize-stats.vcxproj]
-C:\Textgen\ik_llama.cpp\examples\quantize-stats\quantize-stats.cpp(678,1): error C3493: 'kNg' cannot be implicitly capt
-ured because no default capture mode has been specified [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\llama-qu
-antize-stats.vcxproj]
-C:\Textgen\ik_llama.cpp\examples\quantize-stats\quantize-stats.cpp(693,5): error C2064: term does not evaluate to a fun
-ction taking 0 arguments [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\llama-quantize-stats.vcxproj]
-C:\Textgen\ik_llama.cpp\examples\quantize-stats\quantize-stats.cpp(780,1): error C3493: 'kNumVal' cannot be implicitly
-captured because no default capture mode has been specified [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\llam
-a-quantize-stats.vcxproj]
-C:\Textgen\ik_llama.cpp\examples\quantize-stats\quantize-stats.cpp(824,5): error C2064: term does not evaluate to a fun
-ction taking 0 arguments [C:\Textgen\ik_llama.cpp\build\examples\quantize-stats\llama-quantize-stats.vcxproj]
- llama-gguf.vcxproj -> C:\Textgen\ik_llama.cpp\build\bin\Release\llama-gguf.exe
- llama-gguf-hash.vcxproj -> C:\Textgen\ik_llama.cpp\build\bin\Release\llama-gguf-hash.exe
- llama-bench-matmult.vcxproj -> C:\Textgen\ik_llama.cpp\build\bin\Release\llama-bench-matmult.exe
-```
-
----
-
-👤 **ikawrakow** commented the **2025-05-23** at **13:29:23**:
+👤 **ikawrakow** commented on **2025-05-23** at **13:29:23**
And now?
@@ -163,7 +139,7 @@ I never work on Windows, but from what I hear from `llama.cpp` users `clang` pro
---
-👤 **quasar-of-mikus** commented the **2025-05-23** at **13:44:54**:
+👤 **quasar-of-mikus** commented on **2025-05-23** at **13:44:54**
It works now, no more errors during compilation.
>from what I hear from llama.cpp users clang produces faster code than MSVC.
diff --git a/github-data/issues/450 - Bug Performance regression.md b/github-data/issues/450 - Bug Performance regression.md
new file mode 100644
index 000000000..b7df3eeb3
--- /dev/null
+++ b/github-data/issues/450 - Bug Performance regression.md
@@ -0,0 +1,1561 @@
+## 📌 [Issue #450](https://github.com/ikawrakow/ik_llama.cpp/issues/450) - Bug: Performance regression
+
+| **Author** | `cmoncure` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-05-23 |
+| **Updated** | 2025-05-30 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+After this PR: Refactor iqk_mul_mat.cpp ([#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435))
+
+This commit results in a significant performance regression for me, established by git bisect.
+My TG drops by about 30% on DeepSeek. (12.5 t/s => 9.5 t/s)
+
+https://github.com/ikawrakow/ik_llama.cpp/commit/b94cd3b632a78dfb46b18d52b84be66bcf26166a is the first bad commit
+commit https://github.com/ikawrakow/ik_llama.cpp/commit/b94cd3b632a78dfb46b18d52b84be66bcf26166a (HEAD)
+Author: Kawrakow [iwankawrakow@gmail.com](mailto:iwankawrakow@gmail.com)
+Date: Thu May 22 10:05:51 2025 +0300
+
+Refactor iqk_mul_mat.cpp ([#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435))
+
+
+
+### Name and Version
+
+$ ./llama-cli --version
+version: 3705 (ec456322)
+built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
+
+~/ik_llama.cpp/build/bin/llama-server \
+-mla 3 -fa \
+-ctk q8_0 \
+-ctv q8_0 \
+--ctx-size 32768 \
+-fmoe \
+-amb 512 \
+-b 1024 \
+-ub 1024 \
+-sm none \
+--numa isolate \
+--threads 16 \
+--threads-batch 32 \
+--n-gpu-layers 99 \
+--override-tensor exps=CPU \
+--override-tensor attn=CUDA0 \
+--override-tensor exp=CUDA0 \
+--override-tensor blk.*.ffn_gate_inp.weight=CUDA0 \
+--override-tensor blk.*.ffn_down.weight=CUDA0 \
+--override-tensor blk.*.ffn_gate.weight=CUDA0 \
+--override-tensor blk.*.ffn_norm.weight=CUDA0 \
+--override-tensor blk.*.ffn_up_shexp.weight=CUDA0 \
+--override-tensor blk.*.ffn_down_shexp.weight=CUDA0 \
+--override-tensor blk.*.ffn_gate_shexp.weight=CUDA0 \
+--override-tensor blk.*.ffn_gate_inp.weight=CUDA0 \
+--host 0.0.0.0 \
+--port 7862 \
+--alias DeepSeek/DeepSeek-V3-0324-IQ4_K_R4 \
+-m ~/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4.gguf
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+
+```
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-05-23** at **12:49:28**
+
+What is the CPU being used and how was the performance regression determined?
+Log output (including when the server starts) could help.
+
+---
+
+👤 **cmoncure** commented on **2025-05-23** at **13:53:03**
+
+CPU is EPYC 9175F
+I used `git bisect` from HEAD~14 and ran the same prompt against each one. Performance is good on every commit prior to this one.
+
+GOOD log:
+
+$ ./build/bin/llama-cli --version
+version: 3703 (a2b5057a)
+built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
+
+
+```ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+ Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
+ Device 1: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
+INFO [ main] build info | tid="136521606795264" timestamp=1748008001 build=3703 commit="a2b5057a"
+INFO [ main] system info | tid="136521606795264" timestamp=1748008001 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /home/corey/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
+llama_model_loader: - kv 3: general.version str = V3-0324
+llama_model_loader: - kv 4: general.basename str = DeepSeek
+llama_model_loader: - kv 5: general.size_label str = 256x21B
+llama_model_loader: - kv 6: general.license str = mit
+llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 16: general.file_type u32 = 340
+llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
+llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
+llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
+llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 45: general.quantization_version u32 = 2
+llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
+llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
+llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
+llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
+llama_model_loader: - kv 50: split.no u16 = 0
+llama_model_loader: - kv 51: split.count u16 = 0
+llama_model_loader: - kv 52: split.tensors.count i32 = 1147
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 612 tensors
+llama_model_loader: - type iq4_k_r4: 116 tensors
+llama_model_loader: - type iq5_k_r4: 58 tensors
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
+llm_load_print_meta: model params = 672.050 B
+llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
+llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
+llm_load_print_meta: general.name = DeepSeek V3 0324
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.93 MiB
+Tensor blk.0.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.0.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.0.ffn_gate.weight buffer type overriden to CUDA0
+Tensor blk.0.ffn_down.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.1.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.1.ffn_gate.weight buffer type overriden to CUDA0
+Tensor blk.1.ffn_down.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.2.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.2.ffn_gate.weight buffer type overriden to CUDA0
+Tensor blk.2.ffn_down.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.3.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.3.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.4.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.4.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.5.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.5.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.6.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.6.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.7.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.7.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.8.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.8.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.9.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.9.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.10.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.10.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.11.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.11.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.12.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.12.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.13.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.13.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.14.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.14.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.14.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.14.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.14.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.15.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.15.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.15.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.15.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.15.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.16.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.16.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.16.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.16.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.16.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.17.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.17.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.17.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.17.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.17.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.18.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.18.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.18.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.18.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.18.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.19.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.19.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.19.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.19.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.19.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.20.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.20.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.20.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.20.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.20.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.21.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.21.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.21.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.21.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.21.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.22.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.22.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.22.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.22.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.22.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.23.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.23.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.23.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.23.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.23.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.24.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.24.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.24.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.24.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.24.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.25.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.25.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.25.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.25.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.25.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.26.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.26.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.26.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.26.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.26.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.27.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.27.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.27.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.27.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.27.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.28.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.28.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.28.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.28.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.28.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.29.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.29.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.29.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.29.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.29.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.30.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.30.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.30.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.30.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.30.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.31.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.31.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.31.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.31.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.31.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.32.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.32.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.32.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.32.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.32.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.33.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.33.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.33.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.33.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.33.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.34.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.34.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.34.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.34.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.34.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.35.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.35.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.35.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.35.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.35.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.36.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.36.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.36.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.36.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.36.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.37.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.37.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.37.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.37.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.37.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.38.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.38.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.38.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.38.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.38.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.39.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.39.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.39.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.39.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.39.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.40.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.40.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.40.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.40.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.40.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.41.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.41.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.41.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.41.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.41.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.42.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.42.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.42.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.42.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.42.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.43.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.43.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.43.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.43.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.43.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.44.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.44.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.44.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.44.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.44.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.45.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.45.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.45.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.45.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.45.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.46.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.46.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.46.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.46.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.46.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.47.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.47.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.47.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.47.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.47.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.48.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.48.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.48.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.48.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.48.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.49.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.49.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.49.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.49.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.49.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.50.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.50.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.50.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.50.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.50.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.51.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.51.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.51.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.51.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.51.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.52.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.52.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.52.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.52.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.52.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.53.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.53.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.53.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.53.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.53.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.54.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.54.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.54.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.54.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.54.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.55.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.55.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.55.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.55.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.55.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.56.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.56.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.56.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.56.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.56.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.57.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.57.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.57.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.57.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.57.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.58.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.58.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.58.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.58.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.58.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.59.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.59.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.59.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.59.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.59.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_norm.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_q_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_kv_a_norm.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_q_a.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_q_b.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_kv_a_mqa.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_kv_b.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_k_b.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_v_b.weight buffer type overriden to CUDA0
+Tensor blk.60.attn_output.weight buffer type overriden to CUDA0
+Tensor blk.60.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.60.exp_probs_b.bias buffer type overriden to CUDA0
+Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.60.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.60.ffn_up_shexp.weight buffer type overriden to CUDA0
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 392428.85 MiB
+llm_load_tensors: CPU buffer size = 938.98 MiB
+llm_load_tensors: CUDA0 buffer size = 17744.02 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_batch = 1024
+llama_new_context_with_model: n_ubatch = 1024
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 3
+llama_new_context_with_model: attn_max_b = 512
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CUDA0 KV buffer size = 1166.65 MiB
+llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 3650.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 352.01 MiB
+llama_new_context_with_model: graph nodes = 8245
+llama_new_context_with_model: graph splits = 118
+INFO [ init] initializing slots | tid="136521606795264" timestamp=1748008022 n_slots=1
+INFO [ init] new slot | tid="136521606795264" timestamp=1748008022 id_slot=0 n_ctx_slot=32768
+INFO [ main] model loaded | tid="136521606795264" timestamp=1748008022
+INFO [ main] chat template | tid="136521606795264" timestamp=1748008022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+INFO [ main] HTTP server listening | tid="136521606795264" timestamp=1748008022 n_threads_http="31" port="7862" hostname="0.0.0.0"
+INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008022
+INFO [ launch_slot_with_task] slot is processing task | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0 p0=0
+INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008051 id_slot=0 id_task=0 p0=1024
+INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008063 id_slot=0 id_task=0 p0=2048
+INFO [ print_timings] prompt eval time = 25767.00 ms / 2190 tokens ( 11.77 ms per token, 84.99 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 n_prompt_tokens_processed=2190 t_token=11.765754337899544 n_tokens_second=84.9924255836981
+INFO [ print_timings] generation eval time = 15701.68 ms / 222 runs ( 70.73 ms per token, 14.14 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_token_generation=15701.681 n_decoded=222 t_token=70.7282927927928 n_tokens_second=14.138613566279941
+INFO [ print_timings] total time = 41468.68 ms | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 t_token_generation=15701.681 t_total=41468.683000000005
+INFO [ update_slots] slot released | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
+INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
+INFO [ log_server_request] request | tid="136105332502528" timestamp=1748008081 remote_addr="10.254.1.2" remote_port=51316 status=200 method="POST" path="/completion" params={}
+INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
+```
+
+BAD log:
+
+$ ./build-bad/bin/llama-cli --version
+version: 3705 (ec456322)
+built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
+
+(by way of `diff`)
+```
+$ diff goodlog badlog
+5,6c5,6
+< INFO [ main] build info | tid="136521606795264" timestamp=1748008001 build=3703 commit="a2b5057a"
+< INFO [ main] system info | tid="136521606795264" timestamp=1748008001 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+---
+> INFO [ main] build info | tid="127511205212160" timestamp=1748008231 build=3705 commit="ec456322"
+> INFO [ main] system info | tid="127511205212160" timestamp=1748008231 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+1293,1309c1293,1309
+< INFO [ init] initializing slots | tid="136521606795264" timestamp=1748008022 n_slots=1
+< INFO [ init] new slot | tid="136521606795264" timestamp=1748008022 id_slot=0 n_ctx_slot=32768
+< INFO [ main] model loaded | tid="136521606795264" timestamp=1748008022
+< INFO [ main] chat template | tid="136521606795264" timestamp=1748008022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+< INFO [ main] HTTP server listening | tid="136521606795264" timestamp=1748008022 n_threads_http="31" port="7862" hostname="0.0.0.0"
+< INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008022
+< INFO [ launch_slot_with_task] slot is processing task | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0
+< INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0 p0=0
+< INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008051 id_slot=0 id_task=0 p0=1024
+< INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008063 id_slot=0 id_task=0 p0=2048
+< INFO [ print_timings] prompt eval time = 25767.00 ms / 2190 tokens ( 11.77 ms per token, 84.99 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 n_prompt_tokens_processed=2190 t_token=11.765754337899544 n_tokens_second=84.9924255836981
+< INFO [ print_timings] generation eval time = 15701.68 ms / 222 runs ( 70.73 ms per token, 14.14 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_token_generation=15701.681 n_decoded=222 t_token=70.7282927927928 n_tokens_second=14.138613566279941
+< INFO [ print_timings] total time = 41468.68 ms | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 t_token_generation=15701.681 t_total=41468.683000000005
+< INFO [ update_slots] slot released | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
+< INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
+< INFO [ log_server_request] request | tid="136105332502528" timestamp=1748008081 remote_addr="10.254.1.2" remote_port=51316 status=200 method="POST" path="/completion" params={}
+< INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
+---
+> INFO [ init] initializing slots | tid="127511205212160" timestamp=1748008241 n_slots=1
+> INFO [ init] new slot | tid="127511205212160" timestamp=1748008241 id_slot=0 n_ctx_slot=32768
+> INFO [ main] model loaded | tid="127511205212160" timestamp=1748008241
+> INFO [ main] chat template | tid="127511205212160" timestamp=1748008241 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
+> INFO [ main] HTTP server listening | tid="127511205212160" timestamp=1748008241 n_threads_http="31" port="7862" hostname="0.0.0.0"
+> INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008241
+> INFO [ launch_slot_with_task] slot is processing task | tid="127511205212160" timestamp=1748008291 id_slot=0 id_task=0
+> INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008291 id_slot=0 id_task=0 p0=0
+> INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008303 id_slot=0 id_task=0 p0=1024
+> INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008315 id_slot=0 id_task=0 p0=2048
+> INFO [ print_timings] prompt eval time = 25845.83 ms / 2190 tokens ( 11.80 ms per token, 84.73 tokens per second) | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_prompt_processing=25845.833 n_prompt_tokens_processed=2190 t_token=11.801750228310501 n_tokens_second=84.73319470879504
+> INFO [ print_timings] generation eval time = 21665.24 ms / 222 runs ( 97.59 ms per token, 10.25 tokens per second) | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_token_generation=21665.244 n_decoded=222 t_token=97.59118918918918 n_tokens_second=10.246826668557253
+> INFO [ print_timings] total time = 47511.08 ms | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_prompt_processing=25845.833 t_token_generation=21665.244 t_total=47511.077
+> INFO [ update_slots] slot released | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
+> INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008339
+> INFO [ log_server_request] request | tid="127095162204160" timestamp=1748008339 remote_addr="10.254.1.2" remote_port=43794 status=200 method="POST" path="/completion" params={}
+> INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008339
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-05-23** at **15:09:25**
+
+In my case I see zero difference between current main branch and a2b5057a0c9a2758830b6f841bb22150d2511bb1. Tested with DeepSeek-Lite (the 16B little sibling of DeepSeek-V3/R1) and Qwen3-30B-A3B using the exact same custom quantization as yours.
+
+My CPU is Ryzen-7950X, so Zen4 core. Yours is Zen5, so both use the exact same implementation.
+
+I wouldn't know why the performance would change. The 18k LOC `iqk_mul_mat.cpp` got refactored into multiple files for faster build times. There was zero change done in [#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435).
+
+I would try `echo 3 | sudo tee /proc/sys/vm/drop_caches`, and then load the model with the **main branch first** to see what happens.
+
+---
+
+👤 **cmoncure** commented on **2025-05-23** at **16:01:17**
+
+Dropped cache.
+
+Main (bad) build first "ec456322"
+```
+[ print_timings] prompt eval time = 34619.60 ms / 2190 tokens ( 15.81 ms per token, 63.26 tokens per second) | tid="138682949877760" timestamp=1748014236 id_slot=0 id_task=0 t_prompt_processing=34619.603 n_prompt_tokens_processed=2190 t_token=15.80803789954338 n_tokens_second=63.25895764893664
+INFO [ print_timings] generation eval time = 22553.81 ms / 222 runs ( 101.59 ms per token, 9.84 tokens per second) | tid="138682949877760" timestamp=1748014236 id_slot=0 id_task=0 t_token_generation=22553.805 n_decoded=222 t_token=101.59371621621622 n_tokens_second=9.843128465462923
+```
+
+Switch to good build "a2b5057a"
+```
+INFO [ print_timings] prompt eval time = 48430.56 ms / 2190 tokens ( 22.11 ms per token, 45.22 tokens per second) | tid="128418970439680" timestamp=1748014922 id_slot=0 id_task=0 t_prompt_processing=48430.56 n_prompt_tokens_processed=2190 t_token=22.11441095890411 n_tokens_second=45.21938214218461
+INFO [ print_timings] generation eval time = 24928.21 ms / 222 runs ( 112.29 ms per token, 8.91 tokens per second) | tid="128418970439680" timestamp=1748014922 id_slot=0 id_task=0 t_token_generation=24928.211 n_decoded=222 t_token=112.28923873873873 n_tokens_second=8.905572886879046
+```
+
+Well now both are bad.
+
+Switch back to version: 3692 (b90d6ede)
+```
+INFO [ print_timings] prompt eval time = 25607.00 ms / 2190 tokens ( 11.69 ms per token, 85.52 tokens per second) | tid="132738167939072" timestamp=1748015946 id_slot=0 id_task=0 t_prompt_processing=25606.997 n_prompt_tokens_processed=2190 t_token=11.692692694063927 n_tokens_second=85.52349969033854
+INFO [ print_timings] generation eval time = 15771.66 ms / 222 runs ( 71.04 ms per token, 14.08 tokens per second) | tid="132738167939072" timestamp=1748015946 id_slot=0 id_task=0 t_token_generation=15771.659 n_decoded=222 t_token=71.04350900900901 n_tokens_second=14.075881300755997
+```
+Alright, we're in business again. I'll re-bisect dropping the cache each time.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-23** at **16:28:30**
+
+So, you cannot base your measurement on just a single load and one run with 2000 prompt tokens and 200 generated tokens. These giant models take some time to "warm up".
+
+Your CPU has 16 cores, does `--threads-batch 32` help? In my case it always decreases performance compared to just using 16 threads on my 16-core CPU.
+
+You could try a much simpler tensor override rule. Just `-exps=CPU -ngl 100`.
+
+---
+
+👤 **cmoncure** commented on **2025-05-23** at **18:33:25**
+
+> These giant models take some time to "warm up".
+
+This differs from my observations, but I'll take it under advisement and post average results from 4 runs with 4 separate prompts, circling back to reuse one prompt at the end, and dropping cache with each build.
+
+methodology:
+1. echo 3 | sudo tee /proc/sys/vm/drop_caches
+2. git checkout
+3. cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
+4. cmake --build build --config Release -j16
+5. (my llama-server command)
+6. prompt A
+7. prompt B
+8. prompt C
+9. prompt A (repeated)
+
+Runs:
+1. version: 3698 (134d548) => 12.59 t/s (avg)
+2. version: 3701 (b3036a8) => 12.50 t/s (avg)
+3. version: 3703 (a2b5057) => 12.58 t/s (avg)
+4. version: 3704 (b94cd3b) => 9.78 t/s (avg) !
+5. version: 3703 (a2b5057) => 12.68 t/s (avg)
+6. version: 3704 (b94cd3b) => 9.85 t/s (avg) !
+
+(variance <= 0.14s in all runs)
+
+Sure looks like version 3704 is bad. Maybe some compiler optimizations aren't applying?
+
+---
+
+👤 **Ph0rk0z** commented on **2025-05-23** at **19:34:30**
+
+Try with llama sweep bench to get a better average. I didn't notice anything either but I was just using qwen.
+
+---
+
+👤 **saood06** commented on **2025-05-24** at **23:53:08**
+
+@cmoncure
+
+Do you mind trying if setting GGML_LTO on when building it helps?
+
+---
+
+👤 **cmoncure** commented on **2025-05-30** at **23:32:18**
+
+Newer versions seem to have improved (to within 10% of a2b5057) so I'm closing this.
\ No newline at end of file
diff --git a/github-data/issues/450 - Bug_ Performance regression.md b/github-data/issues/450 - Bug_ Performance regression.md
deleted file mode 100644
index 9e8f550ba..000000000
--- a/github-data/issues/450 - Bug_ Performance regression.md
+++ /dev/null
@@ -1,4236 +0,0 @@
-### 🐛 [#450](https://github.com/ikawrakow/ik_llama.cpp/issues/450) - Bug: Performance regression
-
-| **Author** | `cmoncure` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-23 |
-| **Updated** | 2025-05-30 |
-
----
-
-#### Description
-
-### What happened?
-
-After this PR: Refactor iqk_mul_mat.cpp (#435)
-
-This commit results in a significant performance regression for me, established by git bisect.
-My TG drops by about 30% on DeepSeek. (12.5 t/s => 9.5 t/s)
-
-https://github.com/ikawrakow/ik_llama.cpp/commit/b94cd3b632a78dfb46b18d52b84be66bcf26166a is the first bad commit
-commit https://github.com/ikawrakow/ik_llama.cpp/commit/b94cd3b632a78dfb46b18d52b84be66bcf26166a (HEAD)
-Author: Kawrakow [iwankawrakow@gmail.com](mailto:iwankawrakow@gmail.com)
-Date: Thu May 22 10:05:51 2025 +0300
-
-Refactor iqk_mul_mat.cpp (#435)
-
-
-
-### Name and Version
-
-$ ./llama-cli --version
-version: 3705 (ec456322)
-built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
-
-~/ik_llama.cpp/build/bin/llama-server \
--mla 3 -fa \
--ctk q8_0 \
--ctv q8_0 \
---ctx-size 32768 \
--fmoe \
--amb 512 \
--b 1024 \
--ub 1024 \
--sm none \
---numa isolate \
---threads 16 \
---threads-batch 32 \
---n-gpu-layers 99 \
---override-tensor exps=CPU \
---override-tensor attn=CUDA0 \
---override-tensor exp=CUDA0 \
---override-tensor blk.*.ffn_gate_inp.weight=CUDA0 \
---override-tensor blk.*.ffn_down.weight=CUDA0 \
---override-tensor blk.*.ffn_gate.weight=CUDA0 \
---override-tensor blk.*.ffn_norm.weight=CUDA0 \
---override-tensor blk.*.ffn_up_shexp.weight=CUDA0 \
---override-tensor blk.*.ffn_down_shexp.weight=CUDA0 \
---override-tensor blk.*.ffn_gate_shexp.weight=CUDA0 \
---override-tensor blk.*.ffn_gate_inp.weight=CUDA0 \
---host 0.0.0.0 \
---port 7862 \
---alias DeepSeek/DeepSeek-V3-0324-IQ4_K_R4 \
--m ~/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4.gguf
-
-### What operating system are you seeing the problem on?
-
-Linux
-
-### Relevant log output
-
-```shell
-
-```
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** commented the **2025-05-23** at **12:49:28**:
-
-What is the CPU being used and how was the performance regression determined?
-Log output (including when the server starts) could help.
-
----
-
-👤 **cmoncure** commented the **2025-05-23** at **13:53:03**:
-
-CPU is EPYC 9175F
-I used `git bisect` from HEAD~14 and ran the same prompt against each one. Performance is good on every commit prior to this one.
-
-GOOD log:
-
-$ ./build/bin/llama-cli --version
-version: 3703 (a2b5057a)
-built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
-
-
-```ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
-ggml_cuda_init: found 2 CUDA devices:
- Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
- Device 1: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="136521606795264" timestamp=1748008001 build=3703 commit="a2b5057a"
-INFO [ main] system info | tid="136521606795264" timestamp=1748008001 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /home/corey/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 340
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
-llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
-llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
-llama_model_loader: - kv 50: split.no u16 = 0
-llama_model_loader: - kv 51: split.count u16 = 0
-llama_model_loader: - kv 52: split.tensors.count i32 = 1147
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_swa_pattern = 1
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.93 MiB
-Tensor blk.0.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.3.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.4.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.5.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.6.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.7.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.8.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.9.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.10.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.11.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.12.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.13.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.14.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.15.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.16.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.17.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.18.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.19.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.20.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.21.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.22.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.23.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.24.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.25.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.26.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.27.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.28.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.29.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.30.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.31.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.32.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.33.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.34.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.35.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.36.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.37.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.38.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.39.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.40.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.41.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.42.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.43.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.44.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.45.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.46.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.47.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.48.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.49.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.50.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.51.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.52.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.53.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.54.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.55.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.56.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.57.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.58.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.59.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.60.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_up_shexp.weight buffer type overriden to CUDA0
-llm_load_tensors: offloading 61 repeating layers to GPU
-llm_load_tensors: offloading non-repeating layers to GPU
-llm_load_tensors: offloaded 62/62 layers to GPU
-llm_load_tensors: CPU buffer size = 392428.85 MiB
-llm_load_tensors: CPU buffer size = 938.98 MiB
-llm_load_tensors: CUDA0 buffer size = 17744.02 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 1024
-llama_new_context_with_model: n_ubatch = 1024
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 512
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 1166.65 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
-llama_new_context_with_model: CUDA0 compute buffer size = 3650.00 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 352.01 MiB
-llama_new_context_with_model: graph nodes = 8245
-llama_new_context_with_model: graph splits = 118
-INFO [ init] initializing slots | tid="136521606795264" timestamp=1748008022 n_slots=1
-INFO [ init] new slot | tid="136521606795264" timestamp=1748008022 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="136521606795264" timestamp=1748008022
-INFO [ main] chat template | tid="136521606795264" timestamp=1748008022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="136521606795264" timestamp=1748008022 n_threads_http="31" port="7862" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008022
-INFO [ launch_slot_with_task] slot is processing task | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008051 id_slot=0 id_task=0 p0=1024
-INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008063 id_slot=0 id_task=0 p0=2048
-INFO [ print_timings] prompt eval time = 25767.00 ms / 2190 tokens ( 11.77 ms per token, 84.99 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 n_prompt_tokens_processed=2190 t_token=11.765754337899544 n_tokens_second=84.9924255836981
-INFO [ print_timings] generation eval time = 15701.68 ms / 222 runs ( 70.73 ms per token, 14.14 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_token_generation=15701.681 n_decoded=222 t_token=70.7282927927928 n_tokens_second=14.138613566279941
-INFO [ print_timings] total time = 41468.68 ms | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 t_token_generation=15701.681 t_total=41468.683000000005
-INFO [ update_slots] slot released | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
-INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
-INFO [ log_server_request] request | tid="136105332502528" timestamp=1748008081 remote_addr="10.254.1.2" remote_port=51316 status=200 method="POST" path="/completion" params={}
-INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
-```
-
-BAD log:
-
-$ ./build-bad/bin/llama-cli --version
-version: 3705 (ec456322)
-built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
-
-(by way of `diff`)
-```
-$ diff goodlog badlog
-5,6c5,6
-< INFO [ main] build info | tid="136521606795264" timestamp=1748008001 build=3703 commit="a2b5057a"
-< INFO [ main] system info | tid="136521606795264" timestamp=1748008001 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
----
-> INFO [ main] build info | tid="127511205212160" timestamp=1748008231 build=3705 commit="ec456322"
-> INFO [ main] system info | tid="127511205212160" timestamp=1748008231 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-1293,1309c1293,1309
-< INFO [ init] initializing slots | tid="136521606795264" timestamp=1748008022 n_slots=1
-< INFO [ init] new slot | tid="136521606795264" timestamp=1748008022 id_slot=0 n_ctx_slot=32768
-< INFO [ main] model loaded | tid="136521606795264" timestamp=1748008022
-< INFO [ main] chat template | tid="136521606795264" timestamp=1748008022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-< INFO [ main] HTTP server listening | tid="136521606795264" timestamp=1748008022 n_threads_http="31" port="7862" hostname="0.0.0.0"
-< INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008022
-< INFO [ launch_slot_with_task] slot is processing task | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0
-< INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0 p0=0
-< INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008051 id_slot=0 id_task=0 p0=1024
-< INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008063 id_slot=0 id_task=0 p0=2048
-< INFO [ print_timings] prompt eval time = 25767.00 ms / 2190 tokens ( 11.77 ms per token, 84.99 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 n_prompt_tokens_processed=2190 t_token=11.765754337899544 n_tokens_second=84.9924255836981
-< INFO [ print_timings] generation eval time = 15701.68 ms / 222 runs ( 70.73 ms per token, 14.14 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_token_generation=15701.681 n_decoded=222 t_token=70.7282927927928 n_tokens_second=14.138613566279941
-< INFO [ print_timings] total time = 41468.68 ms | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 t_token_generation=15701.681 t_total=41468.683000000005
-< INFO [ update_slots] slot released | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
-< INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
-< INFO [ log_server_request] request | tid="136105332502528" timestamp=1748008081 remote_addr="10.254.1.2" remote_port=51316 status=200 method="POST" path="/completion" params={}
-< INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
----
-> INFO [ init] initializing slots | tid="127511205212160" timestamp=1748008241 n_slots=1
-> INFO [ init] new slot | tid="127511205212160" timestamp=1748008241 id_slot=0 n_ctx_slot=32768
-> INFO [ main] model loaded | tid="127511205212160" timestamp=1748008241
-> INFO [ main] chat template | tid="127511205212160" timestamp=1748008241 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-> INFO [ main] HTTP server listening | tid="127511205212160" timestamp=1748008241 n_threads_http="31" port="7862" hostname="0.0.0.0"
-> INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008241
-> INFO [ launch_slot_with_task] slot is processing task | tid="127511205212160" timestamp=1748008291 id_slot=0 id_task=0
-> INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008291 id_slot=0 id_task=0 p0=0
-> INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008303 id_slot=0 id_task=0 p0=1024
-> INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008315 id_slot=0 id_task=0 p0=2048
-> INFO [ print_timings] prompt eval time = 25845.83 ms / 2190 tokens ( 11.80 ms per token, 84.73 tokens per second) | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_prompt_processing=25845.833 n_prompt_tokens_processed=2190 t_token=11.801750228310501 n_tokens_second=84.73319470879504
-> INFO [ print_timings] generation eval time = 21665.24 ms / 222 runs ( 97.59 ms per token, 10.25 tokens per second) | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_token_generation=21665.244 n_decoded=222 t_token=97.59118918918918 n_tokens_second=10.246826668557253
-> INFO [ print_timings] total time = 47511.08 ms | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_prompt_processing=25845.833 t_token_generation=21665.244 t_total=47511.077
-> INFO [ update_slots] slot released | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
-> INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008339
-> INFO [ log_server_request] request | tid="127095162204160" timestamp=1748008339 remote_addr="10.254.1.2" remote_port=43794 status=200 method="POST" path="/completion" params={}
-> INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008339
-```
-
----
-
-👤 **cmoncure** commented the **2025-05-23** at **13:53:03**:
-
-CPU is EPYC 9175F
-I used `git bisect` from HEAD~14 and ran the same prompt against each one. Performance is good on every commit prior to this one.
-
-GOOD log:
-
-$ ./build/bin/llama-cli --version
-version: 3703 (a2b5057a)
-built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
-
-
-`ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
-ggml_cuda_init: found 2 CUDA devices:
- Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
- Device 1: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="136521606795264" timestamp=1748008001 build=3703 commit="a2b5057a"
-INFO [ main] system info | tid="136521606795264" timestamp=1748008001 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /home/corey/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 340
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
-llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
-llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
-llama_model_loader: - kv 50: split.no u16 = 0
-llama_model_loader: - kv 51: split.count u16 = 0
-llama_model_loader: - kv 52: split.tensors.count i32 = 1147
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_swa_pattern = 1
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.93 MiB
-Tensor blk.0.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.3.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.4.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.5.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.6.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.7.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.8.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.9.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.10.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.11.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.12.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.13.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.14.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.15.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.16.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.17.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.18.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.19.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.20.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.21.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.22.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.23.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.24.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.25.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.26.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.27.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.28.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.29.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.30.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.31.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.32.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.33.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.34.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.35.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.36.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.37.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.38.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.39.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.40.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.41.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.42.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.43.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.44.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.45.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.46.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.47.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.48.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.49.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.50.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.51.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.52.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.53.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.54.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.55.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.56.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.57.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.58.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.59.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.60.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_up_shexp.weight buffer type overriden to CUDA0
-llm_load_tensors: offloading 61 repeating layers to GPU
-llm_load_tensors: offloading non-repeating layers to GPU
-llm_load_tensors: offloaded 62/62 layers to GPU
-llm_load_tensors: CPU buffer size = 392428.85 MiB
-llm_load_tensors: CPU buffer size = 938.98 MiB
-llm_load_tensors: CUDA0 buffer size = 17744.02 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 1024
-llama_new_context_with_model: n_ubatch = 1024
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 512
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 1166.65 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
-llama_new_context_with_model: CUDA0 compute buffer size = 3650.00 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 352.01 MiB
-llama_new_context_with_model: graph nodes = 8245
-llama_new_context_with_model: graph splits = 118
-INFO [ init] initializing slots | tid="136521606795264" timestamp=1748008022 n_slots=1
-INFO [ init] new slot | tid="136521606795264" timestamp=1748008022 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="136521606795264" timestamp=1748008022
-INFO [ main] chat template | tid="136521606795264" timestamp=1748008022 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="136521606795264" timestamp=1748008022 n_threads_http="31" port="7862" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008022
-INFO [ launch_slot_with_task] slot is processing task | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008040 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008051 id_slot=0 id_task=0 p0=1024
-INFO [ update_slots] kv cache rm [p0, end) | tid="136521606795264" timestamp=1748008063 id_slot=0 id_task=0 p0=2048
-INFO [ print_timings] prompt eval time = 25767.00 ms / 2190 tokens ( 11.77 ms per token, 84.99 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 n_prompt_tokens_processed=2190 t_token=11.765754337899544 n_tokens_second=84.9924255836981
-INFO [ print_timings] generation eval time = 15701.68 ms / 222 runs ( 70.73 ms per token, 14.14 tokens per second) | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_token_generation=15701.681 n_decoded=222 t_token=70.7282927927928 n_tokens_second=14.138613566279941
-INFO [ print_timings] total time = 41468.68 ms | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 t_prompt_processing=25767.002 t_token_generation=15701.681 t_total=41468.683000000005
-INFO [ update_slots] slot released | tid="136521606795264" timestamp=1748008081 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
-INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
-INFO [ log_server_request] request | tid="136105332502528" timestamp=1748008081 remote_addr="10.254.1.2" remote_port=51316 status=200 method="POST" path="/completion" params={}
-INFO [ update_slots] all slots are idle | tid="136521606795264" timestamp=1748008081
-`
-
-BAD log:
-
-$ ./build-bad/bin/llama-cli --version
-version: 3705 (ec456322)
-built with cc (Ubuntu 14.2.0-4ubuntu2) 14.2.0 for x86_64-linux-gnu
-
-`ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
-ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
-ggml_cuda_init: found 2 CUDA devices:
- Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
- Device 1: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
-INFO [ main] build info | tid="127511205212160" timestamp=1748008231 build=3705 commit="ec456322"
-INFO [ main] system info | tid="127511205212160" timestamp=1748008231 n_threads=16 n_threads_batch=32 total_threads=32 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /home/corey/AIModels/textgen/DeepSeek-V3-0324-IQ4_K_R4.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 340
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
-llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
-llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
-llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
-llama_model_loader: - kv 50: split.no u16 = 0
-llama_model_loader: - kv 51: split.count u16 = 0
-llama_model_loader: - kv 52: split.tensors.count i32 = 1147
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 612 tensors
-llama_model_loader: - type iq4_k_r4: 116 tensors
-llama_model_loader: - type iq5_k_r4: 58 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_swa_pattern = 1
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = IQ4_K_R4 - 4.5 bpw
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 386.183 GiB (4.936 BPW)
-llm_load_print_meta: repeating layers = 384.349 GiB (4.926 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.93 MiB
-Tensor blk.0.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.0.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.0.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.1.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.1.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.2.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_gate.weight buffer type overriden to CUDA0
-Tensor blk.2.ffn_down.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.3.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.3.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.4.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.4.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.5.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.5.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.6.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.6.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.7.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.7.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.7.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.8.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.8.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.8.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.9.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.9.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.9.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.10.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.10.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.10.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.11.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.11.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.11.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.12.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.12.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.12.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.13.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.13.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.13.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.14.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.14.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.14.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.15.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.15.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.15.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.15.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.16.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.16.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.16.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.17.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.17.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.17.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.18.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.18.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.18.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.19.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.19.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.19.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.20.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.20.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.20.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.20.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.21.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.21.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.21.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.21.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.22.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.22.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.22.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.22.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.23.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.23.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.23.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.23.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.24.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.24.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.24.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.24.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.25.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.25.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.25.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.25.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.26.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.26.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.26.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.26.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.27.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.27.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.27.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.27.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.28.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.28.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.28.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.28.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.29.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.29.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.29.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.29.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.30.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.30.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.30.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.30.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.31.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.31.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.31.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.31.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.32.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.32.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.32.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.32.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.33.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.33.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.33.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.33.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.34.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.34.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.34.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.34.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.35.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.35.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.35.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.35.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.36.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.36.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.36.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.36.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.37.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.37.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.37.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.37.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.38.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.38.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.38.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.38.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.39.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.39.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.39.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.39.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.40.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.40.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.40.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.40.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.41.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.41.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.41.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.41.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.42.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.42.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.42.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.42.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.43.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.43.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.43.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.43.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.44.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.44.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.44.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.44.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.45.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.45.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.45.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.45.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.46.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.46.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.46.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.46.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.47.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.47.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.47.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.47.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.48.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.48.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.48.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.48.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.49.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.49.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.49.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.49.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.50.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.50.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.50.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.50.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.51.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.51.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.51.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.51.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.52.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.52.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.52.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.52.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.53.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.53.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.53.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.53.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.54.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.54.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.54.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.54.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.55.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.55.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.55.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.55.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.56.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.56.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.56.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.56.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.57.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.57.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.57.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.57.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.58.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.58.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.58.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.58.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.59.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.59.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.59.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.59.ffn_up_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_a_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_a.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_q_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_a_mqa.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_kv_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_k_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_v_b.weight buffer type overriden to CUDA0
-Tensor blk.60.attn_output.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_norm.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_gate_inp.weight buffer type overriden to CUDA0
-Tensor blk.60.exp_probs_b.bias buffer type overriden to CUDA0
-Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
-Tensor blk.60.ffn_gate_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_down_shexp.weight buffer type overriden to CUDA0
-Tensor blk.60.ffn_up_shexp.weight buffer type overriden to CUDA0
-llm_load_tensors: offloading 61 repeating layers to GPU
-llm_load_tensors: offloading non-repeating layers to GPU
-llm_load_tensors: offloaded 62/62 layers to GPU
-llm_load_tensors: CPU buffer size = 392428.85 MiB
-llm_load_tensors: CPU buffer size = 938.98 MiB
-llm_load_tensors: CUDA0 buffer size = 17744.02 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 32768
-llama_new_context_with_model: n_batch = 1024
-llama_new_context_with_model: n_ubatch = 1024
-llama_new_context_with_model: flash_attn = 1
-llama_new_context_with_model: mla_attn = 3
-llama_new_context_with_model: attn_max_b = 512
-llama_new_context_with_model: fused_moe = 1
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CUDA0 KV buffer size = 1166.65 MiB
-llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
-llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
-llama_new_context_with_model: CUDA0 compute buffer size = 3650.00 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 352.01 MiB
-llama_new_context_with_model: graph nodes = 8245
-llama_new_context_with_model: graph splits = 118
-INFO [ init] initializing slots | tid="127511205212160" timestamp=1748008241 n_slots=1
-INFO [ init] new slot | tid="127511205212160" timestamp=1748008241 id_slot=0 n_ctx_slot=32768
-INFO [ main] model loaded | tid="127511205212160" timestamp=1748008241
-INFO [ main] chat template | tid="127511205212160" timestamp=1748008241 chat_example="You are a helpful assistant\n\n<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>" built_in=true
-INFO [ main] HTTP server listening | tid="127511205212160" timestamp=1748008241 n_threads_http="31" port="7862" hostname="0.0.0.0"
-INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008241
-INFO [ launch_slot_with_task] slot is processing task | tid="127511205212160" timestamp=1748008291 id_slot=0 id_task=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008291 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008303 id_slot=0 id_task=0 p0=1024
-INFO [ update_slots] kv cache rm [p0, end) | tid="127511205212160" timestamp=1748008315 id_slot=0 id_task=0 p0=2048
-INFO [ print_timings] prompt eval time = 25845.83 ms / 2190 tokens ( 11.80 ms per token, 84.73 tokens per second) | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_prompt_processing=25845.833 n_prompt_tokens_processed=2190 t_token=11.801750228310501 n_tokens_second=84.73319470879504
-INFO [ print_timings] generation eval time = 21665.24 ms / 222 runs ( 97.59 ms per token, 10.25 tokens per second) | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_token_generation=21665.244 n_decoded=222 t_token=97.59118918918918 n_tokens_second=10.246826668557253
-INFO [ print_timings] total time = 47511.08 ms | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 t_prompt_processing=25845.833 t_token_generation=21665.244 t_total=47511.077
-INFO [ update_slots] slot released | tid="127511205212160" timestamp=1748008339 id_slot=0 id_task=0 n_ctx=32768 n_past=2411 n_system_tokens=0 n_cache_tokens=2411 truncated=false
-INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008339
-INFO [ log_server_request] request | tid="127095162204160" timestamp=1748008339 remote_addr="10.254.1.2" remote_port=43794 status=200 method="POST" path="/completion" params={}
-INFO [ update_slots] all slots are idle | tid="127511205212160" timestamp=1748008339
-`
-
----
-
-👤 **ikawrakow** commented the **2025-05-23** at **15:09:25**:
-
-In my case I see zero difference between current main branch and a2b5057a0c9a2758830b6f841bb22150d2511bb1. Tested with DeepSeek-Lite (the 16B little sibling of DeepSeek-V3/R1) and Qwen3-30B-A3B using the exact same custom quantization as yours.
-
-My CPU is Ryzen-7950X, so Zen4 core. Yours is Zen5, so both use the exact same implementation.
-
-I wouldn't know why the performance would change. The 18k LOC `iqk_mul_mat.cpp` got refactored into multiple files for faster build times. There was zero change done in #435.
-
-I would try `echo 3 | sudo tee /proc/sys/vm/drop_caches`, and then load the model with the **main branch first** to see what happens.
-
----
-
-👤 **cmoncure** commented the **2025-05-23** at **16:01:17**:
-
-Dropped cache.
-
-Main (bad) build first "ec456322"
-```
-[ print_timings] prompt eval time = 34619.60 ms / 2190 tokens ( 15.81 ms per token, 63.26 tokens per second) | tid="138682949877760" timestamp=1748014236 id_slot=0 id_task=0 t_prompt_processing=34619.603 n_prompt_tokens_processed=2190 t_token=15.80803789954338 n_tokens_second=63.25895764893664
-INFO [ print_timings] generation eval time = 22553.81 ms / 222 runs ( 101.59 ms per token, 9.84 tokens per second) | tid="138682949877760" timestamp=1748014236 id_slot=0 id_task=0 t_token_generation=22553.805 n_decoded=222 t_token=101.59371621621622 n_tokens_second=9.843128465462923
-```
-
-Switch to good build "a2b5057a"
-```
-INFO [ print_timings] prompt eval time = 48430.56 ms / 2190 tokens ( 22.11 ms per token, 45.22 tokens per second) | tid="128418970439680" timestamp=1748014922 id_slot=0 id_task=0 t_prompt_processing=48430.56 n_prompt_tokens_processed=2190 t_token=22.11441095890411 n_tokens_second=45.21938214218461
-INFO [ print_timings] generation eval time = 24928.21 ms / 222 runs ( 112.29 ms per token, 8.91 tokens per second) | tid="128418970439680" timestamp=1748014922 id_slot=0 id_task=0 t_token_generation=24928.211 n_decoded=222 t_token=112.28923873873873 n_tokens_second=8.905572886879046
-```
-
-Well now both are bad.
-
-Switch back to version: 3692 (b90d6ede)
-```
-INFO [ print_timings] prompt eval time = 25607.00 ms / 2190 tokens ( 11.69 ms per token, 85.52 tokens per second) | tid="132738167939072" timestamp=1748015946 id_slot=0 id_task=0 t_prompt_processing=25606.997 n_prompt_tokens_processed=2190 t_token=11.692692694063927 n_tokens_second=85.52349969033854
-INFO [ print_timings] generation eval time = 15771.66 ms / 222 runs ( 71.04 ms per token, 14.08 tokens per second) | tid="132738167939072" timestamp=1748015946 id_slot=0 id_task=0 t_token_generation=15771.659 n_decoded=222 t_token=71.04350900900901 n_tokens_second=14.075881300755997
-```
-Alright, we're in business again. I'll re-bisect dropping the cache each time.
-
----
-
-👤 **ikawrakow** commented the **2025-05-23** at **16:28:30**:
-
-So, you cannot base your measurement on just a single load and one run with 2000 prompt tokens and 200 generated tokens. These giant models take some time to "warm up".
-
-Your CPU has 16 cores, does `--threads-batch 32` help? In my case it always decreases performance compared to just using 16 threads on my 16-core CPU.
-
-You could try a much simpler tensor override rule. Just `-exps=CPU -ngl 100`.
-
----
-
-👤 **cmoncure** commented the **2025-05-23** at **18:33:25**:
-
-> These giant models take some time to "warm up".
-
-This differs from my observations, but I'll take it under advisement and post average results from 4 runs with 4 separate prompts, circling back to reuse one prompt at the end, and dropping cache with each build.
-
-methodology:
-1. echo 3 | sudo tee /proc/sys/vm/drop_caches
-2. git checkout
-3. cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
-4. cmake --build build --config Release -j16
-5. (my llama-server command)
-6. prompt A
-7. prompt B
-8. prompt C
-9. prompt A (repeated)
-
-Runs:
-1. version: 3698 (134d548) => 12.59 t/s (avg)
-2. version: 3701 (b3036a8) => 12.50 t/s (avg)
-3. version: 3703 (a2b5057) => 12.58 t/s (avg)
-4. version: 3704 (b94cd3b) => 9.78 t/s (avg) !
-5. version: 3703 (a2b5057) => 12.68 t/s (avg)
-6. version: 3704 (b94cd3b) => 9.85 t/s (avg) !
-
-(variance <= 0.14s in all runs)
-
-Sure looks like version 3704 is bad. Maybe some compiler optimizations aren't applying?
-
----
-
-👤 **cmoncure** commented the **2025-05-23** at **18:33:25**:
-
-> These giant models take some time to "warm up".
-
-This differs from my observations, but I'll take it under advisement and post average results from 4 runs with 4 separate prompts, circling back to reuse one prompt at the end, and dropping cache with each build.
-
-methodology:
-1. echo 3 | sudo tee /proc/sys/vm/drop_caches
-2. git checkout
-3. cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=89
-4. cmake --build build --config Release -j16
-5. (my llama-server command)
-6. prompt A
-7. prompt B
-8. prompt C
-9. prompt A (repeated)
-
-Runs:
-1. version: 3698 (134d548) => 12.59 t/s (avg)
-2. version: 3701 (b3036a8) => 12.50 t/s (avg)
-3. version: 3703 (a2b5057) => 12.58 t/s (avg)
-4. version: 3704 (b94cd3b) => 9.78 t/s (avg) !
-5. version: 3703 (a2b5057) => 12.68 t/s (avg)
-6. version: 3704 (b94cd3b) => 9.85 t/s (avg) !
-
-(variance <= 0.14s in all runs)
-
-Sure looks like version 3703 is bad. Maybe some compiler optimizations aren't applying?
-
----
-
-👤 **Ph0rk0z** commented the **2025-05-23** at **19:34:30**:
-
-Try with llama sweep bench to get a better average. I didn't notice anything either but I was just using qwen.
-
----
-
-👤 **saood06** commented the **2025-05-24** at **23:53:08**:
-
-@cmoncure
-
-Do you mind trying if setting GGML_LTO on when building it helps?
-
----
-
-👤 **cmoncure** commented the **2025-05-30** at **23:32:18**:
-
-Newer versions seem to have improved (to within 10% of a2b5057) so I'm closing this.
\ No newline at end of file
diff --git a/github-data/issues/452 - Falcon H1 Support.md b/github-data/issues/452 - Falcon H1 Support.md
index 16bc9d09a..553a50083 100644
--- a/github-data/issues/452 - Falcon H1 Support.md
+++ b/github-data/issues/452 - Falcon H1 Support.md
@@ -1,14 +1,15 @@
-### 📝 [#452](https://github.com/ikawrakow/ik_llama.cpp/issues/452) - Falcon H1 Support
+## 📌 [Issue #452](https://github.com/ikawrakow/ik_llama.cpp/issues/452) - Falcon H1 Support
| **Author** | `Downtown-Case` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-23 |
| **Updated** | 2025-06-27 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
A hybrid transformers/mamba2 series with good performance: https://huggingface.co/collections/tiiuae/falcon-h1-6819f2795bc406da60fab8df
@@ -18,20 +19,20 @@ Support for ik_llama.cpp's tighter quantization schemes would be nice :). Maybe
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-24** at **07:04:24**:
+👤 **ikawrakow** commented on **2025-05-24** at **07:04:24**
Have you though about adding a feature request to the llama.cpp-Falcon-H1 authors?
---
-👤 **Downtown-Case** commented the **2025-06-02** at **18:19:21**:
+👤 **Downtown-Case** commented on **2025-06-02** at **18:19:21**
Seems their implementation needs more time in the oven anyway.
---
-👤 **Downtown-Case** commented the **2025-06-27** at **14:31:42**:
+👤 **Downtown-Case** commented on **2025-06-27** at **14:31:42**
Closing this
\ No newline at end of file
diff --git a/github-data/issues/455 - Bug_ KV cache is never reused in OpenAI compatible Chat Completion api.md b/github-data/issues/455 - Bug KV cache is never reused in OpenAI compatible Chat Completion api.md
similarity index 97%
rename from github-data/issues/455 - Bug_ KV cache is never reused in OpenAI compatible Chat Completion api.md
rename to github-data/issues/455 - Bug KV cache is never reused in OpenAI compatible Chat Completion api.md
index ea7c721fc..9ce35dab2 100644
--- a/github-data/issues/455 - Bug_ KV cache is never reused in OpenAI compatible Chat Completion api.md
+++ b/github-data/issues/455 - Bug KV cache is never reused in OpenAI compatible Chat Completion api.md
@@ -1,4 +1,4 @@
-### 🐛 [#455](https://github.com/ikawrakow/ik_llama.cpp/issues/455) - Bug: KV cache is never reused in OpenAI compatible Chat Completion api
+## 📌 [Issue #455](https://github.com/ikawrakow/ik_llama.cpp/issues/455) - Bug: KV cache is never reused in OpenAI compatible Chat Completion api
| **Author** | `luzamm` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -516,9 +516,9 @@ INFO [ update_slots] all slots are idle | tid="137281198051328" times
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-24** at **23:39:01**:
+👤 **saood06** commented on **2025-05-24** at **23:39:01**
Are you passing in `cache_prompt: true` in your request?
@@ -528,21 +528,13 @@ Edit: Just want to add I use the server and I can get KV cache to be reused betw
---
-👤 **saood06** commented the **2025-05-24** at **23:39:01**:
-
-Are you passing in `cache_prompt: true` in your request?
-
-I know llama.cpp now defaults to it being on, but we do not do that here (would be trivial to change), so as it stands it will not reuse the cache unless you pass that.
-
----
-
-👤 **ikawrakow** commented the **2025-05-25** at **04:32:30**:
+👤 **ikawrakow** commented on **2025-05-25** at **04:32:30**
@saood06 Maybe we should change the default?
---
-👤 **saood06** commented the **2025-05-25** at **04:49:04**:
+👤 **saood06** commented on **2025-05-25** at **04:49:04**
> [@saood06](https://github.com/saood06) Maybe we should change the default?
@@ -552,13 +544,13 @@ I've been tinkering with an alternative caching mechanism as I don't fully like
---
-👤 **luzamm** commented the **2025-05-25** at **08:55:07**:
+👤 **luzamm** commented on **2025-05-25** at **08:55:07**
After passing cache_prompt:true , it worked well. But there are many webui do not pass this field and nowhere to add easily. Is it better to turn it on by default?
---
-👤 **saood06** commented the **2025-05-25** at **09:17:43**:
+👤 **saood06** commented on **2025-05-25** at **09:17:43**
> After passing cache_prompt:true , it worked well.
@@ -570,13 +562,13 @@ Yes, I will do that. I looked into it enough to deem it trivial, just haven't go
---
-👤 **Ph0rk0z** commented the **2025-05-25** at **16:28:04**:
+👤 **Ph0rk0z** commented on **2025-05-25** at **16:28:04**
It never reprocess my cache because I used text completion with sillytavern. What happens when you reach the context limit? I know that mainline has some mechanism for that. Does it just reprocess context with every message post limit?
---
-👤 **saood06** commented the **2025-05-28** at **01:00:43**:
+👤 **saood06** commented on **2025-05-28** at **01:00:43**
@luzamm
Sorry for the delay, but the PR has been made that changes the default, and I have linked it to this issue to automatically close once it gets merged in.
@@ -590,7 +582,7 @@ I have not used context shifting in a long time but as far as I can tell the imp
---
-👤 **Ph0rk0z** commented the **2025-05-28** at **15:12:09**:
+👤 **Ph0rk0z** commented on **2025-05-28** at **15:12:09**
>I have not used context shifting in a long time but as far as I can tell the implementation here is the same as the one I have experienced.
@@ -598,7 +590,7 @@ I thought it doesn't work here because it was forked before the implementation i
---
-👤 **saood06** commented the **2025-05-28** at **22:04:21**:
+👤 **saood06** commented on **2025-05-28** at **22:04:21**
> I thought it doesn't work here because it was forked before the implementation in main. There is no --cache-reuse flag and I see nothing about context shift. Only ever tried the implementation in ooba.
diff --git a/github-data/issues/456 - Bug_ no compilation without IQK_MULMAT.md b/github-data/issues/456 - Bug no compilation without IQK_MULMAT.md
similarity index 75%
rename from github-data/issues/456 - Bug_ no compilation without IQK_MULMAT.md
rename to github-data/issues/456 - Bug no compilation without IQK_MULMAT.md
index 6ea7cf0ef..4b6a12e66 100644
--- a/github-data/issues/456 - Bug_ no compilation without IQK_MULMAT.md
+++ b/github-data/issues/456 - Bug no compilation without IQK_MULMAT.md
@@ -1,4 +1,4 @@
-### 🐛 [#456](https://github.com/ikawrakow/ik_llama.cpp/issues/456) - Bug: no compilation without IQK_MULMAT
+## 📌 [Issue #456](https://github.com/ikawrakow/ik_llama.cpp/issues/456) - Bug: no compilation without IQK_MULMAT
| **Author** | `Nexesenex` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -40,14 +40,14 @@ will not be compiled because "static void ggml_compute_forward_mul_mat_id_up_gat
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-25** at **04:30:11**:
+👤 **ikawrakow** commented on **2025-05-25** at **04:30:11**
It no longer works without `GGML_USE_IQK_MULMAT`, so I'll just remove that option.
---
-👤 **Nexesenex** commented the **2025-05-25** at **12:27:17**:
+👤 **Nexesenex** commented on **2025-05-25** at **12:27:17**
Et voilà!
\ No newline at end of file
diff --git a/github-data/issues/463 - Research_ V100 Flash Attention Implementation.md b/github-data/issues/463 - Research V100 Flash Attention Implementation.md
similarity index 91%
rename from github-data/issues/463 - Research_ V100 Flash Attention Implementation.md
rename to github-data/issues/463 - Research V100 Flash Attention Implementation.md
index b641aed3c..032a06da3 100644
--- a/github-data/issues/463 - Research_ V100 Flash Attention Implementation.md
+++ b/github-data/issues/463 - Research V100 Flash Attention Implementation.md
@@ -1,4 +1,4 @@
-### 📝 [#463](https://github.com/ikawrakow/ik_llama.cpp/issues/463) - Research: V100 Flash Attention Implementation
+## 📌 [Issue #463](https://github.com/ikawrakow/ik_llama.cpp/issues/463) - Research: V100 Flash Attention Implementation
| **Author** | `sempervictus` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### Research Stage
@@ -80,9 +80,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-28** at **11:42:36**:
+👤 **ikawrakow** commented on **2025-05-28** at **11:42:36**
So, my concept is that the flash attention implementation supports Volta, except for the case of DeepSeek models with MLA enabled where Touring or newer is required. The DeepSeek attention architecture has different K- and V-head sizes. Is this supported by the quoted implementation? The usage example suggests that it is not supported.
@@ -90,13 +90,13 @@ But apart from this, support for old hardware is not a focus of this project. Ma
---
-👤 **sempervictus** commented the **2025-05-28** at **17:20:32**:
+👤 **sempervictus** commented on **2025-05-28** at **17:20:32**
@ikawrakow thanks for jumping in. This is a class of hardware still very common in academia and much more available to aspiring developers than a data haul of water-cooled B200s so i'm hoping an exception can be made for putting talented effort toward a an area of runtime logic which underpins a lot of the operating mechanics/capability to include KV quantization. If anything, the optimal use of memory in those devices is difference between being able and unable to load a model (not being able to fit runtime memory into a single device apparently prevents loading of a model that would otherwise fit into multiple devices just fine). So far with our V100s we've see flash attention unsupported messages with every model loaded - llama3/4, phi, falcon, DS, qwen.
---
-👤 **ikawrakow** commented the **2025-05-29** at **06:09:35**:
+👤 **ikawrakow** commented on **2025-05-29** at **06:09:35**
@sempervictus
@@ -104,6 +104,6 @@ Water-cooled B-200s are not a focus here either. This is a hobby project, and I
---
-👤 **sempervictus** commented the **2025-05-29** at **08:49:16**:
+👤 **sempervictus** commented on **2025-05-29** at **08:49:16**
Thank you
\ No newline at end of file
diff --git a/github-data/issues/464 - Bug The streaming every couple of rows blocks for 5-8s.md b/github-data/issues/464 - Bug The streaming every couple of rows blocks for 5-8s.md
new file mode 100644
index 000000000..492d448d9
--- /dev/null
+++ b/github-data/issues/464 - Bug The streaming every couple of rows blocks for 5-8s.md
@@ -0,0 +1,701 @@
+## 📌 [Issue #464](https://github.com/ikawrakow/ik_llama.cpp/issues/464) - Bug: The streaming every couple of rows blocks for 5-8s
+
+| **Author** | `ciprianveg` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-05-27 |
+| **Updated** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+Although I obtained good sweep-bench results for 235b UD_Q5_XL as shown below, and with the q4 quant they were 20% faster, in both cases, this annoying blocking happens every couple of rows. I tried changing from 16 threads to 12, but same thing happens. Wilth main llama, is like 25% slower, but is cursive.
+My system is a TR 3955wx with 16 cores, 256 ddr4 3200, 2x3090..
+Any ideas?
+./build/bin/llama-sweep-bench --model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.2[0-4]\.ffn_up_exps=CUDA0,blk\.2[0-4]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA1,blk\.1[0-9]\.ffn_gate_exps=CUDA1,blk\.2[5-8]\.ffn_up_exps=CUDA1,blk\.2[5-8]\.ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 11.730 | 349.19 | 133.500 | 7.67 |
+| 4096 | 1024 | 4096 | 12.079 | 339.11 | 136.944 | 7.48 |
+| 4096 | 1024 | 8192 | 12.514 | 327.33 | 140.286 | 7.30 |
+| 4096 | 1024 | 12288 | 13.038 | 314.17 | 144.478 | 7.09 |
+| 4096 | 1024 | 16384 | 13.545 | 302.40 | 148.595 | 6.89 |
+| 4096 | 1024 | 20480 | 13.943 | 293.76 | 151.881 | 6.74 |
+| 4096 | 1024 | 24576 | 14.767 | 277.38 | 154.643 | 6.62 |
+| 4096 | 1024 | 28672 | 15.621 | 262.21 | 158.355 | 6.47 |
+| 4096 | 1024 | 32768 | 16.561 | 247.32 | 161.875 | 6.33 |
+| 4096 | 1024 | 36864 | 17.658 | 231.97 | 166.160 | 6.16 |
+
+### Name and Version
+
+llama-server -model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.2[0-4]\.ffn_up_exps=CUDA0,blk\.2[0-4]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA1,blk\.1[0-9]\.ffn_gate_exps=CUDA1,blk\.2[5-8]\.ffn_up_exps=CUDA1,blk\.2[5-8]\.ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
+
+### What operating system are you seeing the problem on?
+
+_No response_
+
+### Relevant log output
+
+```shell
+
+```
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-05-28** at **05:17:25**
+
+Not sure. Do you get many tokens at once after the 5-8 seconds pause, or it just did nothing for 5-8 seconds?
+
+---
+
+👤 **ciprianveg** commented on **2025-05-28** at **06:19:29**
+
+It looks like it did nothing, sometimes a second 5-8 s pause comes after
+just 2 words, other times after 2 rows of text. I tried also with 2048
+ubatch size and with using amb 512, no difference. For my hardware, what
+would be the most suitable build params. I am now setting gpu ggml sched
+copies to 1, cublast off and ggml cuda on
+
+On Wed, 28 May 2025, 08:17 Kawrakow, ***@***.***> wrote:
+
+> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp[#464](https://github.com/ikawrakow/ik_llama.cpp/issues/464))
+>
+>
+> Not sure. Do you get many tokens at once after the 5-8 seconds pause, or
+> it just did nothing for 5-8 seconds?
+>
+> —
+> Reply to this email directly, view it on GitHub
+> ,
+> or unsubscribe
+>
+> .
+> You are receiving this because you authored the thread.Message ID:
+> ***@***.***>
+>
+
+---
+
+👤 **ikawrakow** commented on **2025-05-28** at **07:56:05**
+
+I'm trying to understand the root cause for this strange behavior. Can you reproduce it using `llama-cli` ?
+
+---
+
+👤 **ciprianveg** commented on **2025-05-28** at **10:01:30**
+
+I will try this evening and let you know
+
+---
+
+👤 **ciprianveg** commented on **2025-05-28** at **13:06:09**
+
+Something that maybe can give a clue is that my system is cpu limited, i have 8 channels ddr4 3200 ram but the memory read speed is limited to 85Mb/s instead of the theoretical >200Mb/s because the 16 cores are not enough to read at that speed. This is against the standard cpu systems where memory speed is the limiter, not the cpu..
+
+---
+
+👤 **ciprianveg** commented on **2025-05-28** at **16:52:25**
+
+same issue also with llama-cli
+
+---
+
+👤 **ikawrakow** commented on **2025-05-28** at **17:22:10**
+
+Is there disc activity during the pause? Have you looked at process activity during the pause? Are you running llama.cpp with the exact same parameters (apart from -fmoe)? Is there another memory hungry process running (e.g., another llama.cpp server)?
+
+---
+
+👤 **ciprianveg** commented on **2025-05-28** at **17:27:38**
+
+Llama.cpp runs with exact params except fmoe. I have 256Gb ram and almost
+100gb free. No other memory hungry process..
+
+On Wed, 28 May 2025, 20:22 Kawrakow, ***@***.***> wrote:
+
+> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp[#464](https://github.com/ikawrakow/ik_llama.cpp/issues/464))
+>
+>
+> Is there disc activity during the pause? Have you looked at process
+> activity during the pause? Are you running llama.cpp with the exact same
+> parameters (apart from -fmoe)? Is there another memory hungry process
+> running (e.g., another llama.cpp server)?
+>
+> —
+> Reply to this email directly, view it on GitHub
+> ,
+> or unsubscribe
+>
+> .
+> You are receiving this because you authored the thread.Message ID:
+> ***@***.***>
+>
+
+---
+
+👤 **ikawrakow** commented on **2025-05-29** at **04:19:32**
+
+What about the first two questions? Is the CPU busy during the pauses or just sitting there doing nothing? But at the end it might be easier to just run in the debugger and when it pauses, hit Ctrl-C, type `bt`, and post the backtrace here.
+
+---
+
+👤 **ciprianveg** commented on **2025-05-29** at **05:35:10**
+
+1. Disk activity, no
+2. Top shows llama server between 100-500% when it works and same when it pauses
+
+---
+
+👤 **kirnat** commented on **2025-05-29** at **09:12:01**
+
+Check your PCIe traffic with nvtop or similar when the pause happens. Does it happen if you don't offload any experts to the GPUs?
+
+---
+
+👤 **ikawrakow** commented on **2025-05-29** at **09:31:47**
+
+To test the hypothesis that it gets stuck on copying tensors to the GPU, you can run with `-op 26,0,27,0,29,0`. This disables offloading tensors to the GPU for any type of matrix multiplication.
+
+But running in the debugger, interrupting with Ctrl-C when it gets stuck, and sending the backtrace will hopefully also diagnose where (in which function) it hangs for so long.
+
+---
+
+👤 **ciprianveg** commented on **2025-05-29** at **09:44:38**
+
+XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
+XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
+XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
+XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT) = 0
+XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT_ID) = 0
+XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MOE_FUSED_UP_GATE) = 0
+
+same issue
+
+---
+
+👤 **ciprianveg** commented on **2025-05-29** at **09:54:13**
+
+Thread 1 "llama-server" received signal SIGINT, Interrupt.
+Download failed: Invalid argument. Continuing without source file ./nptl/./nptl/pthread_mutex_lock.c.
+0x00007fffee4a014c in lll_mutex_lock_optimized (mutex=0x55555899a0d8) at ./nptl/pthread_mutex_lock.c:48
+warning: 48 ./nptl/pthread_mutex_lock.c: No such file or directory
+
+this is from debug
+
+also, with nvtop, when pause happens, the gpus transfer speed is around 1,8GB/s and as soon as it unblocks drops to 50-100MB/s
+
+---
+
+👤 **ciprianveg** commented on **2025-05-29** at **13:02:59**
+
+it happened also with ngl 0, with nothing sent to gpus, only that being slower, like 2-3tok/s also the pause was longer, cca 20s
+
+llama-server --model /home/ciprian/ai/models/Qwen3-235B-UD_Q4_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf --alias Qwen3-235B-A22B-UD-Q4_K_XL -fa -ctk q8_0 -ctv q8_0 -c 36864 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ngl 0 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
+
+---
+
+👤 **ikawrakow** commented on **2025-05-29** at **13:22:24**
+
+If you want to test if the pauses happen when running CPU only, you need to say `CUDA_VISIBLE_DEVICES="" ./bin/llama-server...`. Or just make a build with CUDA disabled.
+
+The debug session above was not useful as the main thread is the server thread, so we don't see where the computation hangs. To get the desired backtrace you need to run `llama-cli`.
+
+> the gpus transfer speed is around 1,8GB/s and as soon as it unblocks drops to 50-100MB/s
+
+Isn't this kind of slow? But even at that rate in 5 seconds it will transfer ~9 GB to the GPU. A `Q5_K` quantized Qwen3-235-A22B layer is in the range of 1.8 GB, so it is transferring 5 layers worth of tensors?
+
+Or is this all happening when your context gets full?
+
+---
+
+👤 **ciprianveg** commented on **2025-05-29** at **13:59:44**
+
+debug on llama-cli ctrl+c when paused i don't think is helpful:
+Thread 1 "llama-cli" received signal SIGINT, Interrupt.
+0x00007fffe5391028 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+(gdb)
+
+---
+
+👤 **ikawrakow** commented on **2025-05-29** at **15:42:57**
+
+I guess you need
+```
+thread apply all bt
+```
+
+---
+
+👤 **ciprianveg** commented on **2025-05-29** at **17:40:41**
+
+Hi @ikawrakow:
+0x00007fffe5391024 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+(gdb) thread apply all bt
+
+Thread 21 (Thread 0x7fff647db000 (LWP 18073) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 20 (Thread 0x7fff64fdc000 (LWP 18072) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 19 (Thread 0x7fff657dd000 (LWP 18071) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+--Type for more, q to quit, c to continue without paging--
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 18 (Thread 0x7fff65fde000 (LWP 18070) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 17 (Thread 0x7fff667df000 (LWP 18069) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:6--Type for more, q to quit, c to continue without paging--
+0
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 16 (Thread 0x7fff66fe0000 (LWP 18068) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 15 (Thread 0x7fff677e1000 (LWP 18067) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+--Type for more, q to quit, c to continue without paging--
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 14 (Thread 0x7fff67fe2000 (LWP 18066) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 13 (Thread 0x7fff687e3000 (LWP 18065) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 12 (Thread 0x7fff68fe4000 (LWP 18064) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:6--Type for more, q to quit, c to continue without paging--
+0
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 11 (Thread 0x7fff697e5000 (LWP 18063) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 10 (Thread 0x7fff69fe6000 (LWP 18062) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 9 (Thread 0x7fff6a7e7000 (LWP 18061) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+--Type for more, q to quit, c to continue without paging--
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 8 (Thread 0x7fff6afe8000 (LWP 18060) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 7 (Thread 0x7fff6b7e9000 (LWP 18059) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+--Type for more, q to quit, c to continue without paging--
+
+Thread 6 (Thread 0x7fffa0afa000 (LWP 18018) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) 0x00007fffee498d71 in __futex_abstimed_wait_common64 (private=32767, cancel=true, abstime=0x7fffa0ad6800, op=393, expected=0, futex_word=0x555555cccca0) at ./nptl/futex-internal.c:57
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) __futex_abstimed_wait_common (cancel=true, private=32767, abstime=0x7fffa0ad6800, clockid=0, expected=0, futex_word=0x555555cccca0) at ./nptl/futex-internal.c:87
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555cccca0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffa0ad6800, private=private@entry=0) at ./nptl/futex-internal.c:139
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffee49bc8e in __pthread_cond_wait_common (abstime=0x7fffa0ad6800, clockid=0, mutex=0x555555cc7d30, cond=0x555555cccc78) at ./nptl/pthread_cond_wait.c:503
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) ___pthread_cond_timedwait64 (cond=0x555555cccc78, mutex=0x555555cc7d30, abstime=0x7fffa0ad6800) at ./nptl/pthread_cond_wait.c:652
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffe53cadfa in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#7](https://github.com/ikawrakow/ik_llama.cpp/issues/7) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 5 (Thread 0x7fffa231c000 (LWP 18017) "cuda-EvtHandlr"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) 0x00007fffee51b4cd in __GI___poll (fds=0x7fff70000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) 0x00007fffe547644f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) 0x00007fffe553a80f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 4 (Thread 0x7fffa2d1d000 (LWP 18016) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) 0x00007fffee498d71 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7fffa2cf9800, op=393, expected=0--Type for more, q to quit, c to continue without paging--
+, futex_word=0x555555d20600) at ./nptl/futex-internal.c:57
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x7fffa2cf9800, clockid=0, expected=0, futex_word=0x555555d20600) at ./nptl/futex-internal.c:87
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555d20600, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffa2cf9800, private=private@entry=0) at ./nptl/futex-internal.c:139
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffee49bc8e in __pthread_cond_wait_common (abstime=0x7fffa2cf9800, clockid=0, mutex=0x555555cd1320, cond=0x555555d205d8) at ./nptl/pthread_cond_wait.c:503
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) ___pthread_cond_timedwait64 (cond=0x555555d205d8, mutex=0x555555cd1320, abstime=0x7fffa2cf9800) at ./nptl/pthread_cond_wait.c:652
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffe53cadfa in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#7](https://github.com/ikawrakow/ik_llama.cpp/issues/7) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 3 (Thread 0x7fffa453f000 (LWP 18015) "cuda-EvtHandlr"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) 0x00007fffee51b4cd in __GI___poll (fds=0x7fff88000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) 0x00007fffe547644f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) 0x00007fffe553a80f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 2 (Thread 0x7fffb2dff000 (LWP 18008) "cuda00001400006"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) 0x00007fffee51b4cd in __GI___poll (fds=0x555555cd4240, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) 0x00007fffe547644f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) 0x00007fffe553a80f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+--Type for more, q to quit, c to continue without paging--
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+
+Thread 1 (Thread 0x7ffff7c4d000 (LWP 18005) "llama-cli"):
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) 0x00007fffe5391024 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) 0x00007fffe543328a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) 0x00007fffe5583eae in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffe5585a4c in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x00007fffe56e29f9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffe5341556 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffe5341a70 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#7](https://github.com/ikawrakow/ik_llama.cpp/issues/7) 0x00007fffe5342407 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) 0x00007fffe54ebfe9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+[#9](https://github.com/ikawrakow/ik_llama.cpp/issues/9) 0x00007fffee0481a9 in ?? () from /usr/local/cuda-12.8/lib64/libcudart.so.12
+[#10](https://github.com/ikawrakow/ik_llama.cpp/issues/10) 0x00007fffee017058 in ?? () from /usr/local/cuda-12.8/lib64/libcudart.so.12
+[#11](https://github.com/ikawrakow/ik_llama.cpp/issues/11) 0x00007fffee07693c in cudaMemcpyAsync () from /usr/local/cuda-12.8/lib64/libcudart.so.12
+[#12](https://github.com/ikawrakow/ik_llama.cpp/issues/12) 0x00007fffeee271e5 in ggml_backend_cuda_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /home/ciprian/ai/ik_llama.cpp/build/ggml/src/libggml.so
+[#13](https://github.com/ikawrakow/ik_llama.cpp/issues/13) 0x00007fffeecc0bfc in ggml_backend_sched_graph_compute_async () from /home/ciprian/ai/ik_llama.cpp/build/ggml/src/libggml.so
+[#14](https://github.com/ikawrakow/ik_llama.cpp/issues/14) 0x00007ffff7e8e522 in llama_decode () from /home/ciprian/ai/ik_llama.cpp/build/src/libllama.so
+[#15](https://github.com/ikawrakow/ik_llama.cpp/issues/15) 0x0000555555573b55 in main ()
+
+---
+
+👤 **ikawrakow** commented on **2025-05-30** at **06:41:25**
+
+OK, so we see it being stuck on a call to `cudaMemcpyAsync` copying data from the host to the GPU. No idea why. Or why the transfer rate is just 1.8 GB/s.
+
+---
+
+👤 **ciprianveg** commented on **2025-05-30** at **18:28:15**
+
+Strange, with deepseek i2k from ubergarm it works perfectly..
+
+---
+
+👤 **ikawrakow** commented on **2025-05-31** at **05:31:40**
+
+Thanks for the update.
+
+I really don't know what could be causing the pauses and, unlike the illegal memory access bug, nobody else has reported a similar problem.
+
+---
+
+👤 **pt13762104** commented on **2025-05-31** at **11:33:44**
+
+I also found this problem on my Pc with Qwen3 30B Q4_K_XL. It just stops for a few seconds, then it might be slow or not... unlike llama.cpp.
+
+---
+
+👤 **ciprianveg** commented on **2025-06-01** at **18:03:32**
+
+Another feedback: i tried the 235b iq3 quant done by @ubergarm and it works fine. Maybe the issue is caused by the unsloth UD XL q3, q4 and q6 quants
+
+---
+
+👤 **samteezy** commented on **2025-07-22** at **15:40:24**
+
+I have some context to add here that may help, idk. I'm running a very recent build of `ik_llama` from maybe a week ago if that.
+
+I've been running `Qwen3-30B-A3B-128K-UD-Q5_K_XL.gguf` with `ik_llama` using Vulkan and experiencing lower performance than regular `llama.cpp`, but it's mainly been due to stuttering/pausing during text generation. I noticed that [this pull](https://github.com/ikawrakow/ik_llama.cpp/pull/573) mentions some odd behavior around commas, and what I'm experiencing seems similar, though my delays are more like around a second or less rather than the 5-8 sec mentioned here, I assume because I'm GPU-accelerated. (I have the experts offloaded to CPU).
+
+So today I tried running all three with the same prompt:
+
+`Qwen3-30B-A3B-128K-UD-Q5_K_XL.gguf` from unsloth - dynamic quant
+`Qwen3-30B-A3B-128K-Q4_K_M.gguf` also from unsloth - not a dynamic quant
+`Qwen3-30B-A3B-Q4_K_M.gguf` directly from Qwen
+
+With these settings (all are run via `llama-swap`):
+
+```
+cmd: |
+ /root/llama-builds/ik_llama.cpp/bin/llama-server
+ --port ${PORT}
+ --flash-attn
+ -ctk q8_0 -ctv q8_0
+ --threads 17
+ --n-gpu-layers 0 -sm none --main-gpu 1
+ -m /mnt/models/unsloth/Qwen3-30B-A3B-128K-UD-Q5_K_XL.gguf
+ -fmoe
+ -ot ".*ffn_.*_exps\.weight=CPU"
+ --temp 0.7
+ --min-p 0
+ --top-p 0.8
+ --top-k 20
+ --ctx-size 128000
+ --presence-penalty 0.1
+```
+
+The Qwen official GGUF runs without stutter or issue with near identical performance to `llama.cpp`, but both the dynamic and "regular" quants from unsloth experience the pausing issue. So I don't think it's related to the dynamic quants, but something else with unsloth's edits/fixes.
+
+---
+
+👤 **ciprianveg** commented on **2025-07-22** at **17:24:19**
+
+I can confirm it is comma related.
+
+---
+
+👤 **saood06** commented on **2025-07-22** at **18:42:24**
+
+> I can confirm it is comma related.
+
+I collected as much info as I could about this bug [here](https://github.com/ikawrakow/ik_llama.cpp/pull/573#issuecomment-3033895399).
+
+---
+
+👤 **ciprianveg** commented on **2025-07-22** at **21:08:18**
+
+I tried adding to my llama-server command --override-kv tokenizer.ggml.bos_token_id=int:-1, but it still pauses every comma..
+
+---
+
+👤 **saood06** commented on **2025-07-22** at **21:14:57**
+
+> I tried adding to my llama-server command --override-kv tokenizer.ggml.bos_token_id=int:-1, but it still pauses every comma..
+
+Which is to be expected. The only reason that was relevant was with dots the BOS token upon being loaded was a comma which caused it to present the underlying issue.
+
+Changing that didn't fix the issue with the comma, it just prevented an incorrect situation that just happened to intersect with the comma bug.
+
+---
+
+👤 **ubergarm** commented on **2025-07-23** at **04:43:24**
+
+Hrmm, I possibly just observed this "brief pause after generating a `,` character" in very early test of https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/ on the IQ2_KS. Not sure, and I'm too sleepy to look into it. Just dropping a note to my future self to look more into it.
+
+I also added a note there on the discussing https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/1#688067cdf4ffc5de61c3f86a linking back to here suggesting trying that tokenizer trick which probably won't work and has nothing to do with it lol.
+
+Goodnight and I hope no new MoEs drop while I'm asleep. 💀 😹
+
+---
+
+👤 **saood06** commented on **2025-07-23** at **05:53:09**
+
+>I also added a note there on the discussing https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/1#688067cdf4ffc5de61c3f86a linking back to here suggesting trying that tokenizer trick which probably won't work and has nothing to do with it lol.
+
+I replied above that it doesn't. The model linked has `tokenizer.ggml.add_bos_token = false`. Turning the BOS token off in two ways (`-1` and `false`), will not resolve the comma bug.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **06:18:22**
+
+I can confirm the issue with the Qwen3-30B-A3B `Q4_K_M` model from Unsloth (although the pauses are much shorter, more like a fraction of a second, but still noticable at the 50 t/s speed I have). It does not happen if I quantize the model myself from the official Qwen3-30B-A3B `bf16` model using the exact same `Q4_K_M` recipe.
+
+As the pauses are too short to be able to reliably interrupt at exactly that point in a debugger, I decided to see if I can make it easier to break at a pause. So, I asked the model to write 100 commas in a row. The pauses occur while it is thinking, but when it starts writing the commas there are no pauses. Then I though may be it is comma followed by space, so I asked it to write comma followed by space 100 times. Same thing - pauses while thinking, no pauses while writing `, , , ...`.
+
+I'll try to debug today, let's see how it goes.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **06:21:28**
+
+@samteezy The Vulkan backend does not have an implementation for the `-fmoe` command line argument. The result is that the fused `ffn_up+ffn_gate` op that is enabled by `-fmoe` will be run on the CPU. I'm also wondering why you use `-ngl 0`. My guess is that `-ngl 100 -ot exps=CPU` without the `-fmoe` will produce a better performance with Vulkan.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **07:45:39**
+
+So, with the model that I quantize myself I see
+```
+llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
+```
+
+But with the model from Unsloth I see
+```
+llm_load_print_meta: BOS token = 11 ','
+```
+
+If I run
+```
+./bin/llama-server --override-kv tokenizer.ggml.bos_token_id=int:151643 $other_args
+```
+the pauses are gone.
+
+If I run the Unsloth model in mainline, it also shows the BOS token as 11, but for whatever reason it does not cause an issue there. Perhaps somewhere in `ik_llama.cpp` the `add_bos_token = false` is being ignored?
+
+---
+
+👤 **gapeleon** commented on **2025-07-23** at **07:47:29**
+
+Forgive me if this is already known; but the comma-pause bug seems to depend on what directly precedes the comma.
+
+You can see this with the following prompt:
+
+```
+Repeat the following, verbatim:
+
+**Fast**:
+
+`punctuation + comma + space` eg:
+", ")
+ ;, test
+
+**Slow**:
+`letter + comma + space` eg:
+word,
+
+`number + comma + space` eg:
+1, 2,
+
+`Newline followed by comma` :
+,
+
+```
+
+I haven't checked the different token ids yet.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **07:51:31**
+
+This is not known, at least not known by me.
+
+But do you still see the pauses if you add `--override-kv tokenizer.ggml.bos_token_id=int:151643` to your command line?
+
+---
+
+👤 **gapeleon** commented on **2025-07-23** at **08:06:22**
+
+I just tested, no more pauses after adding that 👍
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **08:29:47**
+
+OK, I found the culprit. It is the change in warm up that was added in [#198](https://github.com/ikawrakow/ik_llama.cpp/issues/198).
+
+Basically, to improve the experience for MoE models, [#198](https://github.com/ikawrakow/ik_llama.cpp/issues/198) changed the warm up to use all experts. But the warm up is being detected with this line of code
+```
+bool is_warming_up = (batch.n_tokens == 1 && (batch.token[0] == ((bos != -1) ? bos : eos)));
+```
+which works out to `true` when we have a comma and bos token is set to `,`. So, each time there is a comma, all 128 experts get used, which makes the calculation basically 16 times slower.
+
+This is not done in mainline, so the issue does not exist even when the BOS token is set to be a comma.
+
+@saood06 Do you want to fix it, or should I?
+
+---
+
+👤 **saood06** commented on **2025-07-23** at **08:56:37**
+
+>@saood06 Do you want to fix it, or should I?
+
+I can add support for the llama_add_bos_token_impl to the warmup code (mainline warmup code that they merged in is very different it is post refactor).
+
+Did the models you quantize have `tokenizer.ggml.add_bos_token = false`?
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **09:01:03**
+
+> Did the models you quantize have tokenizer.ggml.add_bos_token = false
+
+Yes. But the BOS token id was set to 151643 instead of 11. I see the timestamp of the `bf16` GGUF to be April 29, but I do not remember if I used the `ik_llama.cpp` or the `llama.cpp` `convert_hf_to_gguf.py` script to create it from the safetensors.
+
+---
+
+👤 **samteezy** commented on **2025-07-23** at **09:09:07**
+
+> @samteezy The Vulkan backend does not have an implementation for the `-fmoe` command line argument. The result is that the fused `ffn_up+ffn_gate` op that is enabled by `-fmoe` will be run on the CPU. I'm also wondering why you use `-ngl 0`. My guess is that `-ngl 100 -ot exps=CPU` without the `-fmoe` will produce a better performance with Vulkan.
+
+Yeah, I realized that after I posted - I was just mucking about with various settings to see how performance changed. Normally on mainline I'm offloading experts to CPU as needed and leave ngl set to 99. I hadn't quite understood what -fmoe did yet, thanks for the detail.
+
+---
+
+👤 **saood06** commented on **2025-07-23** at **09:15:38**
+
+I just checked [this](https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ2_KS?show_file_info=IQ2_KS%2FQwen3-480B-A35B-Instruct-IQ2_KS-00001-of-00004.gguf) which also has it false. So at least the ones being reported should be covered.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **09:59:00**
+
+This should be fixed on latest main (after merging [#639](https://github.com/ikawrakow/ik_llama.cpp/issues/639)). You should not need to override the BOS token ID.
+
+But if there are still pauses, let me know.
\ No newline at end of file
diff --git a/github-data/issues/464 - Bug_ The streaming every couple of rows blocks for 5-8s.md b/github-data/issues/464 - Bug_ The streaming every couple of rows blocks for 5-8s.md
deleted file mode 100644
index fb4000a8c..000000000
--- a/github-data/issues/464 - Bug_ The streaming every couple of rows blocks for 5-8s.md
+++ /dev/null
@@ -1,486 +0,0 @@
-### 🐛 [#464](https://github.com/ikawrakow/ik_llama.cpp/issues/464) - Bug: The streaming every couple of rows blocks for 5-8s
-
-| **Author** | `ciprianveg` |
-| :--- | :--- |
-| **State** | ✅ **Open** |
-| **Created** | 2025-05-27 |
-| **Updated** | 2025-06-01 |
-
----
-
-#### Description
-
-### What happened?
-
-Although I obtained good sweep-bench results for 235b UD_Q5_XL as shown below, and with the q4 quant they were 20% faster, in both cases, this annoying blocking happens every couple of rows. I tried changing from 16 threads to 12, but same thing happens. Wilth main llama, is like 25% slower, but is cursive.
-My system is a TR 3955wx with 16 cores, 256 ddr4 3200, 2x3090..
-Any ideas?
-./build/bin/llama-sweep-bench --model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.2[0-4]\.ffn_up_exps=CUDA0,blk\.2[0-4]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA1,blk\.1[0-9]\.ffn_gate_exps=CUDA1,blk\.2[5-8]\.ffn_up_exps=CUDA1,blk\.2[5-8]\.ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 4096 | 1024 | 0 | 11.730 | 349.19 | 133.500 | 7.67 |
-| 4096 | 1024 | 4096 | 12.079 | 339.11 | 136.944 | 7.48 |
-| 4096 | 1024 | 8192 | 12.514 | 327.33 | 140.286 | 7.30 |
-| 4096 | 1024 | 12288 | 13.038 | 314.17 | 144.478 | 7.09 |
-| 4096 | 1024 | 16384 | 13.545 | 302.40 | 148.595 | 6.89 |
-| 4096 | 1024 | 20480 | 13.943 | 293.76 | 151.881 | 6.74 |
-| 4096 | 1024 | 24576 | 14.767 | 277.38 | 154.643 | 6.62 |
-| 4096 | 1024 | 28672 | 15.621 | 262.21 | 158.355 | 6.47 |
-| 4096 | 1024 | 32768 | 16.561 | 247.32 | 161.875 | 6.33 |
-| 4096 | 1024 | 36864 | 17.658 | 231.97 | 166.160 | 6.16 |
-
-### Name and Version
-
-llama-server -model /home/ciprian/ai/models/Qwen3-235B-UD_Q5_XL/Qwen3-235B-A22B-UD-Q5_K_XL-00001-of-00004.gguf --alias Qwen3-235B-A22B-UD-Q5_K_XL -fa -fmoe -ctk q8_0 -ctv q8_0 -c 40960 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.2[0-4]\.ffn_up_exps=CUDA0,blk\.2[0-4]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA1,blk\.1[0-9]\.ffn_gate_exps=CUDA1,blk\.2[5-8]\.ffn_up_exps=CUDA1,blk\.2[5-8]\.ffn_gate_exps=CUDA1,exps=CPU" -ngl 99 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
-
-### What operating system are you seeing the problem on?
-
-_No response_
-
-### Relevant log output
-
-```shell
-
-```
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** commented the **2025-05-28** at **05:17:25**:
-
-Not sure. Do you get many tokens at once after the 5-8 seconds pause, or it just did nothing for 5-8 seconds?
-
----
-
-👤 **ciprianveg** commented the **2025-05-28** at **06:19:29**:
-
-It looks like it did nothing, sometimes a second 5-8 s pause comes after
-just 2 words, other times after 2 rows of text. I tried also with 2048
-ubatch size and with using amb 512, no difference. For my hardware, what
-would be the most suitable build params. I am now setting gpu ggml sched
-copies to 1, cublast off and ggml cuda on
-
-On Wed, 28 May 2025, 08:17 Kawrakow, ***@***.***> wrote:
-
-> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp#464)
->
->
-> Not sure. Do you get many tokens at once after the 5-8 seconds pause, or
-> it just did nothing for 5-8 seconds?
->
-> —
-> Reply to this email directly, view it on GitHub
-> ,
-> or unsubscribe
->
-> .
-> You are receiving this because you authored the thread.Message ID:
-> ***@***.***>
->
-
----
-
-👤 **ikawrakow** commented the **2025-05-28** at **07:56:05**:
-
-I'm trying to understand the root cause for this strange behavior. Can you reproduce it using `llama-cli` ?
-
----
-
-👤 **ciprianveg** commented the **2025-05-28** at **10:01:30**:
-
-I will try this evening and let you know
-
----
-
-👤 **ciprianveg** commented the **2025-05-28** at **13:06:09**:
-
-Something that maybe can give a clue is that my system is cpu limited, i have 8 channels ddr4 3200 ram but the memory read speed is limited to 85Mb/s instead of the theoretical >200Mb/s because the 16 cores are not enough to read at that speed. This is against the standard cpu systems where memory speed is the limiter, not the cpu..
-
----
-
-👤 **ciprianveg** commented the **2025-05-28** at **16:52:25**:
-
-same issue also with llama-cli
-
----
-
-👤 **ikawrakow** commented the **2025-05-28** at **17:22:10**:
-
-Is there disc activity during the pause? Have you looked at process activity during the pause? Are you running llama.cpp with the exact same parameters (apart from -fmoe)? Is there another memory hungry process running (e.g., another llama.cpp server)?
-
----
-
-👤 **ciprianveg** commented the **2025-05-28** at **17:27:38**:
-
-Llama.cpp runs with exact params except fmoe. I have 256Gb ram and almost
-100gb free. No other memory hungry process..
-
-On Wed, 28 May 2025, 20:22 Kawrakow, ***@***.***> wrote:
-
-> *ikawrakow* left a comment (ikawrakow/ik_llama.cpp#464)
->
->
-> Is there disc activity during the pause? Have you looked at process
-> activity during the pause? Are you running llama.cpp with the exact same
-> parameters (apart from -fmoe)? Is there another memory hungry process
-> running (e.g., another llama.cpp server)?
->
-> —
-> Reply to this email directly, view it on GitHub
-> ,
-> or unsubscribe
->
-> .
-> You are receiving this because you authored the thread.Message ID:
-> ***@***.***>
->
-
----
-
-👤 **ikawrakow** commented the **2025-05-29** at **04:19:32**:
-
-What about the first two questions? Is the CPU busy during the pauses or just sitting there doing nothing? But at the end it might be easier to just run in the debugger and when it pauses, hit Ctrl-C, type `bt`, and post the backtrace here.
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **05:35:10**:
-
-1. Disk activity, no
-2. Top shows llama server between 100-500% when it works and same when it pauses
-
----
-
-👤 **kirnat** commented the **2025-05-29** at **09:12:01**:
-
-Check your PCIe traffic with nvtop or similar when the pause happens. Does it happen if you don't offload any experts to the GPUs?
-
----
-
-👤 **ikawrakow** commented the **2025-05-29** at **09:31:47**:
-
-To test the hypothesis that it gets stuck on copying tensors to the GPU, you can run with `-op 26,0,27,0,29,0`. This disables offloading tensors to the GPU for any type of matrix multiplication.
-
-But running in the debugger, interrupting with Ctrl-C when it gets stuck, and sending the backtrace will hopefully also diagnose where (in which function) it hangs for so long.
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **09:44:38**:
-
-XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT to OFF
-XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
-XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
-XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT) = 0
-XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MUL_MAT_ID) = 0
-XXXXXXXXXXXXXXXXXXXXXXXXXXXX offload(MOE_FUSED_UP_GATE) = 0
-
-same issue
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **09:54:13**:
-
-Thread 1 "llama-server" received signal SIGINT, Interrupt.
-Download failed: Invalid argument. Continuing without source file ./nptl/./nptl/pthread_mutex_lock.c.
-0x00007fffee4a014c in lll_mutex_lock_optimized (mutex=0x55555899a0d8) at ./nptl/pthread_mutex_lock.c:48
-warning: 48 ./nptl/pthread_mutex_lock.c: No such file or directory
-
-this is from debug
-
-also, with nvtop, when pause happens, the gpus transfer speed is around 1,8GB/s and as soon as it unblocks drops to 50-100MB/s
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **09:54:13**:
-
-Thread 1 "llama-server" received signal SIGINT, Interrupt.
-Download failed: Invalid argument. Continuing without source file ./nptl/./nptl/pthread_mutex_lock.c.
-0x00007fffee4a014c in lll_mutex_lock_optimized (mutex=0x55555899a0d8) at ./nptl/pthread_mutex_lock.c:48
-warning: 48 ./nptl/pthread_mutex_lock.c: No such file or directory
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **13:02:59**:
-
-it happened also with ngl 0, with nothing sent to gpus, only that being slower, like 2-3tok/s also the pause was longer, cca 20s
-
-llama-server --model /home/ciprian/ai/models/Qwen3-235B-UD_Q4_XL/Qwen3-235B-A22B-UD-Q4_K_XL-00001-of-00003.gguf --alias Qwen3-235B-A22B-UD-Q4_K_XL -fa -ctk q8_0 -ctv q8_0 -c 36864 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 0.5 -ngl 0 --threads 16 --host 0.0.0.0 --port 5002 --ubatch-size 4096 --batch-size 4096
-
----
-
-👤 **ikawrakow** commented the **2025-05-29** at **13:22:24**:
-
-If you want to test if the pauses happen when running CPU only, you need to say `CUDA_VISIBLE_DEVICES="" ./bin/llama-server...`. Or just make a build with CUDA disabled.
-
-The debug session above was not useful as the main thread is the server thread, so we don't see where the computation hangs. To get the desired backtrace you need to run `llama-cli`.
-
-> the gpus transfer speed is around 1,8GB/s and as soon as it unblocks drops to 50-100MB/s
-
-Isn't this kind of slow? But even at that rate in 5 seconds it will transfer ~9 GB to the GPU. A `Q5_K` quantized Qwen3-235-A22B layer is in the range of 1.8 GB, so it is transferring 5 layers worth of tensors?
-
-Or is this all happening when your context gets full?
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **13:59:44**:
-
-debug on llama-cli ctrl+c when paused i don't think is helpful:
-Thread 1 "llama-cli" received signal SIGINT, Interrupt.
-0x00007fffe5391028 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-(gdb)
-
----
-
-👤 **ikawrakow** commented the **2025-05-29** at **15:42:57**:
-
-I guess you need
-```
-thread apply all bt
-```
-
----
-
-👤 **ciprianveg** commented the **2025-05-29** at **17:40:41**:
-
-Hi @ikawrakow:
-0x00007fffe5391024 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-(gdb) thread apply all bt
-
-Thread 21 (Thread 0x7fff647db000 (LWP 18073) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 20 (Thread 0x7fff64fdc000 (LWP 18072) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 19 (Thread 0x7fff657dd000 (LWP 18071) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
---Type for more, q to quit, c to continue without paging--
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 18 (Thread 0x7fff65fde000 (LWP 18070) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 17 (Thread 0x7fff667df000 (LWP 18069) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:6--Type for more, q to quit, c to continue without paging--
-0
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 16 (Thread 0x7fff66fe0000 (LWP 18068) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 15 (Thread 0x7fff677e1000 (LWP 18067) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
---Type for more, q to quit, c to continue without paging--
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 14 (Thread 0x7fff67fe2000 (LWP 18066) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 13 (Thread 0x7fff687e3000 (LWP 18065) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 12 (Thread 0x7fff68fe4000 (LWP 18064) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:6--Type for more, q to quit, c to continue without paging--
-0
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 11 (Thread 0x7fff697e5000 (LWP 18063) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 10 (Thread 0x7fff69fe6000 (LWP 18062) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 9 (Thread 0x7fff6a7e7000 (LWP 18061) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
---Type for more, q to quit, c to continue without paging--
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 8 (Thread 0x7fff6afe8000 (LWP 18060) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 7 (Thread 0x7fff6b7e9000 (LWP 18059) "llama-cli"):
-#0 futex_wait (addr=0x55555f628314, val=60160) at ../../../src/libgomp/config/linux/x86/futex.h:97
-#1 do_wait (addr=, val=60160) at ../../../src/libgomp/config/linux/wait.h:67
-#2 gomp_barrier_wait_end (bar=0x55555f628310, state=60160) at ../../../src/libgomp/config/linux/bar.c:48
-#3 0x00007ffff7c87779 in gomp_simple_barrier_wait (bar=) at ../../../src/libgomp/config/posix/simple-bar.h:60
-#4 gomp_thread_start (xdata=) at ../../../src/libgomp/team.c:133
-#5 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#6 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
---Type for more, q to quit, c to continue without paging--
-
-Thread 6 (Thread 0x7fffa0afa000 (LWP 18018) "llama-cli"):
-#0 0x00007fffee498d71 in __futex_abstimed_wait_common64 (private=32767, cancel=true, abstime=0x7fffa0ad6800, op=393, expected=0, futex_word=0x555555cccca0) at ./nptl/futex-internal.c:57
-#1 __futex_abstimed_wait_common (cancel=true, private=32767, abstime=0x7fffa0ad6800, clockid=0, expected=0, futex_word=0x555555cccca0) at ./nptl/futex-internal.c:87
-#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555cccca0, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffa0ad6800, private=private@entry=0) at ./nptl/futex-internal.c:139
-#3 0x00007fffee49bc8e in __pthread_cond_wait_common (abstime=0x7fffa0ad6800, clockid=0, mutex=0x555555cc7d30, cond=0x555555cccc78) at ./nptl/pthread_cond_wait.c:503
-#4 ___pthread_cond_timedwait64 (cond=0x555555cccc78, mutex=0x555555cc7d30, abstime=0x7fffa0ad6800) at ./nptl/pthread_cond_wait.c:652
-#5 0x00007fffe53cadfa in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#6 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#7 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#8 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 5 (Thread 0x7fffa231c000 (LWP 18017) "cuda-EvtHandlr"):
-#0 0x00007fffee51b4cd in __GI___poll (fds=0x7fff70000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
-#1 0x00007fffe547644f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#2 0x00007fffe553a80f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#3 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#4 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#5 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 4 (Thread 0x7fffa2d1d000 (LWP 18016) "llama-cli"):
-#0 0x00007fffee498d71 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x7fffa2cf9800, op=393, expected=0--Type for more, q to quit, c to continue without paging--
-, futex_word=0x555555d20600) at ./nptl/futex-internal.c:57
-#1 __futex_abstimed_wait_common (cancel=true, private=0, abstime=0x7fffa2cf9800, clockid=0, expected=0, futex_word=0x555555d20600) at ./nptl/futex-internal.c:87
-#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555d20600, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffa2cf9800, private=private@entry=0) at ./nptl/futex-internal.c:139
-#3 0x00007fffee49bc8e in __pthread_cond_wait_common (abstime=0x7fffa2cf9800, clockid=0, mutex=0x555555cd1320, cond=0x555555d205d8) at ./nptl/pthread_cond_wait.c:503
-#4 ___pthread_cond_timedwait64 (cond=0x555555d205d8, mutex=0x555555cd1320, abstime=0x7fffa2cf9800) at ./nptl/pthread_cond_wait.c:652
-#5 0x00007fffe53cadfa in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#6 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#7 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#8 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 3 (Thread 0x7fffa453f000 (LWP 18015) "cuda-EvtHandlr"):
-#0 0x00007fffee51b4cd in __GI___poll (fds=0x7fff88000c20, nfds=10, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
-#1 0x00007fffe547644f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#2 0x00007fffe553a80f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#3 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#4 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#5 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 2 (Thread 0x7fffb2dff000 (LWP 18008) "cuda00001400006"):
-#0 0x00007fffee51b4cd in __GI___poll (fds=0x555555cd4240, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
-#1 0x00007fffe547644f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#2 0x00007fffe553a80f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#3 0x00007fffe546e143 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
---Type for more, q to quit, c to continue without paging--
-#4 0x00007fffee49caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
-#5 0x00007fffee529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
-
-Thread 1 (Thread 0x7ffff7c4d000 (LWP 18005) "llama-cli"):
-#0 0x00007fffe5391024 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#1 0x00007fffe543328a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#2 0x00007fffe5583eae in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#3 0x00007fffe5585a4c in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#4 0x00007fffe56e29f9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#5 0x00007fffe5341556 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#6 0x00007fffe5341a70 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#7 0x00007fffe5342407 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#8 0x00007fffe54ebfe9 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
-#9 0x00007fffee0481a9 in ?? () from /usr/local/cuda-12.8/lib64/libcudart.so.12
-#10 0x00007fffee017058 in ?? () from /usr/local/cuda-12.8/lib64/libcudart.so.12
-#11 0x00007fffee07693c in cudaMemcpyAsync () from /usr/local/cuda-12.8/lib64/libcudart.so.12
-#12 0x00007fffeee271e5 in ggml_backend_cuda_buffer_set_tensor(ggml_backend_buffer*, ggml_tensor*, void const*, unsigned long, unsigned long) () from /home/ciprian/ai/ik_llama.cpp/build/ggml/src/libggml.so
-#13 0x00007fffeecc0bfc in ggml_backend_sched_graph_compute_async () from /home/ciprian/ai/ik_llama.cpp/build/ggml/src/libggml.so
-#14 0x00007ffff7e8e522 in llama_decode () from /home/ciprian/ai/ik_llama.cpp/build/src/libllama.so
-#15 0x0000555555573b55 in main ()
-
----
-
-👤 **ikawrakow** commented the **2025-05-30** at **06:41:25**:
-
-OK, so we see it being stuck on a call to `cudaMemcpyAsync` copying data from the host to the GPU. No idea why. Or why the transfer rate is just 1.8 GB/s.
-
----
-
-👤 **ciprianveg** commented the **2025-05-30** at **18:28:15**:
-
-Strange, with deepseek i2k from ubergarm it works perfectly..
-
----
-
-👤 **ikawrakow** commented the **2025-05-31** at **05:31:40**:
-
-Thanks for the update.
-
-I really don't know what could be causing the pauses and, unlike the illegal memory access bug, nobody else has reported a similar problem.
-
----
-
-👤 **pt13762104** commented the **2025-05-31** at **11:33:44**:
-
-I also found this problem on my Pc with Qwen3 30B Q4_K_XL. It just stops for a few seconds, then it might be slow or not... unlike llama.cpp.
-
----
-
-👤 **ciprianveg** commented the **2025-06-01** at **18:03:32**:
-
-Another feedback: i tried the 235b iq3 quant done by @ubergarm and it works fine. Maybe the issue is caused by the unsloth UD XL q3, q4 and q6 quants
\ No newline at end of file
diff --git a/github-data/issues/467 - Bug_ Server does not send data_ _DONE_ for OpenAI-compatible streaming .md b/github-data/issues/467 - Bug Server does not send data DONE for OpenAI-compatible streaming endpoint v1ch.md
similarity index 91%
rename from github-data/issues/467 - Bug_ Server does not send data_ _DONE_ for OpenAI-compatible streaming .md
rename to github-data/issues/467 - Bug Server does not send data DONE for OpenAI-compatible streaming endpoint v1ch.md
index 97b4062ac..fbf12d464 100644
--- a/github-data/issues/467 - Bug_ Server does not send data_ _DONE_ for OpenAI-compatible streaming .md
+++ b/github-data/issues/467 - Bug Server does not send data DONE for OpenAI-compatible streaming endpoint v1ch.md
@@ -1,4 +1,4 @@
-### 🐛 [#467](https://github.com/ikawrakow/ik_llama.cpp/issues/467) - Bug: Server does not send data: [DONE] for OpenAI-compatible streaming endpoint `/v1/chat/completions`
+## 📌 [Issue #467](https://github.com/ikawrakow/ik_llama.cpp/issues/467) - Bug: Server does not send data: [DONE] for OpenAI-compatible streaming endpoint `/v1/chat/completions`
| **Author** | `cyril23` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### Description
@@ -126,9 +126,9 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **cyril23** commented the **2025-05-28** at **10:58:55**:
+👤 **cyril23** commented on **2025-05-28** at **10:58:55**
(with the help of AI ..) I've made a direct modification to the `handle_chat_completions` function in `examples/server/server.cpp` to force the server to send `data: [DONE]\n\n` at the end of a successful stream.
@@ -176,7 +176,7 @@ index 360f571e..c5465846 100644
---
-👤 **ikawrakow** commented the **2025-05-28** at **11:30:14**:
+👤 **ikawrakow** commented on **2025-05-28** at **11:30:14**
@cyril23
@@ -184,7 +184,7 @@ I can try to make a proper PR, but I'm old school and never use such fancy stuff
---
-👤 **cyril23** commented the **2025-05-28** at **13:56:52**:
+👤 **cyril23** commented on **2025-05-28** at **13:56:52**
> I can try to make a proper PR, but I'm old school and never use such fancy stuff. Are you willing to test?
@@ -192,26 +192,26 @@ Sure, I'll test it
---
-👤 **ikawrakow** commented the **2025-05-31** at **05:33:17**:
+👤 **ikawrakow** commented on **2025-05-31** at **05:33:17**
-PR #470 is waiting to be tested.
+PR [#470](https://github.com/ikawrakow/ik_llama.cpp/issues/470) is waiting to be tested.
---
-👤 **cyril23** commented the **2025-06-04** at **06:40:24**:
+👤 **cyril23** commented on **2025-06-04** at **06:40:24**
-> PR [#470](https://github.com/ikawrakow/ik_llama.cpp/pull/470) is waiting to be tested.
+> PR [[#470](https://github.com/ikawrakow/ik_llama.cpp/issues/470)](https://github.com/ikawrakow/ik_llama.cpp/pull/470) is waiting to be tested.
I've tested it successfully in https://github.com/ikawrakow/ik_llama.cpp/pull/470#issuecomment-2938782085, but I'm the wrong guy to review the code
---
-👤 **voipmonitor** commented the **2025-06-17** at **07:03:08**:
+👤 **voipmonitor** commented on **2025-06-17** at **07:03:08**
I have tested it too and it works.
---
-👤 **ikawrakow** commented the **2025-06-17** at **07:34:12**:
+👤 **ikawrakow** commented on **2025-06-17** at **07:34:12**
-Closed via #470
\ No newline at end of file
+Closed via [#470](https://github.com/ikawrakow/ik_llama.cpp/issues/470)
\ No newline at end of file
diff --git a/github-data/issues/472 - Bug_ Don_t build ggml-aarch64 regardless of CPU arch type.md b/github-data/issues/472 - Bug Dont build ggml-aarch64 regardless of CPU arch type.md
similarity index 86%
rename from github-data/issues/472 - Bug_ Don_t build ggml-aarch64 regardless of CPU arch type.md
rename to github-data/issues/472 - Bug Dont build ggml-aarch64 regardless of CPU arch type.md
index aaefe4d14..163e70f67 100644
--- a/github-data/issues/472 - Bug_ Don_t build ggml-aarch64 regardless of CPU arch type.md
+++ b/github-data/issues/472 - Bug Dont build ggml-aarch64 regardless of CPU arch type.md
@@ -1,4 +1,4 @@
-### 🐛 [#472](https://github.com/ikawrakow/ik_llama.cpp/issues/472) - Bug: Don't build ggml-aarch64 regardless of CPU arch type
+## 📌 [Issue #472](https://github.com/ikawrakow/ik_llama.cpp/issues/472) - Bug: Don't build ggml-aarch64 regardless of CPU arch type
| **Author** | `FullstackSensei` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -34,9 +34,9 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-30** at **06:11:12**:
+👤 **ikawrakow** commented on **2025-05-30** at **06:11:12**
Really? Then I guess you need to file a bug report with your compiler vendor. Here is what I see
```
@@ -55,7 +55,7 @@ Is it possible you think it takes a long time because you see it being compiled
---
-👤 **Ph0rk0z** commented the **2025-05-30** at **11:45:48**:
+👤 **Ph0rk0z** commented on **2025-05-30** at **11:45:48**
His compiler isn't broken. I saw this same behavior and though to post about it but just accepted it. The aarch64 is added to cmakelists for everything and some of the quants require symbols from it. I tried to remove it and server wouldn't run due to missing symbols. It is an include in I think ggml.c and those iqk files. I see those already compile and then it sticks on ggml-aarch64.
@@ -63,7 +63,7 @@ It could be a visual bug as you say, but then are those iqk files working on aar
---
-👤 **ikawrakow** commented the **2025-05-30** at **12:27:35**:
+👤 **ikawrakow** commented on **2025-05-30** at **12:27:35**
> It could be a visual bug as you say, but then are those iqk files working on aarch64 specific quant functions?
@@ -75,19 +75,19 @@ In principle I could remove this file, but I find it handy for benchmarking my `
---
-👤 **Ph0rk0z** commented the **2025-05-30** at **14:13:27**:
+👤 **Ph0rk0z** commented on **2025-05-30** at **14:13:27**
When I took it out, it did seem to go much faster and those Q4_0_4_4/Q4_0_8_8 functions popped up warnings. I compile for all cache quantizations too with like -j 90. There are points where it just sits on very little CPU usage for quite a while and this is one that comes up. No clue what it's doing during that time.
---
-👤 **ikawrakow** commented the **2025-05-30** at **15:09:23**:
+👤 **ikawrakow** commented on **2025-05-30** at **15:09:23**
https://github.com/user-attachments/assets/da575fd8-ba9e-41c6-bbb9-658672b47b78
---
-👤 **FullstackSensei** commented the **2025-05-30** at **20:47:54**:
+👤 **FullstackSensei** commented on **2025-05-30** at **20:47:54**
The underlying issue is that building ik_llama.cpp takes ~2x (or more?) the time it takes to build llama.cpp on the same machine with the same build options. I was trying to help find the underlying issue since it does seem to stall at ggml-aarch64 with very low CPU utilization. I genuinely don't care whether there's an ARM build also tucked in there. The issue is the long build times which make updating ik_llama.cpp or testing branches/forks a lot more painful than it needs to be.
@@ -97,12 +97,12 @@ I'm no expert in cmake, but if there's anything I can do to help diagnose the is
---
-👤 **ikawrakow** commented the **2025-05-31** at **05:18:47**:
+👤 **ikawrakow** commented on **2025-05-31** at **05:18:47**
> The underlying issue is that building ik_llama.cpp takes ~2x (or more?) the time it takes to build llama.cpp on the same machine with the same build options.
There are 2 main contributing factors to the longer build times:
-* The matrix multiplication and flash attention kernels that I have added in `ik_llama.cpp`. These are ~18 kLOC of heavily templated C++ code, so take a while to compile. Prior to PR #435 they used to be in a single file that took 2.5 minutes to compile on my CPU. It shouldn't be so bad after #435, but they still do they a while (~20 seconds on my CPU). No progress can be made in the build process until these have been compiled and linked as they are part of the `ggml` library that everything depends on.
+* The matrix multiplication and flash attention kernels that I have added in `ik_llama.cpp`. These are ~18 kLOC of heavily templated C++ code, so take a while to compile. Prior to PR [#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435) they used to be in a single file that took 2.5 minutes to compile on my CPU. It shouldn't be so bad after [#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435), but they still do they a while (~20 seconds on my CPU). No progress can be made in the build process until these have been compiled and linked as they are part of the `ggml` library that everything depends on.
* Compiling `llama.cpp` (a ~23 kLOC C++ source file). This takes ~50 seconds on my CPU. In mainline `llama.cpp` they have refactored their former `llama.cpp` source file into multiple files, which allows this part to be done in parallel. I know I should do something similar here, just haven't come around to do it.
I just measured how long it takes to build `ik_llama.cpp` and `llama.cpp` from scratch with `ccache` disabled and without CUDA (the CUDA code is in a league of its own here and in mainline). Result:
@@ -113,7 +113,7 @@ So, excluding the 50 seconds taken by `llama.cpp` compilation, the remainder in
---
-👤 **saood06** commented the **2025-05-31** at **23:08:49**:
+👤 **saood06** commented on **2025-05-31** at **23:08:49**
> The file name is of course misleading. `ggml-aarch64.c` does not contain only `__aarch64__` specific code. In this fork it contains `ARM_NEON` implementation for the `Q4_0_4_4` and `Q4_0_8_8` quants, plus scalar implementation for these for other platforms.
>
diff --git a/github-data/issues/474 - Bug_ Perf Regression in PP throughput after Pull _461 _...R4 CUDA impl_.md b/github-data/issues/474 - Bug Perf Regression in PP throughput after Pull 461 ...R4 CUDA impl.md
similarity index 88%
rename from github-data/issues/474 - Bug_ Perf Regression in PP throughput after Pull _461 _...R4 CUDA impl_.md
rename to github-data/issues/474 - Bug Perf Regression in PP throughput after Pull 461 ...R4 CUDA impl.md
index ed4a8d271..da86d1628 100644
--- a/github-data/issues/474 - Bug_ Perf Regression in PP throughput after Pull _461 _...R4 CUDA impl_.md
+++ b/github-data/issues/474 - Bug Perf Regression in PP throughput after Pull 461 ...R4 CUDA impl.md
@@ -1,4 +1,4 @@
-### 🐛 [#474](https://github.com/ikawrakow/ik_llama.cpp/issues/474) - Bug: Perf Regression in PP throughput after Pull [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) (...R4 CUDA impl)
+## 📌 [Issue #474](https://github.com/ikawrakow/ik_llama.cpp/issues/474) - Bug: Perf Regression in PP throughput after Pull [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) (...R4 CUDA impl)
| **Author** | `usrlocalben` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -16,7 +16,7 @@ While testing out an IQ4 quant of R1-0528 I noticed that PP throughput on my sys
I compare with an all Q8_0 quant and see what I expect, PP >50/sec (on main/HEAD today.)
-I bisected, and found that this problem was introduced with Pull #461 (commit 1429291).
+I bisected, and found that this problem was introduced with Pull [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) (commit 1429291).
However, my IQ4 quant **doesn't have any _R4 tensors**. It's Q8 shared, and IQ4_K for the remaining tensors.
@@ -26,7 +26,7 @@ CUDA device is RTX 8000 (Turing)
I glance over the commit and mostly see changes that seem clearly restricted to _R4 suffix components. There are some shared parts where _n_interleaved_ is propagated down the template stack (iqk_mmvq.cu) but at a casual glance nothing strikes me as odd, but I'm certainly not that familiar with it. The dot product interface changed to a mutating one taking an accumulator pointer (previously returning the computed result) and that could be curious.
-aside, but maybe related -- there were recent PRs related to mla/fa that had some vague language wrt. Turing support. (Pulls #386 and #408 ) I say vague because 386 indicates turing is not supported, then 408 indicates that it is extended to Turing, but I'm not sure they're referring to the same thing, and the changes in 408 don't seem very significant. It's not clear what the proper mla/fa settings should be on Turing at this time. I currently use `-mla 2 -fa`
+aside, but maybe related -- there were recent PRs related to mla/fa that had some vague language wrt. Turing support. (Pulls [#386](https://github.com/ikawrakow/ik_llama.cpp/issues/386) and [#408](https://github.com/ikawrakow/ik_llama.cpp/issues/408) ) I say vague because 386 indicates turing is not supported, then 408 indicates that it is extended to Turing, but I'm not sure they're referring to the same thing, and the changes in 408 don't seem very significant. It's not clear what the proper mla/fa settings should be on Turing at this time. I currently use `-mla 2 -fa`
### What operating system are you seeing the problem on?
@@ -35,25 +35,25 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-30** at **07:48:21**:
+👤 **ikawrakow** commented on **2025-05-30** at **07:48:21**
> However, my IQ4 quant doesn't have any _R4 tensors. It's Q8 shared, and IQ4_K for the remaining tensors.
> Absence/presence of --run-time-repack doesn't cause nor avoid it.
-To make sure I understand correctly, prior to #461 you observed the same good PP performance irrespective of using or not using `--run-time-repack`. But after #461 you observe the same bad bad PP performance with or without `--run-time-repack` ?
+To make sure I understand correctly, prior to [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) you observed the same good PP performance irrespective of using or not using `--run-time-repack`. But after [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) you observe the same bad bad PP performance with or without `--run-time-repack` ?
---
-👤 **ikawrakow** commented the **2025-05-30** at **07:56:37**:
+👤 **ikawrakow** commented on **2025-05-30** at **07:56:37**
Please also provide your full command line. This really makes it easier to diagnose the problem.
---
-👤 **usrlocalben** commented the **2025-05-30** at **17:15:52**:
+👤 **usrlocalben** commented on **2025-05-30** at **17:15:52**
```
ik_llama.cpp/build/bin/llama-server
@@ -103,7 +103,7 @@ generation eval time = 163612.05 ms / 1437 runs ( 113.86 ms per token,
---
-👤 **ikawrakow** commented the **2025-05-31** at **04:35:34**:
+👤 **ikawrakow** commented on **2025-05-31** at **04:35:34**
**Observations**:
* rtr=no has the same performance on 14292913 and on 24c010b3. In both versions, when rtr=no tensors stored in RAM get offloaded to the GPU to perform the matrix multiplication.
@@ -117,7 +117,7 @@ Conclusion: PCE-E speed is very low, resulting in low PP performance when tensor
- `26,0` disables offloading matrix multiplications
- `27,0` disables offloading indirect matrix multiplications (used in MoE models)
- `29,0` disables offloading fused `ffn_up+ffn_gate` operations (you get these in MoE models when using `-fmoe`)
- * You may want to experiment with `-op` (`op` stands for offload policy, see PR #405)
+ * You may want to experiment with `-op` (`op` stands for offload policy, see PR [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405))
- `-op 29,0 -rtr` should result in the exact same performance as you had on 24c010b3
- If your PCI-E speed is so low as to give such bad performance with GPU offload enabled, adding `-op 27,0` to the above may improve performance compared to what you had on 24c010b3
@@ -125,7 +125,7 @@ Note that for most people not using `-op` and using large batches with `-b 4096
---
-👤 **usrlocalben** commented the **2025-06-01** at **23:41:22**:
+👤 **usrlocalben** commented on **2025-06-01** at **23:41:22**
@ikawrakow
Switching to b/ub=4096 indeed gives the perf that I observed prior to the CUDA _R4, or better. I've seen as high as 90+ t/s now. (And learned something new about how PP is implemented)
@@ -136,7 +136,7 @@ Additionally, it seems like the number of combinations of tensor/config/compile
---
-👤 **ikawrakow** commented the **2025-06-02** at **06:00:39**:
+👤 **ikawrakow** commented on **2025-06-02** at **06:00:39**
> I'm not sure what to do with the Issue. It seems like the commit changed behavior in a way that is orthogonal to its description--but maybe I was just ignorant of the batch-size implications and the previous impl let me get away with it.
@@ -148,7 +148,7 @@ I know. Writing simple and easy to follow instructions has never been one of my
---
-👤 **saood06** commented the **2025-06-02** at **07:36:45**:
+👤 **saood06** commented on **2025-06-02** at **07:36:45**
> I know. Writing simple and easy to follow instructions has never been one of my strengths. Models are different (there are big differences in optimum settings between dense and MoE models, and even for MoE models there are big differences between, say, DeepSeek and Maverick), users systems very between 100% GPU and 100% CPU, and anything in between, there are different quantization types with different tradeoffs, etc. Making it easy for the users would be the domain of product managers, marketing specialists, and technical support, none of which is present in a hobby project such as this one. Hence, it is basically up to the user base to come up with the cook book recipes. [@ubergarm](https://github.com/ubergarm) has done some of that [here](https://github.com/ikawrakow/ik_llama.cpp/discussions/258), but it is by no means complete (and things are moving and changing).
@@ -158,7 +158,7 @@ You do a really good job of providing a lot of info in your PRs but there is no
---
-👤 **ubergarm** commented the **2025-06-02** at **16:18:04**:
+👤 **ubergarm** commented on **2025-06-02** at **16:18:04**
> Additionally, it seems like the number of combinations of tensor/config/compile settings are quite numerous, and more so now after these changes. Is there a way to know what the optimal arrangement should be? e.g. IQ4_K for GPU-tensors, _R4 for cpu tensors, GGML_CUDA_IQK_FORCE_BF16=1 etc. ? Or is it all YMMV, tradeoffs between PP/TG perf, CUDA-arch etc?
@@ -185,6 +185,6 @@ Cheers!
---
-👤 **ikawrakow** commented the **2025-07-05** at **13:13:00**:
+👤 **ikawrakow** commented on **2025-07-05** at **13:13:00**
I think we can close this now.
\ No newline at end of file
diff --git a/github-data/issues/476 - Research_ performance divergence.md b/github-data/issues/476 - Research performance divergence.md
similarity index 90%
rename from github-data/issues/476 - Research_ performance divergence.md
rename to github-data/issues/476 - Research performance divergence.md
index 0d04baf23..c79d6faad 100644
--- a/github-data/issues/476 - Research_ performance divergence.md
+++ b/github-data/issues/476 - Research performance divergence.md
@@ -1,14 +1,14 @@
-### 📝 [#476](https://github.com/ikawrakow/ik_llama.cpp/issues/476) - Research: performance divergence
+## 📌 [Issue #476](https://github.com/ikawrakow/ik_llama.cpp/issues/476) - Research: performance divergence
| **Author** | `VinnyG9` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | ❌ **Closed** |
| **Created** | 2025-05-30 |
-| **Updated** | 2025-06-14 |
+| **Updated** | 2025-07-23 |
---
-#### Description
+## 📄 Description
### Research Stage
@@ -42,15 +42,15 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-31** at **05:22:24**:
+👤 **ikawrakow** commented on **2025-05-31** at **05:22:24**
-Please be specific in your issue. Provide quantization type used, system information, full command line to start the server and, ideally, last good/first bad commit where you observe the performance change. See #474 for an example.
+Please be specific in your issue. Provide quantization type used, system information, full command line to start the server and, ideally, last good/first bad commit where you observe the performance change. See [#474](https://github.com/ikawrakow/ik_llama.cpp/issues/474) for an example.
---
-👤 **VinnyG9** commented the **2025-05-31** at **10:18:09**:
+👤 **VinnyG9** commented on **2025-05-31** at **10:18:09**
I've been testing ik_llama.cpp for about a month mostly benchmarks can't report any regression
running bare build time flags(NATIVE=1, CUDA=1, CUDA_ARCH) and runtime flags(rtr, fa, fmoe, numa)
@@ -61,7 +61,7 @@ no matter the model i try, dense, MoE etc i get less than 50% performance than t
---
-👤 **ikawrakow** commented the **2025-05-31** at **14:28:49**:
+👤 **ikawrakow** commented on **2025-05-31** at **14:28:49**
So, the issue is that the performance you observe when running `llama-server` is 2X lower than the performance you observe when running one of the benchmark tools?
@@ -71,7 +71,7 @@ Generic statement will lead nowhere (other than the issue getting closed)
---
-👤 **VinnyG9** commented the **2025-06-01** at **03:16:25**:
+👤 **VinnyG9** commented on **2025-06-01** at **03:16:25**
> So, the issue is that the performance you observe when running `llama-server` is 2X lower than the performance you observe when running one of the benchmark tools?
@@ -84,7 +84,7 @@ both, literally a bit less than half PP/TG. think it could be a numa issue? i tr
---
-👤 **saood06** commented the **2025-06-01** at **03:26:18**:
+👤 **saood06** commented on **2025-06-01** at **03:26:18**
> both, literally a bit less than half PP/TG. think it could be a numa issue? i tried with stock bios settings but got worse results albeit closer bench/serve numbers
@@ -94,7 +94,7 @@ I brought `llama-sweep-bench` over to this repo and use it regularly because in
---
-👤 **Ph0rk0z** commented the **2025-06-01** at **13:24:16**:
+👤 **Ph0rk0z** commented on **2025-06-01** at **13:24:16**
Its funny because I often get slightly better speeds on server than the sweep bench. Nowhere near half so something is wrong.
@@ -108,7 +108,7 @@ So playing with 4096 batches showed me something. In server, prompt speed on sma
---
-👤 **VinnyG9** commented the **2025-06-02** at **16:34:43**:
+👤 **VinnyG9** commented on **2025-06-02** at **16:34:43**
> > both, literally a bit less than half PP/TG. think it could be a numa issue? i tried with stock bios settings but got worse results albeit closer bench/serve numbers
>
@@ -194,7 +194,7 @@ WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to i
---
-👤 **VinnyG9** commented the **2025-06-02** at **16:55:21**:
+👤 **VinnyG9** commented on **2025-06-02** at **16:55:21**
and running on GPU
@@ -234,7 +234,7 @@ main: llama_decode() failed
---
-👤 **saood06** commented the **2025-06-03** at **00:42:25**:
+👤 **saood06** commented on **2025-06-03** at **00:42:25**
Is there any reason why you use 31 threads? I would say try using 32 threads and see if that helps your performance (but I don't think that is the reason for the gap in performance between server and sweep).
@@ -242,11 +242,11 @@ See this comment about why that might be a bad choice: https://github.com/ikawra
---
-👤 **VinnyG9** commented the **2025-06-03** at **01:37:17**:
+👤 **VinnyG9** commented on **2025-06-03** at **01:37:17**
> Is there any reason why you use 31 threads? I would say try using 32 threads and see if that helps your performance (but I don't think that is the reason for the gap in performance between server and sweep).
>
-> See this comment about why that might be a bad choice: [#223 (comment)](https://github.com/ikawrakow/ik_llama.cpp/discussions/223#discussioncomment-12292591)
+> See this comment about why that might be a bad choice: [[#223](https://github.com/ikawrakow/ik_llama.cpp/issues/223) (comment)](https://github.com/ikawrakow/ik_llama.cpp/discussions/223#discussioncomment-12292591)
yeah when i benched it performance improved with the number of(physical) threads up to 31-32, though only for the moe's
@@ -255,7 +255,7 @@ is it normal that during generation the model pauses on every comma? i find it f
---
-👤 **nux** commented the **2025-06-03** at **02:28:40**:
+👤 **nux** commented on **2025-06-03** at **02:28:40**
Not sure if relevant here - the topic name seems so. Was looking into some performance issues and found this thread.
@@ -301,7 +301,7 @@ Thanks
---
-👤 **ikawrakow** commented the **2025-06-03** at **05:06:02**:
+👤 **ikawrakow** commented on **2025-06-03** at **05:06:02**
@Fuckingnameless
@@ -311,17 +311,17 @@ Having said all this, I still find a factor of 2 difference in CPU performance s
---
-👤 **ikawrakow** commented the **2025-06-03** at **05:20:03**:
+👤 **ikawrakow** commented on **2025-06-03** at **05:20:03**
@nux
-There was PR #461 that added CUDA implementation for some of the row-interleaved quants. This results in a change in behavior for your `IQ4_K_R4` quantized model: prior to PR #461 all matrix multiplications for `X_R4` tensors had to be done on the CPU. After PR #461, for batch size `>= 32` they get offloaded to the GPU to perform the matrix multiplications. If the PCI-E speed is low for some reason, this can make PP slower. You can try adding `-op 26,0,27,0,29,0` to the command line to see what happens. This will disable the offload to the GPU.
+There was PR [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) that added CUDA implementation for some of the row-interleaved quants. This results in a change in behavior for your `IQ4_K_R4` quantized model: prior to PR [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) all matrix multiplications for `X_R4` tensors had to be done on the CPU. After PR [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461), for batch size `>= 32` they get offloaded to the GPU to perform the matrix multiplications. If the PCI-E speed is low for some reason, this can make PP slower. You can try adding `-op 26,0,27,0,29,0` to the command line to see what happens. This will disable the offload to the GPU.
-I have no explanation for the 2X lower TG performance. Try using `-mla 3`, which has been supported on the GPU since PR #408/#413
+I have no explanation for the 2X lower TG performance. Try using `-mla 3`, which has been supported on the GPU since PR [#408](https://github.com/ikawrakow/ik_llama.cpp/issues/408)/[#413](https://github.com/ikawrakow/ik_llama.cpp/issues/413)
---
-👤 **nux** commented the **2025-06-03** at **12:56:00**:
+👤 **nux** commented on **2025-06-03** at **12:56:00**
I will put together a script to go through commits and benchmark to figure out exactly when this started. I'm noticing right now is that while llama-bench is running, the GPU utilization drops to 38-39% for about 10 seconds and going back up to 99%. While llama-bench is running I see this pattern repeating with gpu usage %
@@ -329,7 +329,7 @@ I have been using mla 3 - but ran the benchmark above in mla 2 for comparison pu
---
-👤 **nux** commented the **2025-06-03** at **14:00:55**:
+👤 **nux** commented on **2025-06-03** at **14:00:55**
Commit 0976467 is when the performance went down for me. Was running for i in `cut -d " " -f1 commits.txt `;do git checkout $i;./cmd-build.sh ;./start-bench.sh >> results.txt;done
@@ -361,13 +361,13 @@ build: 24c010b3 (3713)
---
-👤 **ikawrakow** commented the **2025-06-03** at **14:10:23**:
+👤 **ikawrakow** commented on **2025-06-03** at **14:10:23**
@nux Maybe it is better you open a new issue with your findings. You can also add the tensors being used in your model when you do so. This issue is about a discrepancy between performance observed with `llama-bench`/`llama-sweep-bench` and performance observed with `llama-server`.
---
-👤 **VinnyG9** commented the **2025-06-10** at **00:38:42**:
+👤 **VinnyG9** commented on **2025-06-10** at **00:38:42**
> @Fuckingnameless
>
@@ -384,7 +384,7 @@ i make sure the runtime flags are equal between runs, should i be building with:
---
-👤 **cg10036** commented the **2025-06-14** at **15:58:34**:
+👤 **cg10036** commented on **2025-06-14** at **15:58:34**
Hi, I'm leaving a comment because I seem to be experiencing a similar issue.
Quantization Type: IQ4_XS
diff --git a/github-data/issues/479 - Bug_ _ggml_backend_cuda_graph_compute_ disabling CUDA graphs due to GPU.md b/github-data/issues/479 - Bug ggml_backend_cuda_graph_compute disabling CUDA graphs due to GPU architectur.md
similarity index 80%
rename from github-data/issues/479 - Bug_ _ggml_backend_cuda_graph_compute_ disabling CUDA graphs due to GPU.md
rename to github-data/issues/479 - Bug ggml_backend_cuda_graph_compute disabling CUDA graphs due to GPU architectur.md
index 7d02c8ad7..81a23b735 100644
--- a/github-data/issues/479 - Bug_ _ggml_backend_cuda_graph_compute_ disabling CUDA graphs due to GPU.md
+++ b/github-data/issues/479 - Bug ggml_backend_cuda_graph_compute disabling CUDA graphs due to GPU architectur.md
@@ -1,4 +1,4 @@
-### 🐛 [#479](https://github.com/ikawrakow/ik_llama.cpp/issues/479) - Bug: \"ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture\" flood
+## 📌 [Issue #479](https://github.com/ikawrakow/ik_llama.cpp/issues/479) - Bug: "ggml_backend_cuda_graph_compute: disabling CUDA graphs due to GPU architecture" flood
| **Author** | `pt13762104` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -31,9 +31,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-31** at **13:56:56**:
+👤 **ikawrakow** commented on **2025-05-31** at **13:56:56**
So, in this repository `GGML_CUDA_USE_GRAPHS` is off by default. You have explicitly enabled it, but are using a GPU that does not support CUDA graphs and are not satisfied with the observed behavior.
@@ -48,6 +48,6 @@ But it seems you think 3 is better?
---
-👤 **pt13762104** commented the **2025-05-31** at **13:59:34**:
+👤 **pt13762104** commented on **2025-05-31** at **13:59:34**
Oh, I build it with -DCMAKE_CUDA_ARCHITECTURES="75", didn't know such flags existed. Thank you
\ No newline at end of file
diff --git a/github-data/issues/485 - Bug_ Illegal Memory Access loading model to CUDA1.md b/github-data/issues/485 - Bug Illegal Memory Access loading model to CUDA1.md
similarity index 99%
rename from github-data/issues/485 - Bug_ Illegal Memory Access loading model to CUDA1.md
rename to github-data/issues/485 - Bug Illegal Memory Access loading model to CUDA1.md
index d12cfaebc..98ff795db 100644
--- a/github-data/issues/485 - Bug_ Illegal Memory Access loading model to CUDA1.md
+++ b/github-data/issues/485 - Bug Illegal Memory Access loading model to CUDA1.md
@@ -1,4 +1,4 @@
-### 🐛 [#485](https://github.com/ikawrakow/ik_llama.cpp/issues/485) - Bug: Illegal Memory Access loading model to CUDA1
+## 📌 [Issue #485](https://github.com/ikawrakow/ik_llama.cpp/issues/485) - Bug: Illegal Memory Access loading model to CUDA1
| **Author** | `cmoncure` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -463,9 +463,9 @@ CUDA error: an illegal memory access was encountered
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **cmoncure** commented the **2025-06-02** at **21:15:21**:
+👤 **cmoncure** commented on **2025-06-02** at **21:15:21**
This is down to the ergonomics of the configuration options.
Adding -mg 1 solves it. I don't think this should result in a segfault though. Alas, you're just one guy.
diff --git a/github-data/issues/490 - Bug_ Performance drop with 14292913 _461.md b/github-data/issues/490 - Bug Performance drop with 14292913 461.md
similarity index 90%
rename from github-data/issues/490 - Bug_ Performance drop with 14292913 _461.md
rename to github-data/issues/490 - Bug Performance drop with 14292913 461.md
index 516b861a1..0105704a9 100644
--- a/github-data/issues/490 - Bug_ Performance drop with 14292913 _461.md
+++ b/github-data/issues/490 - Bug Performance drop with 14292913 461.md
@@ -1,4 +1,4 @@
-### 🐛 [#490](https://github.com/ikawrakow/ik_llama.cpp/issues/490) - Bug: Performance drop with 14292913 [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461)
+## 📌 [Issue #490](https://github.com/ikawrakow/ik_llama.cpp/issues/490) - Bug: Performance drop with 14292913 [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461)
| **Author** | `nux` |
| :--- | :--- |
@@ -8,11 +8,11 @@
---
-#### Description
+## 📄 Description
### What happened?
-Performance dropping with commit 14292913 #461
+Performance dropping with commit 14292913 [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461)
To identify which commit the performance dropped with I was running:
@@ -75,15 +75,15 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-03** at **14:24:50**:
+👤 **ikawrakow** commented on **2025-06-03** at **14:24:50**
Are all tensors `IQ4_K_R4`? If not, what is the quantization mix in this model?
---
-👤 **nux** commented the **2025-06-03** at **14:30:39**:
+👤 **nux** commented on **2025-06-03** at **14:30:39**
This is https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF IQ4_K_R4
@@ -96,7 +96,7 @@ llama_model_loader: - type iq5_k_r4: 58 tensors
---
-👤 **ikawrakow** commented the **2025-06-03** at **15:08:10**:
+👤 **ikawrakow** commented on **2025-06-03** at **15:08:10**
I cannot run DeepSeek-V3, but as a surrogate here some results with Qwen3-30B-A22B. Quantized with the same mix of `IQ4_K_R4` and `IQ5_K_R4` for the experts, `Q8_0` everything else, just like the model you have. My system is Ryzen-7950X + RTX-4080. I'm leaving all experts on the CPU (`-ot exps=CPU`).
@@ -126,9 +126,9 @@ I see zero difference in TG. PP on main is indeed slower for u-batch of 512, but
---
-👤 **ikawrakow** commented the **2025-06-03** at **15:36:46**:
+👤 **ikawrakow** commented on **2025-06-03** at **15:36:46**
-If you say that you don't want to use large u-batches because of something, you can recover the pre-#461 behavior using `-op 26,0,27,0,29,0`. This disables offloading of tensors that are on the CPU to the GPU. This has not been implemented in `llama-bench`, which has its own command line argument parsing, but is available in `llama-sweep-bench`.
+If you say that you don't want to use large u-batches because of something, you can recover the pre-[#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) behavior using `-op 26,0,27,0,29,0`. This disables offloading of tensors that are on the CPU to the GPU. This has not been implemented in `llama-bench`, which has its own command line argument parsing, but is available in `llama-sweep-bench`.
Here is what I get with
```
@@ -176,7 +176,7 @@ Here is what I get with
---
-👤 **nux** commented the **2025-06-03** at **22:24:24**:
+👤 **nux** commented on **2025-06-03** at **22:24:24**
I don't mind using larger batch sizes. I mostly leave things as they are when it's working and only look at it when there's a problem :-D
@@ -203,7 +203,7 @@ If I'm the only one having problems, I'll keep using 24c010b3 for deepseek-r1 an
---
-👤 **ikawrakow** commented the **2025-06-04** at **04:47:10**:
+👤 **ikawrakow** commented on **2025-06-04** at **04:47:10**
>If I'm the only one having problems, I'll keep using https://github.com/ikawrakow/ik_llama.cpp/commit/24c010b3916b5f1bb9d712d610d1fe9308ef7df4 for deepseek-r1 and deepseek-v3.
@@ -213,7 +213,7 @@ I'll close the issue then.
---
-👤 **nux** commented the **2025-06-04** at **05:47:54**:
+👤 **nux** commented on **2025-06-04** at **05:47:54**
What do you mean options available with DeepSeek? I tried ubatch and have been running mla 3.
@@ -230,15 +230,15 @@ That being said I don't think we pay you enough. I appreciate all the work you'v
---
-👤 **ikawrakow** commented the **2025-06-04** at **05:52:12**:
+👤 **ikawrakow** commented on **2025-06-04** at **05:52:12**
I didn't see your performance values for `-ub 2048` (or even `-b 4096 -ub 4096`
-Neither did I see results for your regular way of using DeepSeek but adding `-op 26,0,27,0,29,0` to your command line. This latter option should match what you had prior to #461.
+Neither did I see results for your regular way of using DeepSeek but adding `-op 26,0,27,0,29,0` to your command line. This latter option should match what you had prior to [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461).
---
-👤 **nux** commented the **2025-06-05** at **13:53:10**:
+👤 **nux** commented on **2025-06-05** at **13:53:10**
-op 26,0,27,0,29,0 brought back the performance. I hadn't tried that one as my PCI-E speed is 16x - but working now.
diff --git a/github-data/issues/498 - question_ about quantize method.md b/github-data/issues/498 - question about quantize method.md
similarity index 77%
rename from github-data/issues/498 - question_ about quantize method.md
rename to github-data/issues/498 - question about quantize method.md
index 5a2d8961d..7d5c19bfc 100644
--- a/github-data/issues/498 - question_ about quantize method.md
+++ b/github-data/issues/498 - question about quantize method.md
@@ -1,4 +1,4 @@
-### 📝 [#498](https://github.com/ikawrakow/ik_llama.cpp/issues/498) - question: about quantize method
+## 📌 [Issue #498](https://github.com/ikawrakow/ik_llama.cpp/issues/498) - question: about quantize method
| **Author** | `nigelzzz` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Hi,
the project is amazing and interesting, looks like it better thank origin llama.cpp.
@@ -21,9 +21,9 @@ thanks
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-06** at **16:39:00**:
+👤 **ikawrakow** commented on **2025-06-06** at **16:39:00**
For BitNet take a look at `IQ1_BN` and `IQ2_BN`. The packing in `IQ2_BN` is simpler and easier to understand, but uses 2 bits per weight. `IQ1_BN` uses 1.625 bits per weight, which is very close to the theoretical 1.58 bits for a ternary data type.
@@ -31,7 +31,7 @@ Otherwise not sure what to recommend. Any of the quantization types should be OK
---
-👤 **aezendc** commented the **2025-06-09** at **10:48:49**:
+👤 **aezendc** commented on **2025-06-09** at **10:48:49**
> For BitNet take a look at `IQ1_BN` and `IQ2_BN`. The packing in `IQ2_BN` is simpler and easier to understand, but uses 2 bits per weight. `IQ1_BN` uses 1.625 bits per weight, which is very close to the theoretical 1.58 bits for a ternary data type.
>
@@ -41,12 +41,12 @@ I like the iq1_bn quantize. Its good and I am using it. Is there a way we can ma
---
-👤 **ikawrakow** commented the **2025-06-09** at **11:01:33**:
+👤 **ikawrakow** commented on **2025-06-09** at **11:01:33**
-See #407
+See [#407](https://github.com/ikawrakow/ik_llama.cpp/issues/407)
---
-👤 **ikawrakow** commented the **2025-06-14** at **12:01:58**:
+👤 **ikawrakow** commented on **2025-06-14** at **12:01:58**
I think we can close it.
\ No newline at end of file
diff --git a/github-data/issues/499 - Bug_ cache quantization crash with IQK_FORCE_BF16.md b/github-data/issues/499 - Bug cache quantization crash with IQK_FORCE_BF16.md
similarity index 90%
rename from github-data/issues/499 - Bug_ cache quantization crash with IQK_FORCE_BF16.md
rename to github-data/issues/499 - Bug cache quantization crash with IQK_FORCE_BF16.md
index b5a43cae7..799be79be 100644
--- a/github-data/issues/499 - Bug_ cache quantization crash with IQK_FORCE_BF16.md
+++ b/github-data/issues/499 - Bug cache quantization crash with IQK_FORCE_BF16.md
@@ -1,4 +1,4 @@
-### 🐛 [#499](https://github.com/ikawrakow/ik_llama.cpp/issues/499) - Bug: cache quantization crash with IQK_FORCE_BF16
+## 📌 [Issue #499](https://github.com/ikawrakow/ik_llama.cpp/issues/499) - Bug: cache quantization crash with IQK_FORCE_BF16
| **Author** | `randoentity` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -91,14 +91,14 @@ gml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*,
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Thireus** commented the **2025-06-06** at **15:04:29**:
+👤 **Thireus** commented on **2025-06-06** at **15:04:29**
I can confirm the same issue occurs on q4_0 as well.
---
-👤 **ikawrakow** commented the **2025-06-06** at **16:32:03**:
+👤 **ikawrakow** commented on **2025-06-06** at **16:32:03**
-Does #501 fix it?
\ No newline at end of file
+Does [#501](https://github.com/ikawrakow/ik_llama.cpp/issues/501) fix it?
\ No newline at end of file
diff --git a/github-data/issues/500 - Bug_ Insane cudaMalloc OOM Error on Dual 3090 GPUs.md b/github-data/issues/500 - Bug Insane cudaMalloc OOM Error on Dual 3090 GPUs.md
similarity index 97%
rename from github-data/issues/500 - Bug_ Insane cudaMalloc OOM Error on Dual 3090 GPUs.md
rename to github-data/issues/500 - Bug Insane cudaMalloc OOM Error on Dual 3090 GPUs.md
index 3b88ce030..5079ef9b5 100644
--- a/github-data/issues/500 - Bug_ Insane cudaMalloc OOM Error on Dual 3090 GPUs.md
+++ b/github-data/issues/500 - Bug Insane cudaMalloc OOM Error on Dual 3090 GPUs.md
@@ -1,4 +1,4 @@
-### 🐛 [#500](https://github.com/ikawrakow/ik_llama.cpp/issues/500) - Bug: Insane cudaMalloc OOM Error on Dual 3090 GPUs
+## 📌 [Issue #500](https://github.com/ikawrakow/ik_llama.cpp/issues/500) - Bug: Insane cudaMalloc OOM Error on Dual 3090 GPUs
| **Author** | `simple6502` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -205,9 +205,9 @@ Aborted
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-06** at **15:58:12**:
+👤 **ikawrakow** commented on **2025-06-06** at **15:58:12**
Try `cmake -DGGML_SCHED_MAX_COPIES=1 ...`
@@ -217,6 +217,6 @@ Also add `--parallel 1` to your command line when starting the server.
---
-👤 **simple6502** commented the **2025-06-06** at **16:23:11**:
+👤 **simple6502** commented on **2025-06-06** at **16:23:11**
Perfect! It works fine now and I don't get any more of those issues. Now I can just fine-tune my settings to work best on my system.
\ No newline at end of file
diff --git a/github-data/issues/503 - Bug_ server_cli fails with segmentation fault.md b/github-data/issues/503 - Bug servercli fails with segmentation fault.md
similarity index 93%
rename from github-data/issues/503 - Bug_ server_cli fails with segmentation fault.md
rename to github-data/issues/503 - Bug servercli fails with segmentation fault.md
index b19dd9820..d56541b84 100644
--- a/github-data/issues/503 - Bug_ server_cli fails with segmentation fault.md
+++ b/github-data/issues/503 - Bug servercli fails with segmentation fault.md
@@ -1,4 +1,4 @@
-### 🐛 [#503](https://github.com/ikawrakow/ik_llama.cpp/issues/503) - Bug: server/cli fails with segmentation fault
+## 📌 [Issue #503](https://github.com/ikawrakow/ik_llama.cpp/issues/503) - Bug: server/cli fails with segmentation fault
| **Author** | `OneOfOne` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -156,44 +156,44 @@ llama_new_context_with_model: graph splits = 779
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **OneOfOne** commented the **2025-06-07** at **20:05:07**:
+👤 **OneOfOne** commented on **2025-06-07** at **20:05:07**
this only happens with the vulkan backend, I haven't figured out how to use rocm or if it's even supported.
---
-👤 **OneOfOne** commented the **2025-06-07** at **20:36:11**:
+👤 **OneOfOne** commented on **2025-06-07** at **20:36:11**
Narrowed it down to `-ctv / -ctk`, removing them makes the model load, however even with full offloading to the GPU, it's extremely slow.
2 tps vs 35tps on lm studio (vulkan backend).
---
-👤 **Ph0rk0z** commented the **2025-06-07** at **22:35:28**:
+👤 **Ph0rk0z** commented on **2025-06-07** at **22:35:28**
Since its not a large MOE but a dense model, not sure if there is a reason to use IK for it instead of mainline.
---
-👤 **OneOfOne** commented the **2025-06-08** at **02:12:36**:
+👤 **OneOfOne** commented on **2025-06-08** at **02:12:36**
I wanted to play with the some of the ggufs optimized for ik_llama, so I figured I'd give it a try, doesn't explain why those options don't work and why it's extremely slow with full gpu offload.
---
-👤 **saood06** commented the **2025-06-08** at **04:56:55**:
+👤 **saood06** commented on **2025-06-08** at **04:56:55**
> Since its not a large MOE but a dense model, not sure if there is a reason to use IK for it instead of mainline.
-That is not true at all. See this (https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828) for a list of reasons on top of the new quant types and there are so many examples of performance gains over mainline, such as for batched performance see the graph in #171.
+That is not true at all. See this (https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828) for a list of reasons on top of the new quant types and there are so many examples of performance gains over mainline, such as for batched performance see the graph in [#171](https://github.com/ikawrakow/ik_llama.cpp/issues/171).
Going back to the actual issue, vulkan and rocm may be functioning well in this repo as they receive very little testing (this is the first I'm hearing of someone trying to use it) and as far as I'm aware have no development here.
---
-👤 **ikawrakow** commented the **2025-06-08** at **05:04:08**:
+👤 **ikawrakow** commented on **2025-06-08** at **05:04:08**
Yes, mainline is a much better place for Vulkan users. There has been zero development or updates to the Vulkan back-end since I forked the project. At that time the `llama.cpp` Vulkan back-end was quite immature. There has been a very active Vulkan development in mainline since then with many performance improvements. ROCm is also never tested, so unclear if it still works.
@@ -203,13 +203,13 @@ These quantization types are not implemented in the Vulkan back-end, so it will
---
-👤 **OneOfOne** commented the **2025-06-08** at **16:22:15**:
+👤 **OneOfOne** commented on **2025-06-08** at **16:22:15**
Thanks for the replies and explanation, I'll close this issue for now until I get an nvidia card I guess
---
-👤 **ubergarm** commented the **2025-06-28** at **22:48:25**:
+👤 **ubergarm** commented on **2025-06-28** at **22:48:25**
@OneOfOne
diff --git a/github-data/issues/507 - Compatible gguf models _.md b/github-data/issues/507 - Compatible gguf models.md
similarity index 57%
rename from github-data/issues/507 - Compatible gguf models _.md
rename to github-data/issues/507 - Compatible gguf models.md
index 99f2dd8c3..171e36045 100644
--- a/github-data/issues/507 - Compatible gguf models _.md
+++ b/github-data/issues/507 - Compatible gguf models.md
@@ -1,4 +1,4 @@
-### 📝 [#507](https://github.com/ikawrakow/ik_llama.cpp/issues/507) - Compatible gguf models ?
+## 📌 [Issue #507](https://github.com/ikawrakow/ik_llama.cpp/issues/507) - Compatible gguf models ?
| **Author** | `lbarasc` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Hi,
@@ -19,15 +19,15 @@ Thank you for your help.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-09** at **12:23:07**:
+👤 **ikawrakow** commented on **2025-06-09** at **12:23:07**
-See #401
+See [#401](https://github.com/ikawrakow/ik_llama.cpp/issues/401)
---
-👤 **lbarasc** commented the **2025-06-09** at **16:47:49**:
+👤 **lbarasc** commented on **2025-06-09** at **16:47:49**
Here is my command under win10 64bits (with latest ik_lama with xeon e5 and rtx 3060 cuda :
@@ -45,24 +45,9 @@ Please help me.
---
-👤 **lbarasc** commented the **2025-06-09** at **16:47:49**:
+👤 **ikawrakow** commented on **2025-06-09** at **16:53:40**
-Here is my command (with latest ik_lama with xeon e5 and rtx 3060 cuda :
-
-D:\ik_lama>llama-server.exe -m ggml-model-i2_s.gguf -p "<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n"
-INFO [ main] build info | tid="21032" timestamp=1749487602 build=1 commit="02272cd"
-INFO [ main] system info | tid="21032" timestamp=1749487602 n_threads=12 n_threads_batch=-1 total_threads=24 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
-
-D:\ik_lama>
-
-I have no error but nothing at all !
-Please help me.
-
----
-
-👤 **ikawrakow** commented the **2025-06-09** at **16:53:40**:
-
-You need to convert the `i2_s` model to `ik_llama.cpp` quants as described in #401. You missed this step:
+You need to convert the `i2_s` model to `ik_llama.cpp` quants as described in [#401](https://github.com/ikawrakow/ik_llama.cpp/issues/401). You missed this step:
```
./build/bin/llama-quantize --allow-requantize ./models/ggml-model-i2_s.gguf ./models/bitnet.gguf iq2_bn_r4
```
@@ -70,7 +55,7 @@ Then your server command should use the newly created file, not the `i2_s` file.
---
-👤 **lbarasc** commented the **2025-06-09** at **17:09:08**:
+👤 **lbarasc** commented on **2025-06-09** at **17:09:08**
I do this :
D:\ik_lama>llama-quantize --allow-requantize ggml-model-i2_s.gguf bitnet.gguf iq2_bn_r4
@@ -84,12 +69,12 @@ but i cannot retrieve bitnet.gguf file ?
---
-👤 **saood06** commented the **2025-06-11** at **07:00:39**:
+👤 **saood06** commented on **2025-06-11** at **07:00:39**
Not sure why the requantize didn't work for you, but I have provided pre-converted models you can use [here](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF).
---
-👤 **ikawrakow** commented the **2025-06-14** at **12:02:29**:
+👤 **ikawrakow** commented on **2025-06-14** at **12:02:29**
Nothing more that we can do here.
\ No newline at end of file
diff --git a/github-data/issues/514 - CUDA Kernel Error on RTX 5090 _Compute Capability 12.0_ _no kernel imag.md b/github-data/issues/514 - CUDA Kernel Error on RTX 5090 Compute Capability 12.0 no kernel image is availab.md
similarity index 97%
rename from github-data/issues/514 - CUDA Kernel Error on RTX 5090 _Compute Capability 12.0_ _no kernel imag.md
rename to github-data/issues/514 - CUDA Kernel Error on RTX 5090 Compute Capability 12.0 no kernel image is availab.md
index a3b4d35f5..20e535450 100644
--- a/github-data/issues/514 - CUDA Kernel Error on RTX 5090 _Compute Capability 12.0_ _no kernel imag.md
+++ b/github-data/issues/514 - CUDA Kernel Error on RTX 5090 Compute Capability 12.0 no kernel image is availab.md
@@ -1,4 +1,4 @@
-### 📝 [#514](https://github.com/ikawrakow/ik_llama.cpp/issues/514) - CUDA Kernel Error on RTX 5090 (Compute Capability 12.0): \"no kernel image is available for execution on the device\"
+## 📌 [Issue #514](https://github.com/ikawrakow/ik_llama.cpp/issues/514) - CUDA Kernel Error on RTX 5090 (Compute Capability 12.0): "no kernel image is available for execution on the device"
| **Author** | `mtcl` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
**Description:**
@@ -505,9 +505,9 @@ Aborted (core dumped)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **mtcl** commented the **2025-06-10** at **04:56:51**:
+👤 **mtcl** commented on **2025-06-10** at **04:56:51**
@ikawrakow or @ubergarm is there an easy fix?
@@ -529,13 +529,13 @@ cmake --build ./build --config Release -j $(nproc)
---
-👤 **ikawrakow** commented the **2025-06-10** at **05:19:51**:
+👤 **ikawrakow** commented on **2025-06-10** at **05:19:51**
So, the default is to make a native build for the GPU you have. This works fine in most cases. I assume it gets built for the 4090 (compute 89). But it seems the 5090 is a different compute architecture, so it does not work. I have no experience with 5090s, and I'm not finding anything related to that in mainline `llama.cpp`. Can you build and run successfully with mainline?
---
-👤 **mtcl** commented the **2025-06-10** at **13:54:58**:
+👤 **mtcl** commented on **2025-06-10** at **13:54:58**
Trying with llama.cpp, pulled latest and configured like this:
@@ -770,7 +770,7 @@ cp llama.cpp/build/bin/llama-* llama.cpp
---
-👤 **mtcl** commented the **2025-06-10** at **14:09:37**:
+👤 **mtcl** commented on **2025-06-10** at **14:09:37**
ok it indeed works with mainline, i validated that it indeed got loaded on 5090.
This is the guide I used by the way:
@@ -973,13 +973,13 @@ llama_perf_context_print: total time = 4409.74 ms / 203 tokens
---
-👤 **ikawrakow** commented the **2025-06-10** at **14:15:24**:
+👤 **ikawrakow** commented on **2025-06-10** at **14:15:24**
In the folder where you build mainline `llama.cpp` there must be a file called `compile_commands.json`. Can you attach it here? Thanks.
---
-👤 **ubergarm** commented the **2025-06-10** at **14:24:05**:
+👤 **ubergarm** commented on **2025-06-10** at **14:24:05**
@mtcl
@@ -996,7 +996,7 @@ Something to try while you get more info for ik anyway and maybe @panchovix will
---
-👤 **mtcl** commented the **2025-06-10** at **14:37:12**:
+👤 **mtcl** commented on **2025-06-10** at **14:37:12**
> In the folder where you build mainline `llama.cpp` there must be a file called `compile_commands.json`. Can you attach it here? Thanks.
@@ -1004,7 +1004,7 @@ Something to try while you get more info for ik anyway and maybe @panchovix will
---
-👤 **Panchovix** commented the **2025-06-10** at **14:45:03**:
+👤 **Panchovix** commented on **2025-06-10** at **14:45:03**
I have at the moment 2x5090+2x4090+2x3090+A6000 and ikllamacpp works fine.
@@ -1027,7 +1027,7 @@ What is your OS by the way? If it's Fedora 42, since it has GCC15, it is a bit d
---
-👤 **mtcl** commented the **2025-06-10** at **15:04:17**:
+👤 **mtcl** commented on **2025-06-10** at **15:04:17**
>
> CUDA 12.8 and 12.9 worked fine to compile.
@@ -1705,7 +1705,7 @@ Aborted (core dumped)
---
-👤 **mtcl** commented the **2025-06-10** at **15:07:37**:
+👤 **mtcl** commented on **2025-06-10** at **15:07:37**
> [@mtcl](https://github.com/mtcl)
>
@@ -1757,13 +1757,13 @@ Aborted (core dumped)
---
-👤 **ikawrakow** commented the **2025-06-10** at **15:08:00**:
+👤 **ikawrakow** commented on **2025-06-10** at **15:08:00**
I think `ccache` maybe the issue. Try building in a new folder.
---
-👤 **mtcl** commented the **2025-06-10** at **15:09:19**:
+👤 **mtcl** commented on **2025-06-10** at **15:09:19**
> I think `ccache` maybe the issue. Try building in a new folder.
@@ -1771,13 +1771,13 @@ i will delete the whole folder, reclone and rebuild. one moment please.
---
-👤 **ikawrakow** commented the **2025-06-10** at **15:29:42**:
+👤 **ikawrakow** commented on **2025-06-10** at **15:29:42**
@Panchovix IIRC, you were getting over 200 t/s prefill for DeepSeek-R1/V3, but I think your setup has improved since then. What is your current performance?
---
-👤 **mtcl** commented the **2025-06-10** at **15:35:52**:
+👤 **mtcl** commented on **2025-06-10** at **15:35:52**
OK this worked! This is what I had to do.
@@ -2696,13 +2696,13 @@ INFO [ update_slots] all slots are idle | tid="132980309143552" times
---
-👤 **ikawrakow** commented the **2025-06-10** at **15:44:24**:
+👤 **ikawrakow** commented on **2025-06-10** at **15:44:24**
I think `ccache` was the issue. It is very useful when not making significant changes to the setup. But it does get confused and does not rebuild correctly what needs to be rebuild. So, in the future, if you fetch a new version of `ik_llama.cpp`, update CUDA, change computer setup, etc., it is best to just delete the existing folder and rebuild from scratch.
---
-👤 **Panchovix** commented the **2025-06-10** at **16:13:42**:
+👤 **Panchovix** commented on **2025-06-10** at **16:13:42**
> [@Panchovix](https://github.com/Panchovix) IIRC, you were getting over 200 t/s prefill for DeepSeek-R1/V3, but I think your setup has improved since then. What is your current performance?
@@ -2722,7 +2722,7 @@ So I haven't tested much recently, but I think on Q3_K_KL I went to a higher bat
---
-👤 **mtcl** commented the **2025-06-10** at **16:16:45**:
+👤 **mtcl** commented on **2025-06-10** at **16:16:45**
I currently have 1*5090, 2*4090, 1*3090 and I'll be getting another 5090 tomorrow.
@@ -2730,19 +2730,13 @@ I was originally going to sell everything else and keep only 2*5090s, is there a
---
-👤 **Panchovix** commented the **2025-06-10** at **20:25:07**:
+👤 **Panchovix** commented on **2025-06-10** at **20:25:07**
@mtcl just no self control, and being able to run Q3 Deepseek 685B models without much issues. Also can *kinda* run IQ4_XS quant with about 20GB RAM left or so.
---
-👤 **Panchovix** commented the **2025-06-10** at **20:25:07**:
-
-@mtcl just no self control, and being bale to run Q3 Deepseek 685B models without much issues. Also can *kinda* run IQ4_NL quant, but just barely.
-
----
-
-👤 **RodriMora** commented the **2025-06-11** at **16:15:46**:
+👤 **RodriMora** commented on **2025-06-11** at **16:15:46**
I do have 2x5090 and 4x3090 and had no problem building at all. I have ccache installed too. How I usually do it:
@@ -2760,7 +2754,7 @@ for mainline I use
---
-👤 **ikawrakow** commented the **2025-06-11** at **16:31:19**:
+👤 **ikawrakow** commented on **2025-06-11** at **16:31:19**
> 200t/s pp, 13t/s tg
@@ -2768,7 +2762,7 @@ With `ik_llama.cpp` or with `llama.cpp`?
---
-👤 **RodriMora** commented the **2025-06-11** at **16:32:54**:
+👤 **RodriMora** commented on **2025-06-11** at **16:32:54**
> > 200t/s pp, 13t/s tg
>
@@ -2787,19 +2781,13 @@ Edit: did a quick sweep bench now
---
-👤 **Panchovix** commented the **2025-06-11** at **16:46:36**:
+👤 **Panchovix** commented on **2025-06-11** at **16:46:36**
@RodriMora Can you tell me the command to run this bench please? Maybe I can try with Q3_K_XL and IQ3_K_R4. I guess you're using a quite big ubatch size?
---
-👤 **Panchovix** commented the **2025-06-11** at **16:46:36**:
-
-@RodriMora Can you tell me the command to run this bench? Maybe I can try with Q3_K_XL and IQ3_K_R4. I guess you're using a quite big ubatch size?
-
----
-
-👤 **RodriMora** commented the **2025-06-11** at **16:56:48**:
+👤 **RodriMora** commented on **2025-06-11** at **16:56:48**
> [@RodriMora](https://github.com/RodriMora) Can you tell me the command to run this bench please? Maybe I can try with Q3_K_XL and IQ3_K_R4. I guess you're using a quite big ubatch size?
@@ -2831,37 +2819,7 @@ Edit: There are some layers missing as I deleted the last one (8,14,18,23,28) fr
---
-👤 **RodriMora** commented the **2025-06-11** at **16:56:48**:
-
-> [@RodriMora](https://github.com/RodriMora) Can you tell me the command to run this bench please? Maybe I can try with Q3_K_XL and IQ3_K_R4. I guess you're using a quite big ubatch size?
-
-The -ot are specific for my setup, the CUDA2 and CUDA4 are the 5090s. 0,1,3,5 are the 3090s
-```
-
-CUDA_VISIBLE_DEVICES="2,4,0,1,3,5" \
- ./build/bin/llama-sweep-bench \
- --model /mnt/llms/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
- --alias ubergarm/DeepSeek-V3-0324-IQ2_K_R4 -mla 3 -fa \
- -amb 512 \
- -fmoe \
- -ctk f16 \
- -c 16384 \
- -ngl 99 \
- -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
- -ot "blk\.(9|10|11|12|13)\.ffn_.*=CUDA1" \
- -ot "blk\.(15|16|17)\.ffn_.*=CUDA2" \
- -ot "blk\.(20|21|22)\.ffn_.*=CUDA3" \
- -ot "blk\.(25|26|27)\.ffn_.*=CUDA4" \
- -ot "blk\.(30|31|32)\.ffn_.*=CUDA5" \
- -ot exps=CPU \
- -b 4096 -ub 4096 \
- --no-mmap \
- --threads 24
-```
-
----
-
-👤 **Panchovix** commented the **2025-06-11** at **19:54:21**:
+👤 **Panchovix** commented on **2025-06-11** at **19:54:21**
Okay I noticed something on ikllamacpp vs llamacpp
@@ -2881,12 +2839,12 @@ PP t/s are similar. I have created a new issue https://github.com/ikawrakow/ik_l
---
-👤 **mtcl** commented the **2025-06-12** at **05:54:13**:
+👤 **mtcl** commented on **2025-06-12** at **05:54:13**
I got 2x5090s and they got perfectly in my system. nowi just need to sell my 2x4090 and 1x3090. 😂
---
-👤 **ikawrakow** commented the **2025-06-14** at **12:00:38**:
+👤 **ikawrakow** commented on **2025-06-14** at **12:00:38**
I think we can close this.
\ No newline at end of file
diff --git a/github-data/issues/521 - When offloading semi layers to some GPUs with -ot_ TG t_s performance t.md b/github-data/issues/521 - When offloading semi layers to some GPUs with -ot TG ts performance tanks CUDA C.md
similarity index 91%
rename from github-data/issues/521 - When offloading semi layers to some GPUs with -ot_ TG t_s performance t.md
rename to github-data/issues/521 - When offloading semi layers to some GPUs with -ot TG ts performance tanks CUDA C.md
index 1de6b5b34..0c32253b4 100644
--- a/github-data/issues/521 - When offloading semi layers to some GPUs with -ot_ TG t_s performance t.md
+++ b/github-data/issues/521 - When offloading semi layers to some GPUs with -ot TG ts performance tanks CUDA C.md
@@ -1,4 +1,4 @@
-### 📝 [#521](https://github.com/ikawrakow/ik_llama.cpp/issues/521) - When offloading semi layers to some GPUs with -ot, TG t/s performance tanks (CUDA + CPU, DeepSeek V3-R1), while not on main llamacpp.
+## 📌 [Issue #521](https://github.com/ikawrakow/ik_llama.cpp/issues/521) - When offloading semi layers to some GPUs with -ot, TG t/s performance tanks (CUDA + CPU, DeepSeek V3-R1), while not on main llamacpp.
| **Author** | `Panchovix` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Hi there, thanks for your work!
@@ -150,15 +150,15 @@ I can test or give more info if is needed.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Ph0rk0z** commented the **2025-06-11** at **21:47:02**:
+👤 **Ph0rk0z** commented on **2025-06-11** at **21:47:02**
If you do fmoe, some of the layers are fused. Do you also see high GPU usage? When I played with this, the up/gate had to be together and then downs could be on a different card. I could tank my prompt processing or my textgen depending on what I chose.
---
-👤 **Panchovix** commented the **2025-06-12** at **00:00:44**:
+👤 **Panchovix** commented on **2025-06-12** at **00:00:44**
@Ph0rk0z Perfect, it was that! Disabling fmoe makes it work correctly
@@ -175,31 +175,25 @@ I haven't checked GPU usage actually, but I assume it is pretty low as PCIe betw
---
-👤 **Ph0rk0z** commented the **2025-06-12** at **11:17:06**:
+👤 **Ph0rk0z** commented on **2025-06-12** at **11:17:06**
GPU usage gets high when you cause it to bounce between 2 GPUs and produce a bottleneck.
---
-👤 **Panchovix** commented the **2025-06-13** at **17:30:21**:
+👤 **Panchovix** commented on **2025-06-13** at **17:30:21**
@Ph0rk0z It seems to peg the main GPU when doing PP at 100%, then, while inferencing, usage seems to bounce on some GPUs at ~90% each at the start, but then it drops to 10-30% per GPU.
---
-👤 **Ph0rk0z** commented the **2025-06-14** at **11:54:57**:
+👤 **Ph0rk0z** commented on **2025-06-14** at **11:54:57**
Then you're not locked up. On mine when the TG became this slow it was doing >50% on only 2 gpu and did it the entire time generating.
---
-👤 **Ph0rk0z** commented the **2025-06-14** at **11:54:57**:
-
-Then you're not locked up. On mine when the TG became this slow it was doing >50% on only 2 gpu and did it the entire time.
-
----
-
-👤 **ubergarm** commented the **2025-07-10** at **02:33:12**:
+👤 **ubergarm** commented on **2025-07-10** at **02:33:12**
@Panchovix
diff --git a/github-data/issues/522 - Bug_ disabling CUDA graphs due to mul_mat_id.md b/github-data/issues/522 - Bug disabling CUDA graphs due to mul_mat_id.md
similarity index 98%
rename from github-data/issues/522 - Bug_ disabling CUDA graphs due to mul_mat_id.md
rename to github-data/issues/522 - Bug disabling CUDA graphs due to mul_mat_id.md
index 2cbb131c8..4c99448ca 100644
--- a/github-data/issues/522 - Bug_ disabling CUDA graphs due to mul_mat_id.md
+++ b/github-data/issues/522 - Bug disabling CUDA graphs due to mul_mat_id.md
@@ -1,4 +1,4 @@
-### 🐛 [#522](https://github.com/ikawrakow/ik_llama.cpp/issues/522) - Bug: disabling CUDA graphs due to mul_mat_id
+## 📌 [Issue #522](https://github.com/ikawrakow/ik_llama.cpp/issues/522) - Bug: disabling CUDA graphs due to mul_mat_id
| **Author** | `SlavikCA` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -460,27 +460,27 @@ ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecuti
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-12** at **05:03:54**:
+👤 **ikawrakow** commented on **2025-06-12** at **05:03:54**
This warning is hidden behind `#ifdef NDEBUG`, so should not appear in a release build.
---
-👤 **SlavikCA** commented the **2025-06-12** at **05:07:30**:
+👤 **SlavikCA** commented on **2025-06-12** at **05:07:30**
so, safe to ignore?
---
-👤 **ikawrakow** commented the **2025-06-12** at **05:15:20**:
+👤 **ikawrakow** commented on **2025-06-12** at **05:15:20**
Yes, the warning is safe to ignore. But you should make sure that you are using a Release build (where this warning should normally not appear), else your performance will be very low. Try adding `-DCMAKE_BUILD_TYPE=Release` to your `cmake` command. If you still see this message, ask your `cmake` vendor why `NDEBUG` is not defined in a release build.
---
-👤 **SlavikCA** commented the **2025-06-12** at **05:19:05**:
+👤 **SlavikCA** commented on **2025-06-12** at **05:19:05**
I did this:
```
diff --git a/github-data/issues/523 - Bug_ tg speed drop after https_github.com_ikawrakow_ik_llama.cpp_pull_5.md b/github-data/issues/523 - Bug tg speed drop after httpsgithub.comikawrakowik_llama.cpppull518.md
similarity index 69%
rename from github-data/issues/523 - Bug_ tg speed drop after https_github.com_ikawrakow_ik_llama.cpp_pull_5.md
rename to github-data/issues/523 - Bug tg speed drop after httpsgithub.comikawrakowik_llama.cpppull518.md
index 1f0953ade..8e615ce05 100644
--- a/github-data/issues/523 - Bug_ tg speed drop after https_github.com_ikawrakow_ik_llama.cpp_pull_5.md
+++ b/github-data/issues/523 - Bug tg speed drop after httpsgithub.comikawrakowik_llama.cpppull518.md
@@ -1,4 +1,4 @@
-### 🐛 [#523](https://github.com/ikawrakow/ik_llama.cpp/issues/523) - Bug: tg speed drop after https://github.com/ikawrakow/ik_llama.cpp/pull/518
+## 📌 [Issue #523](https://github.com/ikawrakow/ik_llama.cpp/issues/523) - Bug: tg speed drop after https://github.com/ikawrakow/ik_llama.cpp/pull/518
| **Author** | `ciprianveg` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -55,27 +55,27 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-12** at **07:50:42**:
+👤 **ikawrakow** commented on **2025-06-12** at **07:50:42**
So, what is it: is the TG speed drop for `IQ3_S` or for `IQ3_XXS`? Or for both? (but there is only one performance value given).
On the two systems I have available (Zen3 and Zen4), TG performance is exactly the same as before (and I don't see a reason why it should decrease by 20%). Have you tried dropping caches?
-The reason you see a low PP performance when you use `-rtr` with these models is that there is no CUDA implementation for `IQ3_S_R4` or `IQ3_XXS_R4`, so the matrix multiplications for the experts left in RAM is done on the CPU, and your CPU seems to be on the low-end performance side (people do get over 100 t/s on high-end CPUs running CPU-only). So, the only case where you would want to use `-rtr` with a quant that does not have a CUDA implementation for the interleaved variant is when your prompts are relatively short, so offloading to the GPU is slower than running on the CPU. But after PRs #516 and #518, normally prompt processing should now be faster without `-rtr` for `IQ3_S` and `IQ3_XXS`.
+The reason you see a low PP performance when you use `-rtr` with these models is that there is no CUDA implementation for `IQ3_S_R4` or `IQ3_XXS_R4`, so the matrix multiplications for the experts left in RAM is done on the CPU, and your CPU seems to be on the low-end performance side (people do get over 100 t/s on high-end CPUs running CPU-only). So, the only case where you would want to use `-rtr` with a quant that does not have a CUDA implementation for the interleaved variant is when your prompts are relatively short, so offloading to the GPU is slower than running on the CPU. But after PRs [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516) and [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518), normally prompt processing should now be faster without `-rtr` for `IQ3_S` and `IQ3_XXS`.
---
-👤 **ikawrakow** commented the **2025-06-12** at **08:00:48**:
+👤 **ikawrakow** commented on **2025-06-12** at **08:00:48**
> and your CPU seems to be on the low-end performance side
-Take that back. You have decided to use the quants with the lowest CPU performance (`IQ3_S` and `IQ3_XXS`), so 25 t/s for DeepSeek-R1 with these quants is not too bad. PP should be 3X better after PR #516 and #518 when running on the CPU.
+Take that back. You have decided to use the quants with the lowest CPU performance (`IQ3_S` and `IQ3_XXS`), so 25 t/s for DeepSeek-R1 with these quants is not too bad. PP should be 3X better after PR [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516) and [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518) when running on the CPU.
---
-👤 **ciprianveg** commented the **2025-06-12** at **08:34:02**:
+👤 **ciprianveg** commented on **2025-06-12** at **08:34:02**
Hi, sorry if I was not clear:
Using DeepSeek-R1-0528-UD-IQ3_XXS,
@@ -88,22 +88,9 @@ I realize that quant is not a good fit, but i tried it because is the biggest on
---
-👤 **ciprianveg** commented the **2025-06-12** at **08:34:02**:
+👤 **ikawrakow** commented on **2025-06-12** at **11:06:45**
-Hi, sorry if I was not clear:
-Using DeepSeek-R1-0528-UD-IQ3_XXS,
-After https://github.com/ikawrakow/ik_llama.cpp/pull/517 tg speed was 5.5t/s (without -rtr and 6.4 with rtr).
-After https://github.com/ikawrakow/ik_llama.cpp/pull/518 tg speed drop to 4.5 t/s (without -rtr and 6.2 with rtr).
-
-If I use -rtr, even after pr 516, 518, pp speed drops from cca 250t/s to 26t/s.
-
-I realize that quant is not a good fit, but i tried it because is the biggest one I can fit on my ram+vram, I wanted something a little bigger and possibly better perplexity wise than the already good and fast Ubergram's IQ2_K_R4 model..
-
----
-
-👤 **ikawrakow** commented the **2025-06-12** at **11:06:45**:
-
-So, after #517 it became slightly faster. Which means that what I did in #516 for `IQ3_XXS` is slightly better on your system. But after #518, which applies the very same approach used in #516 to `IQ3_S`, it suddenly became 20% slower. Looking at the Unsloth `IQ3_XXS` model, I see they have used `IQ3_S` for the routed experts in 5 layers. I.e., less than 10% of the computation is done according to the new approach of #518. In order to observe a 20% drop in performance, simple napkin math tells me that `IQ3_S` GEMV must have become 3 times slower with PR #518. Sorry, but this seems extremely unlikely.
+So, after [#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517) it became slightly faster. Which means that what I did in [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516) for `IQ3_XXS` is slightly better on your system. But after [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518), which applies the very same approach used in [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516) to `IQ3_S`, it suddenly became 20% slower. Looking at the Unsloth `IQ3_XXS` model, I see they have used `IQ3_S` for the routed experts in 5 layers. I.e., less than 10% of the computation is done according to the new approach of [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518). In order to observe a 20% drop in performance, simple napkin math tells me that `IQ3_S` GEMV must have become 3 times slower with PR [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518). Sorry, but this seems extremely unlikely.
You didn't try to drop caches as suggested, did you?
```
@@ -112,7 +99,7 @@ echo 3 | sudo tee /proc/sys/vm/drop_caches
---
-👤 **Ph0rk0z** commented the **2025-06-12** at **11:23:18**:
+👤 **Ph0rk0z** commented on **2025-06-12** at **11:23:18**
I observe a similar thing:
@@ -136,7 +123,7 @@ All changes
| 4096 | 1024 | 8192 | 28.061 | 145.97 | 111.293 | 9.20 |
-Up to #517
+Up to [#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -150,21 +137,21 @@ An R4 quant is nonviable since it drops PP down to 50/60 unless using batch-ubat
---
-👤 **ikawrakow** commented the **2025-06-12** at **11:29:39**:
+👤 **ikawrakow** commented on **2025-06-12** at **11:29:39**
In what sense is a <2% change similar to a 20% change?
---
-👤 **Ph0rk0z** commented the **2025-06-12** at **11:43:54**:
+👤 **Ph0rk0z** commented on **2025-06-12** at **11:43:54**
It confirms there is a change at all. On his particular hardware maybe the change is larger.
---
-👤 **ciprianveg** commented the **2025-06-12** at **11:46:09**:
+👤 **ciprianveg** commented on **2025-06-12** at **11:46:09**
-> So, after [#517](https://github.com/ikawrakow/ik_llama.cpp/pull/517) it became slightly faster. Which means that what I did in [#516](https://github.com/ikawrakow/ik_llama.cpp/pull/516) for `IQ3_XXS` is slightly better on your system. But after [#518](https://github.com/ikawrakow/ik_llama.cpp/pull/518), which applies the very same approach used in [#516](https://github.com/ikawrakow/ik_llama.cpp/pull/516) to `IQ3_S`, it suddenly became 20% slower. Looking at the Unsloth `IQ3_XXS` model, I see they have used `IQ3_S` for the routed experts in 5 layers. I.e., less than 10% of the computation is done according to the new approach of [#518](https://github.com/ikawrakow/ik_llama.cpp/pull/518). In order to observe a 20% drop in performance, simple napkin math tells me that `IQ3_S` GEMV must have become 3 times slower with PR [#518](https://github.com/ikawrakow/ik_llama.cpp/pull/518). Sorry, but this seems extremely unlikely.
+> So, after [[#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517)](https://github.com/ikawrakow/ik_llama.cpp/pull/517) it became slightly faster. Which means that what I did in [[#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516)](https://github.com/ikawrakow/ik_llama.cpp/pull/516) for `IQ3_XXS` is slightly better on your system. But after [[#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518)](https://github.com/ikawrakow/ik_llama.cpp/pull/518), which applies the very same approach used in [[#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516)](https://github.com/ikawrakow/ik_llama.cpp/pull/516) to `IQ3_S`, it suddenly became 20% slower. Looking at the Unsloth `IQ3_XXS` model, I see they have used `IQ3_S` for the routed experts in 5 layers. I.e., less than 10% of the computation is done according to the new approach of [[#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518)](https://github.com/ikawrakow/ik_llama.cpp/pull/518). In order to observe a 20% drop in performance, simple napkin math tells me that `IQ3_S` GEMV must have become 3 times slower with PR [[#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518)](https://github.com/ikawrakow/ik_llama.cpp/pull/518). Sorry, but this seems extremely unlikely.
>
> You didn't try to drop caches as suggested, did you?
>
@@ -176,7 +163,7 @@ i will execute echo 3 | sudo tee /proc/sys/vm/drop_caches, and rerun the test on
---
-👤 **ikawrakow** commented the **2025-06-12** at **12:08:01**:
+👤 **ikawrakow** commented on **2025-06-12** at **12:08:01**
> It confirms there is a change at all. On his particular hardware maybe the change is larger.
@@ -184,15 +171,15 @@ Does it? The fluctuations in performance I observe from run to run are definitel
---
-👤 **Ph0rk0z** commented the **2025-06-12** at **12:24:14**:
+👤 **Ph0rk0z** commented on **2025-06-12** at **12:24:14**
Dunno, there is some variance for sure. I've run many of them. The all changes drop does look like a drop tho. They tend to be repeatable when you have the same settings, especially on the initial/final runs. When you add/remove layers or change settings is when it gets dicey. It smells like with that middle one, I'll never see 10s on TG anymore. Lets see what he comes back with.
---
-👤 **ciprianveg** commented the **2025-06-12** at **13:31:49**:
+👤 **ciprianveg** commented on **2025-06-12** at **13:31:49**
-very strange, i redone the tests dropping the cache after clean rebuild and the difference is big, but the difference is big comparing to origin/ik/iq1_s_gemm. Before #516 and #517 I had a smaller 4.5 t/s tg speed so also something good happened yesterday.
+very strange, i redone the tests dropping the cache after clean rebuild and the difference is big, but the difference is big comparing to origin/ik/iq1_s_gemm. Before [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516) and [#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517) I had a smaller 4.5 t/s tg speed so also something good happened yesterday.
I assume I should not be choosing the worst type of quant for ik_llama (DeepSeekR1-UD-iQ3-XXS), so I switched back to ubergam's q2 and wait nicely till him or someone else can make a slightly bigger. ik compatible quant and enjoy 8t/s tg speed and same 250 t/s pp speed:)
My tests:
@@ -234,9 +221,9 @@ System: TR 3955WX 256GB ram, 2x3090 24GB + A4500 20GB
---
-👤 **ikawrakow** commented the **2025-06-12** at **16:17:20**:
+👤 **ikawrakow** commented on **2025-06-12** at **16:17:20**
-Can you try #524 ?
+Can you try [#524](https://github.com/ikawrakow/ik_llama.cpp/issues/524) ?
My guess is we are running into compiler limitations. The matrix multiplication code uses C++ templates, and I have observed in the past the strange effect that after adding a new instantiation of the template, performance suddenly drops for pre-existing template instances. I haven't seen this effect for a while, but maybe it is there for you?
@@ -244,13 +231,13 @@ What is the compiler version you are using?
---
-👤 **ciprianveg** commented the **2025-06-12** at **16:22:07**:
+👤 **ciprianveg** commented on **2025-06-12** at **16:22:07**
I will try in about 2h and let you know. Thank you!
---
-👤 **ciprianveg** commented the **2025-06-12** at **19:10:24**:
+👤 **ciprianveg** commented on **2025-06-12** at **19:10:24**
hello, much better: origin/ik/iq_gemv_tweaks :)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
diff --git a/github-data/issues/527 - Bug_ Webui improvement _481 core dump with a certain question..md b/github-data/issues/527 - Bug Webui improvement 481 core dump with a certain question.md
similarity index 98%
rename from github-data/issues/527 - Bug_ Webui improvement _481 core dump with a certain question..md
rename to github-data/issues/527 - Bug Webui improvement 481 core dump with a certain question.md
index 98e579de3..715a27c92 100644
--- a/github-data/issues/527 - Bug_ Webui improvement _481 core dump with a certain question..md
+++ b/github-data/issues/527 - Bug Webui improvement 481 core dump with a certain question.md
@@ -1,4 +1,4 @@
-### 🐛 [#527](https://github.com/ikawrakow/ik_llama.cpp/issues/527) - Bug: Webui improvement [#481](https://github.com/ikawrakow/ik_llama.cpp/issues/481) core dump with a certain question.
+## 📌 [Issue #527](https://github.com/ikawrakow/ik_llama.cpp/issues/527) - Bug: Webui improvement [#481](https://github.com/ikawrakow/ik_llama.cpp/issues/481) core dump with a certain question.
| **Author** | `ycat3` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -627,14 +627,14 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-14** at **02:54:09**:
+👤 **ikawrakow** commented on **2025-06-14** at **02:54:09**
-Should be fixed now via PR #528.
+Should be fixed now via PR [#528](https://github.com/ikawrakow/ik_llama.cpp/issues/528).
---
-👤 **ikawrakow** commented the **2025-06-14** at **10:56:13**:
+👤 **ikawrakow** commented on **2025-06-14** at **10:56:13**
-Closed via #528
\ No newline at end of file
+Closed via [#528](https://github.com/ikawrakow/ik_llama.cpp/issues/528)
\ No newline at end of file
diff --git a/github-data/issues/530 - Getting crash on second prompt..md b/github-data/issues/530 - Getting crash on second prompt.md
similarity index 99%
rename from github-data/issues/530 - Getting crash on second prompt..md
rename to github-data/issues/530 - Getting crash on second prompt.md
index 3d09ad798..c0bb6c90e 100644
--- a/github-data/issues/530 - Getting crash on second prompt..md
+++ b/github-data/issues/530 - Getting crash on second prompt.md
@@ -1,4 +1,4 @@
-### 📝 [#530](https://github.com/ikawrakow/ik_llama.cpp/issues/530) - Getting crash on second prompt.
+## 📌 [Issue #530](https://github.com/ikawrakow/ik_llama.cpp/issues/530) - Getting crash on second prompt.
| **Author** | `mtcl` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
Getting crash on second prompt. Would there be any reason why?
@@ -678,21 +678,21 @@ Aborted (core dumped)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-15** at **04:44:17**:
+👤 **ikawrakow** commented on **2025-06-15** at **04:44:17**
You are 1 commit behind current main branch, and that commit fixes exactly this problem.
---
-👤 **mtcl** commented the **2025-06-15** at **05:03:14**:
+👤 **mtcl** commented on **2025-06-15** at **05:03:14**
Alright, pulling latest, building and trying out again :) thank you so much!
---
-👤 **mtcl** commented the **2025-06-15** at **05:38:58**:
+👤 **mtcl** commented on **2025-06-15** at **05:38:58**
so i deleted, recloned, rebuilt, it loaded and then crashed when tried to process prompt. Is there a previous version that was stable that I can revert to?
@@ -1223,13 +1223,13 @@ Aborted (core dumped)
---
-👤 **ikawrakow** commented the **2025-06-15** at **05:43:55**:
+👤 **ikawrakow** commented on **2025-06-15** at **05:43:55**
You are running with a context of 4096. That's what you wanted, or was it just a type missing a zero?
---
-👤 **mtcl** commented the **2025-06-15** at **05:45:44**:
+👤 **mtcl** commented on **2025-06-15** at **05:45:44**
> You are running with a context of 4096. That's what you wanted, or was it just a type missing a zero?
@@ -1237,12 +1237,12 @@ Wow, you know me better than I know myself! It indeed was a typo in a hurry! I w
---
-👤 **ikawrakow** commented the **2025-06-15** at **05:47:29**:
+👤 **ikawrakow** commented on **2025-06-15** at **05:47:29**
So, what happened is that the context became full, it tried to shift it, and that may not work with q8_0 for KV cache.
---
-👤 **mtcl** commented the **2025-06-15** at **05:50:44**:
+👤 **mtcl** commented on **2025-06-15** at **05:50:44**
Ah I see, that makes sense. I will close this in that case! Thanks again.
\ No newline at end of file
diff --git a/github-data/issues/538 - Bug_ GGML_ASSERT failed at first prompt.md b/github-data/issues/538 - Bug GGML_ASSERT failed at first prompt.md
similarity index 97%
rename from github-data/issues/538 - Bug_ GGML_ASSERT failed at first prompt.md
rename to github-data/issues/538 - Bug GGML_ASSERT failed at first prompt.md
index bc6498e73..5ea630e47 100644
--- a/github-data/issues/538 - Bug_ GGML_ASSERT failed at first prompt.md
+++ b/github-data/issues/538 - Bug GGML_ASSERT failed at first prompt.md
@@ -1,4 +1,4 @@
-### 🐛 [#538](https://github.com/ikawrakow/ik_llama.cpp/issues/538) - Bug: GGML_ASSERT failed at first prompt
+## 📌 [Issue #538](https://github.com/ikawrakow/ik_llama.cpp/issues/538) - Bug: GGML_ASSERT failed at first prompt
| **Author** | `iehgit` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -336,9 +336,9 @@ Aborted
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-19** at **01:26:22**:
+👤 **ubergarm** commented on **2025-06-19** at **01:26:22**
Hrmm... I'm getting something odd now too with my `DeepSeek-R1-0528-IQ4_KS_R4` as well as mostly pure models.
@@ -354,13 +354,13 @@ You might try `git checkout dc96820d` and re-build to see if that gets you for n
---
-👤 **ikawrakow** commented the **2025-06-19** at **06:36:55**:
+👤 **ikawrakow** commented on **2025-06-19** at **06:36:55**
-Is it fixed on the latest after #540?
+Is it fixed on the latest after [#540](https://github.com/ikawrakow/ik_llama.cpp/issues/540)?
---
-👤 **ubergarm** commented the **2025-06-19** at **15:35:21**:
+👤 **ubergarm** commented on **2025-06-19** at **15:35:21**
I recompiled to tip of main 3f111ad7 which includes PR540.
@@ -368,6 +368,6 @@ Confirmed it is working again for me and no longer throwing the `Oops(ggml_compu
---
-👤 **iehgit** commented the **2025-06-19** at **16:54:14**:
+👤 **iehgit** commented on **2025-06-19** at **16:54:14**
Fixed indeed. Thanks!
\ No newline at end of file
diff --git a/github-data/issues/539 - Bug_ garbage output.md b/github-data/issues/539 - Bug garbage output.md
similarity index 97%
rename from github-data/issues/539 - Bug_ garbage output.md
rename to github-data/issues/539 - Bug garbage output.md
index e107e9d02..395299fe0 100644
--- a/github-data/issues/539 - Bug_ garbage output.md
+++ b/github-data/issues/539 - Bug garbage output.md
@@ -1,4 +1,4 @@
-### 🐛 [#539](https://github.com/ikawrakow/ik_llama.cpp/issues/539) - Bug: garbage output
+## 📌 [Issue #539](https://github.com/ikawrakow/ik_llama.cpp/issues/539) - Bug: garbage output
| **Author** | `jagusztinl` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -305,9 +305,9 @@ Linux
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **jagusztinl** commented the **2025-06-19** at **08:40:53**:
+👤 **jagusztinl** commented on **2025-06-19** at **08:40:53**
I tried with IQ4_XS models (gemma) it works perfectly, maybe Q4_0 is bad. But with IQ4_XS and -rtr garbage again. What I miss?
@@ -450,17 +450,17 @@ What is the meaning of life? In english please please please please please pleas
---
-👤 **ikawrakow** commented the **2025-06-19** at **08:53:16**:
+👤 **ikawrakow** commented on **2025-06-19** at **08:53:16**
Can you try the latest build?
---
-👤 **jagusztinl** commented the **2025-06-20** at **08:01:04**:
+👤 **jagusztinl** commented on **2025-06-20** at **08:01:04**
Same, please help:
:~/models$ uname -a
-Linux gpt 6.11.0-1015-azure #15~24.04.1-Ubuntu SMP Thu May 1 03:01:44 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
+Linux gpt 6.11.0-1015-azure [#15](https://github.com/ikawrakow/ik_llama.cpp/issues/15)~24.04.1-Ubuntu SMP Thu May 1 03:01:44 UTC 2025 aarch64 aarch64 aarch64 GNU/Linux
:~/models$ gcc --version
gcc (Ubuntu 14.2.0-4ubuntu2~24.04) 14.2.0
@@ -600,7 +600,7 @@ What is the meaning of life? In english please-E4>6'236,(=+G7(@G>H$8,
+👤 **jagusztinl** commented on **2025-06-20** at **12:54:53**
FYI, I had this warnings during compilation:
@@ -1159,7 +1159,7 @@ In file included from /home/alerant/ik_llama.cpp/examples/llava/clip.cpp:24:
---
-👤 **jagusztinl** commented the **2025-06-20** at **14:04:07**:
+👤 **jagusztinl** commented on **2025-06-20** at **14:04:07**
Fixed: build with -DGGML_SVE=ON solved it
@@ -1180,19 +1180,7 @@ ik_llama.cpp:
---
-👤 **jagusztinl** commented the **2025-06-20** at **14:04:07**:
-
-Fixed: build with -DGGML_SVE=ON solved it
-
----
-
-👤 **jagusztinl** commented the **2025-06-20** at **14:06:39**:
-
-But not faster for any model than the current llama.cpp build on ARM CPU
-
----
-
-👤 **ikawrakow** commented the **2025-06-20** at **15:50:59**:
+👤 **ikawrakow** commented on **2025-06-20** at **15:50:59**
You never mentioned your are using an ARM CPU. Unlike llama.cpp, nothing is automatically set for you on ARM. It is likely you need to set arch options manually. `-DGGML_SVE=ON` solving your issues sounds strange to me as no usage is made of SVE anywhere in `ik_llama.cpp`. The only ARM implementation that exists is NEON.
@@ -1204,13 +1192,13 @@ But overall, yes, ARM CPUs are not a big focus of this project. I maintain it in
---
-👤 **ikawrakow** commented the **2025-06-20** at **15:59:12**:
+👤 **ikawrakow** commented on **2025-06-20** at **15:59:12**
Oh, what is the CPU you are using?
---
-👤 **jagusztinl** commented the **2025-06-21** at **08:39:04**:
+👤 **jagusztinl** commented on **2025-06-21** at **08:39:04**
Thank you for your answer, a bit detailed explanation of the project:
-We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuously without limits.
@@ -1225,22 +1213,7 @@ Please advise how can we further optimize Deepseek inference with your solution.
---
-👤 **jagusztinl** commented the **2025-06-21** at **08:39:04**:
-
-Thank you for your answer, a bit detail explanation of the project:
--We are using Azure Cobalt ARM CPUs on spot VMs, (64 real core, 512Gb 12 channel very fast RAM) for 0.5USD/hour (!) instead of expensive GPU setups. The price/perforance ratio is unbeatable: our collegues can use DeepSeek privately for 80USD/month continuosly. without limits.
--We experimented with llama.cpp as the fastest inference engine, with this setup (optimized for Cobalt and linked with ARM performance libs): cmake -DGGML_CPU_KLEIDIAI=ON -DCMAKE_CXX_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS="-mcpu=cobalt-100 -mtune=cobalt-100 -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " and ggml detection results:
-Adding CPU backend variant ggml-cpu: -mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme
-
-The best result was this with llama.cpp, usable but we are looking for better performance, this is why we turned to your project:
-| deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | pp512 | 43.27 ± 0.16 |
-| deepseek2 671B Q4_0 | 353.47 GiB | 671.03 B | RPC | 99 | 1 | tg128 | 10.97 ± 0.07 |
-
-Please advise how can we further optimize Deepseek inference with your solution.
-
----
-
-👤 **jagusztinl** commented the **2025-06-21** at **08:47:17**:
+👤 **jagusztinl** commented on **2025-06-21** at **08:47:17**
About the garbage problem:
If I do not use -DGGML_SVE=ON during compilation, it is not detected:
@@ -1251,7 +1224,7 @@ this is the root cause of the garbage output on this server.
---
-👤 **ikawrakow** commented the **2025-06-21** at **09:22:44**:
+👤 **ikawrakow** commented on **2025-06-21** at **09:22:44**
I'm open to working on optimizing this project for SVE, but it is a hobby project of mine without commercial backing, so I develop/test on the CPU platforms I have access to (`AVX2`, `Zen4`, `ARM_NEON` on an M2-Max CPU).
@@ -1259,13 +1232,13 @@ What are you looking to optimize? I read somewhere that the "typical enterprise"
---
-👤 **saood06** commented the **2025-06-21** at **16:16:04**:
+👤 **saood06** commented on **2025-06-21** at **16:16:04**
-So can you try experimenting with `-DGGML_ARCH_FLAGS=` added by #347. Some users have had some success with it see: https://github.com/ikawrakow/ik_llama.cpp/issues/345#issuecomment-2831460138. It looks like you have done similar experimenting with llama.cpp, in optimizing it.
+So can you try experimenting with `-DGGML_ARCH_FLAGS=` added by [#347](https://github.com/ikawrakow/ik_llama.cpp/issues/347). Some users have had some success with it see: https://github.com/ikawrakow/ik_llama.cpp/issues/345#issuecomment-2831460138. It looks like you have done similar experimenting with llama.cpp, in optimizing it.
---
-👤 **jagusztinl** commented the **2025-06-23** at **15:34:50**:
+👤 **jagusztinl** commented on **2025-06-23** at **15:34:50**
Using this:
cmake -B ./build -DGGML_LTO=ON -DCMAKE_CXX_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DCMAKE_C_FLAGS=" -flto -Ofast -DINTEGER64 -I${ARMPL_DIR}/include -larmpl_ilp64_mp -lamath -lastring -lm " -DGGML_ARCH_FLAGS="-mcpu=neoverse-n2+crc+sve2-aes+sve2-sha3+sve2-sm4+norng+nossbs+dotprod+i8mm+sve+nosme"
@@ -1276,7 +1249,7 @@ ik_llama.cpp is winner :-)
---
-👤 **saood06** commented the **2025-06-23** at **20:40:31**:
+👤 **saood06** commented on **2025-06-23** at **20:40:31**
>ik_llama.cpp is winner :-)
@@ -1296,6 +1269,6 @@ If your use case allows for it, you may be able to get better performance with b
---
-👤 **ikawrakow** commented the **2025-06-26** at **06:49:28**:
+👤 **ikawrakow** commented on **2025-06-26** at **06:49:28**
No need to keep this open.
\ No newline at end of file
diff --git a/github-data/issues/551 - Feature Request_ Support for Falcon Edge series.md b/github-data/issues/551 - Feature Request Support for Falcon Edge series.md
similarity index 90%
rename from github-data/issues/551 - Feature Request_ Support for Falcon Edge series.md
rename to github-data/issues/551 - Feature Request Support for Falcon Edge series.md
index ab2d614dd..27843d1b5 100644
--- a/github-data/issues/551 - Feature Request_ Support for Falcon Edge series.md
+++ b/github-data/issues/551 - Feature Request Support for Falcon Edge series.md
@@ -1,14 +1,15 @@
-### ✨ [#551](https://github.com/ikawrakow/ik_llama.cpp/issues/551) - Feature Request: Support for Falcon Edge series
+## 📌 [Issue #551](https://github.com/ikawrakow/ik_llama.cpp/issues/551) - Feature Request: Support for Falcon Edge series
| **Author** | `harborwater` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-06-24 |
| **Updated** | 2025-06-26 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -34,15 +35,15 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-24** at **12:22:12**:
+👤 **ikawrakow** commented on **2025-06-24** at **12:22:12**
Is it supported in mainline `llama.cpp`?
---
-👤 **saood06** commented the **2025-06-24** at **16:48:14**:
+👤 **saood06** commented on **2025-06-24** at **16:48:14**
> Is it supported in mainline `llama.cpp`?
@@ -59,7 +60,7 @@ I read the blogpost, and I agree. They trained on less tokens (1.5 T vs 4T) but
---
-👤 **ikawrakow** commented the **2025-06-24** at **17:05:32**:
+👤 **ikawrakow** commented on **2025-06-24** at **17:05:32**
In that case it should (almost) work:
```
@@ -75,7 +76,7 @@ So, I guess, it is a matter of adding this `falcon-e` pre-tokenizer? Or are ther
---
-👤 **saood06** commented the **2025-06-24** at **17:12:59**:
+👤 **saood06** commented on **2025-06-24** at **17:12:59**
> So, I guess, it is a matter of adding this `falcon-e` pre-tokenizer?
@@ -89,7 +90,7 @@ None that require change it seems. Their blogpost says:
---
-👤 **ikawrakow** commented the **2025-06-24** at **17:13:41**:
+👤 **ikawrakow** commented on **2025-06-24** at **17:13:41**
Well, pretending that `falcon_e` is the same as `falcon3`, it appears to work:
```
@@ -119,7 +120,7 @@ perplexity: 0.81 seconds per pass - ETA 9.58 minutes
---
-👤 **ikawrakow** commented the **2025-06-24** at **17:16:23**:
+👤 **ikawrakow** commented on **2025-06-24** at **17:16:23**
This is the diff that makes it work:
```
@@ -141,6 +142,6 @@ index a70d2582..de91e687 100644
---
-👤 **ikawrakow** commented the **2025-06-25** at **07:21:17**:
+👤 **ikawrakow** commented on **2025-06-25** at **07:21:17**
-See #555 and let me know of it works.
\ No newline at end of file
+See [#555](https://github.com/ikawrakow/ik_llama.cpp/issues/555) and let me know of it works.
\ No newline at end of file
diff --git a/github-data/issues/561 - Feature Request_ Tencent Hunyuan-A13B model support.md b/github-data/issues/561 - Feature Request Tencent Hunyuan-A13B model support.md
similarity index 67%
rename from github-data/issues/561 - Feature Request_ Tencent Hunyuan-A13B model support.md
rename to github-data/issues/561 - Feature Request Tencent Hunyuan-A13B model support.md
index 8566fa07c..6205ba912 100644
--- a/github-data/issues/561 - Feature Request_ Tencent Hunyuan-A13B model support.md
+++ b/github-data/issues/561 - Feature Request Tencent Hunyuan-A13B model support.md
@@ -1,4 +1,4 @@
-### ✨ [#561](https://github.com/ikawrakow/ik_llama.cpp/issues/561) - Feature Request: Tencent Hunyuan-A13B model support
+## 📌 [Issue #561](https://github.com/ikawrakow/ik_llama.cpp/issues/561) - Feature Request: Tencent Hunyuan-A13B model support
| **Author** | `Downtown-Case` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
80B/13B active MoE, good benchmarks. Seems right up ik_llama.cpp's alley, aka expert offloading like deepseek.
@@ -18,9 +18,9 @@ Relevant main llama.cpp issue: https://github.com/ggml-org/llama.cpp/issues/1441
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-27** at **21:09:18**:
+👤 **ubergarm** commented on **2025-06-27** at **21:09:18**
I took a look at mainline's PR and it isn't quite working there yet.
@@ -36,17 +36,17 @@ I'll look at it again this weekend if I have some time.
---
-👤 **saood06** commented the **2025-06-27** at **21:38:02**:
+👤 **saood06** commented on **2025-06-27** at **21:38:02**
>I took a look at mainline's PR and it isn't quite working there yet.
Yep, it is a draft and says "STILL WIP".
-Once it is functional, I could port this model as it does interest me as well, but I'm not sure how much time I'll have this weekend, assuming no one else has until after then I'll do it (and I'll also port dots as requested in #543 as well since that hasn't been done).
+Once it is functional, I could port this model as it does interest me as well, but I'm not sure how much time I'll have this weekend, assuming no one else has until after then I'll do it (and I'll also port dots as requested in [#543](https://github.com/ikawrakow/ik_llama.cpp/issues/543) as well since that hasn't been done).
---
-👤 **ubergarm** commented the **2025-06-28** at **16:39:06**:
+👤 **ubergarm** commented on **2025-06-28** at **16:39:06**
Thanks @saood06
@@ -90,17 +90,7 @@ Feel free to use anything in my WIP version to continue or test. It doesn't have
---
-👤 **ubergarm** commented the **2025-06-28** at **16:39:06**:
-
-Thanks @saood06
-
-I have a [rough branch porting much of what mainline was doing](https://github.com/ubergarm/ik_llama.cpp/tree/ug/hunyuan-moe), but am gonna work on some other personal priority things today and wait for the dust to settle given I couldn't even get Hunyuan-A13B working with their vllm patch. Their release seems pretty rough around the edges thus far.
-
-Feel free to use anything in my WIP version to continue or test.
-
----
-
-👤 **Downtown-Case** commented the **2025-06-30** at **16:24:50**:
+👤 **Downtown-Case** commented on **2025-06-30** at **16:24:50**
An interesting (and now buried) comment:
@@ -120,7 +110,7 @@ Seems mainline llama.cpp is getting good performance without implementing that,
---
-👤 **ikawrakow** commented the **2025-06-30** at **16:35:34**:
+👤 **ikawrakow** commented on **2025-06-30** at **16:35:34**
We don't have an issue here dealing with a variable number of selected experts due to [SER](https://github.com/ikawrakow/ik_llama.cpp/pull/239).
@@ -128,7 +118,7 @@ Concerning speeding up: you never want to offload tensors that are in RAM to the
---
-👤 **Downtown-Case** commented the **2025-06-30** at **17:30:22**:
+👤 **Downtown-Case** commented on **2025-06-30** at **17:30:22**
I mispoke, I meant to say that unecessary experts shouldn't be used for token generation (not PP), which is what I assumed the quote is talking about? And I didn't mean to use 'offload' in that context.
@@ -138,15 +128,7 @@ I am super excited for this model in ik_llama.cpp because it's the perfect targe
---
-👤 **Downtown-Case** commented the **2025-06-30** at **17:30:22**:
-
-I mispoke, I meant to say that unecessary experts shouldn't be used for token generation (not PP), which is what I assumed the quote is talking about? And I didn't mean to use 'offload,' of course the CPU is the device to use here.
-
-Anyway, that's awesome! I am still unfamiliar with ik_llama.cpp, but SER seems similar to what Tencent presumably trained in.
-
----
-
-👤 **ubergarm** commented the **2025-06-30** at **18:18:24**:
+👤 **ubergarm** commented on **2025-06-30** at **18:18:24**
@Downtown-Case
@@ -154,7 +136,7 @@ I made an attempt using mainline's fresh PR. Feel free to test. Example command
---
-👤 **Downtown-Case** commented the **2025-07-07** at **03:44:17**:
+👤 **Downtown-Case** commented on **2025-07-07** at **03:44:17**
Got bogged down, apologies, but I'm now testing the PR. Thanks for the quant and the recipe @ubergarm! That's a huge help.
@@ -198,51 +180,7 @@ Thanks again. Next I will text much more complex 64K+ prompts, and maybe give th
---
-👤 **Downtown-Case** commented the **2025-07-07** at **03:44:17**:
-
-Got bogged down, apologies, but I'm now testing the PR. Thanks for the quant and the recipe @ubergarm! That's a huge help.
-
-This does feel like one _overtuned_ model. Just a few examples, with a temperature of 1:
-
-It does not like raw completion, or (in my testing, not pictured) skipping the thinking block:
-
-
-
-It very often, very confidently messes up the block, even at zero temperature.
-
-
-
- It's also notable that none of the think/answer tags are individual tokens! So more chance to mess up from sampling there:
-
-
-
-It loops very easily at the slightest deviation (again, this is a temperature of 1 + topK 10, relatively high these days but also one many default to):
-
-
-
-And it's also *hyper* confident about some in-sentence tokens at 1 temperature, which I don't see in other models much:
-
-
-
-***
-
-...Yet it does seem smart!
-
-I think this model is hyper sensitive to sampling and its chat/think templates, and really needs sampling dialed in to stay sane.
-
-***
-
-I *also* encountered a seperate issue, at least once, where sampling seemed to mess up when the model was trying to generate a . It would go off the rails, and mikupad would return invalid logprobs, like something broke inside ik_llama.cpp... but now I can't replicate it.
-
-***
-
-Thanks again. Next I will text much more complex 64K+ prompts, and maybe give the base model a shot using your formula.
-
-...Maybe this instruct model would benefit from a merge with its base? That's helped less overtuned models than this.
-
----
-
-👤 **saood06** commented the **2025-07-07** at **04:05:29**:
+👤 **saood06** commented on **2025-07-07** at **04:05:29**
>...Yet it does seem smart!
>[...]
@@ -258,7 +196,7 @@ The mikupad screenshots are nice, I often do look at the probabilities to unders
---
-👤 **Downtown-Case** commented the **2025-07-07** at **04:55:42**:
+👤 **Downtown-Case** commented on **2025-07-07** at **04:55:42**
@saood06 Ah, lm_head being in a weird place with the merge, right? Hello again!
@@ -278,35 +216,15 @@ Yeah I just meant to re-use the formula.
---
-👤 **Downtown-Case** commented the **2025-07-07** at **04:55:42**:
-
-@saood06 Ah, lm_head being in a weird place, right? Hello again!
-
-Cohere models are _still_ problematic, heh: https://github.com/turboderp-org/exllamav3/issues/53
-
-https://github.com/turboderp-org/exllamav3/issues/34#issuecomment-2854186639
-
-I wonder if that tensor plotting script would show any 'surgery' on A13B...
-
-Anyway, yeah, Mikupad's a great way to `understand the model` via repeated sampling testing, continuing prompts using the notebook format, peaking at the sampling and such; couldn't put it any better myself. It also happens to be good at 64K+ prompts, whereas most UIs bog down trying to display them.
-
-Hence the screenshots don't completely convey it, but this A13B quant does feel funky but usable, and it *does* seem to comprehend quick long context tests.
-
-> I wouldn't reuse the imatrix.dat between the base model and the instruct model (reusing the formula makes sense though).
-
-Yeah I just meant to re-use the formula.
-
----
-
-👤 **saood06** commented the **2025-07-07** at **05:29:09**:
+👤 **saood06** commented on **2025-07-07** at **05:29:09**
> Ah, lm_head being in a weird place with the merge, right? Hello again!
Yep, glad you remember me.
-> Cohere models are _still_ problematic, heh: [turboderp-org/exllamav3#53](https://github.com/turboderp-org/exllamav3/issues/53)
+> Cohere models are _still_ problematic, heh: [turboderp-org/exllamav3[#53](https://github.com/ikawrakow/ik_llama.cpp/issues/53)](https://github.com/turboderp-org/exllamav3/issues/53)
>
-> [turboderp-org/exllamav3#34 (comment)](https://github.com/turboderp-org/exllamav3/issues/34#issuecomment-2854186639)
+> [turboderp-org/exllamav3[#34](https://github.com/ikawrakow/ik_llama.cpp/issues/34) (comment)](https://github.com/turboderp-org/exllamav3/issues/34#issuecomment-2854186639)
That reminds me of these needles in a visualization of SD3 on [reddit](https://www.reddit.com/r/StableDiffusion/comments/1dgikbm/i_made_a_simple_workflow_to_manually_inject_noise/l8stl9u/). It is interesting to see. I wouldn't blame Cohere for the mergekit bug though (as that didn't even just happen to them).
@@ -320,7 +238,7 @@ Yep, also is convenient for steering a model (and understanding the model and it
>It also happens to be good at 64K+ prompts, whereas most UIs bog down trying to display them.
-Interesting to hear, I never went that high before I switched to mikupad. I'm curious how large your database has gotten (and if you used the techniques I posted about to compress it)? I do want the prediction preview to do what [this](https://github.com/the-crypt-keeper/LLooM) does (taking advantage of this repo's good batched performance which I think might need some `server.cpp` changes [see #199])
+Interesting to hear, I never went that high before I switched to mikupad. I'm curious how large your database has gotten (and if you used the techniques I posted about to compress it)? I do want the prediction preview to do what [this](https://github.com/the-crypt-keeper/LLooM) does (taking advantage of this repo's good batched performance which I think might need some `server.cpp` changes [see [#199](https://github.com/ikawrakow/ik_llama.cpp/issues/199)])
> Hence the screenshots don't completely convey it, but this A13B quant does feel "funky but usable," like it's _trying_ to break past its tendancy to loop and obsession with the prompt formatting. It does seem to comprehend quick long context tests, but I need to run more.
@@ -328,7 +246,7 @@ That is good to hear, this model can fit on my 3090 machine which would probably
---
-👤 **Downtown-Case** commented the **2025-07-07** at **06:31:24**:
+👤 **Downtown-Case** commented on **2025-07-07** at **06:31:24**
I am running A13B on a 3090/DDR5 system (up to 60K-ish so far), and its plenty fast, with q8_0/q5_1 cache. I will check token/s next time I look.
@@ -342,7 +260,7 @@ I had some 128k+ prompts I ran before that I intend to remake and try.
---
-👤 **saood06** commented the **2025-07-07** at **06:49:38**:
+👤 **saood06** commented on **2025-07-07** at **06:49:38**
> I am running A13B on a 3090/DDR5 system (up to 60K-ish so far), and its plenty fast, with q8_0/q5_1 cache. I will check token/s next time I look.
@@ -354,11 +272,11 @@ It is? I see it hasn't been updated in a while, but don't see it being depreciat
>Exui would also continue from the _cursor_, in the middle of the tex, which is awesome for testing and editing.
-Ooh, not sure when I'd use that. Mikupad has the control right click menu which is close. I could see a toggle for enabling a mode that allows that (could add it to my roadmap in #558 if you think it is that worthwhile).
+Ooh, not sure when I'd use that. Mikupad has the control right click menu which is close. I could see a toggle for enabling a mode that allows that (could add it to my roadmap in [#558](https://github.com/ikawrakow/ik_llama.cpp/issues/558) if you think it is that worthwhile).
> My mikupad db's only 3.1MB now, but only because I just switched to the standalone nodejs server.
-#558 offers support with `server.cpp` directly (if you do use it, be warned there will be more migrations needed until I switch it to ready) alongside some other benefits (and more in the works and on the roadmap [suggestions highly welcome]).
+[#558](https://github.com/ikawrakow/ik_llama.cpp/issues/558) offers support with `server.cpp` directly (if you do use it, be warned there will be more migrations needed until I switch it to ready) alongside some other benefits (and more in the works and on the roadmap [suggestions highly welcome]).
> I had some 128k+ prompts I ran before that I intend to remake and try.
@@ -366,31 +284,7 @@ If they are still in the browser export and import can work as an alternative to
---
-👤 **saood06** commented the **2025-07-07** at **06:49:38**:
-
-> I am running A13B on a 3090/DDR5 system (up to 60K-ish so far), and its plenty fast, with q8_0/q5_1 cache. I will check token/s next time I look.
-
-DDR4 here, and to be honest for me t/s doesn't matter for this usage unless it is slow (aka below reading speed).
-
-> text-gen-web-ui is _awful_, really most everything I tried is except exui, which is now (sadly) depreciated.
-
-It is? I see it hasn't been updated in a while, but don't see it being depreciated. I know mikupad is in a state where the owner hasn't responded to any of the issues/PR's people have made in ~6 months, which is a major part of why I'm doing work on it here now.
-
->Exui would also continue from the _cursor_, in the middle of the tex, which is awesome for testing and editing.
-
-Ooh, not sure when I'd use that. Mikupad has the control right click menu which is close. I could see a toggle for enabling a mode that allows that (could add it to my roadmap in #558 if you think it is that worthwhile).
-
-> My mikupad db's only 3.1MB now, but only because I just switched to the standalone nodejs server.
-
-#558 offers support with `server.cpp` directly (if you do use it, be warned there will be more migrations needed until I switch it to ready) alongside some other benefits (and more in the works and on the roadmap [suggestions highly welcome]).
-
-> I had some 128k+ prompts I ran before that I intend to remake and try.
-
-If they are still in the browser export and import can work (it is why the first thing I contributed to mikupad was the bulk import for migrating my sessions from my browser version, I already had the files so I never added a bulk export [seems worth adding to my roadmap]).
-
----
-
-👤 **saood06** commented the **2025-07-08** at **07:22:20**:
+👤 **saood06** commented on **2025-07-08** at **07:22:20**
> If they are still in the browser export and import can work as an alternative to remaking them (it is why the first thing I contributed to mikupad was the bulk import for migrating my sessions from my browser version, I already had the files so I never added a bulk export [seems worth adding to my roadmap]).
@@ -398,7 +292,7 @@ I added it here see: https://github.com/ikawrakow/ik_llama.cpp/pull/558/commits/
---
-👤 **ubergarm** commented the **2025-07-08** at **20:01:40**:
+👤 **ubergarm** commented on **2025-07-08** at **20:01:40**
@Downtown-Case @saood06
@@ -412,17 +306,7 @@ Also mikupad is pretty cool to inspect the token probabilities like this, great
---
-👤 **ubergarm** commented the **2025-07-08** at **20:01:40**:
-
-@Downtown-Case @saood06
-
-I already had imatrix for Pretrain as well so just uploaded it to the existing Instruct repo here if anyone wants to experiment with it: https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF/tree/main
-
-fwiw mainline did merge their PR for Hunyuan. Not sure how we're going to proceed here given something still seems fishy with the Instruct. I don't know how to merge the Instruct with the Pretrain but if either of you do and release the safetensors I'd be curious to check out the results.
-
----
-
-👤 **ubergarm** commented the **2025-07-08** at **20:29:20**:
+👤 **ubergarm** commented on **2025-07-08** at **20:29:20**
Oh hey there was a patch from tencent fixing the model chat template, i've removed the few lines and am testing perplexity again. https://github.com/ggml-org/llama.cpp/pull/14584
@@ -460,13 +344,13 @@ Final estimate: PPL = 524.7090 +/- 5.70049
---
-👤 **ubergarm** commented the **2025-07-08** at **21:38:38**:
+👤 **ubergarm** commented on **2025-07-08** at **21:38:38**
-I've updated PR #565 with the small patch to chat template. Perplexity is still wonky (I didn't re-make imatrix with the patch but don't believe `llama_chat_apply_template_internal()` is used during imatrix creation.
+I've updated PR [#565](https://github.com/ikawrakow/ik_llama.cpp/issues/565) with the small patch to chat template. Perplexity is still wonky (I didn't re-make imatrix with the patch but don't believe `llama_chat_apply_template_internal()` is used during imatrix creation.
---
-👤 **saood06** commented the **2025-07-09** at **01:59:58**:
+👤 **saood06** commented on **2025-07-09** at **01:59:58**
> I already had imatrix for Pretrain as well so just uploaded it to the existing Instruct repo here if anyone wants to experiment with it: https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF/tree/main
@@ -486,12 +370,12 @@ The legacy server has that feature as well (I used it a lot), but mikupad is sti
---
-👤 **ubergarm** commented the **2025-07-09** at **19:09:22**:
+👤 **ubergarm** commented on **2025-07-09** at **19:09:22**
@Downtown-Case okay the PR is merged! feel free to close this issue now! ty!
---
-👤 **ikawrakow** commented the **2025-07-12** at **09:53:30**:
+👤 **ikawrakow** commented on **2025-07-12** at **09:53:30**
-Closed via #565
\ No newline at end of file
+Closed via [#565](https://github.com/ikawrakow/ik_llama.cpp/issues/565)
\ No newline at end of file
diff --git a/github-data/issues/568 - Feature Request_ ERNIE MoE Model Support.md b/github-data/issues/568 - Feature Request ERNIE MoE Model Support.md
similarity index 86%
rename from github-data/issues/568 - Feature Request_ ERNIE MoE Model Support.md
rename to github-data/issues/568 - Feature Request ERNIE MoE Model Support.md
index 0308d0d43..77c3d4aea 100644
--- a/github-data/issues/568 - Feature Request_ ERNIE MoE Model Support.md
+++ b/github-data/issues/568 - Feature Request ERNIE MoE Model Support.md
@@ -1,4 +1,4 @@
-### ✨ [#568](https://github.com/ikawrakow/ik_llama.cpp/issues/568) - Feature Request: ERNIE MoE Model Support
+## 📌 [Issue #568](https://github.com/ikawrakow/ik_llama.cpp/issues/568) - Feature Request: ERNIE MoE Model Support
| **Author** | `Downtown-Case` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
New MoE series from Baidu: https://github.com/PaddlePaddle/ERNIE
@@ -44,9 +44,9 @@ I can't keep up with any of this, lol.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Downtown-Case** commented the **2025-07-01** at **19:48:31**:
+👤 **Downtown-Case** commented on **2025-07-01** at **19:48:31**
From the paper:
@@ -79,7 +79,7 @@ There's also details on KV cache quantization.
---
-👤 **Ph0rk0z** commented the **2025-07-11** at **12:25:10**:
+👤 **Ph0rk0z** commented on **2025-07-11** at **12:25:10**
I think we're going to be stuck trying to run Paddle. If it does also quant kv, that means fully offloaded ernie on 4x3090. Their deepseek quant size is impressive too.. only 184GB.
@@ -87,12 +87,6 @@ There's a PR: https://github.com/ggml-org/llama.cpp/pull/14658 that can be porte
---
-👤 **Ph0rk0z** commented the **2025-07-11** at **12:25:10**:
-
-I think we're going to be stuck trying to run Paddle. If it does also quant kv, that means fully offloaded ernie on 4x3090. Their deepseek quant size is impressive too.. only 184GB.
-
----
-
-👤 **fizzAI** commented the **2025-07-18** at **02:17:16**:
+👤 **fizzAI** commented on **2025-07-18** at **02:17:16**
The above PR (https://github.com/ggml-org/llama.cpp/pull/14658) was just finalized and merged into mainline, would be nice to see if anyone is smart enough to port it properly :3
\ No newline at end of file
diff --git a/github-data/issues/572 - Bug_ Oops_ggml_compute_forward_sum_rows_f32_ ffn_moe_weights_sum-60_ fo.md b/github-data/issues/572 - Bug Oopsggml_compute_forward_sum_rows_f32 ffn_moe_weights_sum-60 found nan on De.md
similarity index 98%
rename from github-data/issues/572 - Bug_ Oops_ggml_compute_forward_sum_rows_f32_ ffn_moe_weights_sum-60_ fo.md
rename to github-data/issues/572 - Bug Oopsggml_compute_forward_sum_rows_f32 ffn_moe_weights_sum-60 found nan on De.md
index 63465c20e..6376f96a8 100644
--- a/github-data/issues/572 - Bug_ Oops_ggml_compute_forward_sum_rows_f32_ ffn_moe_weights_sum-60_ fo.md
+++ b/github-data/issues/572 - Bug Oopsggml_compute_forward_sum_rows_f32 ffn_moe_weights_sum-60 found nan on De.md
@@ -1,4 +1,4 @@
-### 🐛 [#572](https://github.com/ikawrakow/ik_llama.cpp/issues/572) - Bug: Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-60): found nan, on DeepSeek V3/R1 on CUDA + CPU
+## 📌 [Issue #572](https://github.com/ikawrakow/ik_llama.cpp/issues/572) - Bug: Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-60): found nan, on DeepSeek V3/R1 on CUDA + CPU
| **Author** | `Panchovix` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -757,21 +757,21 @@ Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-60): found nan for i
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-03** at **13:09:02**:
+👤 **ikawrakow** commented on **2025-07-03** at **13:09:02**
So, nobody else has reported an issue such as this. But you are leaving the shared experts on the CPU. This is your intent?
---
-👤 **Panchovix** commented the **2025-07-03** at **14:05:13**:
+👤 **Panchovix** commented on **2025-07-03** at **14:05:13**
Hi there, yes this is like a new issue that I have noticed just recently but not sure since when. You mean the shexps? Basically I leave an entire layer when I can on a GPU, or 1 layer on 2 GPUs if it's too big when increasing ubatch size.
---
-👤 **ikawrakow** commented the **2025-07-03** at **14:40:44**:
+👤 **ikawrakow** commented on **2025-07-03** at **14:40:44**
Can you try if you can reproduce on 8e5106b20f694c84811b073b3a4f86ca9d871441 ?
@@ -779,7 +779,7 @@ Thanks.
---
-👤 **Panchovix** commented the **2025-07-03** at **16:05:29**:
+👤 **Panchovix** commented on **2025-07-03** at **16:05:29**
Was testing on that commit but got it again sadly
@@ -807,6 +807,6 @@ EDIT: Just wondering, would for example a unstable RAM or CPU cause this? I have
---
-👤 **Panchovix** commented the **2025-07-05** at **16:58:48**:
+👤 **Panchovix** commented on **2025-07-05** at **16:58:48**
Okay for now I have reduced my VRAM overclocks on some 4090s I was using and it seems I haven't seen the error again. So I guess it was related to that. Closing!
\ No newline at end of file
diff --git a/github-data/issues/575 - Bug_ llama-server crash with sampling order.md b/github-data/issues/575 - Bug llama-server crash with sampling order.md
similarity index 91%
rename from github-data/issues/575 - Bug_ llama-server crash with sampling order.md
rename to github-data/issues/575 - Bug llama-server crash with sampling order.md
index 463a91a78..62197b701 100644
--- a/github-data/issues/575 - Bug_ llama-server crash with sampling order.md
+++ b/github-data/issues/575 - Bug llama-server crash with sampling order.md
@@ -1,4 +1,4 @@
-### 🐛 [#575](https://github.com/ikawrakow/ik_llama.cpp/issues/575) - Bug: llama-server crash with sampling order
+## 📌 [Issue #575](https://github.com/ikawrakow/ik_llama.cpp/issues/575) - Bug: llama-server crash with sampling order
| **Author** | `mcm007` |
| :--- | :--- |
@@ -8,11 +8,11 @@
---
-#### Description
+## 📄 Description
### What happened?
-The OpenAi endpoint crashes when samplers order is specified with `--samplers "min_p;temperature"` or `--sampling-seq "mt"` after [Commit 3f111ad](https://github.com/ikawrakow/ik_llama.cpp/commit/3f111ad7bbb2d4f721332f9b2b344e48b3bbf9aa) ([add dry sampler #513 ](https://github.com/ikawrakow/ik_llama.cpp/pull/513)).
+The OpenAi endpoint crashes when samplers order is specified with `--samplers "min_p;temperature"` or `--sampling-seq "mt"` after [Commit 3f111ad](https://github.com/ikawrakow/ik_llama.cpp/commit/3f111ad7bbb2d4f721332f9b2b344e48b3bbf9aa) ([add dry sampler [#513](https://github.com/ikawrakow/ik_llama.cpp/issues/513) ](https://github.com/ikawrakow/ik_llama.cpp/pull/513)).
Behavior observed with [aider](https://aider.chat/) but can be reproduced with curl:
```
@@ -114,15 +114,15 @@ VERB [ update_slots] prompt tokenized | tid="139998054885568" timesta
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-03** at **09:24:06**:
+👤 **ikawrakow** commented on **2025-07-03** at **09:24:06**
Is this one example of many where it crashes, or is this the only sampler combination for which it crashes?
---
-👤 **mcm007** commented the **2025-07-03** at **09:59:07**:
+👤 **mcm007** commented on **2025-07-03** at **09:59:07**
After some tests, it seems that crashes when `dry` is not specified:
@@ -144,13 +144,13 @@ Working:
---
-👤 **ikawrakow** commented the **2025-07-03** at **12:45:33**:
+👤 **ikawrakow** commented on **2025-07-03** at **12:45:33**
-Thanks for the bug report. #578 should fix it.
+Thanks for the bug report. [#578](https://github.com/ikawrakow/ik_llama.cpp/issues/578) should fix it.
---
-👤 **mcm007** commented the **2025-07-03** at **20:17:21**:
+👤 **mcm007** commented on **2025-07-03** at **20:17:21**
Sorry, it has the same behavior/crash 🙄
@@ -163,13 +163,13 @@ Vulkan and all the other improvements are really appreciated.
---
-👤 **ikawrakow** commented the **2025-07-05** at **13:12:19**:
+👤 **ikawrakow** commented on **2025-07-05** at **13:12:19**
This is strange. I tested `llama-cli` with `--sampling-seq mt`, and it works fine after this PR.
---
-👤 **mcm007** commented the **2025-07-05** at **18:17:15**:
+👤 **mcm007** commented on **2025-07-05** at **18:17:15**
Indeed, just tested, `llama-cli` is working after this PR.
@@ -195,12 +195,12 @@ curl -k ik_llamacpp:8080/v1/chat/completions -H "Content-Type: application/json"
---
-👤 **firecoperana** commented the **2025-07-06** at **00:54:04**:
+👤 **firecoperana** commented on **2025-07-06** at **00:54:04**
https://github.com/ikawrakow/ik_llama.cpp/pull/588 should fix the server crash
---
-👤 **mcm007** commented the **2025-07-06** at **06:30:29**:
+👤 **mcm007** commented on **2025-07-06** at **06:30:29**
It works OK, thank you both!
\ No newline at end of file
diff --git a/github-data/issues/576 - Bug_ llama-server crash with _Deepseek2 does not support K-shift_.md b/github-data/issues/576 - Bug llama-server crash with Deepseek2 does not support K-shift.md
similarity index 81%
rename from github-data/issues/576 - Bug_ llama-server crash with _Deepseek2 does not support K-shift_.md
rename to github-data/issues/576 - Bug llama-server crash with Deepseek2 does not support K-shift.md
index c371538d9..e406417e9 100644
--- a/github-data/issues/576 - Bug_ llama-server crash with _Deepseek2 does not support K-shift_.md
+++ b/github-data/issues/576 - Bug llama-server crash with Deepseek2 does not support K-shift.md
@@ -1,4 +1,4 @@
-### 🐛 [#576](https://github.com/ikawrakow/ik_llama.cpp/issues/576) - Bug: llama-server crash with \"Deepseek2 does not support K-shift\"
+## 📌 [Issue #576](https://github.com/ikawrakow/ik_llama.cpp/issues/576) - Bug: llama-server crash with "Deepseek2 does not support K-shift"
| **Author** | `ewhacc` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -71,9 +71,9 @@ The program is not being run.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-03** at **11:38:54**:
+👤 **ikawrakow** commented on **2025-07-03** at **11:38:54**
> In what circumstance, will "Deepseek2 does not support K-shift" be shown?
@@ -81,7 +81,7 @@ When you reach the maximum context length.
---
-👤 **ewhacc** commented the **2025-07-03** at **18:15:28**:
+👤 **ewhacc** commented on **2025-07-03** at **18:15:28**
> When you reach the maximum context length.
@@ -95,7 +95,7 @@ It was ok with R1. I'm going to check with R1 again.
---
-👤 **saood06** commented the **2025-07-03** at **22:29:38**:
+👤 **saood06** commented on **2025-07-03** at **22:29:38**
> > When you reach the maximum context length.
>
@@ -112,24 +112,7 @@ You set `--parallel 2`, which makes your max context per slot (with 0 system tok
---
-👤 **saood06** commented the **2025-07-03** at **22:29:38**:
-
-> > When you reach the maximum context length.
->
-> Did I reach the maximum context length? p0=45065 just before crash.
->
-> n_keep=1 n_left=49150 n_discard=24575 n_ctx=98304 n_past=49151 n_system_tokens=0 n_cache_tokens=49151
->
-> Crashed again for the different prompt, but at the same p0=45065.
->
-
-Yes.
-
-You set `--parallel 2`, which makes your max context per slot (with 0 system tokens) to 49,152 (`98304 / 2`). Your batch size is 4,096 and so you'd expect to see the last reported context length to be between 45,056 - 49,152, which `45065` falls into.
-
----
-
-👤 **ewhacc** commented the **2025-07-04** at **05:16:41**:
+👤 **ewhacc** commented on **2025-07-04** at **05:16:41**
@saood06
diff --git a/github-data/issues/59 - Bug_ GGML Compilation Error_ undefined references to _iqk_mul_mat_.md b/github-data/issues/59 - Bug GGML Compilation Error undefined references to iqk_mul_mat.md
similarity index 92%
rename from github-data/issues/59 - Bug_ GGML Compilation Error_ undefined references to _iqk_mul_mat_.md
rename to github-data/issues/59 - Bug GGML Compilation Error undefined references to iqk_mul_mat.md
index 1013416ea..996bfd9a7 100644
--- a/github-data/issues/59 - Bug_ GGML Compilation Error_ undefined references to _iqk_mul_mat_.md
+++ b/github-data/issues/59 - Bug GGML Compilation Error undefined references to iqk_mul_mat.md
@@ -1,14 +1,15 @@
-### 🐛 [#59](https://github.com/ikawrakow/ik_llama.cpp/issues/59) - Bug: GGML Compilation Error: undefined references to `iqk_mul_mat'
+## 📌 [Issue #59](https://github.com/ikawrakow/ik_llama.cpp/issues/59) - Bug: GGML Compilation Error: undefined references to `iqk_mul_mat'
| **Author** | `ndavidson19` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-09-18 |
| **Updated** | 2024-09-26 |
+| **Labels** | `wontfix` |
---
-#### Description
+## 📄 Description
### What happened?
@@ -109,21 +110,21 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-09-19** at **06:55:56**:
+👤 **ikawrakow** commented on **2024-09-19** at **06:55:56**
I use `cmake`, so the Makefile is less solid than it shou be. Have you tried `make clean && make -j`? I'm away for a few days and will look at the problem when I come back.
---
-👤 **ndavidson19** commented the **2024-09-19** at **16:39:19**:
+👤 **ndavidson19** commented on **2024-09-19** at **16:39:19**
Same error happens with those commands. No rush will try to build via `cmake` on this particular server.
---
-👤 **ikawrakow** commented the **2024-09-21** at **16:05:29**:
+👤 **ikawrakow** commented on **2024-09-21** at **16:05:29**
So, I don't really see what could be wrong with the `Makefile`. The `Makefile`, inherited from `llama.cpp`, is of course useless as it does not reflect the actual build artifact dependencies. E.g., here is what we have as a build rule for `ggml.o`, which is the core of the whole system
```
@@ -145,6 +146,6 @@ Thanks!
---
-👤 **ikawrakow** commented the **2024-09-26** at **16:20:39**:
+👤 **ikawrakow** commented on **2024-09-26** at **16:20:39**
I'm not getting a response, and without the full output of the `make` command it is not possible to see what might be going wrong, so closing.
\ No newline at end of file
diff --git a/github-data/issues/596 - Bug_ Lastest commit broke llama-cli on Windows - mmq.cuh_107_ fatal err.md b/github-data/issues/596 - Bug Lastest commit broke llama-cli on Windows - mmq.cuh107 fatal error.md
similarity index 92%
rename from github-data/issues/596 - Bug_ Lastest commit broke llama-cli on Windows - mmq.cuh_107_ fatal err.md
rename to github-data/issues/596 - Bug Lastest commit broke llama-cli on Windows - mmq.cuh107 fatal error.md
index 61717d94a..a08ac6051 100644
--- a/github-data/issues/596 - Bug_ Lastest commit broke llama-cli on Windows - mmq.cuh_107_ fatal err.md
+++ b/github-data/issues/596 - Bug Lastest commit broke llama-cli on Windows - mmq.cuh107 fatal error.md
@@ -1,4 +1,4 @@
-### 🐛 [#596](https://github.com/ikawrakow/ik_llama.cpp/issues/596) - Bug: Lastest commit broke llama-cli on Windows - mmq.cuh:107: fatal error
+## 📌 [Issue #596](https://github.com/ikawrakow/ik_llama.cpp/issues/596) - Bug: Lastest commit broke llama-cli on Windows - mmq.cuh:107: fatal error
| **Author** | `Thireus` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -82,15 +82,15 @@ D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml-cuda\mmq.cuh:107: fatal error
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-10** at **07:52:57**:
+👤 **ikawrakow** commented on **2025-07-10** at **07:52:57**
What is the quantization mix being used?
---
-👤 **Thireus** commented the **2025-07-11** at **18:24:49**:
+👤 **Thireus** commented on **2025-07-11** at **18:24:49**
This one: https://github.com/Thireus/GGUF-Tool-Suite/blob/main/recipe_examples/DeepSeek-R1-0528.THIREUS-3.1027bpw-3.3372ppl.242GB-GGUF_11GB-GPU_231GB-CPU.3c88ec6_adc8101.recipe
@@ -101,7 +101,7 @@ Edit: Link edited.
---
-👤 **ikawrakow** commented the **2025-07-12** at **06:47:34**:
+👤 **ikawrakow** commented on **2025-07-12** at **06:47:34**
The link you posted gives 404. But even if it worked, we know that the HF tensor viewer does not work when the model contains `ik_llama.cpp` specific quantization types.
@@ -109,7 +109,7 @@ How hard is it to to post the portion of the log that tells us how many tensors
---
-👤 **Thireus** commented the **2025-07-12** at **07:19:00**:
+👤 **Thireus** commented on **2025-07-12** at **07:19:00**
I'm not sure what you mean by "HF tensor viewer", I'm not using it.
@@ -315,34 +315,27 @@ D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml-cuda\mmq.cuh:107: fatal error
---
-👤 **Thireus** commented the **2025-07-12** at **07:19:00**:
+👤 **ikawrakow** commented on **2025-07-12** at **09:27:32**
-Any of these won't work:
-https://github.com/Thireus/GGUF-Tool-Suite/tree/main/recipe_examples
-
----
-
-👤 **ikawrakow** commented the **2025-07-12** at **09:27:32**:
-
-Does #603 fix it for you?
+Does [#603](https://github.com/ikawrakow/ik_llama.cpp/issues/603) fix it for you?
There were two more commits after the commit that actually breaks it for your mix that uses `IQ1_M`, a typically not used quantization type.
---
-👤 **Thireus** commented the **2025-07-12** at **11:08:41**:
+👤 **Thireus** commented on **2025-07-12** at **11:08:41**
Thanks, I'll take a look now and will report back. It'll take a few hours.
---
-👤 **Thireus** commented the **2025-07-12** at **18:35:40**:
+👤 **Thireus** commented on **2025-07-12** at **18:35:40**
@ikawrakow, the fix is working! Thank you so much.
---
-👤 **saood06** commented the **2025-07-13** at **07:56:18**:
+👤 **saood06** commented on **2025-07-13** at **07:56:18**
>The link you posted gives 404. But even if it worked, we know that the HF tensor viewer does not work when the model contains ik_llama.cpp specific quantization types.
>
@@ -352,18 +345,8 @@ It no longer gives a 404 (I didn't see one). It is better than HF tensor viewer,
---
-👤 **saood06** commented the **2025-07-13** at **07:56:18**:
-
->The link you posted gives 404. But even if it worked, we know that the HF tensor viewer does not work when the model contains ik_llama.cpp specific quantization types.
->
->How hard is it to to post the portion of the log that tells us how many tensors there are from what type?
-
-It no longer gives a 404. It is better than HF tensor viewer, it is a documented custom regex string.
-
----
-
-👤 **ikawrakow** commented the **2025-07-13** at **09:37:13**:
+👤 **ikawrakow** commented on **2025-07-13** at **09:37:13**
> It no longer gives a 404 (I didn't see one). It is better than HF tensor viewer, it is a documented custom regex string.
-Yes, I saw it after the link became accessible. That's how I knew what the issue was, and fixed it in #603.
\ No newline at end of file
+Yes, I saw it after the link became accessible. That's how I knew what the issue was, and fixed it in [#603](https://github.com/ikawrakow/ik_llama.cpp/issues/603).
\ No newline at end of file
diff --git a/github-data/issues/597 - Feature Request_ Add THUDM_GLM-4-MoE-100B-A10B support.md b/github-data/issues/597 - Feature Request Add THUDMGLM-4-MoE-100B-A10B support.md
similarity index 80%
rename from github-data/issues/597 - Feature Request_ Add THUDM_GLM-4-MoE-100B-A10B support.md
rename to github-data/issues/597 - Feature Request Add THUDMGLM-4-MoE-100B-A10B support.md
index bc5705afc..ff698aeae 100644
--- a/github-data/issues/597 - Feature Request_ Add THUDM_GLM-4-MoE-100B-A10B support.md
+++ b/github-data/issues/597 - Feature Request Add THUDMGLM-4-MoE-100B-A10B support.md
@@ -1,14 +1,15 @@
-### ✨ [#597](https://github.com/ikawrakow/ik_llama.cpp/issues/597) - Feature Request: Add THUDM/GLM-4-MoE-100B-A10B support
+## 📌 [Issue #597](https://github.com/ikawrakow/ik_llama.cpp/issues/597) - Feature Request: Add THUDM/GLM-4-MoE-100B-A10B support
| **Author** | `ubergarm` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-07-10 |
| **Updated** | 2025-07-14 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
The THUDM dev [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR) seems to be adding support for a new yet unreleased `THUDM/GLM-4-MoE-100B-A10B` model architechture to vLLM currently [here](https://github.com/vllm-project/vllm/pull/20736/files#diff-c2cd72327248d1c1aa3d4b29ec9e47314d9893bfeff94e927841cd640fac84c1R351)
@@ -20,9 +21,9 @@ If it looks promising, I might try to add support for this nice sized MoE when i
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **arch-btw** commented the **2025-07-14** at **23:51:59**:
+👤 **arch-btw** commented on **2025-07-14** at **23:51:59**
Yes, I look forward to this release myself!
diff --git a/github-data/issues/60 - Bug_ Illegal instruction on NEON and Q4_0_4_4.md b/github-data/issues/60 - Bug Illegal instruction on NEON and Q4_0_4_4.md
similarity index 91%
rename from github-data/issues/60 - Bug_ Illegal instruction on NEON and Q4_0_4_4.md
rename to github-data/issues/60 - Bug Illegal instruction on NEON and Q4_0_4_4.md
index 6987ed084..e3be97c60 100644
--- a/github-data/issues/60 - Bug_ Illegal instruction on NEON and Q4_0_4_4.md
+++ b/github-data/issues/60 - Bug Illegal instruction on NEON and Q4_0_4_4.md
@@ -1,14 +1,15 @@
-### 🐛 [#60](https://github.com/ikawrakow/ik_llama.cpp/issues/60) - Bug: Illegal instruction on NEON and Q4_0_4_4
+## 📌 [Issue #60](https://github.com/ikawrakow/ik_llama.cpp/issues/60) - Bug: Illegal instruction on NEON and Q4_0_4_4
| **Author** | `whoreson` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-09-19 |
| **Updated** | 2024-09-19 |
+| **Labels** | `wontfix` |
---
-#### Description
+## 📄 Description
### What happened?
@@ -81,8 +82,8 @@ Thread 7 "llama-cli" received signal SIGILL, Illegal instruction.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-09-19** at **08:45:58**:
+👤 **ikawrakow** commented on **2024-09-19** at **08:45:58**
I never use or check `Q4_0_4_4` or `Q4_0_8_8`. Also, I will definitely not try to debug several hundred lines of ARM assembly written by someone else - closing.
\ No newline at end of file
diff --git a/github-data/issues/600 - Feature Request_ Port --reasoning-budget from main llamacpp _llamaserve.md b/github-data/issues/600 - Feature Request Port --reasoning-budget from main llamacpp llamaserver.md
similarity index 99%
rename from github-data/issues/600 - Feature Request_ Port --reasoning-budget from main llamacpp _llamaserve.md
rename to github-data/issues/600 - Feature Request Port --reasoning-budget from main llamacpp llamaserver.md
index 656646906..3e0121ff3 100644
--- a/github-data/issues/600 - Feature Request_ Port --reasoning-budget from main llamacpp _llamaserve.md
+++ b/github-data/issues/600 - Feature Request Port --reasoning-budget from main llamacpp llamaserver.md
@@ -1,14 +1,15 @@
-### ✨ [#600](https://github.com/ikawrakow/ik_llama.cpp/issues/600) - Feature Request: Port --reasoning-budget from main llamacpp (llamaserver)
+## 📌 [Issue #600](https://github.com/ikawrakow/ik_llama.cpp/issues/600) - Feature Request: Port --reasoning-budget from main llamacpp (llamaserver)
| **Author** | `Panchovix` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-07-11 |
| **Updated** | 2025-07-12 |
+| **Labels** | `enhancement`, `help wanted` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -452,8 +453,8 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-12** at **09:55:54**:
+👤 **ikawrakow** commented on **2025-07-12** at **09:55:54**
Looks like a useful feature, but it is not my coup of tee to copy stuff from mainline. Hence, adding a "help wanted" label and looking forward to a PR from another contributor.
\ No newline at end of file
diff --git a/github-data/issues/601 - Bug_ llama-imatrix crashing.md b/github-data/issues/601 - Bug llama-imatrix crashing.md
similarity index 70%
rename from github-data/issues/601 - Bug_ llama-imatrix crashing.md
rename to github-data/issues/601 - Bug llama-imatrix crashing.md
index e3282194b..f6ac489f4 100644
--- a/github-data/issues/601 - Bug_ llama-imatrix crashing.md
+++ b/github-data/issues/601 - Bug llama-imatrix crashing.md
@@ -1,14 +1,14 @@
-### 🐛 [#601](https://github.com/ikawrakow/ik_llama.cpp/issues/601) - Bug: llama-imatrix crashing
+## 📌 [Issue #601](https://github.com/ikawrakow/ik_llama.cpp/issues/601) - Bug: llama-imatrix crashing
| **Author** | `Lissanro` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-07-12 |
-| **Updated** | 2025-07-19 |
+| **Updated** | 2025-07-27 |
---
-#### Description
+## 📄 Description
### What happened?
@@ -100,9 +100,9 @@ fatal error
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Lissanro** commented the **2025-07-12** at **02:56:11**:
+👤 **Lissanro** commented on **2025-07-12** at **02:56:11**
I should have checked with llama.cpp imatrix before reporting:
@@ -115,13 +115,13 @@ Please consider this bug report as a request to for clearer error message... if
---
-👤 **ikawrakow** commented the **2025-07-12** at **06:45:17**:
+👤 **ikawrakow** commented on **2025-07-12** at **06:45:17**
So, because of the issues around DeepSeek and the MLA tensors that can be different between mainline and `ik_llama.cpp` I disabled the tensor number check that triggers in mainline. That of course leads to the situation where faulty model will load but then crash because of missing tensors.
---
-👤 **ubergarm** commented the **2025-07-12** at **15:49:53**:
+👤 **ubergarm** commented on **2025-07-12** at **15:49:53**
Heya @Lissanro here is the script I use that has worked on DeepSeek-R1, V3, V3-0324, R1-0528, and the new TNG Chimera models. Keep in mind if u got back to the `-fmoe` closed PR it mentions not to use that when doing imatrix to get data for the individual tensors. This is a dual socket intel xeon 6980P with 768GB RAM per numa node (SNC=Disable gives one numa node per socket):
@@ -146,28 +146,7 @@ P.S. I have done it the mainline way by casting the fp8 to bf16 safetensors then
---
-👤 **ubergarm** commented the **2025-07-12** at **15:49:53**:
-
-Heya @Lissanro here is the script I use that has worked on DeepSeek-R1, V3, V3-0324, R1-0528, and the new TNG Chimera models. Keep in mind if u got back to the `-fmoe` closed PR it mentions not to use that when doing imatrix to get data for the individual tensors. This is a dual socket intel xeon 6980P with 768GB RAM per numa node (SNC=Disable gives one numa node per socket):
-
-```bash
-numactl -N 0 -m 0 \
-./build/bin/llama-imatrix \
- -m /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/DeepSeek-TNG-R1T2-Chimera-Q8_0.gguf \
- -f ubergarm-imatrix-calibration-corpus-v02.txt \
- -o /mnt/raid/models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/imatrix-DeepSeek-TNG-R1T2-Chimera-Q8_0.dat \
- --verbosity 1 \
- --ctx-size 512 \
- --layer-similarity \
- --numa numactl \
- --threads 128
-```
-
-I only ever convert fp8 safetensors via the evshiron llama.cpp fork (made from fairydreaming's original MLA stuf) plus triton-cpu to get bf16 GGUFs directly without need for > sm89 architechture GPU or any GPU at all.
-
----
-
-👤 **saood06** commented the **2025-07-12** at **21:04:32**:
+👤 **saood06** commented on **2025-07-12** at **21:04:32**
> llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 1147, got 1025 llama_model_load_from_file_impl: failed to load model
>
@@ -177,7 +156,7 @@ I don't know if it is an "incomplete quant", as 1025 tensors is what I see in my
---
-👤 **Lissanro** commented the **2025-07-12** at **22:30:30**:
+👤 **Lissanro** commented on **2025-07-12** at **22:30:30**
@ubergarm
Thank you, I was making it work without crashing. As it turned out the issue wasn't missing tensors (rebuilding from scratch did not help), but it seems some extra options in my command were crashing it. When I used your command with some adjustment to my system (I have only 64 cores) and paths, it started working, however I tried without any GPUs for now. I will try carefully to add GPU options when I am not using another model actively.
@@ -206,7 +185,7 @@ I wonder, does that mean extra MLA tensors were bundled in the original FP8 mode
---
-👤 **saood06** commented the **2025-07-12** at **23:05:01**:
+👤 **saood06** commented on **2025-07-12** at **23:05:01**
> I wonder, does that mean extra MLA tensors were bundled in the original FP8 model, or did /convert_hf_to_gguf.py add them? I did not take a note how many tensors R1 and R1T original FP8 models had when I was converting them, so not sure if this one is different or the same.
@@ -218,13 +197,13 @@ Hopefully this helps you understand.
>Thank you, I was making it work without crashing. As it turned out the issue wasn't missing tensors (rebuilding from scratch did not help), but it seems some extra options in my command were crashing it. When I used your command with some adjustment to my system (I have only 64 cores) and paths, it started working, however I tried without any GPUs for now. I will try carefully to add GPU options when I am not using another model actively.
-I have very limited experience in creating imatrix files, but I do remember `-fmoe` was stated as not compatible as "this option cannot be used when computing an imatrix because than the intermediate results remain in temporary work buffers, hence will not be propagated to collect activation statistics for the up_exps and gate_exps tensors." (from #229).
+I have very limited experience in creating imatrix files, but I do remember `-fmoe` was stated as not compatible as "this option cannot be used when computing an imatrix because than the intermediate results remain in temporary work buffers, hence will not be propagated to collect activation statistics for the up_exps and gate_exps tensors." (from [#229](https://github.com/ikawrakow/ik_llama.cpp/issues/229)).
I'm not sure if that was the only issue, but it seems like it may have been an issue.
---
-👤 **ubergarm** commented the **2025-07-12** at **23:22:43**:
+👤 **ubergarm** commented on **2025-07-12** at **23:22:43**
@Lissanro
@@ -248,29 +227,7 @@ If I make a quant converted with the mainline two step cast method, it also show
---
-👤 **ubergarm** commented the **2025-07-12** at **23:22:43**:
-
-@Lissanro
-
-> however I tried without any GPUs for now
-
-Glad you're able to get it to run at least on CPU. Curious if it would work with CUDA too.
-
-> This is how I converted from fp8 to bf16:
-
-Wait are you using mainline llama.cpp to do the conversion `python3 /home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py` and then ik to do the imatrix `~/pkgs/ik_llama.cpp/build/bin/llama-imatrix` ?
-
-I've only recently tried that once for an experiment trying a bunch of tiny IQ1_S quants with older quantization types possibly to run on AMD GPU but got distracted. I can't remember but some combination threw an error, either mainline llama.cpp imatrixing a `evshiron+triton-cpu` method quant or vice versa...
-
-I did grab a gguf-dump of the first bf16 file for both methods if you'd like to look, I put both of them here:
-
-https://gist.github.com/ubergarm/d9a3e89355199fc34d8c75882bcc3ab4
-
-If I make a quant converted with the mainline two step cast method, it also shows up when starting on ik_llama.cpp with that error message `missing wkv_b tensor(s) changing MLA from %d to 1`.
-
----
-
-👤 **saood06** commented the **2025-07-12** at **23:40:00**:
+👤 **saood06** commented on **2025-07-12** at **23:40:00**
> I only ever convert fp8 safetensors via the evshiron llama.cpp fork (made from fairydreaming's original MLA stuf) plus triton-cpu to get bf16 GGUFs directly without need for > sm89 architechture GPU or any GPU at all.
@@ -280,17 +237,7 @@ I don't like that this is the way I still resort to doing it (a goal of mine [ev
---
-👤 **saood06** commented the **2025-07-12** at **23:40:00**:
-
-> I only ever convert fp8 safetensors via the evshiron llama.cpp fork (made from fairydreaming's original MLA stuf) plus triton-cpu to get bf16 GGUFs directly without need for > sm89 architechture GPU or any GPU at all.
-
-I don't like that this is the way I still resort to doing it (a goal of mine [even if I haven't been working at it at all recently] is to make using any convert script outside this repo not needed for making GGUFs for models supported by this repo, Besides upcasting FP8 using triton, I know certain models like Gemma 3 and GLM-4 still aren't supported*).
-
-*Well besides the new bitnet model as they have their own standalone scripts [this](https://github.com/microsoft/BitNet/blob/main/utils/convert-ms-to-gguf-bitnet.py) and [this](https://github.com/microsoft/BitNet/blob/main/utils/convert-hf-to-gguf-bitnet.py) that I had issues using those.
-
----
-
-👤 **Lissanro** commented the **2025-07-13** at **00:25:30**:
+👤 **Lissanro** commented on **2025-07-13** at **00:25:30**
@ubergarm
@@ -304,7 +251,7 @@ And according to it, having -fmoe wasn't causing crashes in the past when creati
---
-👤 **saood06** commented the **2025-07-13** at **01:43:19**:
+👤 **saood06** commented on **2025-07-13** at **01:43:19**
> And according to it, having -fmoe wasn't causing crashes in the past when creating imatrix, this is why I was using it, I just wasn't aware it is not supported anymore for the imatrix creation (based on information in this thread, it sounds like maybe it was never really supported).
@@ -312,15 +259,7 @@ Even if it wasn't causing crashing it might explain why your imatrix file was sm
---
-👤 **saood06** commented the **2025-07-13** at **01:43:19**:
-
-> And according to it, having -fmoe wasn't causing crashes in the past when creating imatrix, this is why I was using it, I just wasn't aware it is not supported anymore for the imatrix creation (based on information in this thread, it sounds like maybe it was never really supported).
-
-Even if it wasn't causing crashing it might explain why your imatrix file was smaller than it should have been. (130 MB vs 987 MB).
-
----
-
-👤 **ubergarm** commented the **2025-07-14** at **15:51:51**:
+👤 **ubergarm** commented on **2025-07-14** at **15:51:51**
@saood06
@@ -340,7 +279,7 @@ Not sure how it will pan out, but I think we'll get there eventually!
---
-👤 **ikawrakow** commented the **2025-07-14** at **16:20:27**:
+👤 **ikawrakow** commented on **2025-07-14** at **16:20:27**
> My goal is to get a "small" Kimi-K2-Instruct GGUF using ik's SOTA quants. However, it is a slightly modified DeepSeek architecture with more routed exps, only one ffn dense layer up front (instead of 3), and less MLA heads I believe.
@@ -356,7 +295,7 @@ Unless you are worried about model size and want to squeeze out the last bit pos
---
-👤 **saood06** commented the **2025-07-14** at **18:40:33**:
+👤 **saood06** commented on **2025-07-14** at **18:40:33**
> Yeah I wasn't sure where ik_llama.cpp convert_hf_to_gguf.py stands and skipped porting over the python code on GLM-4 and also Hunyuan-A13B....
@@ -382,7 +321,7 @@ Will read through that. (Edit: Gave a reply there as well).
>and I also decided to go with Compilade's unmerged imatrix GGUF PR as it still saves data even when the routed exps are not 100% (it was dropping a lot at first). Not sure on how compatible that "imatrix.gguf" will be here if I convert it back to ".dat"...
-You mean to accomplish something similar to #202. I've been saying on mainline that nicoboss's fork was based on this PR (since I was the one who reported the issue that lead to the creation of that PR and went back and told them and they made their fork based on that).
+You mean to accomplish something similar to [#202](https://github.com/ikawrakow/ik_llama.cpp/issues/202). I've been saying on mainline that nicoboss's fork was based on this PR (since I was the one who reported the issue that lead to the creation of that PR and went back and told them and they made their fork based on that).
> Not sure how it will pan out, but I think we'll get there eventually!
@@ -390,25 +329,25 @@ Let me know if you need me to help with anything.
---
-👤 **ikawrakow** commented the **2025-07-14** at **19:40:31**:
+👤 **ikawrakow** commented on **2025-07-14** at **19:40:31**
> and I also decided to go with Compilade's unmerged imatrix GGUF PR as it still saves data even when the routed exps are not 100% (it was dropping a lot at first). Not sure on how compatible that "imatrix.gguf" will be here if I convert it back to ".dat"...
If you insist on calculating the imatrix with mainline, you absolutely need compilade's PR. Not because "it still saves data even when the routed exps are not 100%", but because without that PR mainline calculates broken self-attention imatrix data for MLA models (and has been doing that for the last 3 months, and before that it couldn't because it did not support MLA).
-Having said that, there is nothing in compilade's PR that has not been solved here a long time ago. Given that #609 has been merged, I would calculate the imatrix data with `ik_llama.cpp` if I were you.
+Having said that, there is nothing in compilade's PR that has not been solved here a long time ago. Given that [#609](https://github.com/ikawrakow/ik_llama.cpp/issues/609) has been merged, I would calculate the imatrix data with `ik_llama.cpp` if I were you.
---
-👤 **saood06** commented the **2025-07-14** at **19:51:08**:
+👤 **saood06** commented on **2025-07-14** at **19:51:08**
->Having said that, there is nothing in compilade's PR that has not been solved here a long time ago. Given that [#609](https://github.com/ikawrakow/ik_llama.cpp/pull/609) has been merged, I would calculate the imatrix data with `ik_llama.cpp` if I were you.
+>Having said that, there is nothing in compilade's PR that has not been solved here a long time ago. Given that [[#609](https://github.com/ikawrakow/ik_llama.cpp/issues/609)](https://github.com/ikawrakow/ik_llama.cpp/pull/609) has been merged, I would calculate the imatrix data with `ik_llama.cpp` if I were you.
-I agree about generating the imatrix data with `ik_llama.cpp`, but the one thing that has not been solved (at least not ideally in my opinion) is turning the FP8 source file into BF16 but it seems like @ubergarm is already past that point based on the HF thread (also just to clarify this is a separate issue outside the scope of #609 or the compilade PR).
+I agree about generating the imatrix data with `ik_llama.cpp`, but the one thing that has not been solved (at least not ideally in my opinion) is turning the FP8 source file into BF16 but it seems like @ubergarm is already past that point based on the HF thread (also just to clarify this is a separate issue outside the scope of [#609](https://github.com/ikawrakow/ik_llama.cpp/issues/609) or the compilade PR).
---
-👤 **ubergarm** commented the **2025-07-14** at **20:01:59**:
+👤 **ubergarm** commented on **2025-07-14** at **20:01:59**
Thanks y'all, and yes I *want* to use ik_llama.cpp imatrix!!
@@ -428,7 +367,7 @@ I realize I could make imatrix with ik_llama.cpp using a mainline quant, but the
---
-👤 **saood06** commented the **2025-07-14** at **20:06:39**:
+👤 **saood06** commented on **2025-07-14** at **20:06:39**
> I had never understood exactly what step messes up the MLA tensors with the "mainline fp8_cast_bf16.py -> convert_hf_to_gguf.py method" vs what I use here referred to as the "evshiron+triton-cpu direct fp8 -> bf16 gguf method".
@@ -436,7 +375,7 @@ The python script that converts the safetensors into a GGUF is the one that dete
---
-👤 **ubergarm** commented the **2025-07-14** at **20:10:40**:
+👤 **ubergarm** commented on **2025-07-14** at **20:10:40**
> The python script that converts the safetensors into a GGUF is the one that determines what MLA tensors you end up with.
@@ -448,19 +387,7 @@ Thanks for your patience I can be pretty slow on the uptake sometimes haha
---
-👤 **ubergarm** commented the **2025-07-14** at **20:10:40**:
-
-> The python script that converts the safetensors into a GGUF is the one that determines what MLA tensors you end up with.
-
-Yup, I never quite realized that as the evshiron method being a single step confused me. I never grokked where exactly things were happening until going through this all today.
-
-I link to the different code in question [in this comment here](https://github.com/ikawrakow/ik_llama.cpp/pull/609#issuecomment-3070754157)
-
-Thanks for your patience I can be pretty slow on the uptake sometimes haha
-
----
-
-👤 **saood06** commented the **2025-07-14** at **20:21:48**:
+👤 **saood06** commented on **2025-07-14** at **20:21:48**
> > The python script that converts the safetensors into a GGUF is the one that determines what MLA tensors you end up with.
>
@@ -478,7 +405,7 @@ Thank you for doing all this. It helps a lot of people, so I'm glad to assist wh
---
-👤 **ubergarm** commented the **2025-07-14** at **20:37:39**:
+👤 **ubergarm** commented on **2025-07-14** at **20:37:39**
> Yes that isn't the most intuitive, but it is really convenient.
@@ -502,7 +429,7 @@ $ cat quantize-Kimi-K2-Instruct-mainline-Q8_0.log | grep attn_kv_b
---
-👤 **saood06** commented the **2025-07-14** at **20:43:12**:
+👤 **saood06** commented on **2025-07-14** at **20:43:12**
> Yeah, though fortunately now I have a method to use triton-cpu (with your help patching that) and use deepseek's fp8_cast_bf16.py directly to avoid needing enough VRAM or >=sm89 arch for fp8e4m3 support.
@@ -519,24 +446,7 @@ The point to linking the old comment is not for the conclusion or even about com
---
-👤 **saood06** commented the **2025-07-14** at **20:43:12**:
-
-> Yeah, though fortunately now I have a method to use triton-cpu (with your help patching that) and use deepseek's fp8_cast_bf16.py directly to avoid needing enough VRAM or >=sm89 arch for fp8e4m3 support.
-
-I never did that as once you have triton-cpu the evshiron method saves you a step so I always did that.
-
-> Ahh yes, I have definitely read this before, but it didn't sink in, and notes are scattered across so many platforms these days alas... Here it is again for my future self to stuble on it:
->
-> > So in conclusion if the model has all three attn_k_b.weight, attn_v_b.weight and attn_kv_b.weight or just attn_kv_b.weight it will work here, but if it has attn_k_b.weight and attn_v_b.weight but no attn_kv_b.weight it will not work here.
->
-
-NO. The conclusion to that comment is outdated (and I say so in the comment).
-
-The point to linking the old comment is not for the conclusion or even about compatibility, it is just about the differing MLA tensors amongst GGUFs that exist.
-
----
-
-👤 **ubergarm** commented the **2025-07-14** at **20:58:05**:
+👤 **ubergarm** commented on **2025-07-14** at **20:58:05**
> NO. The conclusion to that comment is outdated (and I say so in the comment).
>
@@ -546,7 +456,7 @@ I think I'm doing too many things at the same time, sorry to misunderstand yet a
---
-👤 **saood06** commented the **2025-07-14** at **21:05:08**:
+👤 **saood06** commented on **2025-07-14** at **21:05:08**
> I think I'm doing too many things at the same time, sorry to misunderstand yet again lol.
@@ -558,7 +468,7 @@ Yes (and some from even before any MLA implementation exists). I was linking it
---
-👤 **ubergarm** commented the **2025-07-14** at **21:10:13**:
+👤 **ubergarm** commented on **2025-07-14** at **21:10:13**
Right, I added edited the comment above and stuck this in there: `EDIT BY UBERGARM To be clear ik_llama.cpp does support mainline quants despite mainline changing the MLA tensors!!!`
@@ -568,7 +478,7 @@ Thanks!
---
-👤 **Lissanro** commented the **2025-07-19** at **06:04:11**:
+👤 **Lissanro** commented on **2025-07-19** at **06:04:11**
I tried to rebuilt my quants avoid using -fmoe and -mla 3 options, but use just -mla 1 instead. I was successfully was able to rebuild V3 quant, but R1 gives me a trouble (get nan during imatrix), I would appreciate if anyone encountered similar issues or know how to debug this.
@@ -587,51 +497,212 @@ I also tried without "-mla 1 -b 4096 -ub 4096" and it crashed in a similar way.
---
-👤 **Lissanro** commented the **2025-07-19** at **06:04:11**:
+👤 **ikawrakow** commented on **2025-07-19** at **06:47:39**
-I tried to rebuilt my quants avoid using MLA. I was successfully was able to rebuild V3 quant, but R1 gives me a trouble (get nan during imatrix), I would appreciate if anyone encountered similar issues or know how to debug this.
+This is a bummer. No-one has reported a problem such as this, so it could be useful to see the calibration data if it is not secret.
-First, I create Q8 from BF16:
+You can still use the matrix data that was saved before the NaN occurred.
-~/pkgs/ik_llama.cpp/build/bin/llama-quantize /mnt/secondary/neuro/DeepSeek-R1-0528/DeepSeek-R1-256x21B-0528-BF16.gguf /mnt/neuro/models/DeepSeek-R1-256x21B-0528-IQ4_K-163840seq/DeepSeek-R1-256x21B-0528-Q8_0.gguf Q8_0
+---
-Then I try to build imatrix:
+👤 **ubergarm** commented on **2025-07-19** at **14:42:07**
-~/pkgs/ik_llama.cpp/build/bin/llama-imatrix -m /mnt/neuro/models/DeepSeek-R1-256x21B-0528-IQ4_K-163840seq/DeepSeek-R1-256x21B-0528-Q8_0.gguf -f ~/pkgs/imatrix/compact.txt --n-gpu-layers 62 --tensor-split 25,23,26,26 -ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" --threads 64 -mla 1 -b 4096 -ub 4096
+@Lissanro
+
+Your command looks reasonable, and while i personally don't mix `-ts` and `-ot` it should be fine if its loading how you like onto your GPUs. I haven't used `-ub 4096 -b 4096` while doing imatrix, but it should be fine I just learned yesterday and still work at the default n_ctx 512 which I want.
+
+I presume you compiled with `-DGGML_CUDA_IQK_FORCE_BF16=1` to avoid nans specifically with DeepSeek/MLA models e.g.:
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
+cmake --build build --config Release -j $(nproc)
+```
+
+Otherwise yes bummer indeed.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-19** at **18:05:54**
+
+> I presume you compiled with -DGGML_CUDA_IQK_FORCE_BF16=1 to avoid nans specifically with DeepSeek/MLA models
+
+That was relevant only for quants that did not have quantized matrix multiplications (a.k.a., MMQ), and hence dequantized to `f16` by default, which resulted in NaNs for DeepSeek. This is no longer relevant as all quants have MMQ now. It never was relevant for `Q8_0`.
+
+---
+
+👤 **ubergarm** commented on **2025-07-19** at **18:16:11**
+
+> This is no longer relevant as all quants have MMQ now. It never was relevant for Q8_0.
+
+Thanks for the clarification, the 😈 is in the details. I'll do my best to not perpetuate incorrect collective knowledge for all eternity... 😅
+
+---
+
+👤 **Lissanro** commented on **2025-07-22** at **15:47:48**
+
+@ikawrakow Sure, I can share my imatrix data and from where I got it, I just wanted to try one more time to be sure before providing more details. I tried this time without `-mla 1` and without `-ub 4096 -b 4096` options (with them outcome is the same):
+
+```
+~/pkgs/ik_llama.cpp/build/bin/llama-imatrix \
+-m /mnt/neuro/models/DeepSeek-R1-256x21B-0528-IQ4_K-163840seq/DeepSeek-R1-256x21B-0528-Q8_0.gguf \
+-f ~/pkgs/imatrix/compact-imatrix.txt --n-gpu-layers 62 --tensor-split 25,23,26,26 \
+-ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" --threads 64
...
+save_imatrix: stored collected data after 720 chunks in imatrix.dat
+[720]4.8196,[721]4.8264,[722]4.8189,[723]4.8177,[724]4.8175,[725]4.8171,[726]4.8170,[727]4.8097,[728]4.8111,[729]4.8132,
save_imatrix: stored collected data after 730 chunks in imatrix.dat
[730]4.8195,[731]4.8186,[732]4.8137,[733]4.8200,[734]4.8243,[735]4.8169,[736]4.8161,[737]4.8118,nan detected in blk.60.attn_output.weight
-I also tried without "-mla 1 -b 4096 -ub 4096" and it crashed in a similar way. Maybe something wrong with my Q8 or maybe I missed some imatrix option that is needed, but could not figure this out just yet.
+```
+
+Full log: https://pastebin.com/CgFtLbWB (even though the working folder named IQ4, I am using Q8_0 GGUF quant, and did not get to making IQ4 yet).
+
+My calibration data https://dragon.studio/2025/07/compact-imatrix.txt is a merge of these two:
+https://huggingface.co/ThomasBaruzier/Qwen2.5-3B-Instruct-GGUF/blob/main/calibration_datav3.txt
+https://huggingface.co/datasets/eaddario/imatrix-calibration/blob/main/combined_all_tiny.parquet (converted to txt first)
+
+This attempt was with updated repo from few days ago (could not update before running again due to repo not being available at the time).
+
+By the way, by using the imatrix data, do you mean I can use the resulted file as is, or can continue it somehow without starting over? Any ideas if it is something in my calibration data causing this and if yes, how to pinpoint it? It worked for V3 though, so it is a bit surprising that it fails for R1 0528.
---
-👤 **ikawrakow** commented the **2025-07-19** at **06:47:39**:
+👤 **ikawrakow** commented on **2025-07-22** at **16:05:57**
-This is a bummer. No-one has reported a problem such as this, so it could be useful to see the calibration data if it is not secret.
+> By the way, by using the imatrix data, do you mean I can use the resulted file as is, or can continue it somehow without starting over? Any ideas if it is something in my calibration data causing this and if yes, how to pinpoint it? It worked for V3 though, so it is a bit surprising that it fails for R1 0528.
-You can still use the matrix data that was saved before the NaN occurred.
+It got saved to `imatrix.dat` after 730 chunks. You can use this file to create quantized models. Rename it to something else to not lose it when you try the following:
+* Add `--from-chunk 738` to your imatrix command line. This will start computing from chunk 738 thus skipping chunk 737 that produces the NaN
+* If that goes well, you can combine the two (or more) imatrix files by using
+```
+./bin/imatrix --in-file file1 --in-file file2 [--in-file file3 ...] -o combined_imatrix.dat
+```
---
-👤 **ubergarm** commented the **2025-07-19** at **14:42:07**:
+👤 **magikRUKKOLA** commented on **2025-07-25** at **00:02:15**
+
+> /home/lissanro/pkgs/ik_llama.cpp/ggml/src/ggml.c:15229: fatal error
+
+Uh oh this nasty error.
+
+I can confirm it still present when running the latest ik_llama.cpp master with some 6 bpw+ quants from the Thireus (for the INFERENCE).
+
+In my case it was solved by removing the -fmoe from the command. Tried to look into it, but failed :( which is sad. It was somewhat related to munlock.
+
+[EDIT]: pretty sure I can reproduce it for the PPL calculations with -fmoe, low-free-ram and some absend swap file.
+
+---
+
+👤 **Lissanro** commented on **2025-07-25** at **15:01:32**
+
+@ikawrakow Thank you so much! It is good to know that it is possible to combine imatrix files like that, and besides bypassing NaN errors, opens up a possibility to build in few cycles rather than in one go. Once I completed the imatrix, I think it worked well - Q4_K_XS is about as good or better then Q4_K_M quant I had previously with wrongly built imatrix, and reasoning quality breakdown that I originally observed when testing XS is no longer happens.
+
+* * *
+
+By the way, does anyone here have an idea how to convert Kimi K2 FP8 to BF16? I thought it is based on DeepSeek architecture so thought at least quantizing it going to be easy, but so far I could not found any way to convert it to BF16 -
+for example, llama.cpp folk https://github.com/evshiron/llama.cpp that had script capable of converting FP8 to BF16 for DeepSeek V3 and R1 crashes when loading Kimi K2 safetensors:
+
+```
+python3 /home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py \
+--outtype bf16 --outfile /mnt/Toshiba_Canvio_4TB/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf \
+/mnt/neuro/models/Kimi-K2-Instruct/
+...
+Traceback (most recent call last):
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 5244, in
+ main()
+ ~~~~^^
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 5238, in main
+ model_instance.write()
+ ~~~~~~~~~~~~~~~~~~~~^^
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 440, in write
+ self.prepare_metadata(vocab_only=False)
+ ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 433, in prepare_metadata
+ self.set_vocab()
+ ~~~~~~~~~~~~~~^^
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 4058, in set_vocab
+ self._set_vocab_gpt2()
+ ~~~~~~~~~~~~~~~~~~~~^^
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 728, in _set_vocab_gpt2
+ tokens, toktypes, tokpre = self.get_vocab_base()
+ ~~~~~~~~~~~~~~~~~~~^^
+ File "/home/lissanro/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py", line 522, in get_vocab_base
+ tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
+ File "/home/lissanro/.local/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 946, in from_pretrained
+ tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
+ File "/home/lissanro/.local/lib/python3.13/site-packages/transformers/models/auto/tokenization_auto.py", line 800, in get_tokenizer_config
+ result = json.load(reader)
+ File "/usr/lib/python3.13/json/__init__.py", line 293, in load
+ return loads(fp.read(),
+ cls=cls, object_hook=object_hook,
+ parse_float=parse_float, parse_int=parse_int,
+ parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
+ File "/usr/lib/python3.13/json/__init__.py", line 346, in loads
+ return _default_decoder.decode(s)
+ ~~~~~~~~~~~~~~~~~~~~~~~^^^
+ File "/usr/lib/python3.13/json/decoder.py", line 348, in decode
+ raise JSONDecodeError("Extra data", s, end)
+json.decoder.JSONDecodeError: Extra data: line 165 column 2 (char 5595)
+```
+
+(full log: https://pastebin.com/3GmeCK9d )
+
+convert_hf_to_gguf.py script from ik_llama.cpp and original llama.cpp does not seem to support FP8 unfortunately (so I assume this is not actually a bug, hence this is why I do not open an issue for it, and why I am trying alternatives that can work with CPU or 3090 cards).
+
+---
+
+👤 **ubergarm** commented on **2025-07-26** at **18:20:07**
@Lissanro
-Your command looks reasonable, and while i personally don't mix `-ts` and `-ot` it should be fine if its loading how you like onto your GPUs. I haven't used `-ub 4096 -b 4096` while doing imatrix, but it should be fine I just learned yesterday and still work at the default n_ctx 512 which I want.
+> convert_hf_to_gguf.py script from ik_llama.cpp and original llama.cpp does not seem to support FP8 unfortunately
+
+Both myself and anikifoss have converted Kimi-K2 using this repo's `convert_hf_to_gguf.py` script. I've written about all three methods to get fp8 safetensors into bf16 GGUFs here: https://github.com/ggml-org/llama.cpp/issues/14762#issuecomment-3098571703
+
+If you have the disk space, the triton-cpu method works quite well though takes some time. Otherwise you can grab pre-made bf16 GGUFs from like unsloth or folks which have already been passed through DeepSeek's original fp8_cast_bf16 python script but not yet converted.
+
+---
+
+👤 **Lissanro** commented on **2025-07-26** at **18:51:10**
+
+@ubergarm
+I actually managed to get conversion going today, but it was non-trivial and required some modifications. The error I shared above was the easiest to solve and caused by a change of the json in the upstream repository - my download script double checks that all files fully downloaded and the new version was slightly bigger so ended up as a corrupted file that I had to redownload.
+
+But, then it complained about requiring trust_remote_code=True to the AutoTokenizer.from_pretrained function call, then some other things... later on, about some unprocessed experts. So, I ended up with this patch:
+
+https://dragon.studio/2025/07/lama.cpp-fp8-to-bf16-patch.diff
+
+It was based on the differences between the https://github.com/evshiron/llama.cpp fork and the upstream llama.cpp related to the conversion script and Kimi K2 updates.
+
+I am still in the process of converting the model though, and then will need to build imatrix and final quant, so it will be a while before I actually test the result to see if it works as expected. But I though I share the patch here in case it is useful to someone. I used this command to start the conversion process:
-I presume you compiled with `-DGGML_CUDA_IQK_FORCE_BF16=1` to avoid nans specifically with DeepSeek/MLA models e.g.:
-```bash
-cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_VULKAN=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
-cmake --build build --config Release -j $(nproc)
+```
+python3 ~/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py \
+--outtype bf16 \
+--outfile /mnt/neuro/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf \
+/mnt/neuro/models/Kimi-K2-Instruct/
```
-Otherwise yes bummer indeed.
+_(I am on 4G connection so even downloading original FP8 was quite a challenge; there is also a risk to get bandwidth throttled if I go beyond some undefined traffic limit, this is why I could not just download premade BF16 after downloading FP8 and had to figure out the solution)_
---
-👤 **ikawrakow** commented the **2025-07-19** at **18:05:54**:
+👤 **ubergarm** commented on **2025-07-26** at **19:14:45**
-> I presume you compiled with -DGGML_CUDA_IQK_FORCE_BF16=1 to avoid nans specifically with DeepSeek/MLA models
+@Lissanro
+
+Oh amazing you got the Kimi-K2 fp8 safetensors over 4G! Yes wouldn't want to download the 2TB bf16 safetensors! Glad you got it to convert and thanks for sharing your patch.
+
+> and then will need to build imatrix
+
+I assume you've been following along with some questions about imatrix for MLA tensors in the other threads, which I've added an issue for now here: https://github.com/ikawrakow/ik_llama.cpp/issues/651
+
+I'd be curious if your imatrix run for kimi-k2-instruct gives you data for those attn_k_b and attn_v_b tensors. You're welcome to use my imatrix, however, I don't think it contains that data despite my running it with `-mla 1` as explained in the issue.
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **01:21:24**
+
+> But I though I share the patch here in case it is useful to someone. I used this command to start the conversion process:
-That was relevant only for quants that did not have quantized matrix multiplications (a.k.a., MMQ), and hence dequantized to `f16` by default, which resulted in NaNs for DeepSeek. This is no longer relevant as all quants have MMQ now. It never was relevant for `Q8_0`.
\ No newline at end of file
+Thanks. It may end up being useful in some way to me.
\ No newline at end of file
diff --git a/github-data/issues/605 - Bug_ IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script c.md b/github-data/issues/605 - Bug IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script cannot proc.md
similarity index 86%
rename from github-data/issues/605 - Bug_ IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script c.md
rename to github-data/issues/605 - Bug IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script cannot proc.md
index 63fca0b50..4fa0cb6d6 100644
--- a/github-data/issues/605 - Bug_ IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script c.md
+++ b/github-data/issues/605 - Bug IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script cannot proc.md
@@ -1,4 +1,4 @@
-### 🐛 [#605](https://github.com/ikawrakow/ik_llama.cpp/issues/605) - Bug: IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script cannot process IQ3_KS tensors
+## 📌 [Issue #605](https://github.com/ikawrakow/ik_llama.cpp/issues/605) - Bug: IQ3_KS missing from GGMLQuantizationType - gguf_reader.py script cannot process IQ3_KS tensors
| **Author** | `Thireus` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
diff --git a/github-data/issues/614 - Feature Request_ port no-mmproj-offload.md b/github-data/issues/614 - Feature Request port no-mmproj-offload.md
similarity index 74%
rename from github-data/issues/614 - Feature Request_ port no-mmproj-offload.md
rename to github-data/issues/614 - Feature Request port no-mmproj-offload.md
index 9ba9203aa..d8cc6534d 100644
--- a/github-data/issues/614 - Feature Request_ port no-mmproj-offload.md
+++ b/github-data/issues/614 - Feature Request port no-mmproj-offload.md
@@ -1,14 +1,15 @@
-### ✨ [#614](https://github.com/ikawrakow/ik_llama.cpp/issues/614) - Feature Request: port no-mmproj-offload
+## 📌 [Issue #614](https://github.com/ikawrakow/ik_llama.cpp/issues/614) - Feature Request: port no-mmproj-offload
| **Author** | `erazortt` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-07-15 |
| **Updated** | 2025-07-16 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -31,8 +32,8 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-16** at **09:19:35**:
+👤 **ikawrakow** commented on **2025-07-16** at **09:19:35**
-There is no vision support at all in `ik_llama.cpp`, see my response in #615
\ No newline at end of file
+There is no vision support at all in `ik_llama.cpp`, see my response in [#615](https://github.com/ikawrakow/ik_llama.cpp/issues/615)
\ No newline at end of file
diff --git a/github-data/issues/615 - Bug_ Gemma3 Vision not working.md b/github-data/issues/615 - Bug Gemma3 Vision not working.md
similarity index 76%
rename from github-data/issues/615 - Bug_ Gemma3 Vision not working.md
rename to github-data/issues/615 - Bug Gemma3 Vision not working.md
index 1a1e8d2f2..e2be7c93a 100644
--- a/github-data/issues/615 - Bug_ Gemma3 Vision not working.md
+++ b/github-data/issues/615 - Bug Gemma3 Vision not working.md
@@ -1,4 +1,4 @@
-### 🐛 [#615](https://github.com/ikawrakow/ik_llama.cpp/issues/615) - Bug: Gemma3 Vision not working
+## 📌 [Issue #615](https://github.com/ikawrakow/ik_llama.cpp/issues/615) - Bug: Gemma3 Vision not working
| **Author** | `erazortt` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -36,20 +36,20 @@ Windows
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **jmcook** commented the **2025-07-16** at **03:36:05**:
+👤 **jmcook** commented on **2025-07-16** at **03:36:05**
That's funny, I was just trying the same thing tonight and noticed the same thing!
---
-👤 **ikawrakow** commented the **2025-07-16** at **09:18:39**:
+👤 **ikawrakow** commented on **2025-07-16** at **09:18:39**
Sorry, there is no vision support in `ik_llama.cpp` at all. As I know nothing about vision or multi-modality, my suggestion is to try to convince @ngxson to contribute the multi-modality library he created for `llama.cpp` also to `ik_llama.cpp`.
---
-👤 **ikawrakow** commented the **2025-07-19** at **09:27:13**:
+👤 **ikawrakow** commented on **2025-07-19** at **09:27:13**
I think I'll close this one. A feature request can be opened instead.
\ No newline at end of file
diff --git a/github-data/issues/625 - Bug_ undefined symbol errors after successful compilation.md b/github-data/issues/625 - Bug undefined symbol errors after successful compilation.md
similarity index 92%
rename from github-data/issues/625 - Bug_ undefined symbol errors after successful compilation.md
rename to github-data/issues/625 - Bug undefined symbol errors after successful compilation.md
index 6bb2329c1..8f425270b 100644
--- a/github-data/issues/625 - Bug_ undefined symbol errors after successful compilation.md
+++ b/github-data/issues/625 - Bug undefined symbol errors after successful compilation.md
@@ -1,4 +1,4 @@
-### 🐛 [#625](https://github.com/ikawrakow/ik_llama.cpp/issues/625) - Bug: undefined symbol errors after successful compilation
+## 📌 [Issue #625](https://github.com/ikawrakow/ik_llama.cpp/issues/625) - Bug: undefined symbol errors after successful compilation
| **Author** | `samteezy` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -79,9 +79,9 @@ Ubuntu 24.04 running in Proxmox LXC
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-18** at **06:12:11**:
+👤 **ikawrakow** commented on **2025-07-18** at **06:12:11**
It looks like a confusion between `llama.cpp` and `ik_llama.cpp` libraries. I suspect `llama.cpp` is installed system-wide, so when the `ik_llama.cpp` server is started it picks up the `llama.cpp` DLLs.
@@ -93,7 +93,7 @@ export LD_LIBRARY_PATH="/root/llama-builds/ik_llama.cpp/bin:$LD_LIBRARY_PATH"
---
-👤 **samteezy** commented the **2025-07-18** at **12:14:29**:
+👤 **samteezy** commented on **2025-07-18** at **12:14:29**
Yep, that was root cause. I've been restructuring my llama environment to use local, static builds of both `llama.cpp` and `ik_llama.cpp` this morning using `-DBUILD_SHARED_LIBS=OFF` and now they're both working great.
Thanks for all your hard work!
\ No newline at end of file
diff --git a/github-data/issues/626 - Feature Request_ Add IQK GEMM for IQ1_M.md b/github-data/issues/626 - Feature Request Add IQK GEMM for IQ1_M.md
similarity index 89%
rename from github-data/issues/626 - Feature Request_ Add IQK GEMM for IQ1_M.md
rename to github-data/issues/626 - Feature Request Add IQK GEMM for IQ1_M.md
index 43cd194ec..659b01fd6 100644
--- a/github-data/issues/626 - Feature Request_ Add IQK GEMM for IQ1_M.md
+++ b/github-data/issues/626 - Feature Request Add IQK GEMM for IQ1_M.md
@@ -1,14 +1,16 @@
-### ✨ [#626](https://github.com/ikawrakow/ik_llama.cpp/issues/626) - Feature Request: Add IQK GEMM for IQ1_M
+## 📌 [Issue #626](https://github.com/ikawrakow/ik_llama.cpp/issues/626) - Feature Request: Add IQK GEMM for IQ1_M
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-07-18 |
| **Updated** | 2025-07-18 |
+| **Labels** | `enhancement` |
+| **Assignees** | `ikawrakow` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -37,9 +39,9 @@ Either add IQK GEMM for `IQ1_M`, or at least quard against the absence of a GEMM
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-07-18** at **14:43:32**:
+👤 **ubergarm** commented on **2025-07-18** at **14:43:32**
I'll not open a new issue regarding unsloths Kimi-K2-Instruct-IQ1_S failing with `-fmoe` as discussed on other threads here and [reported on hugging face here](https://github.com/ikawrakow/ik_llama.cpp/issues/626). I also recreated the issue and observed removing `-fmoe` allows that model to run.
@@ -66,13 +68,13 @@ Given the "unsloth dynamic" is to change the tensor size up and down across laye
---
-👤 **ikawrakow** commented the **2025-07-18** at **14:46:02**:
+👤 **ikawrakow** commented on **2025-07-18** at **14:46:02**
-I created issue #626 for this, so no need to add another one.
+I created issue [#626](https://github.com/ikawrakow/ik_llama.cpp/issues/626) for this, so no need to add another one.
---
-👤 **ubergarm** commented the **2025-07-18** at **17:34:41**:
+👤 **ubergarm** commented on **2025-07-18** at **17:34:41**
Confirmed I can now run unsloths `Kimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf` with `-fmoe`! Thanks!
diff --git a/github-data/issues/627 - Feature Request_ Tensor Parallelism.md b/github-data/issues/627 - Feature Request Tensor Parallelism.md
similarity index 85%
rename from github-data/issues/627 - Feature Request_ Tensor Parallelism.md
rename to github-data/issues/627 - Feature Request Tensor Parallelism.md
index bdb9cf5c5..c7798b5fe 100644
--- a/github-data/issues/627 - Feature Request_ Tensor Parallelism.md
+++ b/github-data/issues/627 - Feature Request Tensor Parallelism.md
@@ -1,14 +1,15 @@
-### ✨ [#627](https://github.com/ikawrakow/ik_llama.cpp/issues/627) - Feature Request: Tensor Parallelism
+## 📌 [Issue #627](https://github.com/ikawrakow/ik_llama.cpp/issues/627) - Feature Request: Tensor Parallelism
| **Author** | `rankaiyx` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-07-18 |
-| **Updated** | 2025-07-19 |
+| **Updated** | 2025-07-20 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -45,9 +46,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-18** at **08:05:35**:
+👤 **ikawrakow** commented on **2025-07-18** at **08:05:35**
Have you tried raising the issue with the `llama.cpp` project?
@@ -55,7 +56,7 @@ Support for old hardware is not one of the strengths of this project, while exac
---
-👤 **rankaiyx** commented the **2025-07-18** at **08:21:28**:
+👤 **rankaiyx** commented on **2025-07-18** at **08:21:28**
There is an issue.
But it's expired.
@@ -64,7 +65,7 @@ https://github.com/ggml-org/llama.cpp/issues/9086
---
-👤 **Ph0rk0z** commented the **2025-07-19** at **10:12:01**:
+👤 **Ph0rk0z** commented on **2025-07-19** at **10:12:01**
Originally Cuda Dev was supposed to work on backend agnostic TP. Someone else volunteered and made partial PRs but appears to have abandoned them. Progress is stalled.
@@ -74,15 +75,15 @@ What's interesting is fastllm, who claims to fully utilize numa and supports hyb
---
-👤 **saood06** commented the **2025-07-19** at **10:19:14**:
+👤 **saood06** commented on **2025-07-19** at **10:19:14**
>Wanted to compare with IK but then realized command-A isn't supported.
-I thought it was from #341
+I thought it was from [#341](https://github.com/ikawrakow/ik_llama.cpp/issues/341)
---
-👤 **Ph0rk0z** commented the **2025-07-19** at **15:01:49**:
+👤 **Ph0rk0z** commented on **2025-07-19** at **15:01:49**
Damn.. I missed that. Will give it a go.
@@ -94,13 +95,7 @@ Mainline: I unfortunately pulled today. My speed in parallel is only 12t/s. With
---
-👤 **Ph0rk0z** commented the **2025-07-19** at **15:01:49**:
-
-Damn.. I missed that. Will give it a go.
-
----
-
-👤 **saood06** commented the **2025-07-20** at **01:09:42**:
+👤 **saood06** commented on **2025-07-20** at **01:09:42**
> IK: Same prompt processing speed as mainline. In SM row, 17t/s generation. Loads GPU 0 like mainline used to. Unfortunately, command-A outputs what looks like parts of the training data or random text. Without SM it is coherent but only does ~12T/s
>
diff --git a/github-data/issues/629 - Multi-GPU performance _Windows_ is significantly worse than single-GPU.md b/github-data/issues/629 - Multi-GPU performance Windows is significantly worse than single-GPU.md
similarity index 98%
rename from github-data/issues/629 - Multi-GPU performance _Windows_ is significantly worse than single-GPU.md
rename to github-data/issues/629 - Multi-GPU performance Windows is significantly worse than single-GPU.md
index f91d80365..e8a836c4e 100644
--- a/github-data/issues/629 - Multi-GPU performance _Windows_ is significantly worse than single-GPU.md
+++ b/github-data/issues/629 - Multi-GPU performance Windows is significantly worse than single-GPU.md
@@ -1,14 +1,14 @@
-### 📝 [#629](https://github.com/ikawrakow/ik_llama.cpp/issues/629) - Multi-GPU performance (Windows) is significantly worse than single-GPU
+## 📌 [Issue #629](https://github.com/ikawrakow/ik_llama.cpp/issues/629) - Multi-GPU performance (Windows) is significantly worse than single-GPU
| **Author** | `sousekd` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2025-07-18 |
-| **Updated** | 2025-07-19 |
+| **Updated** | 2025-07-20 |
---
-#### Description
+## 📄 Description
Testing on a single NPS1 Epyc 9355 system equipped with an RTX 5090 and an RTX 4090, I observe slightly lower PP t/s and much lower TG t/s when both GPUs are enabled compared with using just one.
@@ -896,9 +896,9 @@ Any thoughts?
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-18** at **13:39:02**:
+👤 **ikawrakow** commented on **2025-07-18** at **13:39:02**
I think there are too many graph splits.
Look for this line or similar
@@ -928,7 +928,7 @@ If one of the suggestions helps, please let us know as this would be useful to q
---
-👤 **ubergarm** commented the **2025-07-18** at **14:31:20**:
+👤 **ubergarm** commented on **2025-07-18** at **14:31:20**
fwiw someone was asking me about Qwen3-235B on ik fork with windows also saying they weren't getting the speed-ups they were expecting with multi-GPU
@@ -959,7 +959,7 @@ They are going to test `-rtr` after some more discussion, but I'll point them he
---
-👤 **sousekd** commented the **2025-07-18** at **14:40:14**:
+👤 **sousekd** commented on **2025-07-18** at **14:40:14**
Hmmm, I see. Do I understand correctly, that with a typical layer of Qwen3 looking like this:
@@ -982,7 +982,7 @@ It doesn't. Nvidia blocks it for **consumer-level** cards in their Windows drive
---
-👤 **Panchovix** commented the **2025-07-18** at **15:19:30**:
+👤 **Panchovix** commented on **2025-07-18** at **15:19:30**
As a user with 7 GPUs, I would say just use Linux (sadly or not, depending on what you like) for LLMs, as I feel there is something wrong on Windows related to threading and multiGPU.
@@ -1017,7 +1017,7 @@ Also I know NVIDIA doesn't support nccl on Windows but I don't think affects lcp
---
-👤 **sousekd** commented the **2025-07-18** at **19:51:54**:
+👤 **sousekd** commented on **2025-07-18** at **19:51:54**
Your suggestion definitely helped, @ikawrakow.
I only experimented with @ubergarm's **Qwen3-235B-A22B-mix-IQ3_K**, as it is likely relevant to more users:
@@ -1823,7 +1823,7 @@ _threads_batch = 32
---
-👤 **sousekd** commented the **2025-07-18** at **22:28:37**:
+👤 **sousekd** commented on **2025-07-18** at **22:28:37**
Results for bartowski's **Qwen3‑235B‑A22B‑Q8_0** are less encouraging: although they're a bit better than before, the multi-GPU setup improves neither PP t/s nor TG t/s when compared with a single GPU:
@@ -2684,7 +2684,7 @@ _threads_batch = 32
---
-👤 **ikawrakow** commented the **2025-07-19** at **06:45:47**:
+👤 **ikawrakow** commented on **2025-07-19** at **06:45:47**
Thank you for these results.
@@ -2694,7 +2694,7 @@ I'm still waiting for the day when someone will decide to build a system with a
---
-👤 **sousekd** commented the **2025-07-19** at **08:14:42**:
+👤 **sousekd** commented on **2025-07-19** at **08:14:42**
Yeah, I think these results are really proof of the great optimizations you did on the CPU side… and also proof of Nvidia’s policy of deliberately disabling hardware features to drive upsells.
@@ -2706,25 +2706,15 @@ Anyway, it is amazing to see these huge models running on a relatively affordabl
---
-👤 **sousekd** commented the **2025-07-19** at **08:14:42**:
-
-Yeah, I think these results are really proof of the great optimizations you did on the CPU side… and also proof of Nvidia’s policy of deliberately disabling hardware features to drive upsells.
-
-> I'm still waiting for the day when someone will decide to build a system with a 7995WX CPU, instead of dropping the required 10 grant on buying multiple high-end GPUs. A 7995WX system with all memory banks populated with high speed RAM may not be able to compete with your system on PP performance, but I wouldn't be surprised if it beats it in TG speed.
-
-I'd probably expect the opposite - beating Epyc on PP, but not quite reaching the memory bandwidth of Epyc. But it would be nice to see - my Epyc 9355 was less then third of 7995WX price!
-
----
-
-👤 **ikawrakow** commented the **2025-07-19** at **09:18:10**:
+👤 **ikawrakow** commented on **2025-07-19** at **09:18:10**
Maybe you have posted CPU-only performance results somewhere else, but it is becoming hard to find stuff in this repository, so do you mind re-posting here? Just so one has it side-by-side to see how much you gain from adding the 5090. Thanks.
---
-👤 **sousekd** commented the **2025-07-19** at **20:12:42**:
+👤 **sousekd** commented on **2025-07-19** at **20:12:42**
-@ikawrakow Seems GPU is still quite handy for these larger models :)
+@ikawrakow Seems having at least *some* GPU is still quite handy for these larger models :)
@@ -2732,7 +2722,7 @@ Maybe you have posted CPU-only performance results somewhere else, but it is bec
---
-👤 **ikawrakow** commented the **2025-07-20** at **05:29:03**:
+👤 **ikawrakow** commented on **2025-07-20** at **05:29:03**
> @ikawrakow Seems having at least some GPU is still quite handy for these larger models :)
@@ -2748,7 +2738,7 @@ In defense of the CPU only scenario:
---
-👤 **sousekd** commented the **2025-07-20** at **10:31:10**:
+👤 **sousekd** commented on **2025-07-20** at **10:31:10**
Everything you said makes perfect sense. I also haven’t really tuned the inference parameters for optimal performance here, unlike with the CPU+GPU setup.
diff --git a/github-data/issues/633 - Bug Command-A Spits incoherence when using -sm row.md b/github-data/issues/633 - Bug Command-A Spits incoherence when using -sm row.md
new file mode 100644
index 000000000..b05df3cb6
--- /dev/null
+++ b/github-data/issues/633 - Bug Command-A Spits incoherence when using -sm row.md
@@ -0,0 +1,476 @@
+## 📌 [Issue #633](https://github.com/ikawrakow/ik_llama.cpp/issues/633) - Bug: Command-A Spits incoherence when using -sm row
+
+| **Author** | `Ph0rk0z` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-20 |
+| **Updated** | 2025-07-23 |
+| **Labels** | `help wanted` |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+Command-A speeds for IK are higher than mainline, especially since they broke something recently to knock it back from 15t/s to 12t/s. I still see 15s in YALS which uses an older commit. Prompt processing drops from 350 to 140, but I assume that's a facet of using SM row and can't be fixed. Mainline does it too.
+
+Here, I get 17, almost 18t/s, unfortunately the result is as follows:
+
+
+
+
+Is it KVcache related? SM row puts the cache all on GPU0.
+
+SM layer works correctly but T/G speeds suffer.
+
+### Name and Version
+
+Git latest.
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+CUDA_VISIBLE_DEVICES=0,1,2,3 ./bin/llama-server \
+ -c 32768 \
+ --host 192.168.1.211 \
+ -ngl 99 \
+ -ctk q8_0 \
+ -ctv q8_0 \
+ -fa \
+ -sm row \
+ -b 2048 \
+ -ub 2048
+INFO [ main] build info | tid="139717304852480" timestamp=1753018161 build=3829 commit="f1323339"
+INFO [ main] system info | tid="139717304852480" timestamp=1753018161 n_threads=48 n_threads_batch=-1 total_threads=96 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+llama_model_loader: additional 1 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 47 key-value pairs and 514 tensors from Agatha-111B-v1-Q4_K_L/Agatha-111B-v1-Q4_K_L-00001-of-00002.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = cohere2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Agatha 111B v1
+llama_model_loader: - kv 3: general.version str = v1
+llama_model_loader: - kv 4: general.basename str = Agatha
+llama_model_loader: - kv 5: general.size_label str = 111B
+llama_model_loader: - kv 6: general.base_model.count u32 = 1
+llama_model_loader: - kv 7: general.base_model.0.name str = C4Ai Command A 03 2025
+llama_model_loader: - kv 8: general.base_model.0.version str = 03-2025
+llama_model_loader: - kv 9: general.base_model.0.organization str = CohereLabs
+llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/CohereLabs/c4a...
+llama_model_loader: - kv 11: cohere2.block_count u32 = 64
+llama_model_loader: - kv 12: cohere2.context_length u32 = 262144
+llama_model_loader: - kv 13: cohere2.embedding_length u32 = 12288
+llama_model_loader: - kv 14: cohere2.feed_forward_length u32 = 36864
+llama_model_loader: - kv 15: cohere2.attention.head_count u32 = 96
+llama_model_loader: - kv 16: cohere2.attention.head_count_kv u32 = 8
+llama_model_loader: - kv 17: cohere2.rope.freq_base f32 = 50000.000000
+llama_model_loader: - kv 18: cohere2.attention.layer_norm_epsilon f32 = 0.000010
+llama_model_loader: - kv 19: cohere2.attention.key_length u32 = 128
+llama_model_loader: - kv 20: cohere2.attention.value_length u32 = 128
+llama_model_loader: - kv 21: cohere2.logit_scale f32 = 0.250000
+llama_model_loader: - kv 22: cohere2.attention.sliding_window u32 = 4096
+llama_model_loader: - kv 23: cohere2.vocab_size u32 = 256000
+llama_model_loader: - kv 24: cohere2.rope.dimension_count u32 = 128
+llama_model_loader: - kv 25: cohere2.rope.scaling.type str = none
+llama_model_loader: - kv 26: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 27: tokenizer.ggml.pre str = command-r
+llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,256000] = ["", "", "", "", ...
+llama_model_loader: - kv 29: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
+llama_model_loader: - kv 30: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
+llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 5
+llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 255001
+llama_model_loader: - kv 33: tokenizer.ggml.unknown_token_id u32 = 1
+llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 0
+llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 37: tokenizer.chat_template str = {{ bos_token }}{% if documents %}\n{% ...
+llama_model_loader: - kv 38: general.quantization_version u32 = 2
+llama_model_loader: - kv 39: general.file_type u32 = 15
+llama_model_loader: - kv 40: quantize.imatrix.file str = /models_out/Agatha-111B-v1-GGUF/TheDr...
+llama_model_loader: - kv 41: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
+llama_model_loader: - kv 42: quantize.imatrix.entries_count i32 = 448
+llama_model_loader: - kv 43: quantize.imatrix.chunks_count i32 = 509
+llama_model_loader: - kv 44: split.no u16 = 0
+llama_model_loader: - kv 45: split.tensors.count i32 = 514
+llama_model_loader: - kv 46: split.count u16 = 2
+llama_model_loader: - type f32: 65 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llama_model_loader: - type q4_K: 384 tensors
+llama_model_loader: - type q6_K: 64 tensors
+llm_load_vocab: special tokens cache size = 41
+llm_load_vocab: token to piece cache size = 1.8428 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = cohere2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 256000
+llm_load_print_meta: n_merges = 253333
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 262144
+llm_load_print_meta: n_embd = 12288
+llm_load_print_meta: n_layer = 64
+llm_load_print_meta: n_head = 96
+llm_load_print_meta: n_head_kv = 8
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 4096
+llm_load_print_meta: n_swa_pattern = 4
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 12
+llm_load_print_meta: n_embd_k_gqa = 1024
+llm_load_print_meta: n_embd_v_gqa = 1024
+llm_load_print_meta: f_norm_eps = 1.0e-05
+llm_load_print_meta: f_norm_rms_eps = 0.0e+00
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 2.5e-01
+llm_load_print_meta: n_ff = 36864
+llm_load_print_meta: n_expert = 0
+llm_load_print_meta: n_expert_used = 0
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = none
+llm_load_print_meta: freq_base_train = 50000.0
+llm_load_print_meta: freq_scale_train = 1
+llm_load_print_meta: n_ctx_orig_yarn = 262144
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = Q4_K - Medium
+llm_load_print_meta: model params = 111.058 B
+llm_load_print_meta: model size = 63.224 GiB (4.890 BPW)
+llm_load_print_meta: general.name = Agatha 111B v1
+llm_load_print_meta: BOS token = 5 ''
+llm_load_print_meta: EOS token = 255001 '<|END_OF_TURN_TOKEN|>'
+llm_load_print_meta: UNK token = 1 ''
+llm_load_print_meta: PAD token = 0 ''
+llm_load_print_meta: LF token = 136 'Ä'
+llm_load_print_meta: max token length = 1024
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 4 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+llm_load_tensors: ggml ctx size = 0.74 MiB
+llm_load_tensors: offloading 64 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 65/65 layers to GPU
+llm_load_tensors: CUDA_Split buffer size = 64738.50 MiB
+llm_load_tensors: CPU buffer size = 3187.50 MiB
+llm_load_tensors: CUDA0 buffer size = 3.05 MiB
+..............................................................................................
+llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 2048
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 50000.0
+llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: CUDA0 KV buffer size = 4352.03 MiB
+llama_new_context_with_model: KV self size = 4352.00 MiB, K (q8_0): 2176.00 MiB, V (q8_0): 2176.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 1.95 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 2096.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 608.02 MiB
+llama_new_context_with_model: graph nodes = 1578
+llama_new_context_with_model: graph splits = 2
+```
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-20** at **13:54:00**
+
+So, I have tried to preserve the row split mode while making changes to the CUDA back-end, but not having a multi-GPU system to test, I may have broken something. Also, the row split mode is in whatever state it was in mainline when I forked the project a year ago, so who knows if it worked there back then.
+
+If this is broken, I would need help.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-20** at **14:02:34**
+
+It was working at the time. They since refactored it to split the cache evenly among the GPUs. Where would be a good place to start?
+
+To me this looks as if you just ran the model with empty CTX because it's coherent-ish. It processes the context and never feeds it back in when generating.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-20** at **14:17:46**
+
+> Where would be a good place to start?
+
+```
+git blame ggml-cuda.cu | grep -i kawrakow
+```
+and then try to see if I have made a change that does not consider the split row mode (or considers it but incorrectly).
+
+For sure split row mode does not work for MoE models (and it didn't work in mainline either). It would be a much bigger undertaking to fix that.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-20** at **14:31:30**
+
+I went through all the commits in mainline regarding row and so far nothing matched up. Besides ggml-cuda, I think some changes went into mmq.cuh, they hard a race condition at one point.
+
+I'm going to try some other models as well to see if it's only cohere related. Quantized cache didn't make a difference, disabling FA still produced incoherent output but it was at least tangentially related to the CTX.
+
+L2-70b is working perfectly with same settings. Sounds like command-A related issue or some facet of it's architecture.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-20** at **15:52:00**
+
+Think we getting somewhere...
+
+Cohere2 and Gemma3 both use SWA. The output from gemma3 is actually *worse*, but it exhibits similar behavior. I posit that the bug lies there.
+
+Is there a way to disable SWA when loading the models for testing short of commenting out lines in the code?
+
+---
+
+👤 **ikawrakow** commented on **2025-07-20** at **15:56:28**
+
+Before getting into debugging SWA, can you try a model that does not use SWA? So we know that split row mode is not broken?
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-20** at **16:05:50**
+
+Yes. I used DarkMiqu which is L2 70b and it works fine.
+
+Also loading a qwen2 right now to double check.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-20** at **16:35:01**
+
+Maybe spoke too soon.
+
+
+ Qwen2 Arch
+
+```
+llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 1000000.0
+llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: CUDA0 KV buffer size = 5440.04 MiB
+llama_new_context_with_model: KV self size = 5440.00 MiB, K (q8_0): 2720.00 MiB, V (q8_0): 2720.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 313.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 80.01 MiB
+llama_new_context_with_model: graph nodes = 2246
+llama_new_context_with_model: graph splits = 2
+CUDA error: an illegal memory access was encountered
+ current device: 1, in function ggml_cuda_op_mul_mat at /home/supermicro/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:1752
+ cudaGetLastError()
+/home/supermicro/ai/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
+```
+
+
+
+GLM-Z1 also has problems. Have to take a look at what that arch is doing.
+
+Am out of different GGUF archs to try, most of the rest are EXL2. Wish I had original CR+ in gguf.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-20** at **16:44:20**
+
+Can you run `cuda-gdb --args put_your_qwen2_command_here`. Then when you see `(cuda-gdb)`, just type `run`.
+
+Then post the output when it crashes.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-20** at **17:12:10**
+
+I found that it loads if I recompile without GGML_CUDA_F16.. sadly only outputs *********. I will give you a stack trace when I come back from the grocery store.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-22** at **19:29:49**
+
+Welcome back to the land of the living. Here is where it crashes with F16 enabled.
+
+```
+CUDA Exception: Warp Illegal Address
+The exception was triggered at PC 0x7ff35529cff0
+
+Thread 1 "llama-server" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
+[Switching focus to CUDA kernel 0, grid 561, block (0,0,0), thread (192,0,0), device 1, sm 0, warp 5, lane 0]
+0x00007ff35529d000 in void convert_unary(void const*, __half*, long)<<<(32,1,1),(256,1,1)>>> ()
+```
+
+Not sure if it's all the same bug but could be related.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **10:49:02**
+
+@Ph0rk0z Thank you. This narrows it down somewhat, but it would be useful to have the backtrace to see from where this was triggered. If you have time to recreate the issue, do `bt` or `thread apply all bt` when it crashes, and post the result.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-23** at **11:51:58**
+
+I'm not very good with GDB so wasn't sure what to do after.
+
+
+ Backtrace
+
+
+```
+CUDA Exception: Warp Illegal Address
+The exception was triggered at PC 0x7ff35529cff0
+
+Thread 1 "llama-server" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
+[Switching focus to CUDA kernel 0, grid 561, block (0,0,0), thread (192,0,0), device 1, sm 0, warp 7, lane 0]
+0x00007ff35529d000 in void convert_unary(void const*, __half*, long)<<<(32,1,1),(256,1,1)>>> ()
+(cuda-gdb) thread apply all bt
+
+Thread 11 (LWP 1716794 "cuda-EvtHandlr"):
+#0 0x00007fffd1518bcf in poll () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffc6708517 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#2 0x00007fffc67cb17f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 10 (LWP 1716793 "llama-server"):
+#0 0x00007fffd1491117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffd1493e9b in pthread_cond_timedwait () from /lib/x86_64-linux-gnu/libc.so.6
+#2 0x00007fffc6673eea in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 9 (LWP 1716773 "cuda-EvtHandlr"):
+#0 0x00007fffd1518bcf in poll () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffc6708517 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#2 0x00007fffc67cb17f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 8 (LWP 1716772 "llama-server"):
+#0 0x00007fffd1491117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffd1493e9b in pthread_cond_timedwait () from /lib/x86_64-linux-gnu/libc.so.6
+#2 0x00007fffc6673eea in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 7 (LWP 1716743 "cuda-EvtHandlr"):
+#0 0x00007fffd1518bcf in poll () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffc6708517 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#2 0x00007fffc67cb17f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+--Type for more, q to quit, c to continue without paging--
+Thread 6 (LWP 1716742 "llama-server"):
+#0 0x00007fffd1491117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffd1493e9b in pthread_cond_timedwait () from /lib/x86_64-linux-gnu/libc.so.6
+#2 0x00007fffc6673eea in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 5 (LWP 1716697 "cuda-EvtHandlr"):
+#0 0x00007fffd1518bcf in poll () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffc6708517 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#2 0x00007fffc67cb17f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 4 (LWP 1716696 "llama-server"):
+#0 0x00007fffd1491117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffd1493e9b in pthread_cond_timedwait () from /lib/x86_64-linux-gnu/libc.so.6
+#2 0x00007fffc6673eea in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 3 (LWP 1716686 "llama-server"):
+#0 0x00007fffd1525e2e in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffa517f6e6 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#2 0x00007fffa517c939 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#3 0x00007fffa517ef3b in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#4 0x00007fffa51894ba in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#5 0x00007fffa51899bb in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#6 0x00007fffa505dc15 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#7 0x00007fffa4ec7837 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#8 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#9 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 2 (LWP 1716685 "cuda00002000009"):
+#0 0x00007fffd1518bcf in poll () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffc6708517 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#2 0x00007fffc67cb17f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#3 0x00007fffc66f8b23 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+--Type for more, q to quit, c to continue without paging--
+#4 0x00007fffd1494ac3 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#5 0x00007fffd1526850 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+
+Thread 1 (LWP 1716680 "llama-server"):
+#0 0x00007fffd1491117 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
+#1 0x00007fffd1493a41 in pthread_cond_wait () from /lib/x86_64-linux-gnu/libc.so.6
+#2 0x00007fffa517ca1b in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#3 0x00007fffa517ef3b in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#4 0x00007fffa4ee080d in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#5 0x00007fffa4f4a049 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#6 0x00007fffa4d12718 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#7 0x00007fffa4fdc5b6 in ?? () from /lib/x86_64-linux-gnu/libcudadebugger.so.1
+#8 0x00007fffc65ace1a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#9 0x00007fffc672238c in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
+#10 0x00007fffd1012f91 in ?? () from /home/supermicro/miniconda3/envs/cuda12/lib/libcudart.so.12
+#11 0x00007fffd104f5e4 in cudaSetDevice () from /home/supermicro/miniconda3/envs/cuda12/lib/libcudart.so.12
+#12 0x00007fffd2706552 in ggml_cuda_set_device(int) () from /home/supermicro/ai/ik_llama.cpp/ggml/src/libggml.so
+#13 0x00007fffd2710fe2 in ggml_cuda_op_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, void (*)(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, char const*, float const*, char const*, float*, long, long, long, long, CUstream_st*), void (*)(float const*, void*, long, long, long, long, ggml_type, CUstream_st*)) [clone .constprop.0] () from /home/supermicro/ai/ik_llama.cpp/ggml/src/libggml.so
+#14 0x00007fffd271cbc6 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/supermicro/ai/ik_llama.cpp/ggml/src/libggml.so
+#15 0x00007fffd1917716 in ggml_backend_sched_graph_compute_async () from /home/supermicro/ai/ik_llama.cpp/ggml/src/libggml.so
+#16 0x00007ffff7ea2832 in llama_decode () from /home/supermicro/ai/ik_llama.cpp/src/libllama.so
+#17 0x000055555562f6b1 in llama_init_from_gpt_params(gpt_params&) ()
+#18 0x00005555555c48a6 in server_context::load_model(gpt_params const&) ()
+#19 0x0000555555579418 in main ()
+```
+
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **12:08:51**
+
+It is a release build, so there isn't much you can do. I don't think I understand what the bug is from the backtrace. I guess, the only way to resolve this is for me to get a multi-GPU system and debug it myself.
+
+But thanks for helping.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-23** at **12:42:55**
+
+Not worth it for me to compile it a different way? I saw in the code that on certain arch it upcasts to FP32 for several calculations and I have enabled FP16 cuda. Command-A and I think qwen2 among them. But cohere2 doesn't crash with FP16, just ignores the prompt and qwen2 loads but outputs only one or two token ids.
+
+BTW, some MoE can also load in row split, it just t/s tanks. IIRC, either qwen-235b or deepseek-v2.5 did.
\ No newline at end of file
diff --git a/github-data/issues/638 - github_data dir contains filename causing issues on Windows.md b/github-data/issues/638 - github_data dir contains filename causing issues on Windows.md
new file mode 100644
index 000000000..784ddd1f2
--- /dev/null
+++ b/github-data/issues/638 - github_data dir contains filename causing issues on Windows.md
@@ -0,0 +1,67 @@
+## 📌 [Issue #638](https://github.com/ikawrakow/ik_llama.cpp/issues/638) - "github_data" dir contains filename causing issues on Windows
+
+| **Author** | `sousekd` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-07-23 |
+| **Updated** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+Commit ab7d193 makes it difficult to work with the repo on Windows, as it contains files with long names and/or invalid characters causing all kinds of `git` issues.
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-23** at **09:07:14**
+
+What are the invalid characters? We can ask @ThomasBaruzier to recreate, but he will need more specific requirements to change the scripts.
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-07-23** at **09:30:13**
+
+I don't have access to a windows machine at the moment. I will sanitize the filenames more strictly and limit them to 100 chars instead of 200.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **09:43:53**
+
+> I don't have access to a windows machine at the moment. I will sanitize the filenames more strictly and limit them to 100 chars instead of 200
+
+Isn't it better to just name the files "Issue_XXX", "Discussion_XXX" and "PR_XXX", where `XXX` is the number of the issue/PR/discussion? There is of course some value in having the title in the file name, but if this causes problems, I can just use grep to see the titles.
+
+---
+
+👤 **saood06** commented on **2025-07-23** at **09:45:31**
+
+>Isn't it better to just name the files "Issue_XXX", "Discussion_XXX" and "PR_XXX", where XXX is the number of the issue/PR/discussion? There is of course some value in having the title in the file name, but if this causes problems, I can just use grep to see the titles.
+
+If you do that you could have a .json file with a dictionary.
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-07-23** at **10:11:27**
+
+> Isn't it better to just name the files "Issue_XXX", "Discussion_XXX" and "PR_XXX", where `XXX` is the number of the issue/PR/discussion? There is of course some value in having the title in the file name, but if this causes problems, I can just use grep to see the titles.
+
+I already went ahead and sanitized the filenames to allow only `a-zA-Z0-9_.\- ` with an 80 character limit before seeing your message. I can switch to your proposed approach if you prefer, though I’m fairly confident the current fix is sufficient.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **10:36:47**
+
+@sousekd Can you check if [#640](https://github.com/ikawrakow/ik_llama.cpp/issues/640) fixes your problem? Thanks.
+
+---
+
+👤 **sousekd** commented on **2025-07-23** at **11:29:42**
+
+@ikawrakow @ThomasBaruzier Sorry I was off. **Yes** it fixes the problem. Thank you.
+
+The issue was:
+
+`error: cannot stat 'github-data/issues/383-Bug_ Loading DeepSeek R1T Chimera causes _llama_model_load_ error loading model_ check_tensor_dims_ tensor 'blk.0.attn_q_b.weight' has wrong shape; expected 1536, 73728, got 1536, 24576, 1, ': Filename too long`
\ No newline at end of file
diff --git a/github-data/issues/641 - Bug Vulkan issues with Qwen3-30B-A3B.md b/github-data/issues/641 - Bug Vulkan issues with Qwen3-30B-A3B.md
new file mode 100644
index 000000000..c8117d99c
--- /dev/null
+++ b/github-data/issues/641 - Bug Vulkan issues with Qwen3-30B-A3B.md
@@ -0,0 +1,427 @@
+## 📌 [Issue #641](https://github.com/ikawrakow/ik_llama.cpp/issues/641) - Bug: Vulkan issues with Qwen3-30B-A3B
+
+| **Author** | `samteezy` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-23 |
+| **Updated** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+@ikawrakow your comment earlier [here](https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-3105956547) reminded me of why I had `-ngl 0` in the settings I'd been playing with.
+
+In my experimentation, using Vulkan with `Qwen3-30B-A3B` and having _any_ layers on the GPU returns a lot of endless repetition like:
+
+```
+Prompt: Tell me how the game mancala is played.
+
+Response:
+Thought Process:
+Okay, so I need to figure out how to play Mancalas. Wait no, actually, I need to tell the user how the game mancala is played. I mean, I need to, uh... well, I need to think through how to explain it. Let me, let me try to recall what I know. So the game is... Mancala, right? That's a game where, you know, it's a game where the players, uh, the objective is to, well, I mean, I think that it's a game where the players, you know, the players, uh, the player is, well, the player is to, the player is to... I mean, I think that it's a game where you the players, you know, the player, you know, the player, you know, the player is to, well, the player is to, the player is to... I mean, I think that it's a game where you, the player, you know, the player, you know, the player is to, the player is to, the player is to... I mean, I think that the players are to, well, the players are to, the players are to, the players are to, the player is to... I mean, I think that it's a game where you, the player, you know, the player, you know, the player is to, the
+```
+(I put it out of its misery at that point).
+
+Running with the nearly identical settings in mainline `llama.cpp`, I don't have this issue. (In mainline, because I have both ROCm and Vulkan built, I use `--device Vulkan1`, whereas in `ik_llama.cpp` I use `-mg 1` because I get the error `error: unknown argument: --device`.)
+
+Where Vulkan in `ik_llama.cpp` has been working for me is with non-MoE models like `Qwen3-0.6B` or `Devstral-Small-2507`.
+
+FYI, I'm using unsloth quants mainly across all of these models.
+
+Oh, and for clarity, system specs:
+- Xeon 2150B CPU
+- Radeon V620 32GB (shows as device 0 with ROCm, device 1 in Vulkan, annoyingly)
+- Radeon Pro WX3200 4GB (reverse of above)
+- Running in Ubuntu 24.04 container within Proxmox-hosted LXC
+
+### Name and Version
+
+root@llama:~# llama-builds/ik_llama.cpp/bin/llama-server --version
+version: 3816 (7093a358)
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+
+```
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-23** at **13:38:01**
+
+Is this with or without flash attention? Does changing the flash attention setting change the result? And what does it say about coopmat when initializing the Vulkan back-end?
+
+---
+
+👤 **samteezy** commented on **2025-07-23** at **14:28:34**
+
+Prior runs were with -fa on and quantized cache, still encountering same issue without flash attention.
+
+Last run without flash attention:
+
+```bash
+root@llama:~# /root/llama-builds/ik_llama.cpp/bin/llama-cli --threads 10 --n-gpu-layers 99 -mg 1 -ot exps=CPU -m /mnt/models/unsloth/Qwen3-30B-A3B-128K-UD-Q5_K_XL.gguf --temp 0.7 --min-p 0 --top-p 0.8 --top-k 20 --ctx-size 32000 --presence-penalty 0.1 -v --prompt "Tell me how the game mancala is played."
+ggml_vulkan: 0 = AMD Radeon (TM) Pro WX 3200 Series (RADV POLARIS12) (radv) | uma: 0 | fp16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
+ggml_vulkan: 1 = AMD Radeon PRO V620 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
+Log start
+main: build = 3816 (7093a358)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: seed = 1753280863
+llama_model_loader: loaded meta data with 39 key-value pairs and 579 tensors from /mnt/models/unsloth/Qwen3-30B-A3B-128K-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3-30B-A3B-128K
+llama_model_loader: - kv 3: general.finetune str = 128k
+llama_model_loader: - kv 4: general.basename str = Qwen3-30B-A3B-128K
+llama_model_loader: - kv 5: general.quantized_by str = Unsloth
+llama_model_loader: - kv 6: general.size_label str = 30B-A3B
+llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
+llama_model_loader: - kv 8: qwen3moe.block_count u32 = 48
+llama_model_loader: - kv 9: qwen3moe.context_length u32 = 131072
+llama_model_loader: - kv 10: qwen3moe.embedding_length u32 = 2048
+llama_model_loader: - kv 11: qwen3moe.feed_forward_length u32 = 6144
+llama_model_loader: - kv 12: qwen3moe.attention.head_count u32 = 32
+llama_model_loader: - kv 13: qwen3moe.attention.head_count_kv u32 = 4
+llama_model_loader: - kv 14: qwen3moe.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 15: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 16: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 17: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 18: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 19: qwen3moe.expert_count u32 = 128
+llama_model_loader: - kv 20: qwen3moe.expert_feed_forward_length u32 = 768
+llama_model_loader: - kv 21: qwen3moe.rope.scaling.type str = yarn
+llama_model_loader: - kv 22: qwen3moe.rope.scaling.factor f32 = 4.000000
+llama_model_loader: - kv 23: qwen3moe.rope.scaling.original_context_length u32 = 32768
+llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151654
+llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 32: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
+llama_model_loader: - kv 33: general.quantization_version u32 = 2
+llama_model_loader: - kv 34: general.file_type u32 = 17
+llama_model_loader: - kv 35: quantize.imatrix.file str = Qwen3-30B-A3B-128K-GGUF/imatrix_unslo...
+llama_model_loader: - kv 36: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-30B-A3B-128...
+llama_model_loader: - kv 37: quantize.imatrix.entries_count i32 = 384
+llama_model_loader: - kv 38: quantize.imatrix.chunks_count i32 = 685
+llama_model_loader: - type f32: 241 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llama_model_loader: - type q4_K: 20 tensors
+llama_model_loader: - type q5_K: 227 tensors
+llama_model_loader: - type q6_K: 90 tensors
+llm_load_vocab: special tokens cache size = 26
+llm_load_vocab: token to piece cache size = 0.9311 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen3moe
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 151936
+llm_load_print_meta: n_merges = 151387
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 131072
+llm_load_print_meta: n_embd = 2048
+llm_load_print_meta: n_layer = 48
+llm_load_print_meta: n_head = 32
+llm_load_print_meta: n_head_kv = 4
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 8
+llm_load_print_meta: n_embd_k_gqa = 512
+llm_load_print_meta: n_embd_v_gqa = 512
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 6144
+llm_load_print_meta: n_expert = 128
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 1000000.0
+llm_load_print_meta: freq_scale_train = 0.25
+llm_load_print_meta: n_ctx_orig_yarn = 32768
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = Q5_K - Medium
+llm_load_print_meta: model params = 30.532 B
+llm_load_print_meta: model size = 20.242 GiB (5.695 BPW)
+llm_load_print_meta: repeating layers = 19.805 GiB (5.688 BPW, 29.910 B parameters)
+llm_load_print_meta: general.name = Qwen3-30B-A3B-128K
+llm_load_print_meta: BOS token = 11 ','
+llm_load_print_meta: EOS token = 151645 '<|im_end|>'
+llm_load_print_meta: PAD token = 151654 '<|vision_pad|>'
+llm_load_print_meta: LF token = 148848 'ÄĬ'
+llm_load_print_meta: EOT token = 151645 '<|im_end|>'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_ff_exp = 768
+llm_load_tensors: ggml ctx size = 0.76 MiB
+Tensor blk.0.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.3.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.4.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.5.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.6.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.7.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.8.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.9.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.10.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
+llm_load_tensors: offloading 48 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 49/49 layers to GPU
+llm_load_tensors: CPU buffer size = 20266.31 MiB
+llm_load_tensors: CPU buffer size = 204.02 MiB
+llm_load_tensors: Vulkan1 buffer size = 810.48 MiB
+llm_load_tensors: Vulkan0 buffer size = 82.48 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 32000
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 0
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 1000000.0
+llama_new_context_with_model: freq_scale = 0.25
+llama_kv_cache_init: Vulkan1 KV buffer size = 2625.00 MiB
+llama_kv_cache_init: Vulkan0 KV buffer size = 375.00 MiB
+llama_new_context_with_model: KV self size = 3000.00 MiB, K (f16): 1500.00 MiB, V (f16): 1500.00 MiB
+llama_new_context_with_model: CPU output buffer size = 0.58 MiB
+llama_new_context_with_model: Vulkan0 compute buffer size = 2086.50 MiB
+llama_new_context_with_model: Vulkan1 compute buffer size = 2086.50 MiB
+llama_new_context_with_model: CPU compute buffer size = 0.00 MiB
+llama_new_context_with_model: Vulkan_Host compute buffer size = 66.51 MiB
+llama_new_context_with_model: graph nodes = 2165
+llama_new_context_with_model: graph splits = 189
+
+system_info: n_threads = 10 / 18 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+sampling:
+ repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.100
+ top_k = 20, tfs_z = 1.000, top_p = 0.800, min_p = 0.000, typical_p = 1.000, temp = 0.700
+ mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
+ xtc_probability = 0.000, xtc_threshold = 1.000, top_n_sigma = 0.000
+sampling order:
+CFG -> Penalties -> dry -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> xtc -> top_n_sigma -> temperature
+generate: n_ctx = 32000, n_batch = 2048, n_predict = -1, n_keep = 0
+
+
+Tell me how the game mancala is played. What is the rule of the game? What are the rules of the game? What is the rule of the game? What is the rule of the game? What is the rule of the game? What is the rule of the game? What is the rule of
+
+llama_print_timings: load time = 6099.71 ms
+llama_print_timings: sample time = 109.50 ms / 53 runs ( 2.07 ms per token, 484.01 tokens per second)
+llama_print_timings: prompt eval time = 347.74 ms / 10 tokens ( 34.77 ms per token, 28.76 tokens per second)
+llama_print_timings: eval time = 3571.80 ms / 53 runs ( 67.39 ms per token, 14.84 tokens per second)
+llama_print_timings: total time = 4110.00 ms / 63 tokens
+```
+
+I don't see any mention of coopmat in the console output, even with -v. What should I be searching for?
+
+---
+
+👤 **samteezy** commented on **2025-07-23** at **14:35:07**
+
+And if it helps, my current build params:
+
+```
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="$INSTALL_DIR" -DBUILD_SHARED_LIBS=OFF \
+ $CCACHE_FLAG \ #This is part of a script, so I set this dynamically, defaults to "on"
+ -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=FLAME \
+ -DGGML_VULKAN=ON \
+ -DGGML_CUDA_FA_ALL_QUANTS=ON #allowing all combinations of KV cache
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **14:40:38**
+
+I see this when I run with the Vulkan back-end:
+```
+ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
+```
+If I go to another machine with the same GPU where I have updated the Nvidia driver to the latest and greatest, I see
+```
+ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
+```
+instead.
+
+With the initial port of Vulkan back-end, some pre-processor macros were not set, and as a result the build was without coop mat enabled. This leads to a horrible performance.
+
+I'm missing the `ggml_vulkan: ...` output in your log, so not sure when and how your Vulkan back-end gets initialized.
+
+---
+
+👤 **samteezy** commented on **2025-07-23** at **14:51:10**
+
+Well, that explains it... if you look at the top of the logs, you'll see that neither GPU has any matrix cores (hence why coopmat doesn't show in the first place)
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **14:57:44**
+
+Oh, that's the log entry I was looking for, but I missed it because in my case it shows up somewhere else.
+
+OK, the Vulkan port here was never tested without coopmat, so something is likely broken.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **15:07:41**
+
+OK, so this means that the scalar implementation of one of the non-linear self-attention ops is broken here. If you don't upload anything to the GPU, these ops will run on the CPU, and it works.
+
+I'll try to debug when I come back from vacation in 2 weeks.
\ No newline at end of file
diff --git a/github-data/issues/644 - Feature Request Way to use on Tesla P40.md b/github-data/issues/644 - Feature Request Way to use on Tesla P40.md
new file mode 100644
index 000000000..d4fd74e8d
--- /dev/null
+++ b/github-data/issues/644 - Feature Request Way to use on Tesla P40.md
@@ -0,0 +1,292 @@
+## 📌 [Issue #644](https://github.com/ikawrakow/ik_llama.cpp/issues/644) - Feature Request: Way to use on Tesla P40
+
+| **Author** | `narikm` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-24 |
+| **Updated** | 2025-07-27 |
+| **Labels** | `enhancement` |
+
+---
+
+## 📄 Description
+
+### Prerequisites
+
+- [x] I am running the latest code. Mention the version if possible as well.
+- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
+- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
+- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.
+
+### Feature Description
+
+Is there a way to use this code on legacy Tesla P40?
+
+### Motivation
+
+Is there a way to use this repo on a old Tesla P40? I tried to deactivate flash attention and use:
+cmake -B build ^
+ -DGGML_CUDA=ON ^
+ -DGGML_BLAS=OFF ^
+ -DGGML_CUDA_ARCH=61 ^
+ -DGGML_CUDA_GRAPH=OFF ^
+ -DGGML_CUDA_FORCE_MMQ=OFF ^
+ -DGGML_CUDA_DMMV_X=32 ^
+ -DGGML_CUDA_MMQ_ENABLE=OFF ^
+
+according to Chat GPT, is there a way to compile it for old devices?
+
+I only get CUDA errors like:
+CUDA error: an illegal memory access was encountered
+ current device: 0, in function launch_mul_mat_q at D:\ik_llama.cpp\ggml\src\ggml-cuda\template-instances\../mmq.cuh:4008
+ cudaFuncSetAttribute(mul_mat_q, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
+D:\ik_llama.cpp\ggml\src\ggml-cuda.cu:110: CUDA error
+
+### Possible Implementation
+
+_No response_
+
+---
+
+## 💬 Conversation
+
+👤 **firecoperana** commented on **2025-07-25** at **11:59:14**
+
+P40 should work. Try this command:
+-DGGML_CUDA=ON ^
+-DGGML_BLAS=OFF ^
+-DCMAKE_CUDA_ARCHITECTURES="61" ^
+-DGGML_CUDA_USE_GRAPHS=OFF ^
+-DGGML_CUDA_FORCE_MMQ=OFF ^
+-DGGML_CUDA_DMMV_X=32 ^
+-DGGML_CUDA_MMQ_ENABLE=OFF
+
+Flash attention should work too with -DGGML_CUDA_FA_ALL_QUANTS=ON
+
+---
+
+👤 **narikm** commented on **2025-07-25** at **14:35:54**
+
+Same error.
+INFO [ update_slots] kv cache rm [p0, end) | tid="5232" timestamp=1753453942 id_slot=0 id_task=0 p0=0
+CUDA error: an illegal memory access was encountered
+ current device: 0, in function launch_mul_mat_q at G:\ik_llama.cpp\ggml\src\ggml-cuda\template-instances\../mmq.cuh:4008
+ cudaFuncSetAttribute(mul_mat_q, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
+G:\ik_llama.cpp\ggml\src\ggml-cuda.cu:110: CUDA error
+
+The compiler said -DGGML_CUDA_MMQ_ENABLE=OFF was not used. Is this related?
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **15:12:29**
+
+Can you post your full command line including the model name?
+
+---
+
+👤 **narikm** commented on **2025-07-25** at **15:49:47**
+
+The launch command line:
+cd /d G:\ik_llama.cpp\build\bin\Release
+
+llama-server ^
+ --alias DeepSeek-R1-0528-IQ2_K_R4 ^
+ --model "G:\DeepSeek-R1-0528-IQ2_K_R4\DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf" ^
+ -rtr ^
+ --ctx-size 12288 ^
+ -ctk q8_0 ^
+ -mla 2 ^
+ -amb 512 ^
+ -fmoe ^
+ --n-gpu-layers 63 ^
+ --override-tensor exps=CPU ^
+ --parallel 1 ^
+ --threads 32 ^
+ --host 0.0.0.0 ^
+ --port 8008
+
+
+The build line is what you gave me:
+cmake -B build ^
+-DGGML_CUDA=ON ^
+-DGGML_BLAS=OFF ^
+-DCMAKE_CUDA_ARCHITECTURES="61" ^
+-DGGML_CUDA_USE_GRAPHS=OFF ^
+-DGGML_CUDA_FORCE_MMQ=OFF ^
+-DGGML_CUDA_DMMV_X=32 ^
+-DGGML_CUDA_MMQ_ENABLE=OFF
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **16:40:52**
+
+You are using the ik quants file. Add -DGGML_IQK_FA_ALL_QUANTS=1 to see if it works. You can also use regular quants here.
+
+---
+
+👤 **narikm** commented on **2025-07-25** at **16:48:36**
+
+Will do. Do i need -fa in the launch ? The compiler said:
+
+Manually-specified variables were not used by the project:
+
+ GGML_CUDA_MMQ_ENABLE.
+
+Is this OK?
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **17:11:32**
+
+-fa is supported, but since you are not using ctv, it's not required I think. I would remove GGML_CUDA_MMQ_ENABLE since it's off by default. If you still have issue, you can remove -ctk q8_0.
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **17:17:24**
+
+You can add -DGGML_CUDA_FA_ALL_QUANTS=ON to make sure all quants are supported.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-25** at **17:42:44**
+
+DeepSeek flash attention does not work on a P40.
+
+---
+
+👤 **narikm** commented on **2025-07-25** at **17:55:14**
+
+ChatGPT said the same, so i removed the -fa in the launch args before posting. It changed the error to the one i posted. I still get the same error each time. It load the model but when i ask something, error, the cpu run for ten seconds, then the program crash.
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **18:54:27**
+
+It most likely not gonna work, but if you like to try, can you remove line 4008-4009 or 4008-4010 in mmq.cuh, and run it without fa?
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-25** at **20:33:58**
+
+>DeepSeek flash attention does not work on a P40.
+
+Is this because of having to use BF16? Or something else like not enough smem?
+
+>-DGGML_CUDA_FA_ALL_QUANTS=ON
+
+I think this is just to use different K and V quantization. Like Q4/Q8.
+
+This thread saved me from re-installing the P40s I have at least. Any other caveats, say for qwen coder or others? Assume P100 is the same story too. What about turning cards?
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **21:34:39**
+
+> ChatGPT said the same, so i removed the -fa in the launch args before posting. It changed the error to the one i posted. I still get the same error each time. It load the model but when i ask something, error, the cpu run for ten seconds, then the program crash.
+
+Does your card work in llama.cpp if you use regular quants of Deepseek R1?
+
+---
+
+👤 **narikm** commented on **2025-07-25** at **21:37:54**
+
+I only tried with webui and a 7b llama3, where it works. I want to use the better IK quant, for faster inference AFAIK.
+
+---
+
+👤 **narikm** commented on **2025-07-25** at **23:26:49**
+
+> It most likely not gonna work, but if you like to try, can you remove line 4008-4009 or 4008-4010 in mmq.cuh, and run it without fa?
+
+Another error:
+CUDA error: an illegal memory access was encountered
+ current device: 0, in function ggml_cuda_op_mul_mat at G:\ik_llama.cpp\ggml\src\ggml-cuda.cu:1733
+ ggml_cuda_cpy_tensor_2d( src1_ddf_i, src1, i03, i02, src1_col_0, src1_col_0+src1_ncols, stream)
+G:\ik_llama.cpp\ggml\src\ggml-cuda.cu:110: CUDA error
+
+---
+
+👤 **firecoperana** commented on **2025-07-26** at **00:55:28**
+
+I haven't used IQ2_K_R4 quants, but I can use my 1080ti for DeepSeek V3 UD IQ1_S with -rtr -fa -fmoe -mla 1 without the above code change in ik_llama.cpp.
+
+---
+
+👤 **narikm** commented on **2025-07-26** at **01:11:03**
+
+What args did you use to compile? Can you give me your launchs commands so i can try to repicate?
+
+---
+
+👤 **firecoperana** commented on **2025-07-26** at **01:36:39**
+
+cmake.exe -B build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="86;61" -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DLLAMA_CURL=OFF -DBUILD_SHARED_LIBS=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_IQK_FA_ALL_QUANTS=0
+
+
+llama-server.exe ^
+ --model "F:\DeepSeek-V3 UD IQ1_S\DeepSeek V3 UD IQ1_S.gguf" ^
+ --host 292.138.3.201 ^
+ --port 6703 ^
+ --n-gpu-layers 99 ^
+ --tensor-split 99 ^
+ --split-mode layer --main-gpu 0 ^
+ --threads 10 --ctx-size 100 --cache-type-k q8_0 --seed 1234 -ot exps=CPU --ubatch-size 16 --batch-size 16 ^
+ -rtr -fa -fmoe -mla 1
+
+---
+
+👤 **narikm** commented on **2025-07-26** at **23:34:24**
+
+Ok, so i tested with your your args and model: it crash with -fa. Without it doesn't but neither does it output tokens, despite the cpu and gpu working. So there is no current way to use a P40 with ik llama.
+
+---
+
+👤 **saood06** commented on **2025-07-26** at **23:37:51**
+
+>Without it doesn't but neither does it output tokens, despite the cpu and gpu working.
+
+Are you using the text_completion endpoint? If so that may be the reason see [#654](https://github.com/ikawrakow/ik_llama.cpp/issues/654).
+
+---
+
+👤 **narikm** commented on **2025-07-26** at **23:42:30**
+
+Yes, on silly tavern, will test again once fix is merged. But it still crash with Ik quant like the Ubergarm deepseek.
+
+---
+
+👤 **firecoperana** commented on **2025-07-26** at **23:44:34**
+
+You can test from the built-in webui first. This is still working. Just type ip and port in the browser. If you didn't set it, default is 127.0.0.1:8080.
+
+---
+
+👤 **narikm** commented on **2025-07-26** at **23:56:13**
+
+I tested, it works locally with the standard quant (unsloth/DeepSeek-V3-0324-GGUF) but crash with the ik quant [ubergarm/DeepSeek-V3-0324-GGUF] (https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF). Is there a way to use the better ik quant?
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **00:03:29**
+
+>I tested, it works locally with the standard quant (unsloth/DeepSeek-V3-0324-GGUF) but crash with the ik quant [ubergarm/DeepSeek-V3-0324-GGUF] (https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF). Is there a way to use the better ik quant?
+
+Can you give more details about this crash?
+
+---
+
+👤 **narikm** commented on **2025-07-27** at **00:05:42**
+
+The same as always, CUDA error: an illegal memory access was encountered
+current device: 0, in function launch_mul_mat_q at D:\ik_llama.cpp\ggml\src\ggml-cuda\template-instances../mmq.cuh:4008
+cudaFuncSetAttribute(mul_mat_q, cudaFuncAttributeMaxDynamicSharedMemorySize, shmem)
+D:\ik_llama.cpp\ggml\src\ggml-cuda.cu:110: CUDA error
+Chatgpt wanted me to desactivate mmq to fall back on older matrix multiplication, but the args he gave me were not taken into account.
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **00:39:53**
+
+>Yes, on silly tavern, will test again once fix is merged. But it still crash with Ik quant like the Ubergarm deepseek.
+
+I know you already tested, but just as a heads up, I merged the fix in.
\ No newline at end of file
diff --git a/github-data/issues/647 - web ui not showing tsec.md b/github-data/issues/647 - web ui not showing tsec.md
new file mode 100644
index 000000000..91e626b0f
--- /dev/null
+++ b/github-data/issues/647 - web ui not showing tsec.md
@@ -0,0 +1,33 @@
+## 📌 [Issue #647](https://github.com/ikawrakow/ik_llama.cpp/issues/647) - web ui not showing t/sec
+
+| **Author** | `gopinath87607` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-07-25 |
+| **Updated** | 2025-07-27 |
+
+---
+
+## 📄 Description
+
+i used to get t/sec in web ui by enabling option in advance tab but right now its even if i enable it its not showing does others also getting same ?
+
+---
+
+## 💬 Conversation
+
+👤 **firecoperana** commented on **2025-07-25** at **15:13:36**
+
+The same. I will fix it.
+
+---
+
+👤 **firecoperana** commented on **2025-07-25** at **23:11:40**
+
+https://github.com/ikawrakow/ik_llama.cpp/pull/648
+
+---
+
+👤 **firecoperana** commented on **2025-07-27** at **13:14:47**
+
+PR merged. Close this.
\ No newline at end of file
diff --git a/github-data/issues/649 - Bug IQ2_KL in ffn_gateup_exps for Qwen3-Coder-480B-A35B-Instruct iqk_fa_template.md b/github-data/issues/649 - Bug IQ2_KL in ffn_gateup_exps for Qwen3-Coder-480B-A35B-Instruct iqk_fa_template.md
new file mode 100644
index 000000000..ba0cfc510
--- /dev/null
+++ b/github-data/issues/649 - Bug IQ2_KL in ffn_gateup_exps for Qwen3-Coder-480B-A35B-Instruct iqk_fa_template.md
@@ -0,0 +1,306 @@
+## 📌 [Issue #649](https://github.com/ikawrakow/ik_llama.cpp/issues/649) - Bug: IQ2_KL in ffn_(gate|up)_exps for Qwen3-Coder-480B-A35B-Instruct `iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed`
+
+| **Author** | `ubergarm` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-26 |
+| **Updated** | 2025-07-26 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+I cooked a quant using IQ2_KL in ffn_(gate|up)_exps tensors for Qwen3-Coder-480B-A35B-Instruct. However, when trying to run it throws `iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed` and exits on startup. Compiled CPU-only backend.
+
+Oddly though, I have a different quant using IQ2_KL in only the ffn_down tensors and it works fine. It is identical to the recipe except for slightly larger routed exps:
+```
+# Routed Experts
+blk\..*\.ffn_down_exps\.weight=iq2_kl
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k
+```
+
+Finally just to test, I tried removing `-fmoe` but it failed with the same assert error. Removing `-fa` it exited with this error: `Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-2): found nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 160`
+
+I noticed this issue before releasing the quant so no rush. If I have time I might try rolling back to earlier version to see if it was possibly a regression.
+
+*EDIT*: Also my recent `Kimi-K2-Instruct-IQ2_KL` quants are working fine too:
+```
+Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq3_ks
+# Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq2_kl # <--- i have one with all exps iq2_kl also
+Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq2_kl
+```
+
+
+
+👈 IQ2_KL Quant Recipe
+
+This is the quant that throws the error:
+```bash
+#!/usr/bin/env bash
+
+# Repeating Layers [0-61]
+
+custom="
+# Attention
+blk\..*\.attn_q.*=iq6_k
+blk\..*\.attn_k.*=q8_0
+blk\..*\.attn_v.*=q8_0
+blk\..*\.attn_output.*=iq6_k
+
+# Routed Experts
+blk\..*\.ffn_down_exps\.weight=iq3_k
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
+
+# Non-Repeating Layers
+token_embd\.weight=iq4_k
+output\.weight=iq6_k
+"
+
+custom=$(
+ echo "$custom" | grep -v '^#' | \
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+
+numactl -N 0 -m 0 \
+./build/bin/llama-quantize \
+ --custom-q "$custom" \
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KL.gguf \
+ IQ2_KL \
+ 192
+```
+
+
+
+
+
+
+👈 llama-server command, log and error
+
+```bash
+# compile CPU-only backend
+
+model=/mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf
+
+numactl -N 1 -m 1 \
+./build/bin/llama-server \
+ --model "$model"\
+ --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct \
+ --ctx-size 196608 \
+ -ctk q8_0 -ctv q8_0 \
+ -fa -fmoe \
+ -ub 4096 -b 4096 \
+ --parallel 1 \
+ --threads 128 \
+ --threads-batch 192 \
+ --numa numactl \
+ --host 127.0.0.1 \
+ --port 8080 \
+ --no-mmap
+
+INFO [ main] build info | tid="127586578487488" timestamp=1753302334 build=3821 commit="1b052109"
+INFO [ main] system info | tid="127586578487488" timestamp=1753302334 n_threads=128 n_threads_batch=192 total_threads=768 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+llama_model_loader: additional 3 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 41 key-value pairs and 747 tensors from /mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 Coder 480B A35B Instruct
+llama_model_loader: - kv 3: general.finetune str = Instruct
+llama_model_loader: - kv 4: general.basename str = Qwen3-Coder
+llama_model_loader: - kv 5: general.size_label str = 480B-A35B
+llama_model_loader: - kv 6: general.license str = apache-2.0
+llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
+llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 9: qwen3moe.block_count u32 = 62
+llama_model_loader: - kv 10: qwen3moe.context_length u32 = 262144
+llama_model_loader: - kv 11: qwen3moe.embedding_length u32 = 6144
+llama_model_loader: - kv 12: qwen3moe.feed_forward_length u32 = 8192
+llama_model_loader: - kv 13: qwen3moe.attention.head_count u32 = 96
+llama_model_loader: - kv 14: qwen3moe.attention.head_count_kv u32 = 8
+llama_model_loader: - kv 15: qwen3moe.rope.freq_base f32 = 10000000.000000
+llama_model_loader: - kv 16: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 17: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 18: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 19: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 20: general.file_type u32 = 155
+llama_model_loader: - kv 21: qwen3moe.expert_count u32 = 160
+llama_model_loader: - kv 22: qwen3moe.expert_feed_forward_length u32 = 2560
+llama_model_loader: - kv 23: qwen3moe.expert_shared_feed_forward_length u32 = 0
+llama_model_loader: - kv 24: general.quantization_version u32 = 2
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 33: tokenizer.chat_template str = {% macro render_item_list(item_list, ...
+llama_model_loader: - kv 34: quantize.imatrix.file str = /mnt/raid/models/ubergarm/Qwen3-Coder...
+llama_model_loader: - kv 35: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
+llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 497
+llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 840
+llama_model_loader: - kv 38: split.no u16 = 0
+llama_model_loader: - kv 39: split.count u16 = 4
+llama_model_loader: - kv 40: split.tensors.count i32 = 747
+llama_model_loader: - type f32: 311 tensors
+llama_model_loader: - type q8_0: 124 tensors
+llama_model_loader: - type iq3_k: 62 tensors
+llama_model_loader: - type iq4_k: 1 tensors
+llama_model_loader: - type iq6_k: 125 tensors
+llama_model_loader: - type iq2_kl: 124 tensors
+llm_load_vocab: special tokens cache size = 26
+llm_load_vocab: token to piece cache size = 0.9311 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen3moe
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 151936
+llm_load_print_meta: n_merges = 151387
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 262144
+llm_load_print_meta: n_embd = 6144
+llm_load_print_meta: n_layer = 62
+llm_load_print_meta: n_head = 96
+llm_load_print_meta: n_head_kv = 8
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 12
+llm_load_print_meta: n_embd_k_gqa = 1024
+llm_load_print_meta: n_embd_v_gqa = 1024
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 8192
+llm_load_print_meta: n_expert = 160
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = linear
+llm_load_print_meta: freq_base_train = 10000000.0
+llm_load_print_meta: freq_scale_train = 1
+llm_load_print_meta: n_ctx_orig_yarn = 262144
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = IQ2_KL - 2.6875 bpw
+llm_load_print_meta: model params = 480.155 B
+llm_load_print_meta: model size = 169.597 GiB (3.034 BPW)
+llm_load_print_meta: repeating layers = 168.388 GiB (3.024 BPW, 478.288 B parameters)
+llm_load_print_meta: general.name = Qwen3 Coder 480B A35B Instruct
+llm_load_print_meta: BOS token = 11 ','
+llm_load_print_meta: EOS token = 151645 '<|im_end|>'
+llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
+llm_load_print_meta: LF token = 148848 'ÄĬ'
+llm_load_print_meta: EOT token = 151645 '<|im_end|>'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_ff_exp = 2560
+llm_load_tensors: ggml ctx size = 0.33 MiB
+llm_load_tensors: CPU buffer size = 173666.87 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 196608
+llama_new_context_with_model: n_batch = 4096
+llama_new_context_with_model: n_ubatch = 4096
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000000.0
+llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: CPU KV buffer size = 25296.00 MiB
+llama_new_context_with_model: KV self size = 25296.00 MiB, K (q8_0): 12648.00 MiB, V (q8_0): 12648.00 MiB
+llama_new_context_with_model: CPU output buffer size = 2.32 MiB
+llama_new_context_with_model: CPU compute buffer size = 5184.05 MiB
+llama_new_context_with_model: graph nodes = 2424
+llama_new_context_with_model: graph splits = 1
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+GGML_ASSERT(fms.S[j] > 0) failed
+GGML_ASSERT(fms.S[j] > 0) failed
+
+GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+warning: process 4140403 is a zombie - the process has already terminated
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+./myscripts/api-server-Qwen3-Coder-480B-A35B-Instruct.sh: line 34: 4140403 Aborted (core dumped) numactl -N 1 -m 1 ./build/bin/llama-server --model "$model" --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct --ctx-size 196608 -ctk q8_0 -ctv q8_0 -fa -fmoe -ub 4096 -b 4096 --parallel 3 --threads 128 --threads-batch 192 --numa numactl --host 127.0.0.1 --port 8080 --no-mmap
+```
+
+
+
+
+### Name and Version
+
+$ ./build/bin/llama-server --version
+version: 3822 (4e9c78c0)
+built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+
+```
\ No newline at end of file
diff --git a/github-data/issues/650 - Bug segfault in llama-quantize iq3_kt.md b/github-data/issues/650 - Bug segfault in llama-quantize iq3_kt.md
new file mode 100644
index 000000000..3f757a375
--- /dev/null
+++ b/github-data/issues/650 - Bug segfault in llama-quantize iq3_kt.md
@@ -0,0 +1,222 @@
+## 📌 [Issue #650](https://github.com/ikawrakow/ik_llama.cpp/issues/650) - Bug: segfault in llama-quantize iq3_kt
+
+| **Author** | `ubergarm` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-26 |
+| **Updated** | 2025-07-26 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+I've been unable to quantize using iq3_kt in a recent reicpe. I ran the script in `gdb` and grabbed a `bt` after it segfaults. I'm am sure sure I have used iq3_kt before, but I believe that was on a different CPU rig. I could try to recreate this issue with a smaller quant, or on a different rig, or roll back to an earlier version to check for possible regressions.
+
+Here is the log and backtrace just to document it in case someone else sees anything similar.
+
+
+👈 llama-quantize command and debugging logs and backtrace
+
+```bash
+#!/usr/bin/env bash
+
+# Repeating Layers [0-61]
+
+custom="
+# Attention
+blk\..*\.attn_q.*=iq4_kt
+blk\..*\.attn_k.*=iq4_kt
+blk\..*\.attn_v.*=iq4_kt
+blk\..*\.attn_output.*=iq4_kt
+
+# Routed Experts
+blk\..*\.ffn_down_exps\.weight=iq3_kt
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
+
+# Non-Repeating Layers
+token_embd\.weight=iq4_kt
+output\.weight=iq6_k
+"
+
+custom=$(
+ echo "$custom" | grep -v '^#' | \
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+
+numactl -N 1 -m 1 \
+gdb -q --args ./build/bin/llama-quantize \
+ --custom-q "$custom" \
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf \
+ IQ2_KT \
+ 192
+
+[Thread debugging using libthread_db enabled]
+Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
+Downloading separate debug info for /lib/x86_64-linux-gnu/libgomp.so.1...
+main: build = 3822 (4e9c78c0)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: quantizing '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf' to '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf' as IQ2_KT using 192 threads
+llama_model_loader: additional 20 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 37 key-value pairs and 747 tensors from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 Coder 480B A35B Instruct
+llama_model_loader: - kv 3: general.finetune str = Instruct
+llama_model_loader: - kv 4: general.basename str = Qwen3-Coder
+llama_model_loader: - kv 5: general.size_label str = 480B-A35B
+llama_model_loader: - kv 6: general.license str = apache-2.0
+llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
+llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 9: qwen3moe.block_count u32 = 62
+llama_model_loader: - kv 10: qwen3moe.context_length u32 = 262144
+llama_model_loader: - kv 11: qwen3moe.embedding_length u32 = 6144
+llama_model_loader: - kv 12: qwen3moe.feed_forward_length u32 = 8192
+llama_model_loader: - kv 13: qwen3moe.attention.head_count u32 = 96
+llama_model_loader: - kv 14: qwen3moe.attention.head_count_kv u32 = 8
+llama_model_loader: - kv 15: qwen3moe.rope.freq_base f32 = 10000000.000000
+llama_model_loader: - kv 16: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 17: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 18: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 19: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 20: general.file_type u32 = 32
+llama_model_loader: - kv 21: qwen3moe.expert_count u32 = 160
+llama_model_loader: - kv 22: qwen3moe.expert_feed_forward_length u32 = 2560
+llama_model_loader: - kv 23: qwen3moe.expert_shared_feed_forward_length u32 = 0
+llama_model_loader: - kv 24: general.quantization_version u32 = 2
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 33: tokenizer.chat_template str = {% macro render_item_list(item_list, ...
+llama_model_loader: - kv 34: split.no u16 = 0
+llama_model_loader: - kv 35: split.count u16 = 21
+llama_model_loader: - kv 36: split.tensors.count i32 = 747
+llama_model_loader: - type f32: 311 tensors
+llama_model_loader: - type bf16: 436 tensors
+================================ Have weights data with 497 entries
+[ 1/ 747] token_embd.weight - [ 6144, 151936, 1, 1], type = bf16, Using custom type iq4_kt for tensor token_embd.weight
+
+====== llama_model_quantize_internal: did not find weights for token_embd.weight
+[New Thread 0x7f1f54cfe6c0 (LWP 298924)]
+[New Thread 0x7f1f544fd6c0 (LWP 298925)]
+...
+[Thread 0x7f1ef1f7b6c0 (LWP 299113) exited]
+converting to iq4_kt .. Adding custom rule blk\..*\.attn_q.* -> iq4_kt
+Adding custom rule blk\..*\.attn_k.* -> iq4_kt
+Adding custom rule blk\..*\.attn_v.* -> iq4_kt
+Adding custom rule blk\..*\.attn_output.* -> iq4_kt
+Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq3_kt
+Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq2_kt
+Adding custom rule token_embd\.weight -> iq4_kt
+Adding custom rule output\.weight -> iq6_k
+load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
+load_imatrix: loaded 497 importance matrix entries from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat computed on 840 chunks
+prepare_imatrix: have 497 importance matrix entries
+[New Thread 0x7f1ef0f796c0 (LWP 299121)]
+[New Thread 0x7f1ef177a6c0 (LWP 299122)]
+...
+[Thread 0x7f1b227fc6c0 (LWP 299311) exited]
+[Thread 0x7f1dc67fc6c0 (LWP 299163) exited]
+size = 1780.50 MiB -> 445.70 MiB
+[ 2/ 747] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 3/ 747] blk.0.attn_k.weight - [ 6144, 1024, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_k.weight
+[New Thread 0x7f1b227fc6c0 (LWP 299312)]
+[Thread 0x7f1b227fc6c0 (LWP 299312) exited]
+[New Thread 0x7f1b22ffd6c0 (LWP 299313)]
+...
+converting to iq4_kt .. cluster_points: Oops. Cluster 4 has no points: 0 1 0 0
+cluster_points: 1 out of 625 clusters dir not have any points
+...
+size = 12.00 MiB -> 3.00 MiB
+[ 4/ 747] blk.0.attn_output.weight - [12288, 6144, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_output.weight
+...
+converting to iq4_kt .. [New Thread 0x7f1b217fa6c0 (LWP 299887)]
+...
+size = 144.00 MiB -> 36.02 MiB
+[ 5/ 747] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 6/ 747] blk.0.attn_q.weight - [ 6144, 12288, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_q.weight
+...
+converting to iq4_kt .. [New Thread 0x7f1b217fa6c0 (LWP 300271)]
+...
+size = 144.00 MiB -> 36.05 MiB
+[ 7/ 747] blk.0.attn_v.weight - [ 6144, 1024, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_v.weight
+...
+converting to iq4_kt .. [New Thread 0x7f1b217fa6c0 (LWP 300660)]
+...
+size = 12.00 MiB -> 3.00 MiB
+[ 8/ 747] blk.0.attn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
+[ 9/ 747] blk.0.ffn_down_exps.weight - [ 2560, 6144, 160, 1], type = bf16, Using custom type iq3_kt for tensor blk.0.ffn_down_exps.weight
+...
+converting to iq3_kt .. [New Thread 0x7f1fd5d446c0 (LWP 301043)]
+...
+[New Thread 0x7f20305f96c0 (LWP 311379)]
+[New Thread 0x7f202fdf86c0 (LWP 311380)]
+
+Thread 12423 "llama-quantize" received signal SIGSEGV, Segmentation fault.
+[Switching to Thread 0x7f1fd5d446c0 (LWP 311369)]
+0x00007ffff77cee9d in (anonymous namespace)::quantize_row_iq3_kt_impl (x=0x7f19767ff010, vy=0x7f1667daa010, n_per_row=2560,
+ quant_weights=0x7fffe16f8010, all_scales=0x7f1c88000b70, all_weights=0x7f1c88000cc0, qtmp=0x7f1c880034d0)
+ at /home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:9179
+9179 for (int j = 0; j < Q::kGroupSize; ++j) *xt++ = q[j];
+[?2004h[?2004l
+[?2004h(gdb) bt
+[?2004l
+#0 0x00007ffff77cee9d in (anonymous namespace)::quantize_row_iq3_kt_impl (x=0x7f19767ff010, vy=0x7f1667daa010, n_per_row=2560,
+ quant_weights=0x7fffe16f8010, all_scales=0x7f1c88000b70, all_weights=0x7f1c88000cc0, qtmp=0x7f1c880034d0)
+ at /home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:9179
+#1 0x00007ffff77cfd24 in quantize_iq3_kt (src=0x7f19767ff010, dst=0x7f1667daa010, nrows=7, n_per_row=2560, imatrix=0x7fffe16f8010)
+ at /home/w/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:9298
+#2 0x00007ffff6c8e4ba in ggml_quantize_chunk (type=GGML_TYPE_IQ3_KT, src=0x7f19767ff010, dst=0x7f1667daa010, start=0, nrows=7, n_per_row=2560,
+ imatrix=0x7fffe16f8010) at /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:24019
+#3 0x00007ffff7ced7e2 in operator() (__closure=0x555557273448) at /home/w/projects/ik_llama.cpp/src/llama.cpp:19839
+#4 0x00007ffff7d19743 in std::__invoke_impl&, int):: >(std::__invoke_other, struct {...} &&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61
+#5 0x00007ffff7d195b3 in std::__invoke&, int):: >(struct {...} &&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:96
+#6 0x00007ffff7d194ba in std::thread::_Invoker&, int):: > >::_M_invoke<0>(std::_Index_tuple<0>) (this=0x555557273448)
+ at /usr/include/c++/13/bits/std_thread.h:292
+#7 0x00007ffff7d19472 in std::thread::_Invoker&, int):: > >::operator()(void) (this=0x555557273448) at /usr/include/c++/13/bits/std_thread.h:299
+#8 0x00007ffff7d19432 in std::thread::_State_impl&, int):: > > >::_M_run(void) (this=0x555557273440)
+ at /usr/include/c++/13/bits/std_thread.h:244
+#9 0x00007ffff68ecdb4 in std::execute_native_thread_routine (__p=0x555557273440) at ../../../../../src/libstdc++-v3/src/c++11/thread.cc:104
+#10 0x00007ffff649caa4 in start_thread (arg=) at ./nptl/pthread_create.c:447
+#11 0x00007ffff6529c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
+[?2004h
+(gdb)
+```
+
+
+
+
+### Name and Version
+
+$ ./build/bin/llama-server --version
+version: 3822 (4e9c78c0)
+built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+
+```
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-26** at **07:38:06**
+
+I suspect this issue and [#649](https://github.com/ikawrakow/ik_llama.cpp/issues/649) are related and likely caused by blocks filled by zeros in the model weights and/or corresponding imatrix data. In the case of `IQ2_KL` we get NaNs in the quantized model, which then causes the assert in the FA kernel. In the case of `IQ3_KT` we crash already during quantization because the search for the best match fails.
+
+I'll need to add guards against that.
\ No newline at end of file
diff --git a/github-data/issues/651 - Research imatrix on MLA DeepSeekKimi-K2 for attn_k_b and attn_v_b.md b/github-data/issues/651 - Research imatrix on MLA DeepSeekKimi-K2 for attn_k_b and attn_v_b.md
new file mode 100644
index 000000000..76237ccb0
--- /dev/null
+++ b/github-data/issues/651 - Research imatrix on MLA DeepSeekKimi-K2 for attn_k_b and attn_v_b.md
@@ -0,0 +1,93 @@
+## 📌 [Issue #651](https://github.com/ikawrakow/ik_llama.cpp/issues/651) - Research: imatrix on MLA DeepSeek/Kimi-K2 for `attn_k_b` and `attn_v_b`
+
+| **Author** | `ubergarm` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-26 |
+| **Updated** | 2025-07-26 |
+
+---
+
+## 📄 Description
+
+### Previous existing literature and research
+
+We've had some discussions on how to use imatrix on DeepSeek/Kimi-K2 models with MLA to make sure it applies to `attn_k_b` and `attn_v_b` tensors.
+
+When using an imatrix to quantize these messages are expected:
+```
+====== llama_model_quantize_internal: did not find weights for token_embd.weight
+====== llama_model_quantize_internal: did not find weights for blk.0.attn_kv_b.weight
+```
+
+This is fine as the token_embd does not use imatrix data, and the strategy is to set `attn_kv_b` always to Q8_0 as it is only used for PP assuming end user mode of `-mla 3`.
+
+However the following messages *not* okay and mean there maybe was an issue with how imatrix data was collected:
+```bash
+====== llama_model_quantize_internal: did not find weights for blk.0.attn_k_b.weight
+====== llama_model_quantize_internal: did not find weights for blk.0.attn_v_b.weight
+```
+
+Given these tensors are good to quantize to speed up TG, ideally there should be imatrix data there.
+
+In the past I had mistakenly left off `-mla 1` while creating imatrix and so was missing the data for `attn_k_b` and `attn_v_b` oops!
+
+More recently with Kimi-K2 I *did* use `-mla 1` but still was missing data for both tensors. One tensors seems because it is named `attn_k_b.weight (reshaped)`. And `attn_v_b` does not appear in the verbose logs. For now I am just quantizing `attn_kv_b` as designed, but now also `attn_k_b`, and `attn_v_b` to Q8_0 on Kimi-K2-Instruct models to preserve perplexity at the cost of some TG speed.
+
+See attached log files, too long to add into details fold:
+
+[imat-kimi-no-mla.log](https://github.com/user-attachments/files/21441623/imat-kimi-no-mla.log)
+
+[imat-kimi-mla-1.log](https://github.com/user-attachments/files/21441622/imat-kimi-mla-1.log)
+
+A few things I can check would be:
+
+### 1. Try it on DeepSeek-V2-Lite
+So I just tried llama-imatrix on this model and it seems to use that `attn_k_b.weight (reshaped)` name and I don't ever see `attn_v_b`, though when actually quantizing it doesn't throw a message about missing `attn_v_b`.
+
+```
+# llama-quantize
+[ 11/ 431] blk.0.attn_k_b.weight - [ 128, 8192, 1, 1], type = bf16, Using custom type q4_0 for tensor blk.0.attn_k_b.weight
+====== llama_model_quantize_internal: did not find weights for blk.0.attn_k_b.weight
+converting to q4_0 .. size = 2.00 MiB -> 0.56 MiB
+[ 12/ 431] blk.0.attn_v_b.weight - [ 512, 2048, 1, 1], type = bf16, Using custom type q4_0 for tensor blk.0.attn_v_b.weight
+converting to q4_0 .. size = 2.00 MiB -> 0.56 MiB
+```
+
+```
+CUDA_VISIBLE_DEVICES="0" \
+./build/bin/llama-imatrix \
+ -mla 1 \
+ --verbosity 2 \
+ --layer-similarity \
+ -m /mnt/raid/models/ubergarm/DeepSeek-V2-Lite-GGUF/DeepSeek-V2-Lite-64x1.6B-BF16.gguf \
+ -f ubergarm-imatrix-calibration-corpus-v02.txt \
+ -o /tmp/imatrix-tmp.dat \
+ -ngl 99 \
+ --ctx-size 512 \
+ --threads 1
+
+...
+
+collect_imatrix[1]: blk.20.ffn_down_shexp.weight, MUL_MAT, 2816 x 512, 0
+collect_imatrix[1]: blk.21.attn_kv_a_mqa.weight, MUL_MAT, 2048 x 512, 0
+collect_imatrix[1]: blk.21.attn_q.weight, MUL_MAT, 2048 x 512, 0
+collect_imatrix[1]: blk.21.attn_k_b.weight (reshaped), MUL_MAT, 128 x 512, 0
+collect_imatrix[1]: blk.21.attn_output.weight, MUL_MAT, 2048 x 512, 0
+collect_imatrix[1]: blk.21.ffn_gate_inp.weight, MUL_MAT, 2048 x 512, 0
+collect_imatrix[1]: blk.21.ffn_gate_exps.weight, MUL_MAT_ID, 2048 x 512, 0
+collect_imatrix[1]: blk.21.ffn_up_exps.weight, MUL_MAT_ID, 2048 x 512, 0
+collect_imatrix[1]: blk.21.ffn_down_exps.weight, MUL_MAT_ID, 1408 x 512, 0
+collect_imatrix[1]: blk.21.ffn_gate_shexp.weight, MUL_MAT, 2048 x 512, 0
+collect_imatrix[1]: blk.21.ffn_up_shexp.weight, MUL_MAT, 2048 x 512, 0
+collect_imatrix[1]: blk.21.ffn_down_shexp.weight, MUL_MAT, 2816 x 512, 0
+collect_imatrix[1]: blk.22.attn_kv_a_mqa.weight, MUL_MAT, 2048 x 512, 0
+```
+
+---
+
+We've discussed this topic across a number of discussions and PRs hah, here is most recent relevent comments:
+
+## References
+* https://github.com/ikawrakow/ik_llama.cpp/pull/642#issuecomment-3109818995
+* https://github.com/ikawrakow/ik_llama.cpp/issues/601#issuecomment-3070185792
\ No newline at end of file
diff --git a/github-data/issues/655 - Bug warning failed to munlock buffer Cannot allocate memory.md b/github-data/issues/655 - Bug warning failed to munlock buffer Cannot allocate memory.md
new file mode 100644
index 000000000..76825f9dd
--- /dev/null
+++ b/github-data/issues/655 - Bug warning failed to munlock buffer Cannot allocate memory.md
@@ -0,0 +1,115 @@
+## 📌 [Issue #655](https://github.com/ikawrakow/ik_llama.cpp/issues/655) - Bug: warning: failed to munlock buffer: Cannot allocate memory
+
+| **Author** | `magikRUKKOLA` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-26 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+Any idea what's going on?
+
+
+```
+
+export MALLOC_CONF="background_thread:true,percpu_arena:phycpu,metadata_thp:auto,dirty_decay_ms:10000,muzzy_decay_ms:60000"
+export LD_PRELOAD=/usr/local/lib/libjemalloc.so
+
+ulimit -n 9999
+ulimit -l unlimited
+
+CUDA_VISIBLE_DEVICES="0,1" \
+/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench \
+ --warmup-batch \
+ --model /opt/ubergarm/Kimi-K2-Instruct-GGUF/IQ3_KS/Kimi-K2-Instruct-IQ3_KS-00001-of-00010.gguf \
+ --alias ubergarm/Kimi-K2-Instruct-GGUF_IQ3_KS \
+ --ctx-size $((128 * 1024)) \
+ -b $((32 * 512)) -ub $((16 * 512)) \
+ --mlock \
+ --seed 3407 \
+ --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 --repeat-penalty 1.0 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -amb 512 \
+ --main-gpu 1 \
+ --tensor-split 2,8 \
+ --split-mode layer \
+ --override-tensor exps=CPU \
+ --n-gpu-layers 99 \
+ --threads $(grep ^cpu\\scores /proc/cpuinfo | uniq | awk '{print $4}' | xargs -I{} echo "{}-0" | bc) \
+ --host 0.0.0.0 \
+ --port 8080 \
+ --lookup-cache-dynamic /mnt/data/ik_llama.kv.dump
+```
+
+
+### Name and Version
+
+/opt/ik_llama.cpp/ik_llama.cpp/build/bin/llama-sweep-bench --version
+version: 3822 (4e9c78c0)
+built with cc (Debian 14.2.0-19) 14.2.0 for x86_64-linux-gnu
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+....................................................................................................
+llama_new_context_with_model: n_ctx = 131072
+llama_new_context_with_model: n_batch = 16384
+llama_new_context_with_model: n_ubatch = 8192
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 3
+llama_new_context_with_model: attn_max_b = 512
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 50000.0
+llama_new_context_with_model: freq_scale = 0.03125
+llama_kv_cache_init: CUDA0 KV buffer size = 994.51 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 3672.02 MiB
+llama_new_context_with_model: KV self size = 4666.50 MiB, c^KV (q8_0): 4666.50 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.62 MiB
+llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
+llama_new_context_with_model: CUDA0 compute buffer size = 12432.03 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 8672.06 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 4320.09 MiB
+llama_new_context_with_model: graph nodes = 13771
+llama_new_context_with_model: graph splits = 231
+
+main: n_kv_max = 131072, n_batch = 16384, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 99, n_threads = 64, n_threads_batch = 64
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 8192 | 2048 | 0 | 35.383 | 231.52 | 199.517 | 10.26 |
+| 8192 | 2048 | 8192 | 39.030 | 209.89 | 208.009 | 9.85 |
+| 8192 | 2048 | 16384 | 43.585 | 187.96 | 218.848 | 9.36 |
+| 8192 | 2048 | 24576 | 48.235 | 169.84 | 226.935 | 9.02 |
+| 8192 | 2048 | 32768 | 52.680 | 155.50 | 236.670 | 8.65 |
+| 8192 | 2048 | 40960 | 57.465 | 142.56 | 245.538 | 8.34 |
+| 8192 | 2048 | 49152 | 62.096 | 131.93 | 254.381 | 8.05 |
+| 8192 | 2048 | 57344 | 66.846 | 122.55 | 264.281 | 7.75 |
+| 8192 | 2048 | 65536 | 71.637 | 114.35 | 272.132 | 7.53 |
+| 8192 | 2048 | 73728 | 76.372 | 107.26 | 280.351 | 7.31 |
+| 8192 | 2048 | 81920 | 81.235 | 100.84 | 283.917 | 7.21 |
+| 8192 | 2048 | 90112 | 86.135 | 95.11 | 292.227 | 7.01 |
+| 8192 | 2048 | 98304 | 91.048 | 89.97 | 300.119 | 6.82 |
+| 8192 | 2048 | 106496 | 95.891 | 85.43 | 309.025 | 6.63 |
+| 8192 | 2048 | 114688 | 100.902 | 81.19 | 317.808 | 6.44 |
+| 8192 | 2048 | 122880 | 105.924 | 77.34 | 325.710 | 6.29 |
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+warning: failed to munlock buffer: Cannot allocate memory
+```
\ No newline at end of file
diff --git a/github-data/issues/658 - Bug Kimi K2 quantization fails with error Unknown model.md b/github-data/issues/658 - Bug Kimi K2 quantization fails with error Unknown model.md
new file mode 100644
index 000000000..707fd02f2
--- /dev/null
+++ b/github-data/issues/658 - Bug Kimi K2 quantization fails with error Unknown model.md
@@ -0,0 +1,136 @@
+## 📌 [Issue #658](https://github.com/ikawrakow/ik_llama.cpp/issues/658) - Bug: Kimi K2 quantization fails with error "Unknown model"
+
+| **Author** | `Lissanro` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2025-07-27 |
+| **Updated** | 2025-07-27 |
+
+---
+
+## 📄 Description
+
+### What happened?
+
+I tried to quantize Kimi K2 model from BF16:
+
+```
+~/pkgs/ik_llama.cpp/build/bin/llama-quantize \
+/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf \
+/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-Q6_K.gguf \
+Q6_K
+```
+
+And get this error:
+
+```
+Sorry, uknown model => cannot fix it => bailing out
+```
+
+Maybe support for K2 wasn't added yet for quantization (I tested also with llama.cpp, and it finished quantizing it successfully, even though with a warning "61 of 731 tensor(s) required fallback quantization" due to "tensor cols 128 x 512 are not divisible by 256, required for q6_K" for some of the tensors, so I am assuming by BF16 files are fine).
+
+Also, very minor thing - there is a typo in the error message ("uknown" instead of "unknown").
+
+### Name and Version
+
+> ~/pkgs/ik_llama.cpp/build/bin/llama-cli --version
+version: 3826 (ae0ba31f)
+built with cc (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
+
+### What operating system are you seeing the problem on?
+
+Linux
+
+### Relevant log output
+
+```shell
+> ~/pkgs/ik_llama.cpp/build/bin/llama-quantize /mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf /mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-Q6_K.gguf Q6_K
+main: build = 3826 (ae0ba31f)
+main: built with cc (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
+main: quantizing '/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf' to '/mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-Q6_K.2.gguf' as Q6_K
+llama_model_loader: loaded meta data with 43 key-value pairs and 1096 tensors from /mnt/neuro/models/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Kimi K2 Instruct
+llama_model_loader: - kv 3: general.finetune str = Instruct
+llama_model_loader: - kv 4: general.basename str = Kimi-K2
+llama_model_loader: - kv 5: general.size_label str = 384x14B
+llama_model_loader: - kv 6: general.license str = other
+llama_model_loader: - kv 7: general.license.name str = modified-mit
+llama_model_loader: - kv 8: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 9: deepseek2.context_length u32 = 131072
+llama_model_loader: - kv 10: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 11: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 12: deepseek2.attention.head_count u32 = 64
+llama_model_loader: - kv 13: deepseek2.attention.head_count_kv u32 = 1
+llama_model_loader: - kv 14: deepseek2.rope.freq_base f32 = 50000.000000
+llama_model_loader: - kv 15: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 16: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 17: general.file_type u32 = 32
+llama_model_loader: - kv 18: deepseek2.leading_dense_block_count u32 = 1
+llama_model_loader: - kv 19: deepseek2.vocab_size u32 = 163840
+llama_model_loader: - kv 20: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 21: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 22: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 23: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 24: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 25: deepseek2.expert_count u32 = 384
+llama_model_loader: - kv 26: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 27: deepseek2.expert_weights_scale f32 = 2.827000
+llama_model_loader: - kv 28: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 29: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 30: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 31: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 32: deepseek2.rope.scaling.factor f32 = 32.000000
+llama_model_loader: - kv 33: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 34: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 36: tokenizer.ggml.pre str = kimi-k2
+llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,163842] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,163842] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 163584
+llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 163585
+llama_model_loader: - kv 41: tokenizer.chat_template str = {%- if tools -%}\n <|im_system|>tool_...
+llama_model_loader: - kv 42: general.quantization_version u32 = 2
+llama_model_loader: - type f32: 365 tensors
+llama_model_loader: - type bf16: 731 tensors
+==========================================================================
+Detected incompatible DeepSeek model.
+Will try to fix, but there are no guarantees
+
+*** Your prompt processing speed will be crippled ***
+
+Consider making your own ik_llama.cpp compatible model or
+ask the model provider to make one for you,
+Sorry, uknown model => cannot fix it => bailing out
+```
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-27** at **15:55:14**
+
+This is a GGUF that you made yourself, correct?
+
+The self-attention related metadata needs to be the way it comes out of either the `ik_llama.cpp` or the `llama.cpp` `convert_hf_to_gguf.py` conversion script, else it will not work. I'm on mobile, so difficult to sort out exactly what is not correct in your GGUF.
+
+---
+
+👤 **Lissanro** commented on **2025-07-27** at **16:44:09**
+
+Yes, correct, due to limits of my internet connection I could not download premade BF16, only the original FP8, so had to create my own BF16 GGUF.
+
+I have used `convert_hf_to_gguf.py` from [llama.cpp evshiron fork](https://github.com/evshiron/llama.cpp), with the [patch](https://dragon.studio/2025/07/lama.cpp-fp8-to-bf16-patch.diff) applied (the patch is based on the differences between the evshiron fork and the upstream llama.cpp related to the conversion script and Kimi K2 updates).
+
+I ran this command to convert from the original FP8 to BF16:
+
+```
+python3 ~/pkgs/llama.cpp-fp8-to-bf16/llama.cpp/convert_hf_to_gguf.py \
+--outtype bf16 \
+--outfile /mnt/neuro/Kimi-K2-Instruct/Kimi-K2-Instruct-BF16.gguf \
+/mnt/neuro/models/Kimi-K2-Instruct/
+```
+
+Maybe I missed a step or some extra option required to add metadata? But given `convert_hf_to_gguf.py` conversion script in ik_llama.cpp or mainline llama.cpp does not support Kimi K2 original model that comes in FP8 format, could you please clarify what did you mean by "_self-attention related metadata needs to be the way it comes out of either the ik_llama.cpp or the llama.cpp convert_hf_to_gguf.py conversion script_" - do I need to extract this metadata somehow and add it manually, or is it something that needs to be patched in the evshiron's llama.cpp fork? Maybe lack of some metadata is also the reason why mainline's `convert_hf_to_gguf.py` prints warnings when processing my BF16.
\ No newline at end of file
diff --git a/github-data/issues/67 - Feature Request_ Elliminate_reduce unnecessary copies.md b/github-data/issues/67 - Feature Request Elliminatereduce unnecessary copies.md
similarity index 54%
rename from github-data/issues/67 - Feature Request_ Elliminate_reduce unnecessary copies.md
rename to github-data/issues/67 - Feature Request Elliminatereduce unnecessary copies.md
index 07deef4e8..19c6abc14 100644
--- a/github-data/issues/67 - Feature Request_ Elliminate_reduce unnecessary copies.md
+++ b/github-data/issues/67 - Feature Request Elliminatereduce unnecessary copies.md
@@ -1,13 +1,14 @@
-### ✨ [#67](https://github.com/ikawrakow/ik_llama.cpp/issues/67) - Feature Request: Elliminate/reduce unnecessary copies
+## 📌 [Issue #67](https://github.com/ikawrakow/ik_llama.cpp/issues/67) - Feature Request: Elliminate/reduce unnecessary copies
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2024-09-28 |
+| **Labels** | `enhancement` |
---
-#### Description
+## 📄 Description
### Prerequisites
@@ -18,7 +19,7 @@
### Feature Description
-PR #66 does it for Phi-3(.5)-mini, with a non-negligible performance gain on GPUs. Architectures that could potentially benefit from the same optimization are Falcon, DBRX, Starcoder, Bert, Bloom, MPT, Qwen, Phi-2, GPT-2, Codeshell, OpenLM, GPT-Neox, ChatGLM.
+PR [#66](https://github.com/ikawrakow/ik_llama.cpp/issues/66) does it for Phi-3(.5)-mini, with a non-negligible performance gain on GPUs. Architectures that could potentially benefit from the same optimization are Falcon, DBRX, Starcoder, Bert, Bloom, MPT, Qwen, Phi-2, GPT-2, Codeshell, OpenLM, GPT-Neox, ChatGLM.
### Motivation
@@ -26,4 +27,4 @@ Improve performance
### Possible Implementation
-See #66
\ No newline at end of file
+See [#66](https://github.com/ikawrakow/ik_llama.cpp/issues/66)
\ No newline at end of file
diff --git a/github-data/issues/88 - Bug_ Won_t compile on MSVC.md b/github-data/issues/88 - Bug Wont compile on MSVC.md
similarity index 85%
rename from github-data/issues/88 - Bug_ Won_t compile on MSVC.md
rename to github-data/issues/88 - Bug Wont compile on MSVC.md
index 9bcf0eed3..363297a44 100644
--- a/github-data/issues/88 - Bug_ Won_t compile on MSVC.md
+++ b/github-data/issues/88 - Bug Wont compile on MSVC.md
@@ -1,4 +1,4 @@
-### 🐛 [#88](https://github.com/ikawrakow/ik_llama.cpp/issues/88) - Bug: Won't compile on MSVC
+## 📌 [Issue #88](https://github.com/ikawrakow/ik_llama.cpp/issues/88) - Bug: Won't compile on MSVC
| **Author** | `saood06` |
| :--- | :--- |
@@ -8,11 +8,11 @@
---
-#### Description
+## 📄 Description
### What happened?
-As mentioned in #82 this does not compile with MSVC. I was able to get through the issues and make it compile on my machine, no PR right now, but if this issue stays open long enough I will create one with an actual fix.
+As mentioned in [#82](https://github.com/ikawrakow/ik_llama.cpp/issues/82) this does not compile with MSVC. I was able to get through the issues and make it compile on my machine, no PR right now, but if this issue stays open long enough I will create one with an actual fix.
Here's the git diff of the changes I made:
```diff
@@ -107,9 +107,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-10-15** at **05:52:32**:
+👤 **ikawrakow** commented on **2024-10-15** at **05:52:32**
Thanks for the fix. This is the only issue MSVC has with the 10k+ LOC that I have added? This is a pleasant surprise.
@@ -117,12 +117,12 @@ Please submit a PR. As I don't have the ability to test on Windows, the issue wi
---
-👤 **Nexesenex** commented the **2024-10-17** at **18:48:34**:
+👤 **Nexesenex** commented on **2024-10-17** at **18:48:34**
@saood06 : It worked perfectly for me, thanks.
---
-👤 **ikawrakow** commented the **2024-10-19** at **18:00:25**:
+👤 **ikawrakow** commented on **2024-10-19** at **18:00:25**
-Fixed via #93
\ No newline at end of file
+Fixed via [#93](https://github.com/ikawrakow/ik_llama.cpp/issues/93)
\ No newline at end of file
diff --git a/github-data/issues/92 - Bug_ Quantized KV cache produces garbage in situation where llama.cpp do.md b/github-data/issues/92 - Bug Quantized KV cache produces garbage in situation where llama.cpp does not.md
similarity index 56%
rename from github-data/issues/92 - Bug_ Quantized KV cache produces garbage in situation where llama.cpp do.md
rename to github-data/issues/92 - Bug Quantized KV cache produces garbage in situation where llama.cpp does not.md
index 59a217930..8d95a51e4 100644
--- a/github-data/issues/92 - Bug_ Quantized KV cache produces garbage in situation where llama.cpp do.md
+++ b/github-data/issues/92 - Bug Quantized KV cache produces garbage in situation where llama.cpp does not.md
@@ -1,4 +1,4 @@
-### 🐛 [#92](https://github.com/ikawrakow/ik_llama.cpp/issues/92) - Bug: Quantized KV cache produces garbage in situation where llama.cpp does not
+## 📌 [Issue #92](https://github.com/ikawrakow/ik_llama.cpp/issues/92) - Bug: Quantized KV cache produces garbage in situation where llama.cpp does not
| **Author** | `saood06` |
| :--- | :--- |
@@ -8,7 +8,7 @@
---
-#### Description
+## 📄 Description
### What happened?
@@ -36,9 +36,9 @@ _No response_
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-10-18** at **07:35:48**:
+👤 **ikawrakow** commented on **2024-10-18** at **07:35:48**
What happens with `-ctv q4_0` ?
@@ -48,7 +48,7 @@ But if the FA kernel is running on the GPU then I don't know what is happening.
---
-👤 **Nexesenex** commented the **2024-10-18** at **17:16:57**:
+👤 **Nexesenex** commented on **2024-10-18** at **17:16:57**
@saood06 : you can compile with GGML_FA_ALL_QUANTS, and try -ctk q5_1 -ctv q5_0, retain a very decent quality, and check if the phenomena you mention in q8_0 still occurs there. That KV quant works on IK's LlamaCPP, on a Mistral Large (I use it too) quantized with in custom quant based mainly on IQ4_KSS.
@@ -58,20 +58,20 @@ Data are Johannes Gaessler's.
---
-👤 **ikawrakow** commented the **2024-10-19** at **14:24:01**:
+👤 **ikawrakow** commented on **2024-10-19** at **14:24:01**
Judging by PPL and KLD, `-ctk q8_0 -ctv iq4_nl` beats `ctk q5_1 -ctv q5_0` by quite some margin. It uses ~10% more memory for the cache, but inference is slightly faster.
---
-👤 **Nexesenex** commented the **2024-10-19** at **14:31:47**:
+👤 **Nexesenex** commented on **2024-10-19** at **14:31:47**
As far as I know, IQ quants are not available for KVQ cache on Cuda.
ggml\src\ggml-cuda\fattn.cu doesn't contain any reference to them.
---
-👤 **ikawrakow** commented the **2024-10-19** at **14:54:06**:
+👤 **ikawrakow** commented on **2024-10-19** at **14:54:06**
> As far as I know, IQ quants are not available for KVQ cache on Cuda.
@@ -144,80 +144,7 @@ It is on the CPU where `-ctk q8_0 -ctv iq4_nl` is quite a bit faster.
---
-👤 **ikawrakow** commented the **2024-10-19** at **14:54:06**:
-
-> As far as I know, IQ quants are not available for KVQ cache on Cuda.
-
-Have you tried with this repo? `IQ4_NL` is available for KV cache.
-
-
-./bin/llama-perplexity -m llama-3.1-instruct-iq4kss.gguf -f ../tests/wiki.test.raw -t 1 -ngl 100 -fa -ctk q8_0 -ctv iq4_nl
-
-llama_kv_cache_init: CUDA0 KV buffer size = 104.00 MiB
-llama_new_context_with_model: KV self size = 104.00 MiB, K (q8_0): 68.00 MiB, V (iq4_nl): 36.00 MiB
-llama_new_context_with_model: CUDA_Host output buffer size = 1.96 MiB
-llama_new_context_with_model: CUDA0 compute buffer size = 266.50 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
-llama_new_context_with_model: graph nodes = 806
-llama_new_context_with_model: graph splits = 2
-
-system_info: n_threads = 1 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
-perplexity: tokenizing the input ..
-perplexity: tokenization took 142.301 ms
-perplexity: calculating perplexity over 564 chunks, n_ctx=512, batch_size=2048, n_seq=4
-perplexity: 0.64 seconds per pass - ETA 1.50 minutes
-[1]4.2997,[2]5.5078,[3]6.0053,[4]6.3547,[5]6.7268,[6]7.0718,[7]7.4507,[8]7.9743,[9]8.6317,[10]8.8219,[11]8.9887,[12]9.0891,[13]9.4975,[14]9.0932,[15]9.0087,[16]8.7617,[17]8.6791,[18]8.8135,[19]8.5408,[20]8.3868,[21]8.3814,[22]8.0731,[23]7.7960,[24]7.6181,[25]7.3896,[26]7.2858,[27]7.1902,[28]7.1038,[29]7.1737,[30]7.1573,[31]7.1504,[32]7.0949,[33]7.1199,[34]7.1658,[35]7.2052,[36]7.3079,[37]7.2762,[38]7.3041,[39]7.2843,[40]7.2991,[41]7.2945,[42]7.2208,[43]7.2553,[44]7.2124,[45]7.3190,[46]7.3427,[47]7.3360,[48]7.3206,[49]7.2950,[50]7.3607,[51]7.4344,[52]7.4000,[53]7.5139,[54]7.5210,[55]7.5206,[56]7.5684,[57]7.5831,[58]7.5933,[59]7.5393,[60]7.5898,[61]7.6529,[62]7.7069,[63]7.7720,[64]7.8413,[65]7.8296,[66]7.8277,[67]7.8087,[68]7.8351,[69]7.8853,[70]7.9126,[71]7.8960,[72]7.8540,[73]7.8263,[74]7.8362,[75]7.7787,[76]7.7507,[77]7.6992,[78]7.7113,[79]7.7276,[80]7.7405,[81]7.7354,[82]7.7526,[83]7.7624,[84]7.7487,[85]7.7436,[86]7.7346,[87]7.8240,[88]7.8164,[89]7.8366,[90]7.8449,[91]7.8379,[92]7.8383,[93]7.8281,[94]7.8329,[95]7.8276,[96]7.8583,[97]7.8737,[98]7.8766,[99]7.8749,[100]7.8598,[101]7.8589,[102]7.8825,[103]7.9153,[104]7.9718,[105]7.9641,[106]8.0166,[107]8.0411,[108]8.0494,[109]8.0896,[110]8.1375,[111]8.1515,[112]8.1137,[113]8.1044,[114]8.0952,[115]8.0772,[116]8.0743,[117]8.0643,[118]8.0438,[119]8.0250,[120]7.9950,[121]7.9713,[122]7.9441,[123]7.9163,[124]7.8621,[125]7.8154,[126]7.7847,[127]7.7520,[128]7.7520,[129]7.7509,[130]7.7584,[131]7.7590,[132]7.7372,[133]7.7097,[134]7.7157,[135]7.7059,[136]7.7096,[137]7.7207,[138]7.7470,[139]7.7690,[140]7.7515,[141]7.7154,[142]7.6819,[143]7.6296,[144]7.5938,[145]7.5441,[146]7.5113,[147]7.4822,[148]7.4584,[149]7.4323,[150]7.4114,[151]7.3773,[152]7.3468,[153]7.3190,[154]7.2849,[155]7.2581,[156]7.2436,[157]7.2149,[158]7.2123,[159]7.1853,[160]7.1736,[161]7.1930,[162]7.1944,[163]7.2152,[164]7.2231,[165]7.2550,[166]7.2869,[167]7.3102,[168]7.3540,[169]7.3755,[170]7.4089,[171]7.4476,[172]7.4573,[173]7.4620,[174]7.4610,[175]7.4836,[176]7.4910,[177]7.4986,[178]7.5098,[179]7.5079,[180]7.5188,[181]7.5233,[182]7.5326,[183]7.5573,[184]7.5695,[185]7.5824,[186]7.5844,[187]7.6069,[188]7.6232,[189]7.6350,[190]7.6462,[191]7.6373,[192]7.6259,[193]7.6156,[194]7.6106,[195]7.6455,[196]7.6438,[197]7.6479,[198]7.6358,[199]7.6274,[200]7.6108,[201]7.5794,[202]7.5717,[203]7.5371,[204]7.5324,[205]7.5233,[206]7.5085,[207]7.4972,[208]7.5045,[209]7.5122,[210]7.5131,[211]7.4948,[212]7.4682,[213]7.4592,[214]7.4617,[215]7.4480,[216]7.4518,[217]7.4322,[218]7.4167,[219]7.4102,[220]7.4058,[221]7.3850,[222]7.3709,[223]7.3576,[224]7.3493,[225]7.3522,[226]7.3436,[227]7.3198,[228]7.3134,[229]7.3022,[230]7.2868,[231]7.2873,[232]7.2918,[233]7.3000,[234]7.3005,[235]7.3161,[236]7.3193,[237]7.3355,[238]7.3477,[239]7.3573,[240]7.3605,[241]7.3645,[242]7.3793,[243]7.3826,[244]7.4030,[245]7.4255,[246]7.4274,[247]7.4276,[248]7.4372,[249]7.4261,[250]7.3997,[251]7.3887,[252]7.3680,[253]7.3593,[254]7.3584,[255]7.3656,[256]7.3644,[257]7.3653,[258]7.3607,[259]7.3586,[260]7.3505,[261]7.3348,[262]7.3223,[263]7.3181,[264]7.3031,[265]7.3033,[266]7.2874,[267]7.2800,[268]7.2723,[269]7.2663,[270]7.2570,[271]7.2509,[272]7.2522,[273]7.2265,[274]7.2096,[275]7.2139,[276]7.2146,[277]7.2002,[278]7.1953,[279]7.1980,[280]7.2106,[281]7.2210,[282]7.2333,[283]7.2392,[284]7.2417,[285]7.2586,[286]7.2584,[287]7.2669,[288]7.2588,[289]7.2535,[290]7.2528,[291]7.2558,[292]7.2508,[293]7.2517,[294]7.2564,[295]7.2560,[296]7.2576,[297]7.2562,[298]7.2514,[299]7.2561,[300]7.2595,[301]7.2535,[302]7.2462,[303]7.2483,[304]7.2376,[305]7.2403,[306]7.2528,[307]7.2602,[308]7.2602,[309]7.2695,[310]7.2607,[311]7.2611,[312]7.2701,[313]7.2854,[314]7.3038,[315]7.3074,[316]7.3150,[317]7.3100,[318]7.3121,[319]7.3037,[320]7.2952,[321]7.2946,[322]7.2932,[323]7.2850,[324]7.2912,[325]7.2795,[326]7.2812,[327]7.2825,[328]7.2752,[329]7.2690,[330]7.2534,[331]7.2593,[332]7.2568,[333]7.2518,[334]7.2483,[335]7.2343,[336]7.2305,[337]7.2225,[338]7.2168,[339]7.2128,[340]7.2161,[341]7.2155,[342]7.2190,[343]7.2267,[344]7.2382,[345]7.2416,[346]7.2436,[347]7.2470,[348]7.2545,[349]7.2606,[350]7.2634,[351]7.2663,[352]7.2726,[353]7.2941,[354]7.3126,[355]7.3299,[356]7.3420,[357]7.3602,[358]7.3746,[359]7.3920,[360]7.4038,[361]7.4081,[362]7.4218,[363]7.4288,[364]7.4291,[365]7.4385,[366]7.4519,[367]7.4621,[368]7.4703,[369]7.4767,[370]7.4875,[371]7.5017,[372]7.5163,[373]7.5175,[374]7.5126,[375]7.5047,[376]7.5086,[377]7.5259,[378]7.5400,[379]7.5385,[380]7.5353,[381]7.5278,[382]7.5303,[383]7.5364,[384]7.5388,[385]7.5414,[386]7.5440,[387]7.5499,[388]7.5562,[389]7.5588,[390]7.5467,[391]7.5348,[392]7.5273,[393]7.5312,[394]7.5318,[395]7.5288,[396]7.5301,[397]7.5429,[398]7.5402,[399]7.5345,[400]7.5446,[401]7.5432,[402]7.5352,[403]7.5381,[404]7.5355,[405]7.5383,[406]7.5421,[407]7.5421,[408]7.5367,[409]7.5421,[410]7.5333,[411]7.5328,[412]7.5217,[413]7.5218,[414]7.5311,[415]7.5373,[416]7.5389,[417]7.5353,[418]7.5381,[419]7.5325,[420]7.5329,[421]7.5349,[422]7.5320,[423]7.5361,[424]7.5308,[425]7.5165,[426]7.5184,[427]7.5167,[428]7.5118,[429]7.5018,[430]7.5016,[431]7.4934,[432]7.4873,[433]7.4852,[434]7.4847,[435]7.4713,[436]7.4753,[437]7.4711,[438]7.4659,[439]7.4637,[440]7.4614,[441]7.4642,[442]7.4650,[443]7.4803,[444]7.4849,[445]7.4829,[446]7.4803,[447]7.4787,[448]7.4837,[449]7.4830,[450]7.4803,[451]7.4814,[452]7.4877,[453]7.4917,[454]7.4918,[455]7.4949,[456]7.4897,[457]7.4921,[458]7.4799,[459]7.4861,[460]7.4946,[461]7.4923,[462]7.4919,[463]7.4862,[464]7.4903,[465]7.5053,[466]7.5125,[467]7.5117,[468]7.5133,[469]7.5104,[470]7.5090,[471]7.5053,[472]7.4992,[473]7.4918,[474]7.4884,[475]7.4870,[476]7.4857,[477]7.4776,[478]7.4751,[479]7.4695,[480]7.4704,[481]7.4713,[482]7.4749,[483]7.4695,[484]7.4701,[485]7.4656,[486]7.4692,[487]7.4761,[488]7.4784,[489]7.4800,[490]7.4841,[491]7.4816,[492]7.4829,[493]7.4890,[494]7.4904,[495]7.4871,[496]7.4845,[497]7.4849,[498]7.4822,[499]7.4836,[500]7.4811,[501]7.4752,[502]7.4762,[503]7.4787,[504]7.4771,[505]7.4722,[506]7.4737,[507]7.4761,[508]7.4822,[509]7.4786,[510]7.4791,[511]7.4746,[512]7.4771,[513]7.4766,[514]7.4786,[515]7.4771,[516]7.4803,[517]7.4832,[518]7.4780,[519]7.4801,[520]7.4853,[521]7.4877,[522]7.4977,[523]7.4953,[524]7.4886,[525]7.4893,[526]7.4906,[527]7.4942,[528]7.4911,[529]7.4814,[530]7.4713,[531]7.4785,[532]7.4709,[533]7.4653,[534]7.4477,[535]7.4383,[536]7.4371,[537]7.4408,[538]7.4446,[539]7.4431,[540]7.4492,[541]7.4507,[542]7.4566,[543]7.4649,[544]7.4726,[545]7.4720,[546]7.4806,[547]7.4839,[548]7.4732,[549]7.4691,[550]7.4604,[551]7.4618,[552]7.4648,[553]7.4710,[554]7.4731,[555]7.4727,[556]7.4713,[557]7.4646,[558]7.4678,[559]7.4703,[560]7.4750,[561]7.4814,[562]7.4941,[563]7.4882,[564]7.4896,
-Final estimate: PPL = 7.4896 +/- 0.04778
-
-llama_print_timings: load time = 893.21 ms
-llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: prompt eval time = 58848.32 ms / 288768 tokens ( 0.20 ms per token, 4906.99 tokens per second)
-llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: total time = 62023.33 ms / 288769 tokens
-
-
-
-
-./bin/llama-perplexity -m llama-3.1-instruct-iq4kss.gguf -f ../tests/wiki.test.raw -t 1 -ngl 100 -fa -ctk q5_1 -ctv q5_0l
-
-llama_kv_cache_init: CUDA0 KV buffer size = 92.00 MiB
-llama_new_context_with_model: KV self size = 92.00 MiB, K (q5_1): 48.00 MiB, V (q5_0): 44.00 MiB
-llama_new_context_with_model: CUDA_Host output buffer size = 1.96 MiB
-llama_new_context_with_model: CUDA0 compute buffer size = 266.50 MiB
-llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
-llama_new_context_with_model: graph nodes = 806
-llama_new_context_with_model: graph splits = 2
-
-system_info: n_threads = 1 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
-perplexity: tokenizing the input ..
-perplexity: tokenization took 146.542 ms
-perplexity: calculating perplexity over 564 chunks, n_ctx=512, batch_size=2048, n_seq=4
-perplexity: 0.63 seconds per pass - ETA 1.47 minutes
-[1]4.3405,[2]5.5099,[3]6.0314,[4]6.3570,[5]6.7371,[6]7.0748,[7]7.4564,[8]7.9839,[9]8.6380,[10]8.8331,[11]8.9924,[12]9.0971,[13]9.5138,[14]9.1123,[15]9.0236,[16]8.7697,[17]8.6805,[18]8.8077,[19]8.5344,[20]8.3874,[21]8.3766,[22]8.0658,[23]7.7916,[24]7.6164,[25]7.3842,[26]7.2792,[27]7.1844,[28]7.0971,[29]7.1678,[30]7.1542,[31]7.1464,[32]7.0900,[33]7.1167,[34]7.1636,[35]7.2032,[36]7.3050,[37]7.2715,[38]7.3009,[39]7.2807,[40]7.2959,[41]7.2919,[42]7.2182,[43]7.2519,[44]7.2063,[45]7.3137,[46]7.3355,[47]7.3302,[48]7.3152,[49]7.2899,[50]7.3547,[51]7.4284,[52]7.3953,[53]7.5116,[54]7.5166,[55]7.5167,[56]7.5652,[57]7.5822,[58]7.5914,[59]7.5385,[60]7.5890,[61]7.6513,[62]7.7063,[63]7.7700,[64]7.8389,[65]7.8279,[66]7.8246,[67]7.8051,[68]7.8313,[69]7.8818,[70]7.9101,[71]7.8918,[72]7.8497,[73]7.8221,[74]7.8332,[75]7.7752,[76]7.7461,[77]7.6960,[78]7.7099,[79]7.7259,[80]7.7391,[81]7.7337,[82]7.7512,[83]7.7609,[84]7.7474,[85]7.7426,[86]7.7338,[87]7.8233,[88]7.8163,[89]7.8370,[90]7.8458,[91]7.8388,[92]7.8383,[93]7.8300,[94]7.8353,[95]7.8301,[96]7.8611,[97]7.8772,[98]7.8809,[99]7.8802,[100]7.8651,[101]7.8648,[102]7.8886,[103]7.9214,[104]7.9777,[105]7.9696,[106]8.0222,[107]8.0466,[108]8.0554,[109]8.0949,[110]8.1419,[111]8.1558,[112]8.1176,[113]8.1082,[114]8.0990,[115]8.0806,[116]8.0780,[117]8.0684,[118]8.0474,[119]8.0282,[120]7.9980,[121]7.9749,[122]7.9479,[123]7.9206,[124]7.8673,[125]7.8196,[126]7.7894,[127]7.7561,[128]7.7576,[129]7.7565,[130]7.7639,[131]7.7649,[132]7.7434,[133]7.7153,[134]7.7212,[135]7.7118,[136]7.7156,[137]7.7265,[138]7.7522,[139]7.7743,[140]7.7560,[141]7.7191,[142]7.6855,[143]7.6338,[144]7.5982,[145]7.5495,[146]7.5164,[147]7.4873,[148]7.4638,[149]7.4379,[150]7.4171,[151]7.3830,[152]7.3527,[153]7.3248,[154]7.2907,[155]7.2646,[156]7.2502,[157]7.2215,[158]7.2191,[159]7.1921,[160]7.1803,[161]7.2005,[162]7.2022,[163]7.2226,[164]7.2300,[165]7.2621,[166]7.2937,[167]7.3171,[168]7.3609,[169]7.3827,[170]7.4161,[171]7.4551,[172]7.4647,[173]7.4693,[174]7.4683,[175]7.4908,[176]7.4983,[177]7.5060,[178]7.5173,[179]7.5156,[180]7.5265,[181]7.5308,[182]7.5402,[183]7.5648,[184]7.5771,[185]7.5904,[186]7.5931,[187]7.6155,[188]7.6323,[189]7.6442,[190]7.6555,[191]7.6467,[192]7.6355,[193]7.6245,[194]7.6199,[195]7.6545,[196]7.6530,[197]7.6571,[198]7.6448,[199]7.6367,[200]7.6199,[201]7.5885,[202]7.5809,[203]7.5463,[204]7.5411,[205]7.5316,[206]7.5170,[207]7.5058,[208]7.5129,[209]7.5204,[210]7.5212,[211]7.5031,[212]7.4767,[213]7.4675,[214]7.4698,[215]7.4562,[216]7.4600,[217]7.4404,[218]7.4250,[219]7.4179,[220]7.4133,[221]7.3923,[222]7.3785,[223]7.3648,[224]7.3571,[225]7.3594,[226]7.3510,[227]7.3275,[228]7.3214,[229]7.3098,[230]7.2946,[231]7.2951,[232]7.2995,[233]7.3076,[234]7.3078,[235]7.3233,[236]7.3268,[237]7.3430,[238]7.3548,[239]7.3643,[240]7.3674,[241]7.3717,[242]7.3864,[243]7.3901,[244]7.4106,[245]7.4330,[246]7.4352,[247]7.4355,[248]7.4452,[249]7.4339,[250]7.4073,[251]7.3962,[252]7.3755,[253]7.3671,[254]7.3663,[255]7.3735,[256]7.3726,[257]7.3739,[258]7.3696,[259]7.3674,[260]7.3594,[261]7.3435,[262]7.3307,[263]7.3267,[264]7.3116,[265]7.3115,[266]7.2958,[267]7.2883,[268]7.2805,[269]7.2747,[270]7.2653,[271]7.2595,[272]7.2615,[273]7.2360,[274]7.2190,[275]7.2233,[276]7.2242,[277]7.2101,[278]7.2050,[279]7.2078,[280]7.2205,[281]7.2307,[282]7.2432,[283]7.2493,[284]7.2518,[285]7.2689,[286]7.2690,[287]7.2770,[288]7.2688,[289]7.2638,[290]7.2630,[291]7.2657,[292]7.2608,[293]7.2616,[294]7.2664,[295]7.2660,[296]7.2676,[297]7.2660,[298]7.2611,[299]7.2658,[300]7.2691,[301]7.2631,[302]7.2555,[303]7.2575,[304]7.2465,[305]7.2488,[306]7.2614,[307]7.2686,[308]7.2687,[309]7.2781,[310]7.2692,[311]7.2695,[312]7.2790,[313]7.2944,[314]7.3129,[315]7.3165,[316]7.3243,[317]7.3194,[318]7.3214,[319]7.3129,[320]7.3043,[321]7.3035,[322]7.3021,[323]7.2939,[324]7.3000,[325]7.2885,[326]7.2903,[327]7.2916,[328]7.2844,[329]7.2783,[330]7.2623,[331]7.2681,[332]7.2655,[333]7.2606,[334]7.2570,[335]7.2431,[336]7.2394,[337]7.2314,[338]7.2256,[339]7.2216,[340]7.2245,[341]7.2238,[342]7.2271,[343]7.2348,[344]7.2463,[345]7.2496,[346]7.2520,[347]7.2554,[348]7.2628,[349]7.2688,[350]7.2713,[351]7.2740,[352]7.2803,[353]7.3016,[354]7.3198,[355]7.3373,[356]7.3493,[357]7.3675,[358]7.3819,[359]7.3994,[360]7.4108,[361]7.4151,[362]7.4286,[363]7.4356,[364]7.4360,[365]7.4456,[366]7.4591,[367]7.4695,[368]7.4774,[369]7.4839,[370]7.4945,[371]7.5087,[372]7.5233,[373]7.5243,[374]7.5193,[375]7.5113,[376]7.5153,[377]7.5326,[378]7.5468,[379]7.5454,[380]7.5421,[381]7.5349,[382]7.5374,[383]7.5436,[384]7.5462,[385]7.5489,[386]7.5514,[387]7.5573,[388]7.5636,[389]7.5661,[390]7.5540,[391]7.5419,[392]7.5342,[393]7.5382,[394]7.5388,[395]7.5359,[396]7.5373,[397]7.5501,[398]7.5472,[399]7.5416,[400]7.5516,[401]7.5504,[402]7.5425,[403]7.5453,[404]7.5426,[405]7.5454,[406]7.5492,[407]7.5495,[408]7.5442,[409]7.5494,[410]7.5408,[411]7.5404,[412]7.5293,[413]7.5293,[414]7.5384,[415]7.5448,[416]7.5464,[417]7.5428,[418]7.5455,[419]7.5398,[420]7.5403,[421]7.5426,[422]7.5397,[423]7.5441,[424]7.5387,[425]7.5245,[426]7.5265,[427]7.5247,[428]7.5198,[429]7.5097,[430]7.5091,[431]7.5010,[432]7.4949,[433]7.4928,[434]7.4924,[435]7.4790,[436]7.4831,[437]7.4789,[438]7.4740,[439]7.4718,[440]7.4698,[441]7.4727,[442]7.4735,[443]7.4887,[444]7.4934,[445]7.4915,[446]7.4888,[447]7.4874,[448]7.4926,[449]7.4919,[450]7.4893,[451]7.4907,[452]7.4969,[453]7.5009,[454]7.5010,[455]7.5042,[456]7.4990,[457]7.5014,[458]7.4892,[459]7.4954,[460]7.5038,[461]7.5016,[462]7.5014,[463]7.4957,[464]7.4998,[465]7.5148,[466]7.5224,[467]7.5217,[468]7.5232,[469]7.5204,[470]7.5190,[471]7.5152,[472]7.5089,[473]7.5016,[474]7.4983,[475]7.4969,[476]7.4956,[477]7.4874,[478]7.4849,[479]7.4793,[480]7.4800,[481]7.4809,[482]7.4844,[483]7.4791,[484]7.4798,[485]7.4751,[486]7.4786,[487]7.4855,[488]7.4877,[489]7.4894,[490]7.4936,[491]7.4910,[492]7.4924,[493]7.4982,[494]7.4994,[495]7.4962,[496]7.4936,[497]7.4939,[498]7.4913,[499]7.4926,[500]7.4901,[501]7.4841,[502]7.4853,[503]7.4876,[504]7.4860,[505]7.4811,[506]7.4824,[507]7.4848,[508]7.4912,[509]7.4876,[510]7.4882,[511]7.4836,[512]7.4860,[513]7.4854,[514]7.4873,[515]7.4861,[516]7.4892,[517]7.4920,[518]7.4865,[519]7.4887,[520]7.4941,[521]7.4963,[522]7.5060,[523]7.5035,[524]7.4966,[525]7.4972,[526]7.4984,[527]7.5018,[528]7.4988,[529]7.4892,[530]7.4790,[531]7.4862,[532]7.4787,[533]7.4731,[534]7.4555,[535]7.4464,[536]7.4453,[537]7.4493,[538]7.4530,[539]7.4515,[540]7.4573,[541]7.4590,[542]7.4648,[543]7.4733,[544]7.4810,[545]7.4805,[546]7.4891,[547]7.4924,[548]7.4816,[549]7.4777,[550]7.4689,[551]7.4703,[552]7.4733,[553]7.4794,[554]7.4811,[555]7.4806,[556]7.4792,[557]7.4725,[558]7.4756,[559]7.4781,[560]7.4830,[561]7.4896,[562]7.5023,[563]7.4964,[564]7.4978,
-Final estimate: PPL = 7.4978 +/- 0.04775
-
-llama_print_timings: load time = 862.72 ms
-llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: prompt eval time = 58481.55 ms / 288768 tokens ( 0.20 ms per token, 4937.76 tokens per second)
-llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
-llama_print_timings: total time = 61661.41 ms / 288769 tokens
-
-
-
-It is a matter of having `GGML_COPY` available, which I implemented. It is also available in mainline `llama.cpp` CUDA code, except that there someone has disabled it for whatever reason.
-
-I see now that performance on CUDA is pretty much the same:
-
-| model | size | params | backend | ngl | type_k | type_v | fa | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -: | ------------: | ---------------: |
-| llama 8B IQ4_KS - 4.25 bpw | 4.14 GiB | 8.03 B | CUDA | 99 | q5_1 | q5_0 | 1 | pp8192 | 4777.42 ± 3.50 |
-| llama 8B IQ4_KS - 4.25 bpw | 4.14 GiB | 8.03 B | CUDA | 99 | q8_0 | iq4_nl | 1 | pp8192 | 4757.62 ± 2.13 |
-
-It is on the CPU where `-ctk q8_0 -ctv iq4_nl` is quite a bit faster.
-
----
-
-👤 **Nexesenex** commented the **2024-10-19** at **19:44:00**:
+👤 **Nexesenex** commented on **2024-10-19** at **19:44:00**
Well, I can execute PPL tests on both mainline and IK_Llama with V cache in iq4_nl, no problem with that.
@@ -260,51 +187,7 @@ Which makes sense, no FA kernel being available, thus compiled, for such a KV ca
---
-👤 **Nexesenex** commented the **2024-10-19** at **19:44:00**:
-
-Well, I can execute PPL tests on both mainline and IK_Llama with V cache in iq4_xl, no problem with that.
-
-But if I want to use Llama server, or integrate it into my KoboldCPP fork, here's what I get instead of a generation :
-
-
-```
-INFO [ update_slots] kv cache rm [p0, end) | tid="19596" timestamp=1729366649 id_slot=0 id_task=0 p0=0
-INFO [ update_slots] kv cache rm [p0, end) | tid="19596" timestamp=1729366649 id_slot=0 id_task=0 p0=1024
-INFO [ update_slots] kv cache rm [p0, end) | tid="19596" timestamp=1729366650 id_slot=0 id_task=0 p0=2048
-INFO [ update_slots] kv cache rm [p0, end) | tid="19596" timestamp=1729366650 id_slot=0 id_task=0 p0=3072
-Unsupported KV type combination for head_size 128.
-Supported combinations:
- - K == q4_0, V == q4_0, 4.50 BPV
- - K == q8_0, V == q8_0, 8.50 BPV
- - K == f16, V == f16, 16.00 BPV
-Compile with GGML_CUDA_FA_ALL_QUANTS for all combinations of q4_0, q4_1, q5_0, q5_1, q8_0, and f16.
-Q:\GitHub\ik_llama.cpp.fks\ggml\src\ggml-cuda\fattn-common.cuh:576: fatal error
-
-Q:\LLAMA_IK>pause
-Press any key to continue . . .
-```
-
-My fork of KoboldCPP:
-
-```
-Processing Prompt [BLAS] (13200 / 13200 tokens)
-Generating (1 / 512 tokens)Unsupported KV type combination for head_size 128.
-Supported combinations:
- - K == q4_0, V == q4_0, 4.50 BPV
- - K == q8_0, V == q8_0, 8.50 BPV
- - K == f16, V == f16, 16.00 BPV
-Compile with GGML_CUDA_FA_ALL_QUANTS for all combinations of q4_0, q4_1, q5_0, q5_1, q8_0, and f16.
-Q:\GitHub\kobold.cpp\ggml\src\ggml-cuda\fattn-common.cuh:576: fatal error
-
-Q:\Kob\KoboldNew\Dist>pause
-Press any key to continue . . .
-```
-
-Which makes sense, no kernel being compiled for such KV cache.
-
----
-
-👤 **saood06** commented the **2024-10-20** at **07:05:51**:
+👤 **saood06** commented on **2024-10-20** at **07:05:51**
> What happens with `-ctv q4_0` ?
@@ -318,37 +201,19 @@ I also tested `-fa -ctk q8_0 -ctv q4_0 -nkvo`, because I thought maybe putting a
>This is because I have changed the bit arrangement in `Q8_0` when quantization is done during inference, with the result that `Q8_0` cannot be used for V cache when FA is running on the CPU.
-You mentioned this in #76 but as the error above says this is a head size of 128. If that's the case, shouldn't -ctk q8_0 -ctv q8_0 work for this model?
+You mentioned this in [#76](https://github.com/ikawrakow/ik_llama.cpp/issues/76) but as the error above says this is a head size of 128. If that's the case, shouldn't -ctk q8_0 -ctv q8_0 work for this model?
---
-👤 **saood06** commented the **2024-10-20** at **07:05:51**:
-
-> What happens with `-ctv q4_0` ?
-
-`-fa -ctk q8_0 -ctv q4_0` produced the same garbage output.
-
->Is FA running on the GPU or on the CPU?
-
-I don't know. Is it possible that it is running on both given that the KV is allocated per layer? I had to recompile with GGML_CUDA_FA_ALL_QUANTS because initially it gave me the issue of "Unsupported KV type combination for head_size 128. ... fattn-common.cuh:576: fatal error".
-
-I also tested `-fa -ctk q8_0 -ctv q4_0 -nkvo`, because I thought maybe putting all of the KV cache on the CPU would fix it, but this resulted in an even worse output. Instead of something like " to, of for. for" as it did before for Q8_0/Q8_0 and Q8_0/Q4_0. It was spamming was [control_36]. The ten 10 tokens in probs were [control_36],[control_20],[IMG],[control_32],[control_24],[control_16],[control_18],[/INST],[control_22],[MIDDLE], with them all showing a probability of null.
-
->This is because I have changed the bit arrangement in `Q8_0` when quantization is done during inference, with the result that `Q8_0` cannot be used for V cache when FA is running on the CPU.
-
-You mentioned this in #76 but as the error above says this is a head size of 128. If that's the case, shouldn't -ctk q8_0 -ctv q8_0 work for this model?
-
----
-
-👤 **ikawrakow** commented the **2024-10-20** at **09:04:35**:
+👤 **ikawrakow** commented on **2024-10-20** at **09:04:35**
> My fork of KoboldCPP is compiled with the tag FA_ALL_QUANTS, and the KVQ combos I use with the legacy KV quants are all working, iq4_nl is not.
-Yes, sorry, it needed some extra things to also work for TG. See #99 that enables `IQ4_NL` for V-cache when attention head size is 128.
+Yes, sorry, it needed some extra things to also work for TG. See [#99](https://github.com/ikawrakow/ik_llama.cpp/issues/99) that enables `IQ4_NL` for V-cache when attention head size is 128.
---
-👤 **ikawrakow** commented the **2024-10-20** at **09:11:14**:
+👤 **ikawrakow** commented on **2024-10-20** at **09:11:14**
> You mentioned this in https://github.com/ikawrakow/ik_llama.cpp/pull/76 but as the error above says this is a head size of 128. If that's the case, shouldn't -ctk q8_0 -ctv q8_0 work for this model?
@@ -362,7 +227,7 @@ llama_model_loader: - type iq4_ks: 193 tensors
---
-👤 **saood06** commented the **2024-10-20** at **18:41:36**:
+👤 **saood06** commented on **2024-10-20** at **18:41:36**
>The question is why does it work for @Nexesenex with this model?
@@ -381,51 +246,7 @@ Also the other thing I noted don't know if it is at all relevant was running Q8/
---
-👤 **Nexesenex** commented the **2024-10-20** at **21:32:42**:
-
-Well, @ikawrakow, I merged this PR on my KCPP fork and tested with success V IQ4_NL in full offload on Mistral 123b IQ4_3S/IQ4_XS mix with K 8, 5.1, 5.0. K IQ4_NL doesn't seem to work, it produces gibberish. I'm compiling IK Llama right now, to see if there's a difference.
-
-For the compute buffer, that's weird. I'll compare my KCPP with IK LLama on that matter.
-
----
-
-👤 **Nexesenex** commented the **2024-10-20** at **21:46:16**:
-
-Well, @ikawrakow, I merged the IQ4_NL PRs (improve quant speed, and token generation) PR on my KCPP fork and tested with success V IQ4_NL in full offload on Mistral 123b IQ4_3S/IQ4_XS mix with K 8, 5.1, 5.0. K IQ4_NL doesn't seem to work, it produces gibberish. I'm compiling IK Llama right now, to see if there's a difference.
-
-@saood06 :
-
-Full offload with that :
-```
-llama_model_loader: - type f32: 177 tensors
-llama_model_loader: - type q6_K: 89 tensors
-llama_model_loader: - type iq3_xxs: 88 tensors
-llama_model_loader: - type iq3_s: 110 tensors
-llama_model_loader: - type iq4_xs: 331 tensors
-```
-
----
-
-👤 **Nexesenex** commented the **2024-10-20** at **21:48:00**:
-
-Well, I merged the IQ4_NL PRs (improve quant speed, and token generation) PR on my KCPP fork and tested with success V IQ4_NL in full offload on Mistral 123b IQ4_3S/IQ4_XS mix with K 8, 5.1, 5.0. K IQ4_NL doesn't seem to work, it produces gibberish.
-
-I'm compiling IK Llama right now, to see if there's a difference.
-
-Full offload of my Mistral 123b IQ3/IQ4 mix (On KoboldCPP) with that tensor config :
-```
-llama_model_loader: - type f32: 177 tensors
-llama_model_loader: - type q6_K: 89 tensors
-llama_model_loader: - type iq3_xxs: 88 tensors
-llama_model_loader: - type iq3_s: 110 tensors
-llama_model_loader: - type iq4_xs: 331 tensors
-```
-
-I'll load it on IK as soon as it's compiled.
-
----
-
-👤 **Nexesenex** commented the **2024-10-20** at **23:37:25**:
+👤 **Nexesenex** commented on **2024-10-20** at **23:37:25**
Well, I merged the IQ4_NL PRs (improve quant speed, and token generation) on my KCPP fork and tested with success V cache IQ4_NL in full offload on Mistral 123b IQ4_3S/IQ4_XS mix with K cache q8, 5.1, 5.0. K cache IQ4_NL doesn't seem to work, it produces gibberish.
@@ -444,26 +265,7 @@ llama_model_loader: - type iq4_xs: 331 tensors
---
-👤 **Nexesenex** commented the **2024-10-20** at **23:37:25**:
-
-Well, I merged the IQ4_NL PRs (improve quant speed, and token generation) PR on my KCPP fork and tested with success V cache IQ4_NL in full offload on Mistral 123b IQ4_3S/IQ4_XS mix with K cache q8, 5.1, 5.0. K cache IQ4_NL doesn't seem to work, it produces gibberish.
-
-Here's what works for me :
-https://github.com/Nexesenex/croco.cpp/tree/qkv
-
-On IK LLama, I didn't make it work, surprisingly, despite trying 2 different compiling (one with the PR, one with some edits on my branch nex_3).
-
-As for my model, Mistral 123b IQ3/IQ4 mix (On KoboldCPP), it's done with that tensor config :
-
-llama_model_loader: - type f32: 177 tensors
-llama_model_loader: - type q6_K: 89 tensors
-llama_model_loader: - type iq3_xxs: 88 tensors
-llama_model_loader: - type iq3_s: 110 tensors
-llama_model_loader: - type iq4_xs: 331 tensors
-
----
-
-👤 **ikawrakow** commented the **2024-10-21** at **05:43:15**:
+👤 **ikawrakow** commented on **2024-10-21** at **05:43:15**
@Nexesenex
@@ -475,7 +277,7 @@ To make sure I understand correctly: you added the `IQ4_NL` V-cache related chan
---
-👤 **Nexesenex** commented the **2024-10-21** at **06:06:12**:
+👤 **Nexesenex** commented on **2024-10-21** at **06:06:12**
Hey @ikawrakow
@@ -489,21 +291,7 @@ As for adding on KCPP, yes, i've been thorough so it would work. While on IK_L,
---
-👤 **Nexesenex** commented the **2024-10-21** at **06:06:12**:
-
-Hey @ikawrakow
-
-I agree with you, I'm always thinking "wtf" when people are using KV Q4_0 or Q4_1, my daily combo being q5_1/q5_0 when I lacked of VRAM (I don't have patience for less than full offload).
-
--> and I don't really lack of VRAM anymore - I just pushed to 64GB - except for 123b full context, thus the use of V iq4_nl if I want to hit 128k with a smaller quant with the best ratio between model loss and KVQuant loss, I guess I can now go to less than 3.20 PPL 512 for 128k context.
-
-But you have a lot of folks running on such cache still because they have 6-8GB of VRAM and want to run Gemma v2 for example, and if that's not too much of hassle for you to make that dot product, simply switching them on IQ4_NL would grant them a whole 1% of perplexity reduction accordingly to what I tested on Q4_0.
-
-As for adding on KCPP, yes, i've been thorough so it would work. While on IK_L, I just compiled what you offered, failed, made a few edits which "made sense", failed again, and dropped it. I'm sure it works, but I'm missing something I didn't miss on KCPP. Now that I slept, I will try again.
-
----
-
-👤 **Nexesenex** commented the **2024-10-21** at **06:19:14**:
+👤 **Nexesenex** commented on **2024-10-21** at **06:19:14**
Edit:
Fresh as a flower, I recompiled, launched Llama_IK main, and it worked like a charm in generation (K q8_0, V iq4_nl). Dunno what I did different yesterday, but I was exhausted. So forget my report about it not working.
@@ -532,14 +320,7 @@ Is this normal? (I'm sorry if it sounds silly.. I'm no dev ^^)
---
-👤 **Nexesenex** commented the **2024-10-21** at **06:19:14**:
-
-Edit:
-Fresh as a flower, I recompiled, launched Llama_IK main, and it worked like a charm in generation (K q8_0, V iq4_nl). Dunno what I did different yesterday, but I was exhausted. So forget my report about it not working.
-
----
-
-👤 **ikawrakow** commented the **2024-10-21** at **10:12:07**:
+👤 **ikawrakow** commented on **2024-10-21** at **10:12:07**
> You have on IK_L, ggml_tensor * Q = dst->src[1];
>
@@ -552,7 +333,7 @@ Yes, this is a silly typo. `git blame` tells me that this line comes from Johann
---
-👤 **saood06** commented the **2024-10-21** at **17:52:43**:
+👤 **saood06** commented on **2024-10-21** at **17:52:43**
> < 5 bpw K-cache is not very useful (and I think it would be better to disable the `Q4_0 + Q4_0` KV-cache combination as it is way off the mark).
@@ -564,7 +345,7 @@ I would test the no offload case, but I do not have the system resources to do s
---
-👤 **ikawrakow** commented the **2024-10-21** at **18:24:22**:
+👤 **ikawrakow** commented on **2024-10-21** at **18:24:22**
> Going back to my original issue, it seems like it is working for Nexesenex because he is fully offloading it to the GPU and thus only using the FA kernel for the GPU which is the same as llama.cpp.
@@ -572,7 +353,7 @@ But partial offload works for me just fine. I just cannot test with Mistral Larg
---
-👤 **saood06** commented the **2024-10-22** at **01:43:54**:
+👤 **saood06** commented on **2024-10-22** at **01:43:54**
Was able to reproduce the issue with smaller models, it also does not seem to be exclusive to partial offloading, but also affects CPU only inference.
@@ -588,21 +369,7 @@ Edit: I have a theory on what may be the issue will test and report back later.
---
-👤 **saood06** commented the **2024-10-22** at **01:43:54**:
-
-Was able to reproduce the issue with smaller models, it also does not seem to be exclusive to partial offloading, but also affects CPU only inference.
-
-Tested Q8/Q8, Q8/Q4, Q4/Q4 partially offloaded, and Q4/Q4 with no offload at all on this [model](https://huggingface.co/mradermacher/Midnight-Miqu-70B-v1.5-i1-GGUF/tree/main?show_file_info=Midnight-Miqu-70B-v1.5.i1-Q4_K_S.gguf). All resulted in all probs being null. Tested llama.cpp Q8/Q8 and Q4/Q4 with partial offload and all output is coherent and similar to non quantized.
-
-Also CPU only FA, with no KV quant on ik_llama.cpp also resulted in correct output.
-
-Tested a Gemma-2 27B based model as well a bit and resulted in the same null probs output with partial offload. I was unable to compare full offload case as for my system I can fully offload with llama.cpp but ik_llama.cpp has ~500MB larger CUDA0 compute buffer size when fully offloaded vs llama.cpp which prevented me from being able to fully offload.
-
-Mistral Large 2 was the only model where a quantized KV cache resulted in output that was incoherent but still not completely broken, everything else I tested is like the Mistral Large 2 nkvo case where it is all null probs.
-
----
-
-👤 **ikawrakow** commented the **2024-10-22** at **06:33:15**:
+👤 **ikawrakow** commented on **2024-10-22** at **06:33:15**
Well, in my case Miqu and Gemma-27b-Instruct both work fine.
@@ -618,21 +385,7 @@ So, not really sure what happens in your case. Hopefully your theory what might
---
-👤 **ikawrakow** commented the **2024-10-22** at **06:33:15**:
-
-Well, in my case Miqu and Gemma-27b-Instruct both work fine.
-
-Here is Miqu you linked hosted on a CPU with `AVX2`
-
-
-And here is Gemma2-27b hosted on a Zen4 CPU:
-
-
-Both with partial offload as my GPU has only 16 GB VRAM.
-
----
-
-👤 **saood06** commented the **2024-10-22** at **20:35:43**:
+👤 **saood06** commented on **2024-10-22** at **20:35:43**
> Hopefully your theory what might be wrong will find the problem.
@@ -727,13 +480,13 @@ index 6e27c614..61db23ed 100644
---
-👤 **saood06** commented the **2024-10-23** at **04:14:35**:
+👤 **saood06** commented on **2024-10-23** at **04:14:35**
Update, built it with GCC without CUDA, ran FA Q4/Q4 with the long long changes above. Same null probs result. Just realized I forgot to set [this](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mms-bitfields). Will try again setting that to disable the MS layout for structs later. If this is the issue, then this also should be an easy way for you to reproduce it without having access to a Windows machine.
---
-👤 **saood06** commented the **2024-10-23** at **22:00:48**:
+👤 **saood06** commented on **2024-10-23** at **22:00:48**
Compiled with the flag on GCC and still same null probs result.
@@ -741,13 +494,13 @@ The fact that FA with FP16 KV works, but not anything quantized does narrow the
---
-👤 **ikawrakow** commented the **2025-01-30** at **15:44:38**:
+👤 **ikawrakow** commented on **2025-01-30** at **15:44:38**
@saood06 There have been quite a few changes (and fixes) in the CPU FA implementation since October. Are you still observing the problem?
---
-👤 **saood06** commented the **2025-02-11** at **20:01:35**:
+👤 **saood06** commented on **2025-02-11** at **20:01:35**
> [@saood06](https://github.com/saood06) There have been quite a few changes (and fixes) in the CPU FA implementation since October. Are you still observing the problem?
diff --git a/github-data/pull_requests/1 - Offload Bitnet token embeddings to the GPU.md b/github-data/pull_requests/1 - Offload Bitnet token embeddings to the GPU.md
index 42d06b856..f091bc784 100644
--- a/github-data/pull_requests/1 - Offload Bitnet token embeddings to the GPU.md
+++ b/github-data/pull_requests/1 - Offload Bitnet token embeddings to the GPU.md
@@ -1,14 +1,18 @@
-### 🔀 [#1](https://github.com/ikawrakow/ik_llama.cpp/pull/1) - Offload Bitnet token embeddings to the GPU
+## 🔀 [Pull Request #1](https://github.com/ikawrakow/ik_llama.cpp/pull/1) - Offload Bitnet token embeddings to the GPU
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_token_embedding_gpu` |
+| **Target Branch** | `main` |
| **Created** | 2024-07-26 |
| **Updated** | 2024-07-26 |
+| **Merged** | 2024-07-26 |
+| **Assignees** | `ikawrakow` |
---
-#### Description
+## 📄 Description
This PR puts the `token_embedding` tensor on the GPU for the Bitnet-1.58b model. This results in a significantly improved performance on CUDA/Metal as can be seen in the table. `CUDA` is for RTX-4080, `Metal` is for a 30-code M2-Max GPU, the host CPU is a Ryzen-7950X for `CUDA`.
diff --git a/github-data/pull_requests/10 - iq4_k_ speedup quantization by a factor of _2.md b/github-data/pull_requests/10 - iq4_k speedup quantization by a factor of 2.md
similarity index 67%
rename from github-data/pull_requests/10 - iq4_k_ speedup quantization by a factor of _2.md
rename to github-data/pull_requests/10 - iq4_k speedup quantization by a factor of 2.md
index 2164c39ed..f98492917 100644
--- a/github-data/pull_requests/10 - iq4_k_ speedup quantization by a factor of _2.md
+++ b/github-data/pull_requests/10 - iq4_k speedup quantization by a factor of 2.md
@@ -1,13 +1,16 @@
-### 🔀 [#10](https://github.com/ikawrakow/ik_llama.cpp/pull/10) - iq4_k: speedup quantization by a factor of ~2
+## 🔀 [Pull Request #10](https://github.com/ikawrakow/ik_llama.cpp/pull/10) - iq4_k: speedup quantization by a factor of ~2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/faster_iq4k_quantize` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-03 |
| **Updated** | 2024-08-03 |
+| **Merged** | 2024-08-03 |
---
-#### Description
+## 📄 Description
It is interesting to observe that `clang` produces code that is ~6X faster than the `GCC` result on a simple benchmark that measures the speed of the `best_index_iq4n` function (which is the bottleneck during `IQ4_K` quantization). But when this is used in practice in `quantize_row_iq4_k_impl_bs16`, the `clang` executable is actually worse than the `GCC` executable. Either way, both compilers need a hand, so this PR gives it to them. This gives us a ~2X speedup in the `IQ4_K` quantization.
\ No newline at end of file
diff --git a/github-data/pull_requests/101 - Enable q6_0 in flash attention.md b/github-data/pull_requests/101 - Enable q6_0 in flash attention.md
index 8b3dfe827..366623f3e 100644
--- a/github-data/pull_requests/101 - Enable q6_0 in flash attention.md
+++ b/github-data/pull_requests/101 - Enable q6_0 in flash attention.md
@@ -1,14 +1,17 @@
-### 🔀 [#101](https://github.com/ikawrakow/ik_llama.cpp/pull/101) - Enable q6_0 in flash attention
+## 🔀 [Pull Request #101](https://github.com/ikawrakow/ik_llama.cpp/pull/101) - Enable q6_0 in flash attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_enable_q6_0` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-21 |
| **Updated** | 2024-10-22 |
+| **Merged** | 2024-10-22 |
---
-#### Description
+## 📄 Description
As with `IQ4_NL`, just for head size of 128 for now. Without `GGML_CUDA_FA_ALL_QUANTS` set, only `Q6_0 + Q5_0` and `Q8_0 + Q6_0` are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache from
@@ -24,10 +27,10 @@ As with `IQ4_NL`, just for head size of 128 for now. Without `GGML_CUDA_FA_ALL_Q
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-10-21** at **18:14:38**:
+👤 **Nexesenex** commented on **2024-10-21** at **18:14:38**
-Merged in my fork of Kobold CPP. K q6_0 V q5_0 works like a charm. I also activated 16/6, 6/iq4_nl, as well as 8/6 and 6/6, I'll test them tonight or tomorrow.
+Merged in my fork of Kobold CPP. K q6_0 V q5_0 works like a charm. I also activated 16/6, 6/iq4_nl, as well as 8/6 and 6/6, I'll test them tonight or tomorrow. Edit : All the modes activated are working and are coherent in generation.
Thank you (very very much) and congratulation for this, IK, I'm delighted to have those options and thus the best inference quality I can get right now, and I'm gonna release soon an updated version of my fork, with the proper credits of course, so everyone interested and not too scared by downloading my patchwork can enjoy the fruit of your labors on these KV Quants, as some already enjoyed a bit more speed on CPU due to some of your commits that I was able to merge a few months ago!
\ No newline at end of file
diff --git a/github-data/pull_requests/102 - Add support for Granite and GraniteMoE models.md b/github-data/pull_requests/102 - Add support for Granite and GraniteMoE models.md
index 835e5acf3..1b499a6ab 100644
--- a/github-data/pull_requests/102 - Add support for Granite and GraniteMoE models.md
+++ b/github-data/pull_requests/102 - Add support for Granite and GraniteMoE models.md
@@ -1,13 +1,16 @@
-### 🔀 [#102](https://github.com/ikawrakow/ik_llama.cpp/pull/102) - Add support for Granite and GraniteMoE models
+## 🔀 [Pull Request #102](https://github.com/ikawrakow/ik_llama.cpp/pull/102) - Add support for Granite and GraniteMoE models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_granite` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-22 |
| **Updated** | 2024-10-22 |
+| **Merged** | 2024-10-22 |
---
-#### Description
+## 📄 Description
On CUDA GraniteMoE-1b suffers from precision issues in the attention portion, so I became curious to see why. One way to avoid the NaNs is to set the precision of the `K*Q` matrix multiplication to `F32`. What also fixes it is to apply the attention scale on `Q` before the `K*Q` multiplication (the solution I went with in this PR). One can apply the scale before or after RoPE. It works in both cases, so this really narrows it down to the `K*Q` multiplication suffering from precision issues when done in `f16`. Strange how these models were trained in the first place.
\ No newline at end of file
diff --git a/github-data/pull_requests/105 - Fix quantized k-cache without FA.md b/github-data/pull_requests/105 - Fix quantized k-cache without FA.md
index 5dae2814f..8c33acd2c 100644
--- a/github-data/pull_requests/105 - Fix quantized k-cache without FA.md
+++ b/github-data/pull_requests/105 - Fix quantized k-cache without FA.md
@@ -1,16 +1,19 @@
-### 🐛 [#105](https://github.com/ikawrakow/ik_llama.cpp/pull/105) - Fix quantized k-cache without FA
+## 🔀 [Pull Request #105](https://github.com/ikawrakow/ik_llama.cpp/pull/105) - Fix quantized k-cache without FA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_quantized_k_cache` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-24 |
| **Updated** | 2024-10-24 |
+| **Merged** | 2024-10-24 |
---
-#### Description
+## 📄 Description
Ref https://github.com/ggerganov/llama.cpp/pull/10032
Ref https://github.com/ggerganov/llama.cpp/pull/10021
-Closes #103
\ No newline at end of file
+Closes [#103](https://github.com/ikawrakow/ik_llama.cpp/issues/103)
\ No newline at end of file
diff --git a/github-data/pull_requests/106 - Bitnet changes.md b/github-data/pull_requests/106 - Bitnet changes.md
index 68848f61d..11589544f 100644
--- a/github-data/pull_requests/106 - Bitnet changes.md
+++ b/github-data/pull_requests/106 - Bitnet changes.md
@@ -1,14 +1,17 @@
-### 🔀 [#106](https://github.com/ikawrakow/ik_llama.cpp/pull/106) - Bitnet changes
+## 🔀 [Pull Request #106](https://github.com/ikawrakow/ik_llama.cpp/pull/106) - Bitnet changes
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/adapt_iq1_iq2_bn` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-24 |
| **Updated** | 2024-10-25 |
+| **Merged** | 2024-10-25 |
---
-#### Description
+## 📄 Description
* Change `IQ1_BN` and `IQ2_BN` to have per row scales. In that way we can handle Bitnet models with and without separate tensor scales
* Remove `IQ1_TN` and `IQ2_TN`. With the above change these are now redundant. `IQ1_BN` and `IQ2_BN` are also faster, so no reason to keep these around
@@ -20,6 +23,6 @@ On CUDA (RTX-4080) we now get 368 t/s for TG-128 with the 3.3B Bitnet model (`IQ
**Update**
-I wasted quite some time trying to figure out why the Bitnet changes don't work on Metal. At the end it turned out that it is PR #98 that breaks the Metal back-end. So, this PR reverts #98.
+I wasted quite some time trying to figure out why the Bitnet changes don't work on Metal. At the end it turned out that it is PR [#98](https://github.com/ikawrakow/ik_llama.cpp/issues/98) that breaks the Metal back-end. So, this PR reverts [#98](https://github.com/ikawrakow/ik_llama.cpp/issues/98).
-@agray3 Do you have the ability to investigate why #98 breaks the Metal back-end?
\ No newline at end of file
+@agray3 Do you have the ability to investigate why [#98](https://github.com/ikawrakow/ik_llama.cpp/issues/98) breaks the Metal back-end?
\ No newline at end of file
diff --git a/github-data/pull_requests/107 - Faster IQ1_BN Metal implementation.md b/github-data/pull_requests/107 - Faster IQ1_BN Metal implementation.md
index ce88abc1e..f2d6dda2e 100644
--- a/github-data/pull_requests/107 - Faster IQ1_BN Metal implementation.md
+++ b/github-data/pull_requests/107 - Faster IQ1_BN Metal implementation.md
@@ -1,14 +1,17 @@
-### 🔀 [#107](https://github.com/ikawrakow/ik_llama.cpp/pull/107) - Faster IQ1_BN Metal implementation
+## 🔀 [Pull Request #107](https://github.com/ikawrakow/ik_llama.cpp/pull/107) - Faster IQ1_BN Metal implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1bn_metal` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-26 |
| **Updated** | 2024-10-26 |
+| **Merged** | 2024-10-26 |
---
-#### Description
+## 📄 Description
On my 30-core M2-Max TG-128 for Bitnet-1.58b-3.3B improves from 82 t/s to 94.7 t/s.
PP-512 goes from 686 t/s to 702 t/s.
diff --git a/github-data/pull_requests/108 - Another Bitnet performance improvement on Metal.md b/github-data/pull_requests/108 - Another Bitnet performance improvement on Metal.md
index e7e1660e2..e7d799e9f 100644
--- a/github-data/pull_requests/108 - Another Bitnet performance improvement on Metal.md
+++ b/github-data/pull_requests/108 - Another Bitnet performance improvement on Metal.md
@@ -1,14 +1,17 @@
-### 🔀 [#108](https://github.com/ikawrakow/ik_llama.cpp/pull/108) - Another Bitnet performance improvement on Metal
+## 🔀 [Pull Request #108](https://github.com/ikawrakow/ik_llama.cpp/pull/108) - Another Bitnet performance improvement on Metal
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_improve_metal` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-26 |
| **Updated** | 2024-10-26 |
+| **Merged** | 2024-10-26 |
---
-#### Description
+## 📄 Description
This time just the dequantize function.
diff --git a/github-data/pull_requests/109 - Bitnet CUDA improvements.md b/github-data/pull_requests/109 - Bitnet CUDA improvements.md
index 0d0d5e959..e408cb16f 100644
--- a/github-data/pull_requests/109 - Bitnet CUDA improvements.md
+++ b/github-data/pull_requests/109 - Bitnet CUDA improvements.md
@@ -1,14 +1,17 @@
-### 🔀 [#109](https://github.com/ikawrakow/ik_llama.cpp/pull/109) - Bitnet CUDA improvements
+## 🔀 [Pull Request #109](https://github.com/ikawrakow/ik_llama.cpp/pull/109) - Bitnet CUDA improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_cuda` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-26 |
| **Updated** | 2024-10-26 |
+| **Merged** | 2024-10-26 |
---
-#### Description
+## 📄 Description
`IQ1_BN` TG-128 on RTX-4080 goes to 340 t/s up from 318 t/s.
On the front page the performance listed for `IQ1_BN` on CUDA is 301 t/s, so a pretty nice improvement since then.
\ No newline at end of file
diff --git a/github-data/pull_requests/11 - Faster iq3_k and iq5_k quantization.md b/github-data/pull_requests/11 - Faster iq3_k and iq5_k quantization.md
index ba9b6ce2d..71d51aa33 100644
--- a/github-data/pull_requests/11 - Faster iq3_k and iq5_k quantization.md
+++ b/github-data/pull_requests/11 - Faster iq3_k and iq5_k quantization.md
@@ -1,7 +1,16 @@
-### 🔀 [#11](https://github.com/ikawrakow/ik_llama.cpp/pull/11) - Faster iq3_k and iq5_k quantization
+## 🔀 [Pull Request #11](https://github.com/ikawrakow/ik_llama.cpp/pull/11) - Faster iq3_k and iq5_k quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/faster_iq3_iq5_quantize` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-05 |
-| **Updated** | 2024-08-05 |
\ No newline at end of file
+| **Updated** | 2024-08-05 |
+| **Merged** | 2024-08-05 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/110 - Bitnet_ use the fused mul-silu in the FFN network.md b/github-data/pull_requests/110 - Bitnet use the fused mul-silu in the FFN network.md
similarity index 55%
rename from github-data/pull_requests/110 - Bitnet_ use the fused mul-silu in the FFN network.md
rename to github-data/pull_requests/110 - Bitnet use the fused mul-silu in the FFN network.md
index 846fbe761..c6b635f6b 100644
--- a/github-data/pull_requests/110 - Bitnet_ use the fused mul-silu in the FFN network.md
+++ b/github-data/pull_requests/110 - Bitnet use the fused mul-silu in the FFN network.md
@@ -1,14 +1,17 @@
-### 🔀 [#110](https://github.com/ikawrakow/ik_llama.cpp/pull/110) - Bitnet: use the fused mul-silu in the FFN network
+## 🔀 [Pull Request #110](https://github.com/ikawrakow/ik_llama.cpp/pull/110) - Bitnet: use the fused mul-silu in the FFN network
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_fused_unary` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-26 |
| **Updated** | 2024-10-26 |
+| **Merged** | 2024-10-26 |
---
-#### Description
+## 📄 Description
I had forgotten that `build_bitnet()` does not use the standerd `llm_build_ffn` function, so the fused mul-silu didn't get used automatically for Bitnet when I added it to llm_build_ffn.
diff --git a/github-data/pull_requests/111 - Use fused mul - unary op also for MoE models.md b/github-data/pull_requests/111 - Use fused mul - unary op also for MoE models.md
index f8f07ac5f..b68c4329a 100644
--- a/github-data/pull_requests/111 - Use fused mul - unary op also for MoE models.md
+++ b/github-data/pull_requests/111 - Use fused mul - unary op also for MoE models.md
@@ -1,13 +1,16 @@
-### 🔀 [#111](https://github.com/ikawrakow/ik_llama.cpp/pull/111) - Use fused mul - unary op also for MoE models
+## 🔀 [Pull Request #111](https://github.com/ikawrakow/ik_llama.cpp/pull/111) - Use fused mul - unary op also for MoE models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/moe_fused_unary` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-26 |
| **Updated** | 2024-10-26 |
+| **Merged** | 2024-10-26 |
---
-#### Description
+## 📄 Description
This gives us a ~1% speedup for MoE models on CUDA and Metal.
\ No newline at end of file
diff --git a/github-data/pull_requests/112 - Faster MoE inference.md b/github-data/pull_requests/112 - Faster MoE inference.md
index 7fc1f81b4..288a0daf2 100644
--- a/github-data/pull_requests/112 - Faster MoE inference.md
+++ b/github-data/pull_requests/112 - Faster MoE inference.md
@@ -1,14 +1,17 @@
-### 🔀 [#112](https://github.com/ikawrakow/ik_llama.cpp/pull/112) - Faster MoE inference
+## 🔀 [Pull Request #112](https://github.com/ikawrakow/ik_llama.cpp/pull/112) - Faster MoE inference
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/multi_add` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-31 |
| **Updated** | 2025-06-23 |
+| **Merged** | 2024-10-31 |
---
-#### Description
+## 📄 Description
This PR
* Adds a new op `GGML_MULTI_ADD` used to sum up the contributions of the selected experts. It results in, e.g., a 7% improvement of token generation speed for Granite-1B-MoE on CUDA (RTX-4080).
@@ -16,9 +19,9 @@ This PR
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2025-06-23** at **12:59:59**:
+👤 **Nexesenex** commented on **2025-06-23** at **12:59:59**
Hey IK.
@@ -35,11 +38,17 @@ Hey IK.
What of the case if expert_used >= 3?
-For example, on Mistral 8x22b, there's a perplexity benefit to use 3 experts instead of 2 (-2% PPL 512).
+For example, on Mixtral 8x22b, there's a perplexity benefit to use 3 experts instead of 2 (-2% PPL 512).
---
-👤 **Nexesenex** commented the **2025-06-23** at **13:08:58**:
+👤 **ikawrakow** commented on **2025-06-23** at **13:05:43**
+
+Well, if it is not 1 or 2, then we handle it via `multi_add`, which handles adding together any number of contributions.
+
+---
+
+👤 **Nexesenex** commented on **2025-06-23** at **13:08:58**
Oh silly me, I just read too fast the code, I understand now.
Sorry!
\ No newline at end of file
diff --git a/github-data/pull_requests/113 - Trellis quantization.md b/github-data/pull_requests/113 - Trellis quantization.md
index 73abdca47..a6fa35350 100644
--- a/github-data/pull_requests/113 - Trellis quantization.md
+++ b/github-data/pull_requests/113 - Trellis quantization.md
@@ -1,14 +1,16 @@
-### 🔀 [#113](https://github.com/ikawrakow/ik_llama.cpp/pull/113) - Trellis quantization
+## 🔀 [Pull Request #113](https://github.com/ikawrakow/ik_llama.cpp/pull/113) - Trellis quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `ik/try_trellis` |
+| **Target Branch** | `main` |
| **Created** | 2024-11-15 |
| **Updated** | 2025-06-01 |
---
-#### Description
+## 📄 Description
The latest quantization hype is `QTIP` - [paper](https://arxiv.org/pdf/2406.11235), [repository](https://github.com/Cornell-RelaxML/qtip). They use a Trellis approach and report impressive results, so I decided to look into this more closely.
@@ -61,9 +63,9 @@ In comparison, I get 194 t/s for `IQ2_KT` (with flash attention enabled, which I
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-07** at **03:27:46**:
+👤 **saood06** commented on **2025-04-07** at **03:27:46**
Turboderp was also inspired by QTIP when redoing quantization for their new inference engine found [here](https://github.com/turboderp-org/exllamav3).
@@ -73,7 +75,7 @@ I'm interested and will look into it (maybe when the inference engine matures a
---
-👤 **compilade** commented the **2025-04-07** at **12:17:42**:
+👤 **compilade** commented on **2025-04-07** at **12:17:42**
> There is graphs and more details showing performance of their quants [here](https://github.com/turboderp-org/exllamav3/blob/master/doc/exl3.md).
@@ -85,7 +87,7 @@ Still looks very promising, though!
---
-👤 **saood06** commented the **2025-04-07** at **12:43:17**:
+👤 **saood06** commented on **2025-04-07** at **12:43:17**
> > There is graphs and more details showing performance of their quants [here](https://github.com/turboderp-org/exllamav3/blob/master/doc/exl3.md).
>
@@ -105,7 +107,13 @@ Yes.
---
-👤 **saood06** commented the **2025-04-07** at **13:25:24**:
+👤 **ikawrakow** commented on **2025-04-07** at **12:45:16**
+
+I don't like these plots too much. The y-axis needs to be logarithmic, and it needs to be difference to unquantized, not absolute values (else we are chasing differences between possibly different ways of computing perplexity). Also, they massively overemphasize the low bpw range. If you plot on a log scale, you get a more realistic picture. Either way, yes, trellis quantization can bring a 0.1-0.2 bpw reduction in quantized size for the same model quality. But is there any indication of performance? I could get my implementation here to be reasonably performant on CUDA, but expect the CPU implementation to be a disaster performance wise.
+
+---
+
+👤 **saood06** commented on **2025-04-07** at **13:25:24**
> I don't like these plots too much. The y-axis needs to be logarithmic, and it needs to be difference to unquantized, not absolute values (else we are chasing differences between possibly different ways of computing perplexity). Also, they massively overemphasize the low bpw range. If you plot on a log scale, you get a more realistic picture.
@@ -133,7 +141,15 @@ That is unfortunate.
---
-👤 **saood06** commented the **2025-04-08** at **07:21:43**:
+👤 **ikawrakow** commented on **2025-04-07** at **13:40:35**
+
+> People did say it did offered better KV cache due to the Hadamard transform added [here](https://github.com/turboderp-org/exllamav2/commit/324404ebe4e3c4dd0447ffc1290c312de1df02be) than llama.cpp even if the model quantization was not as good
+
+This is interesting. I have tried Hadamard transforms for model weight quantization because of the claims in the QuIP papers, but I never saw any improvement from it. I haven't tried for KV cache quantization, though.
+
+---
+
+👤 **saood06** commented on **2025-04-08** at **07:21:43**
Also I forgot to mention it but I did mention your PR to the QTIP authors shortly after you made this draft PR. They said "It seems like they didn't bother making the weights Gaussian first (the IP part of QTIP) before quantizing with a Gaussian codebook (3INST)."
@@ -141,7 +157,7 @@ You say in the PR "This generates values that are (nearly) normally distributed.
---
-👤 **ikawrakow** commented the **2025-04-08** at **07:38:55**:
+👤 **ikawrakow** commented on **2025-04-08** at **07:38:55**
It depends on what the QTIP authors mean by "they didn't bother making the weights Gaussian first". If they mean that I did not apply a Hadamard transform first, I did try that (QuIP/QuIP#/QTIP they all insist on applying Hadamard transforms to model weights before quantization), but it did not improve the result in any way. The thing about Hadamard transforms and imatrix is that they do not mix well - one needs a special imatrix for that. But I have also tried this, without much success. If they mean that I have missed something in the 3INST implementation, and hence the generated sequence is not normally distributed, and it would be better otherwise, I cannot confirm that either. I did a lot of Monte Carlo stuff in the past, so I know a thing or two about random number sequences. I tried an implementation that produces a perfect Gaussian distribution (and quite a bit more efficiently than theirs), but that made results worse.
@@ -151,7 +167,7 @@ But do the QTIP authors believe theirs is much better than what I have done? My
---
-👤 **saood06** commented the **2025-04-08** at **08:02:15**:
+👤 **saood06** commented on **2025-04-08** at **08:02:15**
> I was planning to try a sequence that generates quantized values, so CPU inference will be more efficient. But than I started doing other stuff, so that never materialized.
@@ -165,7 +181,7 @@ I don't know, the one line I quoted ("It seems ...") is the only thing they said
---
-👤 **louiehelm** commented the **2025-04-17** at **20:00:44**:
+👤 **louiehelm** commented on **2025-04-17** at **20:00:44**
The Hadamard Bros and other people fixated on rotations aren't doing it primarily to improve LLM weight quantization. It's for eliminating downstream outliers in run-time activations + KV-cache so they can successfully quantize those more aggressively down to 4-bits without scrambling model fidelity.
@@ -175,7 +191,7 @@ There's another way to resolve this besides submitting to the Hadamard cult. [Pr
---
-👤 **saood06** commented the **2025-04-18** at **23:11:20**:
+👤 **saood06** commented on **2025-04-18** at **23:11:20**
> There's another way to resolve this besides submitting to the Hadamard cult.
@@ -183,7 +199,7 @@ The author of ExllamaV3 reported that they will attempt other ideas as well and
---
-👤 **saood06** commented the **2025-04-19** at **11:07:35**:
+👤 **saood06** commented on **2025-04-19** at **11:07:35**
> [PrefixQuant](https://arxiv.org/abs/2410.05265)
@@ -201,21 +217,21 @@ This still sounds useful they reported this took 13 minutes on Llama-3-70B with
---
-👤 **louiehelm** commented the **2025-04-22** at **22:37:09**:
+👤 **louiehelm** commented on **2025-04-22** at **22:37:09**
It's fascinating how well your quants track optimal limits from rate-distortion theory.
-Optimal R(D) = 2^(-2*bitrate)
+Optimal D(R) = 2^(-2*bitrate)

Some of your new quants actually dip down to only ~1.25 bits of overhead.
-That's really good considering "optimal" = infinite codebook (which prob hurt t/s)
+That's really good considering "optimal" = infinite codebook (which prob hurts t/s)
---
-👤 **ikawrakow** commented the **2025-04-23** at **07:01:57**:
+👤 **ikawrakow** commented on **2025-04-23** at **07:01:57**
Where does the equation for the optimal R(D) come from?
@@ -223,7 +239,30 @@ LLaMA-3 requires about ~1 bpw more to achieve the same quantization error compar
---
-👤 **saood06** commented the **2025-04-24** at **00:23:38**:
+👤 **louiehelm** commented on **2025-04-23** at **20:03:59**
+
+Worst-case model weights can be approximated as maximally unpredictable Gaussian data -- essentially what LLMs might become in the limit once they're trained hard enough to reach 100% entropy levels (a full 8.0 bits per byte)
+
+Shannon's rate-distortion function:
+ R(D) = ½ log₂(σ² / D)
+Normalize weights with σ² = 1
+ R(D) = ½ log₂(1 / D)
+Solving for D as function of rate gives:
+ D(R) = 2^(‑2R).
+
+This is a foundational result from information theory. I applied it in this context because you really are making a code to preserve information as well as possible in lower bitrates. This concept is usually deployed to analyze lossy image or audio compression or to design superior analog channel protocols for multiplexing light better in fiber optic cables. But it applies equally in this setting too and it makes sense in retrospect that this limit would bound the maximum efficiency of any LLM quantization algorithms too.
+
+The reason this is so interesting is because usually we use information theory to discuss way more boring distortion proxies like MSE (mean squared error). For an LLM the MSE = (original value - value in quant)^2/(# of parameters). Have you ever investigated what this is for your quants? In any case, I just think it's beautiful seeing LLMs manifest such a clean offset from the optimal-distortion bound on such an abstract ability as being able to faithfully recite Wikipedia.
+
+Also, if I rebasis my prior graph to use the optimal distortion bound as the x-axis and scale the y-axis to represent bits, even other quantization methods seem to pretty cleanly establish relatively consistent gaps off the distortion lower bound, with only minor deviations. [Note: There's likely bit-width overhead from non-weight params in EXL3 that I can't account for so this chart may be ~5% more favorable than reality.]
+
+
+
+And Yes, your coding overhead for Llama-2 was remarkably small and very close to the limit. I grabbed Llama 3 70b and Llama 2 70b to check and entropy in the actual files. It only went up from 6.1 bits per byte --> 6.27 bits per byte. Obviously L3 has more complexity packed into its weights, but in information theoretic terms, there doesn't appear to be a full +1.1 bits per parameter. Maybe that accounts for 30% of the gap? Other 70% may be from Meta engineers just using the full dynamic range of their weights better in L3 vs L2. This undoubtably made training easier for them by making it more stable, but could have had the downstream effect of making the weights harder to quantize for your algorithm, which may have been tuned to expect numeric distributions more similar to L2 weights. Does your new Trellis quant also have a +1.1bit gap between L2 70b and L3 70b?
+
+---
+
+👤 **saood06** commented on **2025-04-24** at **00:23:38**
>essentially what LLMs might become in the limit once they're trained hard enough to reach 100% entropy levels (a full 8.0 bits per byte)
@@ -231,7 +270,7 @@ Only some recent models are trained at FP8 (such as Deepseek V3/R1), they tend t
---
-👤 **saood06** commented the **2025-04-24** at **07:15:28**:
+👤 **saood06** commented on **2025-04-24** at **07:15:28**
Exllama-V3 added cache quantization,
@@ -247,7 +286,7 @@ They also explain their reasoning in an issue copied below:
---
-👤 **ikawrakow** commented the **2025-04-24** at **07:29:50**:
+👤 **ikawrakow** commented on **2025-04-24** at **07:29:50**
> Does your new Trellis quant also have a +1.1bit gap between L2 70b and L3 70b?
@@ -255,8 +294,23 @@ I have not tried it for 70B models. It is too slow for the amount of patience I
---
-👤 **ikawrakow** commented the **2025-04-24** at **08:18:08**:
+👤 **ikawrakow** commented on **2025-04-24** at **08:18:08**
> Worst-case model weights can be approximated as maximally unpredictable Gaussian data -- essentially what LLMs might become in the limit once they're trained hard enough to reach 100% entropy levels
-I'm not sure I can follow. On my book, LLMs only work because there are patterns encoded in the model weights, i.e., the model weights of an LLM are pretty much the opposite of a memoryless signal as required for these equations to hold. We also know that the model weights are definitely not Gaussian, and the so called "outliers" (i.e., weights that do not fall within the expectation of a normal distribution) are more important than the others. Also, the rate distortion equation tells us something about the difference between the signal (model weights) and its approximate representation (quantized model weights), but it tells us nothing about how this will affect observations (predicted token probabilities), which are the result of a complex set of linear and non-linear operations on the signal.
\ No newline at end of file
+I'm not sure I can follow. On my book, LLMs only work because there are patterns encoded in the model weights, i.e., the model weights of an LLM are pretty much the opposite of a memoryless signal as required for these equations to hold. We also know that the model weights are definitely not Gaussian, and the so called "outliers" (i.e., weights that do not fall within the expectation of a normal distribution) are more important than the others. Also, the rate distortion equation tells us something about the difference between the signal (model weights) and its approximate representation (quantized model weights), but it tells us nothing about how this will affect observations (predicted token probabilities), which are the result of a complex set of linear and non-linear operations on the signal.
+
+---
+
+👤 **saood06** commented on **2025-04-28** at **07:56:02**
+
+>The Hadamard Bros and other people fixated on rotations aren't doing it primarily to improve LLM weight quantization. It's for eliminating downstream outliers in run-time activations + KV-cache so they can successfully quantize those more aggressively down to 4-bits without scrambling model fidelity.
+
+
+The latest paper by the bitnet people is literally that: https://arxiv.org/abs/2504.18415
+
+---
+
+👤 **ikawrakow** commented on **2025-06-01** at **12:27:24**
+
+This is superseded by PRs [#441](https://github.com/ikawrakow/ik_llama.cpp/issues/441), [#453](https://github.com/ikawrakow/ik_llama.cpp/issues/453), [#471](https://github.com/ikawrakow/ik_llama.cpp/issues/471), [#475](https://github.com/ikawrakow/ik_llama.cpp/issues/475), [#482](https://github.com/ikawrakow/ik_llama.cpp/issues/482)
\ No newline at end of file
diff --git a/github-data/pull_requests/114 - MMQ Kernel for Q6_0 _pretty please_.md b/github-data/pull_requests/114 - MMQ Kernel for Q6_0 _pretty please_.md
deleted file mode 100644
index 6b917b71b..000000000
--- a/github-data/pull_requests/114 - MMQ Kernel for Q6_0 _pretty please_.md
+++ /dev/null
@@ -1,43 +0,0 @@
-### 🔀 [#114](https://github.com/ikawrakow/ik_llama.cpp/pull/114) - MMQ Kernel for Q6_0 (pretty please!)
-
-| **Author** | `Nexesenex` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-11-20 |
-| **Updated** | 2024-11-20 |
-
----
-
-#### Description
-
-Q6_0 MMQ Kernel attempt.
-
-Of course, if I can reproduce the formatting, compile and run it, I don't understand anything to the maths involved within the main template, and thus, perplexity jumps by a factor 30000 on a pure Q6_0 quant. :D
-
-I used q5_0 as a base.
-
-I know you're not very much into making MMQ Cuda Kernels, but could you please do this one if it's not too bothersome, IK? Qwen2 models are quite popular and good, but their ffn_down tensors have a reversed shape, and thus, need either Q5_1 as a fallback, either Q8_0, which is unsatisfactory in both case for the ratio quality/size of an overall 5-6 bpw quant.
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [ ] Low
- - [ ] Medium
- - [x] High (it runs, but perplexity is 200k with force MMQ on a pure Q6_0 Sheared Llama 2 2.7b), instead of the 7-8 expected, and it's way above my league to fix that.
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** submitted a review the **2024-11-20** at **09:24:50**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** submitted a review the **2024-11-20** at **15:21:54**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** commented during a code review the **2024-11-20** at **15:21:54** on `ggml/src/ggml-cuda/mmq.cuh`:
-
-It's hard. Too hard for me still. :)
-
-I don't find a similar template for Q5_0 Cublas in convert.cu, or anything remotely close, so I kept digging if I could find similar and sufficient patterns on another quant, or in common.cuh to have a delta and understand how to transpose. I didn't find what I needed. I am sorry. ^^
\ No newline at end of file
diff --git a/github-data/pull_requests/114 - MMQ Kernel for Q6_0 pretty please.md b/github-data/pull_requests/114 - MMQ Kernel for Q6_0 pretty please.md
new file mode 100644
index 000000000..bd62187ba
--- /dev/null
+++ b/github-data/pull_requests/114 - MMQ Kernel for Q6_0 pretty please.md
@@ -0,0 +1,45 @@
+## 🔀 [Pull Request #114](https://github.com/ikawrakow/ik_llama.cpp/pull/114) - MMQ Kernel for Q6_0 (pretty please!)
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Source Branch** | `MMQ-Kernel-for-Q6_0` |
+| **Target Branch** | `main` |
+| **Created** | 2024-11-20 |
+| **Updated** | 2024-11-20 |
+
+---
+
+## 📄 Description
+
+Q6_0 MMQ Kernel attempt.
+
+Of course, if I can reproduce the formatting, compile and run it, I don't understand anything to the maths involved within the main template, and thus, perplexity jumps by a factor 30000 on a pure Q6_0 quant. :D
+
+I used q5_0 as a base.
+
+I know you're not very much into making MMQ Cuda Kernels, but could you please do this one if it's not too bothersome, IK? Qwen2 models are quite popular and good, but their ffn_down tensors have a reversed shape, and thus, need either Q5_1 as a fallback, either Q8_0, which is unsatisfactory in both case for the ratio quality/size of an overall 5-6 bpw quant.
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [ ] Medium
+ - [x] High (it runs, but perplexity is 200k with force MMQ on a pure Q6_0 Sheared Llama 2 2.7b), instead of the 7-8 expected, and it's way above my league to fix that.
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** started a conversation on `ggml/src/ggml-cuda/mmq.cuh` on **2024-11-20** at **09:24:50**
+
+Curious to see if you can get it right. This code is for unpacking 5-bit quants into 8-bit integers (specific to the way `Q5_0/1` pack the bits).
+
+[Here](https://github.com/ikawrakow/ik_llama.cpp/blob/52874c5d21819bd63cc4c500f2fb1be435d16b5e/ggml/src/ggml-cuda/convert.cu#L155) you have the code that unpacks `Q6_0` when the matrix multiplication is done via cuBLAS. Try using the code there to adjust the code here.
+
+> 👤 **Nexesenex** replied on **2024-11-20** at **15:21:54**
+>
+> It's hard. Too hard for me still. :)
+>
+> I don't find a similar template for Q5_0 Cublas in convert.cu, or anything remotely close, so I kept digging if I could find similar and sufficient patterns on another quant, or in common.cuh to have a delta and understand how to transpose. I didn't find what I needed. I am sorry. ^^
+>
+> I basically need an example per "sufficiently similar family of quant" in order to have a chance.
\ No newline at end of file
diff --git a/github-data/pull_requests/115 - MMQ for Q6_0.md b/github-data/pull_requests/115 - MMQ for Q6_0.md
index dfef3bd55..b13f48f26 100644
--- a/github-data/pull_requests/115 - MMQ for Q6_0.md
+++ b/github-data/pull_requests/115 - MMQ for Q6_0.md
@@ -1,14 +1,17 @@
-### 🔀 [#115](https://github.com/ikawrakow/ik_llama.cpp/pull/115) - MMQ for Q6_0
+## 🔀 [Pull Request #115](https://github.com/ikawrakow/ik_llama.cpp/pull/115) - MMQ for Q6_0
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q60_mmq` |
+| **Target Branch** | `main` |
| **Created** | 2024-11-20 |
| **Updated** | 2024-11-21 |
+| **Merged** | 2024-11-21 |
---
-#### Description
+## 📄 Description
Add MMQ kernel for `Q6_0`.
@@ -16,16 +19,16 @@ Add MMQ kernel for `Q6_0`.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-11-20** at **19:42:01**:
-
-Tested successfully on IK_LLame, PPL is 0.1% above Q6_K on a pure quant of Sheared Llama 2.7b.
-Thanks IK. I'll play with the Qwen models in the next days.
-
----
-
-👤 **Nexesenex** commented the **2024-11-20** at **19:42:56**:
+👤 **Nexesenex** commented on **2024-11-20** at **19:42:56**
Tested successfully on IK_LLama, PPL is 0.1% above Q6_K on a pure quant of Sheared Llama 2.7b.
-Thanks IK. I'll play with the Qwen models in the next days.
\ No newline at end of file
+Thanks IK. I'll play with the Qwen models in the next days.
+
+Edit : testing right now a Rhys 78b (based on Qwen 2 72b), with Q5_K ftype, attn_v in Q6_K, and the whole ffdown in q6_0/5_1/5_0
+
+Broadly, 5_1 has 0.2% ppl more than 5_0, and 5.0 0.05% ppl more than 6_0.
+q5_1 underperforms q5_0 nowadays in most of my tests on various models. q6_0 replaces it adequately for the models incompatible with Q6_K, Qwen 2 is not "dense" enough to showcase a real benefit but nevertheless it's there.
+
+Thanks again for your fast help, IK.
\ No newline at end of file
diff --git a/github-data/pull_requests/116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_KQ5_K.md b/github-data/pull_requests/116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_KQ5_K.md
new file mode 100644
index 000000000..89b164d56
--- /dev/null
+++ b/github-data/pull_requests/116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_KQ5_K.md
@@ -0,0 +1,26 @@
+## 🔀 [Pull Request #116](https://github.com/ikawrakow/ik_llama.cpp/pull/116) - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K/Q5_K
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `patch-1` |
+| **Target Branch** | `ik/q60_mmq` |
+| **Created** | 2024-11-20 |
+| **Updated** | 2024-11-21 |
+| **Merged** | 2024-11-21 |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** approved this pull request ✅ on **2024-11-21** at **06:12:49**
\ No newline at end of file
diff --git a/github-data/pull_requests/116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K_Q5_K.md b/github-data/pull_requests/116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K_Q5_K.md
deleted file mode 100644
index d16bb7ac4..000000000
--- a/github-data/pull_requests/116 - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K_Q5_K.md
+++ /dev/null
@@ -1,23 +0,0 @@
-### 🔀 [#116](https://github.com/ikawrakow/ik_llama.cpp/pull/116) - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K/Q5_K
-
-| **Author** | `Nexesenex` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-11-20 |
-| **Updated** | 2024-11-21 |
-
----
-
-#### Description
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [x] Low
- - [ ] Medium
- - [ ] High
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** submitted a review the **2024-11-21** at **06:12:49**: ✅ `APPROVED`
\ No newline at end of file
diff --git a/github-data/pull_requests/117 - Some minor quant strategies tweaks.md b/github-data/pull_requests/117 - Some minor quant strategies tweaks.md
index 5eaa89fff..e39e146c0 100644
--- a/github-data/pull_requests/117 - Some minor quant strategies tweaks.md
+++ b/github-data/pull_requests/117 - Some minor quant strategies tweaks.md
@@ -1,14 +1,16 @@
-### 🔀 [#117](https://github.com/ikawrakow/ik_llama.cpp/pull/117) - Some minor quant strategies tweaks
+## 🔀 [Pull Request #117](https://github.com/ikawrakow/ik_llama.cpp/pull/117) - Some minor quant strategies tweaks
| **Author** | `Nexesenex` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `quants_strategies_tweaks` |
+| **Target Branch** | `main` |
| **Created** | 2024-11-22 |
| **Updated** | 2024-11-23 |
---
-#### Description
+## 📄 Description
Here's what I'd suggest for starters :
@@ -46,17 +48,17 @@ Further ideas for a subsequent PR :
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-11-22** at **15:30:05**:
+👤 **ikawrakow** commented on **2024-11-22** at **15:30:05**
Can you provide some data to support these changes?
---
-👤 **Nexesenex** commented the **2024-11-22** at **16:53:59**:
+👤 **Nexesenex** commented on **2024-11-22** at **16:53:59**
-Not really, IK, i'd have to remake all tests I did during the previous months. I never knew how to log properly LlamaCPP data, so I accumulated knowledge and edits along the way and just restitute you the simplest part of it. I submit that to you in a "trust me bro" fashion because I suppose that you know what I know and then some, and just have more interesting things to do with your skillset than to mess hamster-style with quant strategies like I did since early 2024.
+Not really, IK, i'd have to remake all tests I did during the previous months. I never knew how to log properly (aka. in an automated fashion) LlamaCPP data, so I accumulated knowledge and edits along the way and just restitute you the simplest part of it. I submit that to you in a "trust me bro" fashion because I suppose that you know what I know and then some, and just have more interesting things to do with your skillset than to mess hamster-style with quant strategies like I did since early 2024.
Broadly, there's a few principles that I discovered through your work :
@@ -69,11 +71,11 @@ Broadly, there's a few principles that I discovered through your work :
- ffn_down : basetype +1 as much as possible, especially the first and last eighth of layers, model archs sensitivity are differing vastly for the intermediate layers. Going +1 or +1.5bpw for 1/8 of the layers, instead of +0.5bpw for 3/8 (2 first eights, one last eight or the opposite) of the layers is overkill, especially if the attention tensors are not calibrated for that on the affected layers.
- ffn_gate and up are more tricky, but nevertheless the first / last layers bump applies too, especially since L3 models which are more "dense" than their predecessors.
- embedding and output, the bigger the base weight is, the more you can quantize it, nothing new. High vocab and monolithic embed/output answer to this.
-MOES : 2 experts allow already a bump on the attn tensors, including q and output.
+- MOES : 2 experts allow already a bump on the attn tensors, including q and output.
4 experts should really be treated like 8 experts models, there's no reason at all to discriminate them because they operate the very same (2 experts active), I noticed that on those Pivot/Solar 4 experts model.
-So, without any disrespect, pick what you like, I'm sure that some of it makes sense to you, and ditch what's "too much" for your taste.
+So, without any disrespect, pick what you like, I'm sure that some of it makes sense to you (I often replicate your own quant strategies patterns, for after all, they are based on similar observations.. that you sometimes didn't systematize), and ditch what's "too much" for your taste.
And if you'd like me to go on with the quant strategies, please tell me, I'd be glad to help on something that I actually can grasp and have experience upon.
-Here's for you to eventually get a look on some experiments I made so you can check how far I went : 07ad6c6f321ea3643cff5d38766ce8f13a785bfcmaster_loot_2/
\ No newline at end of file
+Here's for you to eventually get a look on some experiments I made so you can check how far I went : [07ad6c6f321ea3643cff5d38766ce8f13a785bfcmaster_loot_2/](https://github.com/Nexesenex/ik_llama.cpp.fks/commit/07ad6c6f321ea3643cff5d38766ce8f13a785bfc)
\ No newline at end of file
diff --git a/github-data/pull_requests/118 - IQ4_NL_X4.md b/github-data/pull_requests/118 - IQ4_NL_X4.md
index e655b465e..b6e62296a 100644
--- a/github-data/pull_requests/118 - IQ4_NL_X4.md
+++ b/github-data/pull_requests/118 - IQ4_NL_X4.md
@@ -1,14 +1,17 @@
-### 🔀 [#118](https://github.com/ikawrakow/ik_llama.cpp/pull/118) - IQ4_NL_X4
+## 🔀 [Pull Request #118](https://github.com/ikawrakow/ik_llama.cpp/pull/118) - IQ4_NL_X4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_nl_x4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-02 |
| **Updated** | 2024-12-02 |
+| **Merged** | 2024-12-02 |
---
-#### Description
+## 📄 Description
In mainline `llama.cpp` they have added various types where `Q4_0` or `IQ4_NL` are repacked by interleaving quants from 4 or 8 consecutive rows. They get significant improvement in prompt processing speed on `ARM`, so I decided to see if interleaved rows can further improve the `iqk_mul_mat` matrix-matrix multiplication speed.
diff --git a/github-data/pull_requests/119 - Q4_0_R4.md b/github-data/pull_requests/119 - Q4_0_R4.md
index baa726e29..9b1ec5292 100644
--- a/github-data/pull_requests/119 - Q4_0_R4.md
+++ b/github-data/pull_requests/119 - Q4_0_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#119](https://github.com/ikawrakow/ik_llama.cpp/pull/119) - Q4_0_R4
+## 🔀 [Pull Request #119](https://github.com/ikawrakow/ik_llama.cpp/pull/119) - Q4_0_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q4_0_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-02 |
| **Updated** | 2024-12-02 |
+| **Merged** | 2024-12-02 |
---
-#### Description
+## 📄 Description
-`Q4_0` repacked with 4 interleaved rows as `IQ4_NL_X4` (see PR #118).
+`Q4_0` repacked with 4 interleaved rows as `IQ4_NL_X4` (see PR [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118)).
PP-512 for LLaMA-3.1-8B for `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Risen-5975WX):
diff --git a/github-data/pull_requests/12 - q2_K_ allow it to detect ternary nets and quantize accordingly.md b/github-data/pull_requests/12 - q2_K allow it to detect ternary nets and quantize accordingly.md
similarity index 92%
rename from github-data/pull_requests/12 - q2_K_ allow it to detect ternary nets and quantize accordingly.md
rename to github-data/pull_requests/12 - q2_K allow it to detect ternary nets and quantize accordingly.md
index d51ef8fd6..90c7e8356 100644
--- a/github-data/pull_requests/12 - q2_K_ allow it to detect ternary nets and quantize accordingly.md
+++ b/github-data/pull_requests/12 - q2_K allow it to detect ternary nets and quantize accordingly.md
@@ -1,14 +1,17 @@
-### 🔀 [#12](https://github.com/ikawrakow/ik_llama.cpp/pull/12) - q2_K: allow it to detect ternary nets and quantize accordingly
+## 🔀 [Pull Request #12](https://github.com/ikawrakow/ik_llama.cpp/pull/12) - q2_K: allow it to detect ternary nets and quantize accordingly
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/trinet` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-05 |
| **Updated** | 2024-08-05 |
+| **Merged** | 2024-08-05 |
---
-#### Description
+## 📄 Description
It looks like they have abandoned the Bitnet quants in PR-8151 in `llama.cpp` and are now going for quantization types in blocks of 256 similar to k- and i-quants. This of course removes support for 3B Bitnet (number of columns is not a multiple of 256) without clunky stuff such as padding, so they are going for [TriLM](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2) instead, being excited about the newly added `TQ1_0` and `TQ2_0` quantizations, and `TQ2_0` being the fastest quant around on `AVX2`. So, I decided to check how it compares to the CPU implementation here.
diff --git a/github-data/pull_requests/120 - Q8_0_R4.md b/github-data/pull_requests/120 - Q8_0_R4.md
index 133e1dd9f..d28c5809f 100644
--- a/github-data/pull_requests/120 - Q8_0_R4.md
+++ b/github-data/pull_requests/120 - Q8_0_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#120](https://github.com/ikawrakow/ik_llama.cpp/pull/120) - Q8_0_R4
+## 🔀 [Pull Request #120](https://github.com/ikawrakow/ik_llama.cpp/pull/120) - Q8_0_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q8_0_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-03 |
| **Updated** | 2024-12-03 |
+| **Merged** | 2024-12-03 |
---
-#### Description
+## 📄 Description
-Following PR #118, #119: `Q8_0` repacked with 4 interleaved rows.
+Following PR [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119): `Q8_0` repacked with 4 interleaved rows.
PP-512 for LLaMA-3.1-8B for `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Risen-5975WX):
diff --git a/github-data/pull_requests/121 - Q5_0_R4.md b/github-data/pull_requests/121 - Q5_0_R4.md
index 21ab1d72a..7648393cd 100644
--- a/github-data/pull_requests/121 - Q5_0_R4.md
+++ b/github-data/pull_requests/121 - Q5_0_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#121](https://github.com/ikawrakow/ik_llama.cpp/pull/121) - Q5_0_R4
+## 🔀 [Pull Request #121](https://github.com/ikawrakow/ik_llama.cpp/pull/121) - Q5_0_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q5_0_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-03 |
| **Updated** | 2024-12-03 |
+| **Merged** | 2024-12-03 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120 for `Q5_0`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120) for `Q5_0`.
Here is PP-512 for LLaMA-3.1-8B on `Zen4` (Risen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
diff --git a/github-data/pull_requests/122 - Q6_0_R4.md b/github-data/pull_requests/122 - Q6_0_R4.md
index 617ec9401..6faca0e50 100644
--- a/github-data/pull_requests/122 - Q6_0_R4.md
+++ b/github-data/pull_requests/122 - Q6_0_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#122](https://github.com/ikawrakow/ik_llama.cpp/pull/122) - Q6_0_R4
+## 🔀 [Pull Request #122](https://github.com/ikawrakow/ik_llama.cpp/pull/122) - Q6_0_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q6_0_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-03 |
| **Updated** | 2024-12-03 |
+| **Merged** | 2024-12-03 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121 for `Q6_0`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121) for `Q6_0`.
Here is PP-512 for LLaMA-3.1-8B on `Zen4` (Risen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
diff --git a/github-data/pull_requests/123 - IQ4_XS_R4.md b/github-data/pull_requests/123 - IQ4_XS_R4.md
index 907830dad..577e788be 100644
--- a/github-data/pull_requests/123 - IQ4_XS_R4.md
+++ b/github-data/pull_requests/123 - IQ4_XS_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#123](https://github.com/ikawrakow/ik_llama.cpp/pull/123) - IQ4_XS_R4
+## 🔀 [Pull Request #123](https://github.com/ikawrakow/ik_llama.cpp/pull/123) - IQ4_XS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_xs_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-04 |
| **Updated** | 2024-12-04 |
+| **Merged** | 2024-12-04 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121, #122 for `IQ4_XS`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122) for `IQ4_XS`.
I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. `IQ4_XS` seemed easiest, so I tackled that one first. We get a massive speedup on `ARM_NEON` and a more modest (but still significant) gain on `AVX2/Zen4`. I'm not 100% happy with the `Zen4` implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.
diff --git a/github-data/pull_requests/124 - iq2_bn_r4_ fastest Bitnet CPU implementation on the planet.md b/github-data/pull_requests/124 - iq2_bn_r4 fastest Bitnet CPU implementation on the planet.md
similarity index 71%
rename from github-data/pull_requests/124 - iq2_bn_r4_ fastest Bitnet CPU implementation on the planet.md
rename to github-data/pull_requests/124 - iq2_bn_r4 fastest Bitnet CPU implementation on the planet.md
index 2af215fee..71fafb79c 100644
--- a/github-data/pull_requests/124 - iq2_bn_r4_ fastest Bitnet CPU implementation on the planet.md
+++ b/github-data/pull_requests/124 - iq2_bn_r4 fastest Bitnet CPU implementation on the planet.md
@@ -1,16 +1,19 @@
-### 🔀 [#124](https://github.com/ikawrakow/ik_llama.cpp/pull/124) - iq2_bn_r4: fastest Bitnet CPU implementation on the planet
+## 🔀 [Pull Request #124](https://github.com/ikawrakow/ik_llama.cpp/pull/124) - iq2_bn_r4: fastest Bitnet CPU implementation on the planet
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_bn_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-06 |
| **Updated** | 2024-12-06 |
+| **Merged** | 2024-12-06 |
---
-#### Description
+## 📄 Description
-In the footsteps of #118, #119, #120, #121, #122, #123, this PR adds `IQ2_BN_R4`, a 4-rows interleaved packing of the 2-bit Bitnet quantization type `IQ2_BN`.
+In the footsteps of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), this PR adds `IQ2_BN_R4`, a 4-rows interleaved packing of the 2-bit Bitnet quantization type `IQ2_BN`.
Here is `PP-512` for Bitner-1.58b-3B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
diff --git a/github-data/pull_requests/125 - R4 improvements on ARM_NEON.md b/github-data/pull_requests/125 - R4 improvements on ARM_NEON.md
index b8e047c0f..23e49ba85 100644
--- a/github-data/pull_requests/125 - R4 improvements on ARM_NEON.md
+++ b/github-data/pull_requests/125 - R4 improvements on ARM_NEON.md
@@ -1,14 +1,17 @@
-### 🔀 [#125](https://github.com/ikawrakow/ik_llama.cpp/pull/125) - R4 improvements on ARM_NEON
+## 🔀 [Pull Request #125](https://github.com/ikawrakow/ik_llama.cpp/pull/125) - R4 improvements on ARM_NEON
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/r4_neon` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-08 |
| **Updated** | 2024-12-08 |
+| **Merged** | 2024-12-08 |
---
-#### Description
+## 📄 Description
This PR accomplishes two things:
* Reduces bloat by using a template for the `ARM_NEON` matrix multiplication implementation of interleaved rows quants `Q4_0_R4, Q5_0_R4, Q6_0_R4, IQ4_NL_X4, IQ4_XS_R4, Q8_0_R4` (and I should do the same for `AVX2/Zen4`)
diff --git a/github-data/pull_requests/126 - Rename iq4_nl_x4 to iq4_nl_r4.md b/github-data/pull_requests/126 - Rename iq4_nl_x4 to iq4_nl_r4.md
index 8d3f808e8..70061ce87 100644
--- a/github-data/pull_requests/126 - Rename iq4_nl_x4 to iq4_nl_r4.md
+++ b/github-data/pull_requests/126 - Rename iq4_nl_x4 to iq4_nl_r4.md
@@ -1,14 +1,17 @@
-### 🔀 [#126](https://github.com/ikawrakow/ik_llama.cpp/pull/126) - Rename iq4_nl_x4 to iq4_nl_r4
+## 🔀 [Pull Request #126](https://github.com/ikawrakow/ik_llama.cpp/pull/126) - Rename iq4_nl_x4 to iq4_nl_r4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/rename_iq4_nl_x4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-08 |
| **Updated** | 2024-12-08 |
+| **Merged** | 2024-12-08 |
---
-#### Description
+## 📄 Description
To be consistent with the other quants interleaving 4 rows.
diff --git a/github-data/pull_requests/127 - Q4_0_R4 on CUDA.md b/github-data/pull_requests/127 - Q4_0_R4 on CUDA.md
index 30a201869..772c30e60 100644
--- a/github-data/pull_requests/127 - Q4_0_R4 on CUDA.md
+++ b/github-data/pull_requests/127 - Q4_0_R4 on CUDA.md
@@ -1,13 +1,15 @@
-### 🔀 [#127](https://github.com/ikawrakow/ik_llama.cpp/pull/127) - Q4_0_R4 on CUDA
+## 🔀 [Pull Request #127](https://github.com/ikawrakow/ik_llama.cpp/pull/127) - Q4_0_R4 on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `ik/cuda_q4_0_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-08 |
| **Updated** | 2025-01-09 |
---
-#### Description
+## 📄 Description
-With the massive improvements in prompt processing speed on the CPU achieved via interleaving 4 tensor rows (see #118, #119, #120, #121, #122, #123, #124), I was curious to see if one can get a good implementation for the `X_R4` quants on CUDA. This PR is a POC that implements CUDA dequantization and matrix x vector multiplication for `Q4_0_R4`. It achieves the same TG speed as `Q4_0`. It was disappointing to not get a speedup via row interleaving, but at least there is no performance regression. To make it a full PR I should also implement quantized matrix x matrix multiplication for `Q4_0_R4` (here it is done via dequantize to `f16` and cuBLAS, so it is slower than `Q4_0` MMQ).
\ No newline at end of file
+With the massive improvements in prompt processing speed on the CPU achieved via interleaving 4 tensor rows (see [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), [#124](https://github.com/ikawrakow/ik_llama.cpp/issues/124)), I was curious to see if one can get a good implementation for the `X_R4` quants on CUDA. This PR is a POC that implements CUDA dequantization and matrix x vector multiplication for `Q4_0_R4`. It achieves the same TG speed as `Q4_0`. It was disappointing to not get a speedup via row interleaving, but at least there is no performance regression. To make it a full PR I should also implement quantized matrix x matrix multiplication for `Q4_0_R4` (here it is done via dequantize to `f16` and cuBLAS, so it is slower than `Q4_0` MMQ).
\ No newline at end of file
diff --git a/github-data/pull_requests/128 - Faster IQ4_XS_R4 on Zen4.md b/github-data/pull_requests/128 - Faster IQ4_XS_R4 on Zen4.md
index f12ca5f84..6a17b4016 100644
--- a/github-data/pull_requests/128 - Faster IQ4_XS_R4 on Zen4.md
+++ b/github-data/pull_requests/128 - Faster IQ4_XS_R4 on Zen4.md
@@ -1,13 +1,16 @@
-### 🔀 [#128](https://github.com/ikawrakow/ik_llama.cpp/pull/128) - Faster IQ4_XS_R4 on Zen4
+## 🔀 [Pull Request #128](https://github.com/ikawrakow/ik_llama.cpp/pull/128) - Faster IQ4_XS_R4 on Zen4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/zen4_iq4_xs_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-08 |
| **Updated** | 2024-12-08 |
+| **Merged** | 2024-12-08 |
---
-#### Description
+## 📄 Description
We now get PP-512(LLaMA-3.1-8B) = 254 t/s on a Ryzen-7950X CPU, up from 224 t/s.
\ No newline at end of file
diff --git a/github-data/pull_requests/129 - Q4_K_R4.md b/github-data/pull_requests/129 - Q4_K_R4.md
index f49421d77..943a78eaf 100644
--- a/github-data/pull_requests/129 - Q4_K_R4.md
+++ b/github-data/pull_requests/129 - Q4_K_R4.md
@@ -1,18 +1,21 @@
-### 🔀 [#129](https://github.com/ikawrakow/ik_llama.cpp/pull/129) - Q4_K_R4
+## 🔀 [Pull Request #129](https://github.com/ikawrakow/ik_llama.cpp/pull/129) - Q4_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q4_k_r4_v2` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-09 |
| **Updated** | 2024-12-09 |
+| **Merged** | 2024-12-09 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121, #122, #123 for `Q4_K`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123) for `Q4_K`.
-After having demonstrated interleaved rows with blocks and super-blocks for `IQ4_XS` in #123, here the corresponding implementation for `Q4_K`. To not have an explosion of quantization types, `Q4_K_R4` corresponds to `Q4_K_S` (and there is no `_R4` variant for `Q4_K_M`).
+After having demonstrated interleaved rows with blocks and super-blocks for `IQ4_XS` in [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), here the corresponding implementation for `Q4_K`. To not have an explosion of quantization types, `Q4_K_R4` corresponds to `Q4_K_S` (and there is no `_R4` variant for `Q4_K_M`).
We get a massive speedup on `ARM_NEON` and quite significant gain on `AVX2/Zen4`. The `Zen4` implementation could probably be optimized further. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
diff --git a/github-data/pull_requests/13 - Adding IQ2_TN for use with ternary models.md b/github-data/pull_requests/13 - Adding IQ2_TN for use with ternary models.md
index 52cdf69dd..a5e21e6cd 100644
--- a/github-data/pull_requests/13 - Adding IQ2_TN for use with ternary models.md
+++ b/github-data/pull_requests/13 - Adding IQ2_TN for use with ternary models.md
@@ -1,14 +1,17 @@
-### 🔀 [#13](https://github.com/ikawrakow/ik_llama.cpp/pull/13) - Adding IQ2_TN for use with ternary models
+## 🔀 [Pull Request #13](https://github.com/ikawrakow/ik_llama.cpp/pull/13) - Adding IQ2_TN for use with ternary models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_tn` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-06 |
| **Updated** | 2024-08-07 |
+| **Merged** | 2024-08-07 |
---
-#### Description
+## 📄 Description
They have abandoned the `Q1_3` and `Q2_2` quants in [PR-8151](https://github.com/ggerganov/llama.cpp/pull/8151) in `llama.cpp`, and have moved on to `TQ1_0` and `TQ2_0`. Like k-quants, these use blocks of 256 weights and utilize `Q8_K` for quantized dot products on the CPU. This removes support for [Bitnet b1.58](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) (unless one adds padding to a multiple of 256), so they are now focussing on the [TriLM models](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2). Unlike the previous `Q1_3` and `Q2_2`, where the quantized data only holds the ternary `-1/0/+1` values and the tensor scale is added via a separate `ggml_scale` operation, the new `TQ1_0` and `TQ2_0` include a scale in each block of 256. This basically wastes 0.0625 bpw, but has the advantage that one can simply reuse the standard `llama.cpp` computation graphs.
@@ -78,9 +81,9 @@ I have not bothered implementing the MMQ stuff, so CUDA PP performance is via de
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **compilade** commented the **2024-08-06** at **17:00:57**:
+👤 **compilade** commented on **2024-08-06** at **17:00:57**
This is great!
@@ -96,4 +99,18 @@ But if I understand it correctly, both store the packed values in the same order
Do you have plans for `IQ2_TN` to replace `TQ2_0`, or is this something done in parallel to see how fast it can get with better matrix multiplication than lots of dot products?
-Either way, I really appreciate your work on this. This was a pleasant surprise to see in my notifications.
\ No newline at end of file
+Either way, I really appreciate your work on this. This was a pleasant surprise to see in my notifications.
+
+---
+
+👤 **ikawrakow** commented on **2024-08-06** at **18:17:39**
+
+> Does that mean the Metal and CUDA implementations for IQ2_TN would also work for TQ2_0?
+>
+>Do you have plans for IQ2_TN to replace TQ2_0, or is this something done in parallel to see how fast it can get with better matrix multiplication than lots of dot products?
+
+I'm not planning on contributing any of this to the official `llama.cpp` repository. Just hacking for fun.
+
+The Metal implementation should be just a copy/paste operation (and replace `IQ2_TN` with `TQ2_0`, etc.) to add to Metal in `llama.cpp`.
+
+For CUDA I did factor out the dot products of the new quants into a separate file to avoid having to have a coffee each time I touch something there and `mmq.cu / mmvq.cu` needs to be rebuilt. There are only that many coffees I can have in a day. Hence, you will need to rearrange a bit. But other than that, yes, it should just work.
\ No newline at end of file
diff --git a/github-data/pull_requests/130 - Q6_K_R4.md b/github-data/pull_requests/130 - Q6_K_R4.md
index 2cd56150b..f29ccb6ab 100644
--- a/github-data/pull_requests/130 - Q6_K_R4.md
+++ b/github-data/pull_requests/130 - Q6_K_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#130](https://github.com/ikawrakow/ik_llama.cpp/pull/130) - Q6_K_R4
+## 🔀 [Pull Request #130](https://github.com/ikawrakow/ik_llama.cpp/pull/130) - Q6_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q6_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-10 |
| **Updated** | 2024-12-10 |
+| **Merged** | 2024-12-10 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121, #122, #123, #129 for `Q6_K`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), [#129](https://github.com/ikawrakow/ik_llama.cpp/issues/129) for `Q6_K`.
If nothing else `Q6_K` is routinely used for the output tensor, so having a better `Q6_K` performance would be useful.
diff --git a/github-data/pull_requests/131 - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4.md b/github-data/pull_requests/131 - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4.md
index c2481f5b3..a22ce788f 100644
--- a/github-data/pull_requests/131 - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4.md
+++ b/github-data/pull_requests/131 - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4.md
@@ -1,13 +1,16 @@
-### 🔀 [#131](https://github.com/ikawrakow/ik_llama.cpp/pull/131) - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4
+## 🔀 [Pull Request #131](https://github.com/ikawrakow/ik_llama.cpp/pull/131) - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q4_k_r4_v3` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-10 |
| **Updated** | 2024-12-10 |
+| **Merged** | 2024-12-10 |
---
-#### Description
+## 📄 Description
~1-2% speedup.
\ No newline at end of file
diff --git a/github-data/pull_requests/132 - Q5_K_R4.md b/github-data/pull_requests/132 - Q5_K_R4.md
index b07cc6ef6..c7b822263 100644
--- a/github-data/pull_requests/132 - Q5_K_R4.md
+++ b/github-data/pull_requests/132 - Q5_K_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#132](https://github.com/ikawrakow/ik_llama.cpp/pull/132) - Q5_K_R4
+## 🔀 [Pull Request #132](https://github.com/ikawrakow/ik_llama.cpp/pull/132) - Q5_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q5_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-10 |
| **Updated** | 2024-12-10 |
+| **Merged** | 2024-12-10 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121, #122, #123, #129, #130 for `Q5_K`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), [#129](https://github.com/ikawrakow/ik_llama.cpp/issues/129), [#130](https://github.com/ikawrakow/ik_llama.cpp/issues/130) for `Q5_K`.
We get a large speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
diff --git a/github-data/pull_requests/134 - Q3_K_R4.md b/github-data/pull_requests/134 - Q3_K_R4.md
index ab4a06bb4..c623e4f55 100644
--- a/github-data/pull_requests/134 - Q3_K_R4.md
+++ b/github-data/pull_requests/134 - Q3_K_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#134](https://github.com/ikawrakow/ik_llama.cpp/pull/134) - Q3_K_R4
+## 🔀 [Pull Request #134](https://github.com/ikawrakow/ik_llama.cpp/pull/134) - Q3_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q3_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-11 |
| **Updated** | 2024-12-11 |
+| **Merged** | 2024-12-11 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132 for `Q3_K`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), [#129](https://github.com/ikawrakow/ik_llama.cpp/issues/129), [#130](https://github.com/ikawrakow/ik_llama.cpp/issues/130), [#132](https://github.com/ikawrakow/ik_llama.cpp/issues/132) for `Q3_K`.
We get a massive speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
diff --git a/github-data/pull_requests/135 - Better ARM_NEON implementation for R4 quants.md b/github-data/pull_requests/135 - Better ARM_NEON implementation for R4 quants.md
index c16a9b2a3..a7d942217 100644
--- a/github-data/pull_requests/135 - Better ARM_NEON implementation for R4 quants.md
+++ b/github-data/pull_requests/135 - Better ARM_NEON implementation for R4 quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#135](https://github.com/ikawrakow/ik_llama.cpp/pull/135) - Better ARM_NEON implementation for R4 quants
+## 🔀 [Pull Request #135](https://github.com/ikawrakow/ik_llama.cpp/pull/135) - Better ARM_NEON implementation for R4 quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/arm_better_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-11 |
| **Updated** | 2024-12-11 |
+| **Merged** | 2024-12-11 |
---
-#### Description
+## 📄 Description
We get improved performance for `IQ4_XS_R4`, `Q4_K_R4`, `Q5_K_R4`, `Q6_K_R4`. The trick was to accumulate super-blocks in `int32_t`, thus avoiding expensive `int -> float` conversions.
diff --git a/github-data/pull_requests/136 - Q2_K_R4.md b/github-data/pull_requests/136 - Q2_K_R4.md
index 58c3f83a8..827da8bda 100644
--- a/github-data/pull_requests/136 - Q2_K_R4.md
+++ b/github-data/pull_requests/136 - Q2_K_R4.md
@@ -1,16 +1,19 @@
-### 🔀 [#136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4
+## 🔀 [Pull Request #136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q2_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-11 |
| **Updated** | 2024-12-11 |
+| **Merged** | 2024-12-11 |
---
-#### Description
+## 📄 Description
-Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134 for `Q2_K`.
+Follow up of [#118](https://github.com/ikawrakow/ik_llama.cpp/issues/118), [#119](https://github.com/ikawrakow/ik_llama.cpp/issues/119), [#120](https://github.com/ikawrakow/ik_llama.cpp/issues/120), [#121](https://github.com/ikawrakow/ik_llama.cpp/issues/121), [#122](https://github.com/ikawrakow/ik_llama.cpp/issues/122), [#123](https://github.com/ikawrakow/ik_llama.cpp/issues/123), [#129](https://github.com/ikawrakow/ik_llama.cpp/issues/129), [#130](https://github.com/ikawrakow/ik_llama.cpp/issues/130), [#132](https://github.com/ikawrakow/ik_llama.cpp/issues/132), [#134](https://github.com/ikawrakow/ik_llama.cpp/issues/134) for `Q2_K`.
This completes R4 implementation for k-quants on `ARM_NEON`, `AVX2`, and `Zen4`.
diff --git a/github-data/pull_requests/137 - Fix AVX2 implementation of iq4_nl_r4.md b/github-data/pull_requests/137 - Fix AVX2 implementation of iq4_nl_r4.md
index c1283ffc0..4750194bc 100644
--- a/github-data/pull_requests/137 - Fix AVX2 implementation of iq4_nl_r4.md
+++ b/github-data/pull_requests/137 - Fix AVX2 implementation of iq4_nl_r4.md
@@ -1,13 +1,16 @@
-### 🐛 [#137](https://github.com/ikawrakow/ik_llama.cpp/pull/137) - Fix AVX2 implementation of iq4_nl_r4
+## 🔀 [Pull Request #137](https://github.com/ikawrakow/ik_llama.cpp/pull/137) - Fix AVX2 implementation of iq4_nl_r4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_avx2_iq4_nl_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-11 |
| **Updated** | 2024-12-11 |
+| **Merged** | 2024-12-11 |
---
-#### Description
+## 📄 Description
The implementation was using `_mm256_maddubs_epi16`, which overflows (and gets saturated) with the unsigned version of the non-linear quants `IQ4_NL` lookup table. This PR fixes it without a noticeable performance loss.
\ No newline at end of file
diff --git a/github-data/pull_requests/138 - IQ4_K_R4.md b/github-data/pull_requests/138 - IQ4_K_R4.md
index e1148705f..933f4c537 100644
--- a/github-data/pull_requests/138 - IQ4_K_R4.md
+++ b/github-data/pull_requests/138 - IQ4_K_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#138](https://github.com/ikawrakow/ik_llama.cpp/pull/138) - IQ4_K_R4
+## 🔀 [Pull Request #138](https://github.com/ikawrakow/ik_llama.cpp/pull/138) - IQ4_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-12 |
| **Updated** | 2024-12-12 |
+| **Merged** | 2024-12-12 |
---
-#### Description
+## 📄 Description
On to R4 implementation of the new iqk quants.
diff --git a/github-data/pull_requests/139 - Faster R4 quants on Zen4.md b/github-data/pull_requests/139 - Faster R4 quants on Zen4.md
index 9de4b23be..8bb1f6ebe 100644
--- a/github-data/pull_requests/139 - Faster R4 quants on Zen4.md
+++ b/github-data/pull_requests/139 - Faster R4 quants on Zen4.md
@@ -1,16 +1,19 @@
-### 🔀 [#139](https://github.com/ikawrakow/ik_llama.cpp/pull/139) - Faster R4 quants on Zen4
+## 🔀 [Pull Request #139](https://github.com/ikawrakow/ik_llama.cpp/pull/139) - Faster R4 quants on Zen4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/r4_faster_zen4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-13 |
| **Updated** | 2024-12-13 |
+| **Merged** | 2024-12-13 |
---
-#### Description
+## 📄 Description
-Use integer accumulators for dot products within superblocks. I did not use this originally because according to [this Intel reference](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6440,3715,4851,465,488,6424,488,4200,6554,83,4843,5760,5740,6548,6548,852,3669,6205,6205,3669,3675,5750,6375,6437,3869,2675,2675,3850,3869,2946,2946,308,1741,6044,6073,6585,7030,4851,4874,6196,6068,1741,4760,6077,4236,3667,4236,488,4044,3669,5741,6009,3869,691,5303,3843,3667,4843,110,5743,4772,1741,4046,4044,6077,4860,4860,3715,1866,1866,1866,4044,1863,1866,1866,3707,3715,5114,3667,3667,3667,5831,5738,3669,92,2692,4110,4203,4239,3869,94,853,856,1598,4953,6068,5997,4851,5997,4953,4931,6571,420,5068,488,488,4998,5010,3847,3842,4897,114,6007,4863,4761,6005,6008,3910,882,3921,6008,5002,6007,6598,1159,1159,144,828,486,823,299,337,823,4838,4239,2692,1607,6077,6006,4860,828,486,5704,6007,6007,6009,882,2692,2705,473,6007,3866,6007,4239,114,84,344,6006,5002,3869,5824,4690,143,4874,5234,5251,823,5234,2103,2662,2936,3670,2124,1664,5234,2632,5256,5234,5234,1622,461,1583,2252,4772,823,674,344,5234,2629,4175,5506,5512,5500,6189,6424,2692,2705,2671,5997,4986,679,2943,4960,4990,6068,6059,3667,6068,1750,1753,6189,2962,6053,4949,7003,7021,2930,3667,6077,782,6604,5086,6000,6047,6000,5997,6006,6000,6009,6000,6411,770,2938,4236,2965,6053,1753,1866,463,6050,2932,5798,6050,2932,6050,2930,5997,5053,4953,5994,6000,5056,2962,5056,6053,613,6000,6000,5056,2962,4642,4772,6601,1619,4772,6053,5041,4772&text=_mm256_mullo_epi32) the `_mm256_mullo_epi32()` instruction has an extremely high latency. But given that on `ARM_NEON` the use of integer dot product accumulation resulted in significant performance boost (see #135), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication.
+Use integer accumulators for dot products within superblocks. I did not use this originally because according to [this Intel reference](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6440,3715,4851,465,488,6424,488,4200,6554,83,4843,5760,5740,6548,6548,852,3669,6205,6205,3669,3675,5750,6375,6437,3869,2675,2675,3850,3869,2946,2946,308,1741,6044,6073,6585,7030,4851,4874,6196,6068,1741,4760,6077,4236,3667,4236,488,4044,3669,5741,6009,3869,691,5303,3843,3667,4843,110,5743,4772,1741,4046,4044,6077,4860,4860,3715,1866,1866,1866,4044,1863,1866,1866,3707,3715,5114,3667,3667,3667,5831,5738,3669,92,2692,4110,4203,4239,3869,94,853,856,1598,4953,6068,5997,4851,5997,4953,4931,6571,420,5068,488,488,4998,5010,3847,3842,4897,114,6007,4863,4761,6005,6008,3910,882,3921,6008,5002,6007,6598,1159,1159,144,828,486,823,299,337,823,4838,4239,2692,1607,6077,6006,4860,828,486,5704,6007,6007,6009,882,2692,2705,473,6007,3866,6007,4239,114,84,344,6006,5002,3869,5824,4690,143,4874,5234,5251,823,5234,2103,2662,2936,3670,2124,1664,5234,2632,5256,5234,5234,1622,461,1583,2252,4772,823,674,344,5234,2629,4175,5506,5512,5500,6189,6424,2692,2705,2671,5997,4986,679,2943,4960,4990,6068,6059,3667,6068,1750,1753,6189,2962,6053,4949,7003,7021,2930,3667,6077,782,6604,5086,6000,6047,6000,5997,6006,6000,6009,6000,6411,770,2938,4236,2965,6053,1753,1866,463,6050,2932,5798,6050,2932,6050,2930,5997,5053,4953,5994,6000,5056,2962,5056,6053,613,6000,6000,5056,2962,4642,4772,6601,1619,4772,6053,5041,4772&text=_mm256_mullo_epi32) the `_mm256_mullo_epi32()` instruction has an extremely high latency. But given that on `ARM_NEON` the use of integer dot product accumulation resulted in significant performance boost (see [#135](https://github.com/ikawrakow/ik_llama.cpp/issues/135)), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication.
Here PP-512 and TG-128 measurements for LLaMA-3.1-8B on Zen4 (Ryzen-7950X CPU):
diff --git a/github-data/pull_requests/14 - Adding IQ6_K.md b/github-data/pull_requests/14 - Adding IQ6_K.md
index 2068dbf60..a5230b9ad 100644
--- a/github-data/pull_requests/14 - Adding IQ6_K.md
+++ b/github-data/pull_requests/14 - Adding IQ6_K.md
@@ -1,23 +1,26 @@
-### 🔀 [#14](https://github.com/ikawrakow/ik_llama.cpp/pull/14) - Adding IQ6_K
+## 🔀 [Pull Request #14](https://github.com/ikawrakow/ik_llama.cpp/pull/14) - Adding IQ6_K
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq6_k` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-09 |
| **Updated** | 2024-08-09 |
+| **Merged** | 2024-08-09 |
---
-#### Description
+## 📄 Description
This PR
-* Adds `IQ6_K` - see #8 for motivation
+* Adds `IQ6_K` - see [#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) for motivation
* Fixes the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`
### New IQ6_K
-The graph below is a copy of the graph in #8 with the quantization error of the new `IQ6_K` non-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared to `Q6_K` (0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.
+The graph below is a copy of the graph in [#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) with the quantization error of the new `IQ6_K` non-linear quantization type added (cyan circle near 6.6 bpw). We observe a significant improvement compared to `Q6_K` (0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.

diff --git a/github-data/pull_requests/141 - Q8_K_R8_ Fastest quantized matrix multiplications.md b/github-data/pull_requests/141 - Q8_K_R8 Fastest quantized matrix multiplications.md
similarity index 87%
rename from github-data/pull_requests/141 - Q8_K_R8_ Fastest quantized matrix multiplications.md
rename to github-data/pull_requests/141 - Q8_K_R8 Fastest quantized matrix multiplications.md
index e891bfba9..5a0c97d69 100644
--- a/github-data/pull_requests/141 - Q8_K_R8_ Fastest quantized matrix multiplications.md
+++ b/github-data/pull_requests/141 - Q8_K_R8 Fastest quantized matrix multiplications.md
@@ -1,14 +1,17 @@
-### 🔀 [#141](https://github.com/ikawrakow/ik_llama.cpp/pull/141) - Q8_K_R8: Fastest quantized matrix multiplications
+## 🔀 [Pull Request #141](https://github.com/ikawrakow/ik_llama.cpp/pull/141) - Q8_K_R8: Fastest quantized matrix multiplications
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q8_k_r8` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-14 |
| **Updated** | 2024-12-14 |
+| **Merged** | 2024-12-14 |
---
-#### Description
+## 📄 Description
This PR adds `Q8_K_R8` - 8-rows interleaved version of `Q8_K`. With that, we break the world record in prompt processing speed. Here is what we get for PP-512 with LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `AVX2` (Ryzen-5975WX) and `ARM_NEON` (M2-Max):
diff --git a/github-data/pull_requests/142 - BF16_R16 - 16 interleaved bf16 rows.md b/github-data/pull_requests/142 - BF16_R16 - 16 interleaved bf16 rows.md
index 832e7fada..52719026d 100644
--- a/github-data/pull_requests/142 - BF16_R16 - 16 interleaved bf16 rows.md
+++ b/github-data/pull_requests/142 - BF16_R16 - 16 interleaved bf16 rows.md
@@ -1,16 +1,19 @@
-### 🔀 [#142](https://github.com/ikawrakow/ik_llama.cpp/pull/142) - BF16_R16 - 16 interleaved bf16 rows
+## 🔀 [Pull Request #142](https://github.com/ikawrakow/ik_llama.cpp/pull/142) - BF16_R16 - 16 interleaved bf16 rows
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bf16_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-14 |
| **Updated** | 2024-12-15 |
+| **Merged** | 2024-12-15 |
---
-#### Description
+## 📄 Description
-After breaking the world record for 8-bit quantized matrix multiplications with `Q8_K_R8` in PR #141, I got excited to try to speed up `bf16` CPU inference. This PR is the somewhat disappointing result. I tried interleaving 4, 8, and 16 rows, 16 is fastest (but only very slightly faster than 8). It is disappointing because we only gain about 11% in prompt processing speed compared to the `bf16` implementation in `iqk_mul_mat` (but that one is already ~3X faster compared to mainline `llama.cpp`). On the bright side we do get TG speedup - 3.12 t/s vs 2.5 t/s for LLaMA-3.1-8B with 1 thread on a Ryzen-7950X, 4.25 t/s vs 3.9 t/s with 2 threads (and 2 threads fully saturate the memory bandwidth when using `BF16_R16`).
+After breaking the world record for 8-bit quantized matrix multiplications with `Q8_K_R8` in PR [#141](https://github.com/ikawrakow/ik_llama.cpp/issues/141), I got excited to try to speed up `bf16` CPU inference. This PR is the somewhat disappointing result. I tried interleaving 4, 8, and 16 rows, 16 is fastest (but only very slightly faster than 8). It is disappointing because we only gain about 11% in prompt processing speed compared to the `bf16` implementation in `iqk_mul_mat` (but that one is already ~3X faster compared to mainline `llama.cpp`). On the bright side we do get TG speedup - 3.12 t/s vs 2.5 t/s for LLaMA-3.1-8B with 1 thread on a Ryzen-7950X, 4.25 t/s vs 3.9 t/s with 2 threads (and 2 threads fully saturate the memory bandwidth when using `BF16_R16`).
Anyway, here a table with the `BF16_R16` PP-512 and TG-128 speeds on Ryzen-7950X
diff --git a/github-data/pull_requests/143 - Slightly faster IQ4_XS_R4 on AVX2.md b/github-data/pull_requests/143 - Slightly faster IQ4_XS_R4 on AVX2.md
index 595bc35e6..f7c5ac00a 100644
--- a/github-data/pull_requests/143 - Slightly faster IQ4_XS_R4 on AVX2.md
+++ b/github-data/pull_requests/143 - Slightly faster IQ4_XS_R4 on AVX2.md
@@ -1,14 +1,17 @@
-### 🔀 [#143](https://github.com/ikawrakow/ik_llama.cpp/pull/143) - Slightly faster IQ4_XS_R4 on AVX2
+## 🔀 [Pull Request #143](https://github.com/ikawrakow/ik_llama.cpp/pull/143) - Slightly faster IQ4_XS_R4 on AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_xs_r4_avx2` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-16 |
| **Updated** | 2024-12-16 |
+| **Merged** | 2024-12-16 |
---
-#### Description
+## 📄 Description
PPL-512(LLaMA-3.1-8B) on Ryzen-5975WX goes to 262.2 t/s up from 248.2 t/s.
diff --git a/github-data/pull_requests/144 - Slightly faster IQ4_K_R4 on AVX2Zen4.md b/github-data/pull_requests/144 - Slightly faster IQ4_K_R4 on AVX2Zen4.md
new file mode 100644
index 000000000..8d321060e
--- /dev/null
+++ b/github-data/pull_requests/144 - Slightly faster IQ4_K_R4 on AVX2Zen4.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #144](https://github.com/ikawrakow/ik_llama.cpp/pull/144) - Slightly faster IQ4_K_R4 on AVX2/Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_k_r4_avx2` |
+| **Target Branch** | `main` |
+| **Created** | 2024-12-16 |
+| **Updated** | 2024-12-16 |
+| **Merged** | 2024-12-16 |
+
+---
+
+## 📄 Description
+
+We get PP-512(LLaMA-3.1-8B) = 251 t/s (Ryzen-7950X) or 249 t/s (Ryzen-5975WX), up from 232/227 t/s.
\ No newline at end of file
diff --git a/github-data/pull_requests/144 - Slightly faster IQ4_K_R4 on AVX2_Zen4.md b/github-data/pull_requests/144 - Slightly faster IQ4_K_R4 on AVX2_Zen4.md
deleted file mode 100644
index be74d3755..000000000
--- a/github-data/pull_requests/144 - Slightly faster IQ4_K_R4 on AVX2_Zen4.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#144](https://github.com/ikawrakow/ik_llama.cpp/pull/144) - Slightly faster IQ4_K_R4 on AVX2/Zen4
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-12-16 |
-| **Updated** | 2024-12-16 |
-
----
-
-#### Description
-
-We get PP-512(LLaMA-3.1-8B) = 251 t/s (Ryzen-7950X) or 249 t/s (Ryzen-5975WX), up from 232/227 t/s.
\ No newline at end of file
diff --git a/github-data/pull_requests/145 - IQ3_K_R4.md b/github-data/pull_requests/145 - IQ3_K_R4.md
index 77b588a19..1febb6f24 100644
--- a/github-data/pull_requests/145 - IQ3_K_R4.md
+++ b/github-data/pull_requests/145 - IQ3_K_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#145](https://github.com/ikawrakow/ik_llama.cpp/pull/145) - IQ3_K_R4
+## 🔀 [Pull Request #145](https://github.com/ikawrakow/ik_llama.cpp/pull/145) - IQ3_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq3_k_r4_v2` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-17 |
| **Updated** | 2024-12-17 |
+| **Merged** | 2024-12-17 |
---
-#### Description
+## 📄 Description
Adding `IQ3_K` with 4 interleaved rows.
diff --git a/github-data/pull_requests/146 - IQ2_K_R4.md b/github-data/pull_requests/146 - IQ2_K_R4.md
index 8d26350a7..253adcf6f 100644
--- a/github-data/pull_requests/146 - IQ2_K_R4.md
+++ b/github-data/pull_requests/146 - IQ2_K_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#146](https://github.com/ikawrakow/ik_llama.cpp/pull/146) - IQ2_K_R4
+## 🔀 [Pull Request #146](https://github.com/ikawrakow/ik_llama.cpp/pull/146) - IQ2_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-17 |
| **Updated** | 2024-12-17 |
+| **Merged** | 2024-12-17 |
---
-#### Description
+## 📄 Description
Adding `IQ2_K` with 4 interleaved rows.
diff --git a/github-data/pull_requests/147 - Be able to repack tensors at run time.md b/github-data/pull_requests/147 - Be able to repack tensors at run time.md
index 7dabce391..03154e7c6 100644
--- a/github-data/pull_requests/147 - Be able to repack tensors at run time.md
+++ b/github-data/pull_requests/147 - Be able to repack tensors at run time.md
@@ -1,14 +1,17 @@
-### 🔀 [#147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) - Be able to repack tensors at run time
+## 🔀 [Pull Request #147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) - Be able to repack tensors at run time
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/run_time_repack` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-17 |
| **Updated** | 2024-12-17 |
+| **Merged** | 2024-12-17 |
---
-#### Description
+## 📄 Description
It is a bit of a hack as I didn't see a good way to figure out if tensors may be uploaded to a GPU later on. But if running on the CPU it works fine. Just use
```
diff --git a/github-data/pull_requests/148 - Slightly better matrix x vector on Zen4AVX2 for iq2_k_r4 iq3_k_r4 iq4_k_r4.md b/github-data/pull_requests/148 - Slightly better matrix x vector on Zen4AVX2 for iq2_k_r4 iq3_k_r4 iq4_k_r4.md
new file mode 100644
index 000000000..501c21ec4
--- /dev/null
+++ b/github-data/pull_requests/148 - Slightly better matrix x vector on Zen4AVX2 for iq2_k_r4 iq3_k_r4 iq4_k_r4.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #148](https://github.com/ikawrakow/ik_llama.cpp/pull/148) - Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avx2_r4_tweaks` |
+| **Target Branch** | `main` |
+| **Created** | 2024-12-17 |
+| **Updated** | 2024-12-17 |
+| **Merged** | 2024-12-17 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/148 - Slightly better matrix x vector on Zen4_AVX2 for iq2_k_r4_ iq3_k_r4_ iq.md b/github-data/pull_requests/148 - Slightly better matrix x vector on Zen4_AVX2 for iq2_k_r4_ iq3_k_r4_ iq.md
deleted file mode 100644
index f01ed2f2c..000000000
--- a/github-data/pull_requests/148 - Slightly better matrix x vector on Zen4_AVX2 for iq2_k_r4_ iq3_k_r4_ iq.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🔀 [#148](https://github.com/ikawrakow/ik_llama.cpp/pull/148) - Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-12-17 |
-| **Updated** | 2024-12-17 |
\ No newline at end of file
diff --git a/github-data/pull_requests/149 - IQ5_K_R4.md b/github-data/pull_requests/149 - IQ5_K_R4.md
index 465e92318..11be6d1a2 100644
--- a/github-data/pull_requests/149 - IQ5_K_R4.md
+++ b/github-data/pull_requests/149 - IQ5_K_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#149](https://github.com/ikawrakow/ik_llama.cpp/pull/149) - IQ5_K_R4
+## 🔀 [Pull Request #149](https://github.com/ikawrakow/ik_llama.cpp/pull/149) - IQ5_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq5_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-18 |
| **Updated** | 2025-03-27 |
+| **Merged** | 2024-12-18 |
---
-#### Description
+## 📄 Description
Adding `IQ5_K` with 4 interleaved rows.
@@ -33,10 +36,16 @@ Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-03-27** at **06:53:47**:
+👤 **saood06** commented on **2025-03-27** at **06:53:47**
>TG does not look good on AVX2/Zen4
-Does this mean regression compared to non-interleaved or just no benefit?
\ No newline at end of file
+Does this mean regression compared to non-interleaved or just no benefit?
+
+---
+
+👤 **ikawrakow** commented on **2025-03-27** at **07:08:50**
+
+I don't remember. But the "Better Zen4" commit in the PR says "But TG is still slower than iq5_k".
\ No newline at end of file
diff --git a/github-data/pull_requests/150 - IQ4_KS_R4.md b/github-data/pull_requests/150 - IQ4_KS_R4.md
index 593c04ef3..aa3668800 100644
--- a/github-data/pull_requests/150 - IQ4_KS_R4.md
+++ b/github-data/pull_requests/150 - IQ4_KS_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#150](https://github.com/ikawrakow/ik_llama.cpp/pull/150) - IQ4_KS_R4
+## 🔀 [Pull Request #150](https://github.com/ikawrakow/ik_llama.cpp/pull/150) - IQ4_KS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_ks_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-18 |
| **Updated** | 2024-12-18 |
+| **Merged** | 2024-12-18 |
---
-#### Description
+## 📄 Description
Adding `IQ4_KS` with 4 interleaved rows.
diff --git a/github-data/pull_requests/151 - fix typo.md b/github-data/pull_requests/151 - fix typo.md
index 5a761d461..7bd34a5fc 100644
--- a/github-data/pull_requests/151 - fix typo.md
+++ b/github-data/pull_requests/151 - fix typo.md
@@ -1,14 +1,17 @@
-### 🐛 [#151](https://github.com/ikawrakow/ik_llama.cpp/pull/151) - fix typo
+## 🔀 [Pull Request #151](https://github.com/ikawrakow/ik_llama.cpp/pull/151) - fix typo
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `fix-typo` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-20 |
| **Updated** | 2024-12-20 |
+| **Merged** | 2024-12-20 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -18,6 +21,6 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2024-12-20** at **11:02:09**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2024-12-20** at **11:02:09**
\ No newline at end of file
diff --git a/github-data/pull_requests/152 - IQ3_XXS_R4.md b/github-data/pull_requests/152 - IQ3_XXS_R4.md
index 97e7b36e5..d990777a6 100644
--- a/github-data/pull_requests/152 - IQ3_XXS_R4.md
+++ b/github-data/pull_requests/152 - IQ3_XXS_R4.md
@@ -1,14 +1,16 @@
-### 🔀 [#152](https://github.com/ikawrakow/ik_llama.cpp/pull/152) - IQ3_XXS_R4
+## 🔀 [Pull Request #152](https://github.com/ikawrakow/ik_llama.cpp/pull/152) - IQ3_XXS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/iq3_xxs_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-20 |
| **Updated** | 2024-12-20 |
---
-#### Description
+## 📄 Description
Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
diff --git a/github-data/pull_requests/153 - IQ3_XXS_R4.md b/github-data/pull_requests/153 - IQ3_XXS_R4.md
index 13625bf95..b28a0beee 100644
--- a/github-data/pull_requests/153 - IQ3_XXS_R4.md
+++ b/github-data/pull_requests/153 - IQ3_XXS_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#153](https://github.com/ikawrakow/ik_llama.cpp/pull/153) - IQ3_XXS_R4
+## 🔀 [Pull Request #153](https://github.com/ikawrakow/ik_llama.cpp/pull/153) - IQ3_XXS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq3_xxs_r4_v2` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-20 |
| **Updated** | 2024-12-20 |
+| **Merged** | 2024-12-20 |
---
-#### Description
+## 📄 Description
Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
diff --git a/github-data/pull_requests/154 - IQ2_XXS_R4.md b/github-data/pull_requests/154 - IQ2_XXS_R4.md
index 06ce836ba..6f36cf20b 100644
--- a/github-data/pull_requests/154 - IQ2_XXS_R4.md
+++ b/github-data/pull_requests/154 - IQ2_XXS_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#154](https://github.com/ikawrakow/ik_llama.cpp/pull/154) - IQ2_XXS_R4
+## 🔀 [Pull Request #154](https://github.com/ikawrakow/ik_llama.cpp/pull/154) - IQ2_XXS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_xxs_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-20 |
| **Updated** | 2024-12-20 |
+| **Merged** | 2024-12-20 |
---
-#### Description
+## 📄 Description
Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
diff --git a/github-data/pull_requests/155 - IQ2_XS_R4.md b/github-data/pull_requests/155 - IQ2_XS_R4.md
index 546729bff..527efb0cb 100644
--- a/github-data/pull_requests/155 - IQ2_XS_R4.md
+++ b/github-data/pull_requests/155 - IQ2_XS_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#155](https://github.com/ikawrakow/ik_llama.cpp/pull/155) - IQ2_XS_R4
+## 🔀 [Pull Request #155](https://github.com/ikawrakow/ik_llama.cpp/pull/155) - IQ2_XS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_xs_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-21 |
| **Updated** | 2024-12-21 |
+| **Merged** | 2024-12-21 |
---
-#### Description
+## 📄 Description
Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
diff --git a/github-data/pull_requests/156 - IQ2_S_R4.md b/github-data/pull_requests/156 - IQ2_S_R4.md
index 93da10926..bb8928875 100644
--- a/github-data/pull_requests/156 - IQ2_S_R4.md
+++ b/github-data/pull_requests/156 - IQ2_S_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#156](https://github.com/ikawrakow/ik_llama.cpp/pull/156) - IQ2_S_R4
+## 🔀 [Pull Request #156](https://github.com/ikawrakow/ik_llama.cpp/pull/156) - IQ2_S_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_s_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-21 |
| **Updated** | 2024-12-21 |
+| **Merged** | 2024-12-21 |
---
-#### Description
+## 📄 Description
Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
diff --git a/github-data/pull_requests/157 - R4 i-quants improvements.md b/github-data/pull_requests/157 - R4 i-quants improvements.md
index d12319157..b330f0155 100644
--- a/github-data/pull_requests/157 - R4 i-quants improvements.md
+++ b/github-data/pull_requests/157 - R4 i-quants improvements.md
@@ -1,14 +1,17 @@
-### 🔀 [#157](https://github.com/ikawrakow/ik_llama.cpp/pull/157) - R4 i-quants improvements
+## 🔀 [Pull Request #157](https://github.com/ikawrakow/ik_llama.cpp/pull/157) - R4 i-quants improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/r4_nrcy_16` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-22 |
| **Updated** | 2024-12-22 |
+| **Merged** | 2024-12-22 |
---
-#### Description
+## 📄 Description
Unpacking k- and i-quants is computationally expensive. Because of this, it is useful to re-use the unpacked quants for multiplication with as many columns in the right matrix as possible. At the same time one also needs to restrict the number of columns being used to some maximum number so that accumulated results can remain in vector registers, so in `iqk_mul_mat` up to 8 columns are used. But unpacking `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS` is computationally so expensive that is cheaper to load/unload accumulated results to/from vector registers so that unpacked quants can be reused more than 8 times.
diff --git a/github-data/pull_requests/158 - Faster R4 legacy quants.md b/github-data/pull_requests/158 - Faster R4 legacy quants.md
index 50b43173b..fd61e51a2 100644
--- a/github-data/pull_requests/158 - Faster R4 legacy quants.md
+++ b/github-data/pull_requests/158 - Faster R4 legacy quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#158](https://github.com/ikawrakow/ik_llama.cpp/pull/158) - Faster R4 legacy quants
+## 🔀 [Pull Request #158](https://github.com/ikawrakow/ik_llama.cpp/pull/158) - Faster R4 legacy quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/qx_0_r4_avx2` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-22 |
| **Updated** | 2024-12-22 |
+| **Merged** | 2024-12-22 |
---
-#### Description
+## 📄 Description
It seems converting `fp16` to `fp32` is extremely slow on the Ryzen-5975WX CPU (or `ggml`'s `GGML_FP16_TO_FP32` is inadequate), so it is better to convert the `fp16` `Q8_1_x4` block scales using `AVX2` intrinsics, store the result, and then use the converted `fp32` scales when performing the dot product. This PR does that on `AVX2` for `Q4_0_R4, Q5_0_R4, Q6_0_R4` and `Q8_0_R4`. There was no benefit on the Ryzen-7950X (`Zen4`), so not implemented there.
diff --git a/github-data/pull_requests/16 - Fix Makefile.md b/github-data/pull_requests/16 - Fix Makefile.md
index d8d958ca3..819becbf6 100644
--- a/github-data/pull_requests/16 - Fix Makefile.md
+++ b/github-data/pull_requests/16 - Fix Makefile.md
@@ -1,13 +1,16 @@
-### 🐛 [#16](https://github.com/ikawrakow/ik_llama.cpp/pull/16) - Fix Makefile
+## 🔀 [Pull Request #16](https://github.com/ikawrakow/ik_llama.cpp/pull/16) - Fix Makefile
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_Makefile` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-09 |
| **Updated** | 2024-08-09 |
+| **Merged** | 2024-08-09 |
---
-#### Description
+## 📄 Description
I always use cmake, so had forgotten to pay attention to the Makefile.
\ No newline at end of file
diff --git a/github-data/pull_requests/161 - MSVC fixes.md b/github-data/pull_requests/161 - MSVC fixes.md
index 3928914a9..c17f7532f 100644
--- a/github-data/pull_requests/161 - MSVC fixes.md
+++ b/github-data/pull_requests/161 - MSVC fixes.md
@@ -1,22 +1,25 @@
-### 🐛 [#161](https://github.com/ikawrakow/ik_llama.cpp/pull/161) - MSVC fixes
+## 🔀 [Pull Request #161](https://github.com/ikawrakow/ik_llama.cpp/pull/161) - MSVC fixes
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_windows` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-22 |
| **Updated** | 2024-12-23 |
+| **Merged** | 2024-12-23 |
---
-#### Description
+## 📄 Description
-@Nexesenex Does this fix #160?
+@Nexesenex Does this fix [#160](https://github.com/ikawrakow/ik_llama.cpp/issues/160)?
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-12-22** at **16:44:51**:
+👤 **Nexesenex** commented on **2024-12-22** at **16:44:51**

@@ -24,13 +27,13 @@ Sadly not.
---
-👤 **ikawrakow** commented the **2024-12-22** at **17:15:34**:
+👤 **ikawrakow** commented on **2024-12-22** at **17:15:34**
And now?
---
-👤 **Nexesenex** commented the **2024-12-22** at **17:47:25**:
+👤 **Nexesenex** commented on **2024-12-22** at **17:47:25**

@@ -38,13 +41,13 @@ Same.
---
-👤 **ikawrakow** commented the **2024-12-22** at **17:51:20**:
+👤 **ikawrakow** commented on **2024-12-22** at **17:51:20**
Did you pull? These errors are from the previous version, and not what is currently on this branch.
---
-👤 **Nexesenex** commented the **2024-12-23** at **06:18:47**:
+👤 **Nexesenex** commented on **2024-12-23** at **06:18:47**
I apologize, I didn't compile the updated branch indeed. (-*-)
It works now, thank you.
\ No newline at end of file
diff --git a/github-data/pull_requests/162 - IQ3_S_R4.md b/github-data/pull_requests/162 - IQ3_S_R4.md
index 6775e0ef1..ada0dd803 100644
--- a/github-data/pull_requests/162 - IQ3_S_R4.md
+++ b/github-data/pull_requests/162 - IQ3_S_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#162](https://github.com/ikawrakow/ik_llama.cpp/pull/162) - IQ3_S_R4
+## 🔀 [Pull Request #162](https://github.com/ikawrakow/ik_llama.cpp/pull/162) - IQ3_S_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq3_s_r4_v2` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-23 |
| **Updated** | 2024-12-23 |
+| **Merged** | 2024-12-23 |
---
-#### Description
+## 📄 Description
Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
diff --git a/github-data/pull_requests/163 - q4_0_r4_ Use AVX2 version for matrix x vector.md b/github-data/pull_requests/163 - q4_0_r4 Use AVX2 version for matrix x vector.md
similarity index 55%
rename from github-data/pull_requests/163 - q4_0_r4_ Use AVX2 version for matrix x vector.md
rename to github-data/pull_requests/163 - q4_0_r4 Use AVX2 version for matrix x vector.md
index 350b2d734..df2aefb20 100644
--- a/github-data/pull_requests/163 - q4_0_r4_ Use AVX2 version for matrix x vector.md
+++ b/github-data/pull_requests/163 - q4_0_r4 Use AVX2 version for matrix x vector.md
@@ -1,13 +1,16 @@
-### 🔀 [#163](https://github.com/ikawrakow/ik_llama.cpp/pull/163) - q4_0_r4: Use AVX2 version for matrix x vector
+## 🔀 [Pull Request #163](https://github.com/ikawrakow/ik_llama.cpp/pull/163) - q4_0_r4: Use AVX2 version for matrix x vector
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mv_q4_0_r4` |
+| **Target Branch** | `main` |
| **Created** | 2024-12-23 |
| **Updated** | 2024-12-23 |
+| **Merged** | 2024-12-23 |
---
-#### Description
+## 📄 Description
Performance is better. Packing quants into 512-bit registers is costly and when we have just 1 column to multiply, using the `AVX512` version becomes slower. I had already done this for most (all?) other quants, but somehow missed `Q4_0`.
\ No newline at end of file
diff --git a/github-data/pull_requests/168 - Falcon3 changes.md b/github-data/pull_requests/168 - Falcon3 changes.md
index 48934bc6e..9f94a9fc8 100644
--- a/github-data/pull_requests/168 - Falcon3 changes.md
+++ b/github-data/pull_requests/168 - Falcon3 changes.md
@@ -1,14 +1,17 @@
-### 🔀 [#168](https://github.com/ikawrakow/ik_llama.cpp/pull/168) - Falcon3 changes
+## 🔀 [Pull Request #168](https://github.com/ikawrakow/ik_llama.cpp/pull/168) - Falcon3 changes
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/falcon3a` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-10 |
| **Updated** | 2025-01-10 |
+| **Merged** | 2025-01-10 |
---
-#### Description
+## 📄 Description
Two changes:
* Add pre-tokenizer for `Falcon3` (same as `llama3`)
@@ -18,9 +21,9 @@ The second change is required for the `IQ2_BN_R4` 4-row interleaved variant. The
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-01-10** at **12:56:49**:
+👤 **ikawrakow** commented on **2025-01-10** at **12:56:49**
Oh, here some performance figures for `IQ2_BN` and Microsoft's [Bitnet](https://github.com/microsoft/BitNet) `I2_S` quants, which claim to be the fastest CPU implementation of ternary transformer models. Tests run on a Ryzen-7950X CPU.
diff --git a/github-data/pull_requests/169 - Be able to re-quantize MS BitNet I2_S models.md b/github-data/pull_requests/169 - Be able to re-quantize MS BitNet I2_S models.md
index b3a03b937..e84eeb41a 100644
--- a/github-data/pull_requests/169 - Be able to re-quantize MS BitNet I2_S models.md
+++ b/github-data/pull_requests/169 - Be able to re-quantize MS BitNet I2_S models.md
@@ -1,16 +1,19 @@
-### 🔀 [#169](https://github.com/ikawrakow/ik_llama.cpp/pull/169) - Be able to re-quantize MS BitNet I2_S models
+## 🔀 [Pull Request #169](https://github.com/ikawrakow/ik_llama.cpp/pull/169) - Be able to re-quantize MS BitNet I2_S models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/convert_i2s` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-10 |
| **Updated** | 2025-01-10 |
+| **Merged** | 2025-01-10 |
---
-#### Description
+## 📄 Description
-Closes #167
+Closes [#167](https://github.com/ikawrakow/ik_llama.cpp/issues/167)
I also saw requests for `Falcon3-10B-1.58b` being made in the mainline `llama.cpp` and `llamafile` repositories, so decided to add the ability to use this model with `ik_llama.cpp`.
diff --git a/github-data/pull_requests/17 - Merge mainline - Aug 12 2024.md b/github-data/pull_requests/17 - Merge mainline - Aug 12 2024.md
index 29128fb94..9b9b393ac 100644
--- a/github-data/pull_requests/17 - Merge mainline - Aug 12 2024.md
+++ b/github-data/pull_requests/17 - Merge mainline - Aug 12 2024.md
@@ -1,14 +1,17 @@
-### 🔀 [#17](https://github.com/ikawrakow/ik_llama.cpp/pull/17) - Merge mainline - Aug 12 2024
+## 🔀 [Pull Request #17](https://github.com/ikawrakow/ik_llama.cpp/pull/17) - Merge mainline - Aug 12 2024
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/merge_Aug_12_2024` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-12 |
| **Updated** | 2024-08-12 |
+| **Merged** | 2024-08-12 |
---
-#### Description
+## 📄 Description
Mainly for the LLaMA-3.1 RoPE related changes, not much else of interest.
diff --git a/github-data/pull_requests/170 - MoE fix for R4 quants.md b/github-data/pull_requests/170 - MoE fix for R4 quants.md
index c31a9bc83..550335b3b 100644
--- a/github-data/pull_requests/170 - MoE fix for R4 quants.md
+++ b/github-data/pull_requests/170 - MoE fix for R4 quants.md
@@ -1,14 +1,17 @@
-### 🐛 [#170](https://github.com/ikawrakow/ik_llama.cpp/pull/170) - MoE fix for R4 quants
+## 🔀 [Pull Request #170](https://github.com/ikawrakow/ik_llama.cpp/pull/170) - MoE fix for R4 quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_mul_mat_16` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-12 |
| **Updated** | 2025-01-12 |
+| **Merged** | 2025-01-12 |
---
-#### Description
+## 📄 Description
This PR adds two fixes:
* Make sure number of tensor rows being processed by one thread is a multiple of the number of interleaved rows when using `R4` quants also in `iqk_mul_mat_mow`
diff --git a/github-data/pull_requests/171 - Fix lower FA performance for even batch sizes.md b/github-data/pull_requests/171 - Fix lower FA performance for even batch sizes.md
index 33e35edff..1bb133e2a 100644
--- a/github-data/pull_requests/171 - Fix lower FA performance for even batch sizes.md
+++ b/github-data/pull_requests/171 - Fix lower FA performance for even batch sizes.md
@@ -1,16 +1,19 @@
-### 🐛 [#171](https://github.com/ikawrakow/ik_llama.cpp/pull/171) - Fix lower FA performance for even batch sizes
+## 🔀 [Pull Request #171](https://github.com/ikawrakow/ik_llama.cpp/pull/171) - Fix lower FA performance for even batch sizes
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_fattn_odd_even` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-12 |
| **Updated** | 2025-01-12 |
+| **Merged** | 2025-01-12 |
---
-#### Description
+## 📄 Description
-This PR fixes the lower performance for even batch sizes reported in #164. The graph shows a t/s comparison between the main branch and this PR using
+This PR fixes the lower performance for even batch sizes reported in [#164](https://github.com/ikawrakow/ik_llama.cpp/issues/164). The graph shows a t/s comparison between the main branch and this PR using
```
./bin/llama-batched-bench -m some_model.gguf -pps -t 16 -npp 256 -ntg 128 -npl 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 -c 4096 -rtr -fa
```
diff --git a/github-data/pull_requests/172 - CPU Flash Attention improvements.md b/github-data/pull_requests/172 - CPU Flash Attention improvements.md
index bb2f387d4..03b820f82 100644
--- a/github-data/pull_requests/172 - CPU Flash Attention improvements.md
+++ b/github-data/pull_requests/172 - CPU Flash Attention improvements.md
@@ -1,14 +1,17 @@
-### 🔀 [#172](https://github.com/ikawrakow/ik_llama.cpp/pull/172) - CPU Flash Attention improvements
+## 🔀 [Pull Request #172](https://github.com/ikawrakow/ik_llama.cpp/pull/172) - CPU Flash Attention improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-15 |
| **Updated** | 2025-01-15 |
+| **Merged** | 2025-01-15 |
---
-#### Description
+## 📄 Description
This PR
* Improves FA CPU performance for long contexts
diff --git a/github-data/pull_requests/173 - More Flash Attention improvements.md b/github-data/pull_requests/173 - More Flash Attention improvements.md
index 59110942e..6d523a9a1 100644
--- a/github-data/pull_requests/173 - More Flash Attention improvements.md
+++ b/github-data/pull_requests/173 - More Flash Attention improvements.md
@@ -1,20 +1,23 @@
-### 🔀 [#173](https://github.com/ikawrakow/ik_llama.cpp/pull/173) - More Flash Attention improvements
+## 🔀 [Pull Request #173](https://github.com/ikawrakow/ik_llama.cpp/pull/173) - More Flash Attention improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_kqv` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-19 |
| **Updated** | 2025-01-20 |
+| **Merged** | 2025-01-20 |
---
-#### Description
+## 📄 Description
This PR further improves the Flash Attention implementation as follows:
* Slightly faster `V * softmax(K * Q)` implementation. This benefits all V-cache types
* Faster implementation when the K-cache is quantized with `Q8_0` via run-time-repacking to `Q8_0_R4`.
-The following graph shows prompt processing speed as a function of prompt length for LLaMA-3.1-8B quantized with `IQ4_XS` on a Ryzem-7950X CPU. The PR results are shown with black (`BF16` KV-cache) and red (`Q8_0` KV-cache) triangles, circles are used for the main branch. I have reused the graph from the last post in #25 by just adding the results for this PR, so mainline `llama.cpp` performance is shown as well. I'm particularly pleased with the fact that `Q8_0` KV-cache is now on per or even slightly better than the natively supported 16-bit float type as `Q8_0` quantized KV-cache is basically lossless while reducing required memory by 2X.
+The following graph shows prompt processing speed as a function of prompt length for LLaMA-3.1-8B quantized with `IQ4_XS` on a Ryzem-7950X CPU. The PR results are shown with black (`BF16` KV-cache) and red (`Q8_0` KV-cache) triangles, circles are used for the main branch. I have reused the graph from the last post in [#25](https://github.com/ikawrakow/ik_llama.cpp/issues/25) by just adding the results for this PR, so mainline `llama.cpp` performance is shown as well. I'm particularly pleased with the fact that `Q8_0` KV-cache is now on per or even slightly better than the natively supported 16-bit float type as `Q8_0` quantized KV-cache is basically lossless while reducing required memory by 2X.
For reference, with a `Q8_K_R8`-quantized model we achieve 380 t/s for 512 tokens, and 150 t/s for 32k tokens.
@@ -22,9 +25,9 @@ For reference, with a `Q8_K_R8`-quantized model we achieve 380 t/s for 512 token
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-01-20** at **06:57:27**:
+👤 **ikawrakow** commented on **2025-01-20** at **06:57:27**
Here is the performance relative to a GPU (RTX-4080) for the above graph
diff --git a/github-data/pull_requests/174 - On Zen4 repack fp16 models to bf16_r16.md b/github-data/pull_requests/174 - On Zen4 repack fp16 models to bf16_r16.md
index a98d9c54d..fa10fa9bb 100644
--- a/github-data/pull_requests/174 - On Zen4 repack fp16 models to bf16_r16.md
+++ b/github-data/pull_requests/174 - On Zen4 repack fp16 models to bf16_r16.md
@@ -1,14 +1,17 @@
-### 🔀 [#174](https://github.com/ikawrakow/ik_llama.cpp/pull/174) - On Zen4 repack fp16 models to bf16_r16
+## 🔀 [Pull Request #174](https://github.com/ikawrakow/ik_llama.cpp/pull/174) - On Zen4 repack fp16 models to bf16_r16
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/zen4_repack_f16` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-21 |
| **Updated** | 2025-01-21 |
+| **Merged** | 2025-01-21 |
---
-#### Description
+## 📄 Description
...when run-time-repacking is requested via `-rtr`
diff --git a/github-data/pull_requests/175 - Better BF16 support on AVX2.md b/github-data/pull_requests/175 - Better BF16 support on AVX2.md
index ee8b41158..8cc476069 100644
--- a/github-data/pull_requests/175 - Better BF16 support on AVX2.md
+++ b/github-data/pull_requests/175 - Better BF16 support on AVX2.md
@@ -1,14 +1,17 @@
-### 🔀 [#175](https://github.com/ikawrakow/ik_llama.cpp/pull/175) - Better BF16 support on AVX2
+## 🔀 [Pull Request #175](https://github.com/ikawrakow/ik_llama.cpp/pull/175) - Better BF16 support on AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avx2_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-22 |
| **Updated** | 2025-01-22 |
+| **Merged** | 2025-01-22 |
---
-#### Description
+## 📄 Description
On the main branch `bf16` models are computed via `ggml`, which results in a horrible performance. This PR adds much better `GEMM` an `GEMV` for `bf16 x fp32`. The table shows a performance comparison between the main branch and this PR for LLaMA-3.1-8B-Instruct on a Ryzen-5975WX CPU
diff --git a/github-data/pull_requests/176 - Deepseek V3 support added.md b/github-data/pull_requests/176 - Deepseek V3 support added.md
index 8b8b90026..93b76cd70 100644
--- a/github-data/pull_requests/176 - Deepseek V3 support added.md
+++ b/github-data/pull_requests/176 - Deepseek V3 support added.md
@@ -1,14 +1,17 @@
-### 🔀 [#176](https://github.com/ikawrakow/ik_llama.cpp/pull/176) - Deepseek V3 support added
+## 🔀 [Pull Request #176](https://github.com/ikawrakow/ik_llama.cpp/pull/176) - Deepseek V3 support added
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-23 |
| **Updated** | 2025-01-23 |
+| **Merged** | 2025-01-23 |
---
-#### Description
+## 📄 Description
Very direct port of https://github.com/ggerganov/llama.cpp/pull/11049.
@@ -26,13 +29,13 @@ Token generation: 2.75 t/s for IQ4_K, 3.10 t/s for IQ4_K_R4
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-01-23** at **16:09:41**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-01-23** at **16:09:41**
---
-👤 **ikawrakow** commented the **2025-01-23** at **17:00:50**:
+👤 **ikawrakow** commented on **2025-01-23** at **17:00:50**
@saood06
@@ -49,7 +52,7 @@ The check for `tmpl == "deepseek3"` is done before in `llama.cpp`, so this is no
---
-👤 **saood06** commented the **2025-01-23** at **18:00:03**:
+👤 **saood06** commented on **2025-01-23** at **18:00:03**
The change you are referencing happened in https://github.com/ggerganov/llama.cpp/commit/ec7f3ac9ab33e46b136eb5ab6a76c4d81f57c7f1 I was not aware of that till now.
diff --git a/github-data/pull_requests/177 - Update chat templates.md b/github-data/pull_requests/177 - Update chat templates.md
index 846928308..7de9a7962 100644
--- a/github-data/pull_requests/177 - Update chat templates.md
+++ b/github-data/pull_requests/177 - Update chat templates.md
@@ -1,13 +1,16 @@
-### 🔀 [#177](https://github.com/ikawrakow/ik_llama.cpp/pull/177) - Update chat templates
+## 🔀 [Pull Request #177](https://github.com/ikawrakow/ik_llama.cpp/pull/177) - Update chat templates
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/chat_templates` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-23 |
| **Updated** | 2025-01-24 |
+| **Merged** | 2025-01-24 |
---
-#### Description
+## 📄 Description
Basically sync with `llama.cpp`
\ No newline at end of file
diff --git a/github-data/pull_requests/178 - Interleave 8 rows _Q8_0_ IQ4_XS_.md b/github-data/pull_requests/178 - Interleave 8 rows Q8_0 IQ4_XS.md
similarity index 73%
rename from github-data/pull_requests/178 - Interleave 8 rows _Q8_0_ IQ4_XS_.md
rename to github-data/pull_requests/178 - Interleave 8 rows Q8_0 IQ4_XS.md
index 42e9165b3..54b18fefb 100644
--- a/github-data/pull_requests/178 - Interleave 8 rows _Q8_0_ IQ4_XS_.md
+++ b/github-data/pull_requests/178 - Interleave 8 rows Q8_0 IQ4_XS.md
@@ -1,14 +1,18 @@
-### 🔀 [#178](https://github.com/ikawrakow/ik_llama.cpp/pull/178) - Interleave 8 rows (Q8_0, IQ4_XS)
+## 🔀 [Pull Request #178](https://github.com/ikawrakow/ik_llama.cpp/pull/178) - Interleave 8 rows (Q8_0, IQ4_XS)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_xs_r8_v2` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-26 |
| **Updated** | 2025-01-31 |
+| **Merged** | 2025-01-27 |
+| **Labels** | `Breaking change` |
---
-#### Description
+## 📄 Description
One can get better performance on `AVX2/Zen4` by interleaving 8 instead of 4 rows. I did not do it earlier because in my previous attempts performance on `ARM` suffered significantly. But in this PR I found an `ARM_NEON` implementation for 8 interleaved rows for `Q8_0` and `IQ4_XS` that is not slower or is even slightly faster than 4 interleaved rows.
@@ -24,9 +28,9 @@ Below is a graph showing prompt processing (a.k.a. prefill) performance for LLaM
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-01-26** at **17:03:11**:
+👤 **saood06** commented on **2025-01-26** at **17:03:11**
@ikawrakow
@@ -39,11 +43,11 @@ Tested on my Xeon E5-2683 v4 machine via llama-bench.
If you want me to test on my other machine (dual socket Xeon E5-2690 v3) or other models let me know.
-Also any chance you can sync the RPC code (mostly care about #11047 and to a lesser degree #9389 and #11424/#9296), if not I'll do it when I have some free time and submit a PR.
+Also any chance you can sync the RPC code (mostly care about [#11047](https://github.com/ikawrakow/ik_llama.cpp/issues/11047) and to a lesser degree [#9389](https://github.com/ikawrakow/ik_llama.cpp/issues/9389) and [#11424](https://github.com/ikawrakow/ik_llama.cpp/issues/11424)/[#9296](https://github.com/ikawrakow/ik_llama.cpp/issues/9296)), if not I'll do it when I have some free time and submit a PR.
---
-👤 **saood06** commented the **2025-01-27** at **13:06:04**:
+👤 **saood06** commented on **2025-01-27** at **13:06:04**
Testing the batch performance difference showing the peak numbers
@@ -60,29 +64,29 @@ IQ4_XS_R4:
---
-👤 **ikawrakow** commented the **2025-01-27** at **13:28:46**:
+👤 **ikawrakow** commented on **2025-01-27** at **13:28:46**
So, it looks like a small (~2%) improvement. OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is `IQ4_XS_R4`.
---
-👤 **saood06** commented the **2025-01-27** at **14:12:11**:
+👤 **saood06** commented on **2025-01-27** at **14:12:11**
> So, it looks like a small (~2%) improvement.
-Yes, it is an improvement, (there is an edge case where R4 was better and that was at batch size 4).
+Yes, it is an improvement, (there is an edge case where R4 was better at batch size 4).
>OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is `IQ4_XS_R4`.
Yes, it is okay to merge. That model is an IQ4_K_R4 (and IQ4_K), not IQ4_XS, as I prefer your quants over the mainline ones. Which is why I didn't have comparison data for it to mainline.
-On the note of the R1 quant this PR [llama.cpp/pull/11446](https://github.com/ggerganov/llama.cpp/pull/11446) will make me reconvert anyway, I want to use it and also it is easy to grab it now before the KV refactor it is waiting for to implement MLA KV cache. I was going to bring that up anyway in the Deepseek PR because it is a change to the the GGUF for Deepseek.
+On the note of R1, this PR [llama.cpp/pull/11446](https://github.com/ggerganov/llama.cpp/pull/11446) will make me reconvert anyway, I want to use it and also it is easy to grab it now before the KV refactor it is waiting for to implement MLA KV cache. I was going to bring that up anyway in the Deepseek PR because it is a change to the the GGUF for Deepseek.
-#11397 is also showing significant improvements to Deepseek.
+[#11397](https://github.com/ikawrakow/ik_llama.cpp/issues/11397) is also showing significant improvements to Deepseek.
---
-👤 **ikawrakow** commented the **2025-01-27** at **15:41:40**:
+👤 **ikawrakow** commented on **2025-01-27** at **15:41:40**
> On the note of R1, this PR 11446 will make me reconvert anyway
@@ -90,18 +94,20 @@ What is being measured in the graph in this PR? It says "Token generation rate",
---
-👤 **fairydreaming** commented the **2025-01-27** at **19:42:36**:
+👤 **fairydreaming** commented on **2025-01-27** at **19:42:36**
> > On the note of R1, this PR 11446 will make me reconvert anyway
>
> What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?
That would be my modified llama-bench from this PR: https://github.com/ggerganov/llama.cpp/pull/11126
-It allows to measure token generation rate after processing a prompt of given size.
+It allows to measure token generation rate after processing a prompt of given size.
+
+For the graph I used `-gp ,32` options, so it's mean token generation rate of 32 tokens after processing the prompt of ``.
---
-👤 **ikawrakow** commented the **2025-01-28** at **14:06:19**:
+👤 **ikawrakow** commented on **2025-01-28** at **14:06:19**
@fairydreaming Thanks for the clarification.
@@ -131,13 +137,15 @@ I played a bit with your PR 11466. TG after a long prompt looks great compared t
---
-👤 **fairydreaming** commented the **2025-01-28** at **14:12:33**:
+👤 **fairydreaming** commented on **2025-01-28** at **14:12:33**
-@ikawrakow Yup, I noticed this. I'm planning to reorganize tensor dimensions for the prompt processing in the PR, hopefully this will fix the issue.
+@ikawrakow Yup, I noticed this. I plan to reorganize tensor dimensions for the prompt processing in the PR, hopefully this will fix the issue.
+
+Edit: it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation
---
-👤 **saood06** commented the **2025-01-29** at **09:03:52**:
+👤 **saood06** commented on **2025-01-29** at **09:03:52**
@fairydreaming
> It allows to measure token generation rate after processing a prompt of given size.
@@ -150,7 +158,7 @@ Can you push that change? For my use cases the TG benefits outweigh the loss in
---
-👤 **fairydreaming** commented the **2025-01-29** at **10:09:22**:
+👤 **fairydreaming** commented on **2025-01-29** at **10:09:22**
@saood06
@@ -170,13 +178,13 @@ Pushed.
---
-👤 **saood06** commented the **2025-01-30** at **19:32:55**:
+👤 **saood06** commented on **2025-01-30** at **19:32:55**
@ikawrakow
>I did not rename the types to _R8 yet but will in case this gets merged.
---
-👤 **ikawrakow** commented the **2025-01-31** at **06:31:03**:
+👤 **ikawrakow** commented on **2025-01-31** at **06:31:03**
Will do when I come back from FOSDEM.
\ No newline at end of file
diff --git a/github-data/pull_requests/179 - Minor performance improvements.md b/github-data/pull_requests/179 - Minor performance improvements.md
index 6f84a4b54..01f138155 100644
--- a/github-data/pull_requests/179 - Minor performance improvements.md
+++ b/github-data/pull_requests/179 - Minor performance improvements.md
@@ -1,14 +1,17 @@
-### 🔀 [#179](https://github.com/ikawrakow/ik_llama.cpp/pull/179) - Minor performance improvements
+## 🔀 [Pull Request #179](https://github.com/ikawrakow/ik_llama.cpp/pull/179) - Minor performance improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q4_0_r8` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-27 |
| **Updated** | 2025-01-27 |
+| **Merged** | 2025-01-27 |
---
-#### Description
+## 📄 Description
This PR does two things
1. It changes `Q4_0_R4` to 8 interleaved rows
diff --git a/github-data/pull_requests/180 - Deepseek MLA Optimizations.md b/github-data/pull_requests/180 - Deepseek MLA Optimizations.md
index 389e066e7..2363c7c18 100644
--- a/github-data/pull_requests/180 - Deepseek MLA Optimizations.md
+++ b/github-data/pull_requests/180 - Deepseek MLA Optimizations.md
@@ -1,14 +1,16 @@
-### 🔀 [#180](https://github.com/ikawrakow/ik_llama.cpp/pull/180) - Deepseek MLA Optimizations
+## 🔀 [Pull Request #180](https://github.com/ikawrakow/ik_llama.cpp/pull/180) - Deepseek MLA Optimizations
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-29 |
| **Updated** | 2025-02-10 |
---
-#### Description
+## 📄 Description
Very direct port of https://github.com/ggerganov/llama.cpp/pull/11446
@@ -33,9 +35,28 @@ Is there any chance to convert old imatrix files (such as [this](https://hugging
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-01-29** at **09:16:02**:
+👤 **ikawrakow** commented on **2025-01-29** at **08:39:37**
+
+This hurts prompt processing (a.k.a prefill) speed very significantly. Here is what I get for Deepseek2-Lite `FP16` on my Ryzen-7950X with this PR vs the main branch. The MLA cache is supposed to make attention more efficient, so advantage should increase with increasing prompt length. Instead we see that it gets worse:
+
+| model | params | threads | rtr | test | t/s (main) | t/s (PR) | Speedup |
+| ------------------- | ---------: | ------: | --: | -------: | ---------------: | ---------------: | --------: |
+| deepseek2 16B F16 | 15.76 B | 16 | 1 | pp256 | 316.36 ± 0.85 | 308.83 ± 0.56 | 0.976 |
+| deepseek2 16B F16 | 15.76 B | 16 | 1 | pp512 | 386.42 ± 0.70 | 358.51 ± 1.58 | 0.928 |
+| deepseek2 16B F16 | 15.76 B | 16 | 1 | pp1024 | 377.85 ± 1.10 | 336.19 ± 0.63 | 0.890 |
+| deepseek2 16B F16 | 15.76 B | 16 | 1 | pp2048 | 361.73 ± 1.55 | 298.20 ± 1.46 | 0.824 |
+| deepseek2 16B F16 | 15.76 B | 16 | 1 | pp4096 | 332.09 ± 0.07 | 240.84 ± 0.40 | 0.725 |
+| deepseek2 16B F16 | 15.76 B | 16 | 1 | pp8192 | 282.16 ± 0.77 | 169.77 ± 0.83 | 0.602 |
+
+TG is indeed better, and advantage increases with KV cache size. But if we have to wait twice as long to process the prompt, it will take quite a few generated tokens to recover the time lost in the prefill step.
+
+I think we need to either try to understand why the attention part is so much slower when processing batches of tokens and fix it, or simply wait for @fairydreaming to fix their PR.
+
+---
+
+👤 **ikawrakow** commented on **2025-01-29** at **09:16:02**
Here is how much time is being spent in the various matrix multiplications in the attention part when processing a prompt of 8192 tokens:
@@ -69,7 +90,7 @@ Maybe this can be useful when trying to optimize.
---
-👤 **saood06** commented the **2025-01-29** at **09:28:49**:
+👤 **saood06** commented on **2025-01-29** at **09:28:49**
>This hurts prompt processing (a.k.a prefill) speed very significantly.
>[...]
@@ -85,13 +106,13 @@ I was drawn in to this PR for the TG benefits, it should have also been a draft
---
-👤 **ikawrakow** commented the **2025-01-29** at **09:33:33**:
+👤 **ikawrakow** commented on **2025-01-29** at **09:33:33**
@saood06 Perhaps a good way to move forward is to add an additional architecture (`deepseek-mla` or similar), but keep the original `deepseek2/3`. In this way, depending on use case, one can choose the improved TG speed after long prompts or the better PP speed when generating a few tokens after processing a long prompt.
---
-👤 **saood06** commented the **2025-01-29** at **10:21:32**:
+👤 **saood06** commented on **2025-01-29** at **10:21:32**
>Perhaps a good way to move forward is to add an additional architecture (deepseek-mla or similar), but keep the original deepseek2/3. In this way, depending on use case, one can choose the improved TG speed after long prompts or the better PP speed when generating a few tokens after processing a long prompt.
@@ -99,7 +120,7 @@ I'll do that. I'll still leave it in a draft as I'm waiting to see how it progre
---
-👤 **ikawrakow** commented the **2025-01-29** at **11:40:16**:
+👤 **ikawrakow** commented on **2025-01-29** at **11:40:16**
So, as far as I can tell, the attention implementation in this PR leads to ~3X more multiply-adds (madds) when performing matrix multiplications. For prompt processing here we need `2 x 512 x 16 x n_token^2` madds, whereas the original implementation requires `(192 + 128) x 16 x n_token^2` madds. For TG, the PR still requires 3X more madds, namely `2 x 512 x n_prompt` madds here vs `(192 + 128) x 16 x n_prompt` on main. The only reason TG ends up being faster here is the shape of the tensors: On main it is 16 matrix multiplications each being `192 x n_prompt * 192 x 1` (`K*Q`) or `n_prompt x 128 * n_prompt x 1` (`V*softmax(K*Q)`). I.e., we have 16 GEMVs, which are 100% memory bound on modern CPU's. In this PR the TG shapes are `512 x n_prompt * 512 x 16` and `n_prompt x 512 * n_prompt x 16`, so real GEMMs with much higher FLOPs, so we end up needing less time despite doing more work. Hence, the way it is implemented, there is no way one can recover PP performance.
@@ -107,13 +128,13 @@ These figures are of course specific to the Deepseek2-Lite model. It may be diff
---
-👤 **fairydreaming** commented the **2025-01-29** at **12:49:35**:
+👤 **fairydreaming** commented on **2025-01-29** at **12:49:35**
@ikawrakow I think applying the trick with "absorbing" matrices mentioned in the DeepSeek V2 paper shall fix this, I'm working on that.
---
-👤 **ikawrakow** commented the **2025-01-29** at **13:14:33**:
+👤 **ikawrakow** commented on **2025-01-29** at **13:14:33**
@fairydreaming
@@ -123,17 +144,43 @@ Btw, I observe that `attn_kv_b.weight` is still present in the model. Is it need
---
-👤 **fairydreaming** commented the **2025-01-30** at **11:23:08**:
+👤 **fairydreaming** commented on **2025-01-29** at **13:42:54**
+
+> @fairydreaming
+>
+> Great!
+>
+> Btw, I observe that `attn_kv_b.weight` is still present in the model. Is it needed, given that we now have `attn_k_b.weight` and `attn_v_b.weight` ?
+
+No it's not needed for the current version of the code, I will remove it later once I settle on a final set of weights.
+
+---
+
+👤 **fairydreaming** commented on **2025-01-30** at **11:23:08**
@ikawrakow Unfortunately the idea with speeding things up thanks to the matrix absorption is wrong: https://github.com/ggerganov/llama.cpp/pull/11446#issuecomment-2624177134
I'm not sure why they mentioned it in the DeepSeek paper.
-Regarding other possible optimizations do you know how much work is needed to add support for multiplication of transposed matrices to ggml_mul_mat()? The problem is that I use kv cache for multiplication both directly and then in transposed form. I got around this problem by storing kv cache in both regular and transposed forms, but it doubles the amount of required memory.
+Regarding other possible optimizations do you know how much work is needed to add support for multiplication of transposed matrices to ggml_mul_mat()? The problem is that I use kv cache for multiplication first directly and then in transposed form. I got around this problem by storing kv cache in both regular and transposed forms, but it doubles the amount of required memory.
---
-👤 **fairydreaming** commented the **2025-01-30** at **12:39:37**:
+👤 **ikawrakow** commented on **2025-01-30** at **12:09:17**
+
+@fairydreaming
+
+I took a look at the Deepseek-R1 model (out of my league memory and disk space wise), so even there rank-512 cannot be really considered "low-rank" (and it seems it is rank-1536 for `Q`). So yes, at first glance it does not look like pre-multiplying the matrices would be fruitful.
+
+Concerning multiplication of transposed matrices in `ggml_mul_mat`: also this will not help you improve prompt processing speed. In this case the computational graph is not memory bound, so reducing cache size will not speed it up as the number of madds remains the same.
+
+I think one should make Flash Attention work with different K and V head sizes. I did a quick attempt but it doesn't look like I found all the places where `head_size(K) == head_size(V)` is being assumed in `llama.cpp`.
+
+Out of curiosity, did you ever try this repository with your Epyc CPU?
+
+---
+
+👤 **fairydreaming** commented on **2025-01-30** at **12:39:37**
> @fairydreaming
@@ -165,7 +212,7 @@ Generation was ~4.6% faster, while prompt processing was ~90% faster, impressive
---
-👤 **ikawrakow** commented the **2025-01-30** at **13:42:04**:
+👤 **ikawrakow** commented on **2025-01-30** at **13:42:04**
10 t/s TG for Deepseek-R1 - wow!
@@ -175,7 +222,7 @@ I'm playing with Deepseek-Lite and I'm finding that the CUDA performance is pret
---
-👤 **saood06** commented the **2025-01-30** at **16:15:26**:
+👤 **saood06** commented on **2025-01-30** at **16:15:26**
I ran batched-bench at batch size 1 with TG at 32 at various PP to show PP performance and TG performance at different context lengths. Batched-bench numbers are noisy because they do not use repetitions like llama-bench and this model on this machine seems to have some variance, but all data is shown after dropping the cache's and running the model until it is fully in the page cache.
@@ -203,7 +250,7 @@ IQ4_K_R4 on main:
| 8192 | 32 | 1 | 8224 | 1391.722 | 5.89 | 43.056 | 0.74 | 1434.778 | 5.73 |
-Looking at the 8K context results, PP does drop from 5.89 to 4.05, but TG jumps from 0.74 to 2.00. At q8_0 (results below) PP again drops 6.06 to 4.03, but TG benefits going from 0.99 to 1.94. I would test/run this model at even higher context, but I would either need a smaller quant or to use RPC (for reference the KV cache at n_ctx of 8224 is 40,233.55 MiB)
+Looking at the 8K context results, PP does drop from 5.89 to 4.05, but TG jumps from 0.74 to 2.00. At q8_0 (results below) PP again drops 6.06 to 4.03, but TG benefits going from 0.99 to 1.94. I would test/run this model at even higher context, but I would either need a smaller quant or to use RPC (for reference the F16/F16 KV cache at n_ctx 8224 is 40,233.55 MiB)
Expand to see more runs with q8_0 and q6_0 K cache tested as well
@@ -286,11 +333,11 @@ Other people have reported poor performance even for the larger Deepseek models
>So, I guess, your Epyc system wipes the floor with any GPU setup using partial GPU offload of Deepseek-R1.
-Partial offload is reported benefited by this: https://github.com/ggerganov/llama.cpp/pull/11397 and it is something I plan to test/use.
+Partial offload is reported to benefit from this: https://github.com/ggerganov/llama.cpp/pull/11397 and it is something I plan to test/use.
---
-👤 **ikawrakow** commented the **2025-01-30** at **17:12:27**:
+👤 **ikawrakow** commented on **2025-01-30** at **17:12:27**
> not sure why FA is needed for that
@@ -302,7 +349,7 @@ I just made Deepseek-Lite also work on my Mac (M2-Max). I get TG-128 = 70 t/s on
---
-👤 **fairydreaming** commented the **2025-02-01** at **08:09:20**:
+👤 **fairydreaming** commented on **2025-02-01** at **08:09:20**
> So, as far as I can tell, the attention implementation in this PR leads to ~3X more multiply-adds (madds) when performing matrix multiplications. For prompt processing here we need `2 x 512 x 16 x n_token^2` madds, whereas the original implementation requires `(192 + 128) x 16 x n_token^2` madds. For TG, the PR still requires 3X more madds, namely `2 x 512 x n_prompt` madds here vs `(192 + 128) x 16 x n_prompt` on main. The only reason TG ends up being faster here is the shape of the tensors: On main it is 16 matrix multiplications each being `192 x n_prompt * 192 x 1` (`K*Q`) or `n_prompt x 128 * n_prompt x 1` (`V*softmax(K*Q)`). I.e., we have 16 GEMVs, which are 100% memory bound on modern CPU's. In this PR the TG shapes are `512 x n_prompt * 512 x 16` and `n_prompt x 512 * n_prompt x 16`, so real GEMMs with much higher FLOPs, so we end up needing less time despite doing more work. Hence, the way it is implemented, there is no way one can recover PP performance.
@@ -310,13 +357,13 @@ This is something that I kind of intuitively expected, I mean the whole point of
---
-👤 **saood06** commented the **2025-02-09** at **15:02:19**:
+👤 **saood06** commented on **2025-02-09** at **15:02:19**
-This is superseded by #188. Closing
+This is superseded by [#188](https://github.com/ikawrakow/ik_llama.cpp/issues/188). Closing
---
-👤 **jukofyork** commented the **2025-02-10** at **16:48:36**:
+👤 **jukofyork** commented on **2025-02-10** at **16:48:36**
@saood06
@@ -324,16 +371,26 @@ Just saw your linked post.
I see you have a slightly faster prompt processing speed, but what I'm confused about is why when I have everything on the GPU apart from the 3 sets of non-shared experts' tensors, why batch processing it's gaining anything hardly, eg:
-- I can get 3.5 -5 tokens per second for token generation with careful NUMA placement and 30 threads of a 2-CPU system with ~78GB/s per node.
-- I can only get 9-10 tokens per second when using a batch of 1024+ and it should be pulling each set of tensors from RAM to VRAM and doing the work for the 1024 tokens in parallel. IMO this shouild be showing speeds like what KTrasnformers is, but it's nothing like this and I'm near 100% sure there will be some glaring flaw in the way this is handled ***if*** I could actually profile the GGML stuff and see clearly WTF is going on to cause this!
+- I can get 3.5-5 tokens per second for token generation with careful NUMA placement and 30 threads of a 2-CPU system with ~78GB/s per node.
+- I can only get 9-10 tokens per second for prompt processing when using a batch of 1024+ and it should be pulling each set of tensors from RAM to VRAM and doing the work for the 1024 tokens in parallel with 15x the memory bandwidth and 100x+ the compute. IMO this should be showing speeds like what KTrasnformers is, but it's nothing like this and I'm near 100% sure there will be some glaring flaw in the way this is handled ***if*** I could actually profile the GGML stuff and see clearly WTF is going on to cause this!
+
+---
+
+👤 **saood06** commented on **2025-02-10** at **17:12:58**
+
+>I can only get 9-10 tokens per second for prompt processing when using a batch of 1024+ and it should be pulling each set of tensors from RAM to VRAM and doing the work for the 1024 tokens in parallel with 15x the memory bandwidth and 100x+ the compute. IMO this should be showing speeds like what KTrasnformers is, but it's nothing like this and I'm near 100% sure there will be some glaring flaw in the way this is handled if I could actually profile the GGML stuff and see clearly WTF is going on to cause this!
+
+Can you try this fork, without MLA and this PR: https://github.com/ikawrakow/ik_llama.cpp/pull/200 which adds FA support. This should be the fastest prompt processing you can do. Fairydreaming on his system with this fork without MLA and without FA and more optimizations reported 50 tok/s. https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2624398627
+
+If you want to try MLA, just use the -mla flag, which will turn MLA on.
---
-👤 **jukofyork** commented the **2025-02-10** at **17:15:49**:
+👤 **jukofyork** commented on **2025-02-10** at **17:15:49**
> > I can only get 9-10 tokens per second for prompt processing when using a batch of 1024+ and it should be pulling each set of tensors from RAM to VRAM and doing the work for the 1024 tokens in parallel with 15x the memory bandwidth and 100x+ the compute. IMO this should be showing speeds like what KTrasnformers is, but it's nothing like this and I'm near 100% sure there will be some glaring flaw in the way this is handled if I could actually profile the GGML stuff and see clearly WTF is going on to cause this!
>
-> Can you try this fork, without MLA and this PR: #200 which adds FA support. This should be the fastest prompt processing you can do. Fairydreaming on his system with this fork without MLA and without FA and more optimizations reported 50 tok/s. [#180 (comment)](https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2624398627)
+> Can you try this fork, without MLA and this PR: [#200](https://github.com/ikawrakow/ik_llama.cpp/issues/200) which adds FA support. This should be the fastest prompt processing you can do. Fairydreaming on his system with this fork without MLA and without FA and more optimizations reported 50 tok/s. [[#180](https://github.com/ikawrakow/ik_llama.cpp/issues/180) (comment)](https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2624398627)
>
> If you want to try MLA, just use the -mla flag, which will turn MLA on.
diff --git a/github-data/pull_requests/181 - Various.md b/github-data/pull_requests/181 - Various.md
index ecae644d2..488d94acd 100644
--- a/github-data/pull_requests/181 - Various.md
+++ b/github-data/pull_requests/181 - Various.md
@@ -1,14 +1,17 @@
-### 🔀 [#181](https://github.com/ikawrakow/ik_llama.cpp/pull/181) - Various
+## 🔀 [Pull Request #181](https://github.com/ikawrakow/ik_llama.cpp/pull/181) - Various
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bench_gp` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-29 |
| **Updated** | 2025-01-29 |
+| **Merged** | 2025-01-29 |
---
-#### Description
+## 📄 Description
PR started by me adding the `-gp` option to `llama-bench` as per https://github.com/ggerganov/llama.cpp/pull/11126 because I wanted to test TG performance after a long prompt to be able to compare to the MLA attention implementation in https://github.com/ggerganov/llama.cpp/pull/11446.
diff --git a/github-data/pull_requests/182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2_Zen4.md b/github-data/pull_requests/182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2Zen4.md
similarity index 77%
rename from github-data/pull_requests/182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2_Zen4.md
rename to github-data/pull_requests/182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2Zen4.md
index 79d8ce582..b9a6e5411 100644
--- a/github-data/pull_requests/182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2_Zen4.md
+++ b/github-data/pull_requests/182 - Faster Q4_K_R4 and Q5_K_R4 on AVX2Zen4.md
@@ -1,14 +1,17 @@
-### 🔀 [#182](https://github.com/ikawrakow/ik_llama.cpp/pull/182) - Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4
+## 🔀 [Pull Request #182](https://github.com/ikawrakow/ik_llama.cpp/pull/182) - Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/qx_k_b32_avx2` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-30 |
| **Updated** | 2025-01-30 |
+| **Merged** | 2025-01-30 |
---
-#### Description
+## 📄 Description
TG is about the same. PP-512 comparison between main and this PR for LLaMA-3.1-8B on a Ryzen-5975WX (`AVX2`) and a Ryzen-7950X (`Zen4`)
diff --git a/github-data/pull_requests/184 - Deepseek-Lite.md b/github-data/pull_requests/184 - Deepseek-Lite.md
index e2b36d84e..37b45f72f 100644
--- a/github-data/pull_requests/184 - Deepseek-Lite.md
+++ b/github-data/pull_requests/184 - Deepseek-Lite.md
@@ -1,14 +1,17 @@
-### 🔀 [#184](https://github.com/ikawrakow/ik_llama.cpp/pull/184) - Deepseek-Lite
+## 🔀 [Pull Request #184](https://github.com/ikawrakow/ik_llama.cpp/pull/184) - Deepseek-Lite
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/qmix_tweaks_2` |
+| **Target Branch** | `main` |
| **Created** | 2025-01-30 |
| **Updated** | 2025-01-30 |
+| **Merged** | 2025-01-30 |
---
-#### Description
+## 📄 Description
I was playing with Deepseek-Lite and noticed that
* Quantization mixes are inadequate, so added a few quick changes to that
diff --git a/github-data/pull_requests/185 - IQ1_S_R4_ better 1.5 bpw quants.md b/github-data/pull_requests/185 - IQ1_S_R4 better 1.5 bpw quants.md
similarity index 97%
rename from github-data/pull_requests/185 - IQ1_S_R4_ better 1.5 bpw quants.md
rename to github-data/pull_requests/185 - IQ1_S_R4 better 1.5 bpw quants.md
index 583198cb1..03ba46bc3 100644
--- a/github-data/pull_requests/185 - IQ1_S_R4_ better 1.5 bpw quants.md
+++ b/github-data/pull_requests/185 - IQ1_S_R4 better 1.5 bpw quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#185](https://github.com/ikawrakow/ik_llama.cpp/pull/185) - IQ1_S_R4: better 1.5 bpw quants
+## 🔀 [Pull Request #185](https://github.com/ikawrakow/ik_llama.cpp/pull/185) - IQ1_S_R4: better 1.5 bpw quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_s_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-05 |
| **Updated** | 2025-02-08 |
+| **Merged** | 2025-02-05 |
---
-#### Description
+## 📄 Description
Given the hype around DeepSeek's models and [Unsloth's sub-2 bpw](https://huggingface.co/unsloth/DeepSeek-R1-GGUF) quantization of DeepSeek-R1 using `IQ1_S/IQ1_M`, I decided to give some love to sub-2 bpw quants. This PR adds `IQ1_S_R4`, a 4-row interleaved version of `IQ1_S`.
@@ -42,9 +45,9 @@ I don't have the disk space and RAM to play with DeepSeek-R1, so I would be real
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-02-06** at **08:31:42**:
+👤 **saood06** commented on **2025-02-06** at **08:31:42**
>I don't have the disk space and RAM to play with DeepSeek-R1
@@ -60,7 +63,7 @@ Sadly, it doesn't really function. I haven't tried his IQ1_S, but yours might ju
---
-👤 **ikawrakow** commented the **2025-02-06** at **08:40:00**:
+👤 **ikawrakow** commented on **2025-02-06** at **08:40:00**
@saood06 Do you have by any chance the quantization log? It would be useful to have it to verify that the intended tensors with higher bpw are correctly selected. It ends up being smaller than Unsloth's because `IQ1_S_R4` is 1.5 bpw vs 1.5625 bpw for `IQ1_S`. This 4% difference pretty much corresponds to the difference between 131 GiB and 127 GiB.
@@ -68,7 +71,7 @@ Oh, the other thing is that I did not change the default quantization for the to
---
-👤 **saood06** commented the **2025-02-06** at **08:48:25**:
+👤 **saood06** commented on **2025-02-06** at **08:48:25**
>Do you have by any chance the quantization log?
@@ -91,7 +94,7 @@ index 02ad25ce..e23b4d5d 100644
```
-
+Log
```
@@ -1748,7 +1751,7 @@ main: total time = 9034503.69 ms
---
-👤 **ikawrakow** commented the **2025-02-06** at **08:59:44**:
+👤 **ikawrakow** commented on **2025-02-06** at **08:59:44**
I think `token_embedding.weight` is the issue. If you use `Q8_0` instead of `Q2_K`, model size will increase by 660 MiB but quality will be quite a bit better.
@@ -1756,7 +1759,7 @@ Do you have an imatrix with the changed attention tensors?
---
-👤 **saood06** commented the **2025-02-06** at **09:08:55**:
+👤 **saood06** commented on **2025-02-06** at **09:08:55**
>I think token_embedding.weight is the issue. If you use Q8_0 instead of Q2_K, model size will increase by 660 MiB but quality will be quite a bit better.
@@ -1768,13 +1771,13 @@ No, and I don't have the dataset or the compute. The new tensors are split from
---
-👤 **ikawrakow** commented the **2025-02-06** at **09:15:48**:
+👤 **ikawrakow** commented on **2025-02-06** at **09:15:48**
In that case I would simply use `Q8_0` for `attn_k_b` and `attn_v_b`. They are quite small, so model size will increase by just ~0.5 GiB.
---
-👤 **saood06** commented the **2025-02-06** at **09:35:01**:
+👤 **saood06** commented on **2025-02-06** at **09:35:01**
> In that case I would simply use `Q8_0` for `attn_k_b` and `attn_v_b`. They are quite small, so model size will increase by just ~0.5 GiB.
@@ -1782,7 +1785,7 @@ I'll do that. I'll probably remake my IQ4_K_R4 with these changes.
---
-👤 **ikawrakow** commented the **2025-02-06** at **09:37:43**:
+👤 **ikawrakow** commented on **2025-02-06** at **09:37:43**
You may also want to change
```c++
@@ -1804,7 +1807,35 @@ This will cost ~0.4 GiB in quantized model size increase. The check is like this
---
-👤 **ikawrakow** commented the **2025-02-06** at **09:45:37**:
+👤 **saood06** commented on **2025-02-06** at **09:43:24**
+
+> You may also want to change
+>
+> ```c++
+> else if (qs.model.hparams.n_expert >= 8 && (name.find("blk.0.ffn_down") != std::string::npos ||
+> name.find("blk.0.ffn_gate") != std::string::npos ||
+> name.find("blk.0.ffn_up") != std::string::npos)) {
+> new_type = GGML_TYPE_IQ3_K_R4;
+> }
+> ```
+>
+> to
+>
+> ```c++
+> else if (qs.model.hparams.n_expert >= 8 && (name.find("ffn_down.weight") != std::string::npos ||
+> name.find("ffn_gate.weight") != std::string::npos ||
+> name.find("ffn_up.weight") != std::string::npos)) {
+> new_type = GGML_TYPE_IQ4_K_R4;
+> }
+> ```
+>
+> This will cost ~0.4 GiB in quantized model size increase. The check is like this because in DeepSeek-Lite there is a single layer without MoE, but in DeepSeek-R1 there are 3 such layers, and my guess is that those are important to get things on the right track before the experts get involved.
+
+Will do, just a question why for attn_q and attn_k do you use Q4_K_R4 and not IQ4_K_R4. My IQ4_K_R4 uses IQ4_K_R4 for those.
+
+---
+
+👤 **ikawrakow** commented on **2025-02-06** at **09:45:37**
> why for attn_q and attn_k do you use Q4_K_R4 and not IQ4_K_R4
@@ -1812,7 +1843,7 @@ Because of copy/paste. It can be changed to `IQ4_K_R4`.
---
-👤 **saood06** commented the **2025-02-06** at **14:40:00**:
+👤 **saood06** commented on **2025-02-06** at **14:40:00**
I changed some things but it still didn't work.
@@ -3282,13 +3313,13 @@ main: total time = 9295125.73 ms
---
-👤 **ikawrakow** commented the **2025-02-06** at **14:46:28**:
+👤 **ikawrakow** commented on **2025-02-06** at **14:46:28**
When you say "It didn't work", how did it not work? Produced NaNs? Produced gibberish? Produced something like human language but with no real meaning? It isn't as coherent as a higher bit quantization?
---
-👤 **saood06** commented the **2025-02-06** at **15:00:37**:
+👤 **saood06** commented on **2025-02-06** at **15:00:37**
>When you say "It didn't work", how did it not work? Produced NaNs? Produced gibberish? Produced something like human language but with no real meaning? It isn't as coherent as a higher bit quantization?
@@ -3319,116 +3350,107 @@ IQ4_K_R4 single token
---
-👤 **saood06** submitted a review the **2025-02-06** at **15:16:38**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `src/llama.cpp` on **2025-02-06** at **15:16:38**
----
+Could this need to be higher for R1? The unsloth quant does this up to and including layer 8, my most recent attempt only did up to and including layer 6.
-👤 **saood06** commented during a code review the **2025-02-06** at **15:16:38** on `src/llama.cpp`:
+> 👤 **ikawrakow** replied on **2025-02-06** at **15:30:35**
+>
+> Yes, the early layers tend to be more important, so increasing the number of layers and/or increasing the bpw of the quantization used will improve results. It is basically a matter of the balance between quantization quality and model size.
-Could this need to be higher for R1? The unsloth quant does this up to and including layer 8, my most recent attempt only did up to and including layer 6.
+> 👤 **saood06** replied on **2025-02-06** at **16:18:14**
+>
+> >in DeepSeek-Lite there is a single layer without MoE, but in DeepSeek-R1 there are 3 such layers
+>
+> The additional 2 layers of dense, means you hit 2 less MoE layers with this then you do on Lite, and this is still the only meaningful way I can see that the quant I just made is worse than the unsloth one, basically everything else is better, or the same.
---
-👤 **ikawrakow** commented the **2025-02-06** at **15:28:21**:
+👤 **ikawrakow** commented on **2025-02-06** at **15:28:21**
Hmm, not sure. The token probabilities are not completely useless (same top-4 tokens). It is possible the imatrix is not adequate. 4+ bpw quants work even without an imatrix, so a bad imatrix is not immediately recognizable. I see in the log that 315 chunks were used. We have 8 out of 256 experts being active, so each expert got on average less than 10 chunks. That's not a lot of data to properly determine the relative importance of the tensor columns.
In case you have time and energy:
-* Can you try without MLA? I took your PR #180 and made MLA optional (see #188). While testing I noticed that one gets different results and, without having done any meaningful evaluation, my impression was that MLA produced worse responses (tested with DeepSeek-Lite using `f16` to not worry about quantization effects).
+* Can you try without MLA? I took your PR [#180](https://github.com/ikawrakow/ik_llama.cpp/issues/180) and made MLA optional (see [#188](https://github.com/ikawrakow/ik_llama.cpp/issues/188)). While testing I noticed that one gets different results and, without having done any meaningful evaluation, my impression was that MLA produced worse responses (tested with DeepSeek-Lite using `f16` to not worry about quantization effects).
* Have you tried running perplexity? Just a few chunks to compare to your best quantized model
It is of course also possible that removing the super-block scale in `IQ1_S_R4` was not a good move. It didn't have any impact on DeepSeek-Lite, but having 3-bit block scales with just a single row scale is risky, and may result in too much precision loss in case there are big magnitude variations in the model weights.
---
-👤 **ikawrakow** submitted a review the **2025-02-06** at **15:30:35**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-02-06** at **15:30:35** on `src/llama.cpp`:
-
-Yes, the early layers tend to be more important, so increasing the number of layers and/or increasing the bpw of the quantization used will improve results. It is basically a matter of the balance between quantization quality and model size.
-
----
-
-👤 **saood06** commented the **2025-02-06** at **16:06:00**:
+👤 **saood06** commented on **2025-02-06** at **16:06:00**
>It is possible the imatrix is not adequate. 4+ bpw quants work even without an imatrix, so a bad imatrix is not immediately recognizable. I see in the log that 315 chunks were used.
-The one unsloth uses is significantly shorter, only 124. I also do believe the imatrix data is better. The Arctic MoE his imatrix activated all but one expert and they tried hard to get the last one to no avail. All other imatrix activated far less.
+The one unsloth uses is significantly shorter, only 124. I also do believe the imatrix data is better. The Arctic MoE the person who's imatrix's I use activated all but one expert and they tried hard to get the last one to no avail. All other imatrix activated far less.
>Can you try without MLA? I took your PR https://github.com/ikawrakow/ik_llama.cpp/pull/180 and made MLA optional (see https://github.com/ikawrakow/ik_llama.cpp/pull/188). While testing I noticed that one gets different results and, without having done any meaningful evaluation, my impression was that MLA produced worse responses (tested with DeepSeek-Lite using f16 to not worry about quantization effects).
I think this is to be expected. It is a whole different attention mechanism. MLA uses less bits to represents the KV, it is far better at conserving information while compressing the KV cache compared to GQA, but it is still less bits than MHA. They claim it is better than MHA because redundancy in information between heads means you do have some effectively lossless compression. But I've seen enough people actually micro benchmark MHA and MLA and it does seem a bit worse.
-The real benefit of MLA is that it uses less bits, and there was a branch I was working on which allowed me to make use of that (thanks to another one of fairydreaming's PR), which uses mmap to avoid allocating KV until used which means the old gigantic KV (full 128k is ~600 GB), does not allocate and start paging me out. I was able to request 64K of context ( CPU NUMA KV buffer size = 313101.56 MiB ) from server and I used 30K before ending that test, and it never paged to disk thanks to the mmap only allocating what was used.
+The real benefit of MLA is that it uses less bits, and there was a branch I was working on which allowed me to make use of that (thanks to another one of fairydreaming's PR), which uses mmap to avoid allocating KV until used which means the old gigantic KV (full 128k is ~600 GB), does not allocate and start paging me out. I was able to request 64K of context ( CPU NUMA KV buffer size = 313101.56 MiB ) from server and I used 30K before ending that test, and it never paged to disk thanks to the mmap only allocating what was used. I also did not quantize the cache at all, as with MLA it was already so small.
-I saw your PR #188 , there was some minor optimizations from fairydreaming that have that haven't made it to my PR ( #180 ) , along with some other stuff from fairydreaming that is experimental (mmap) and QoL stuff (MoE warmup actually loads in all experts).
+I saw your PR [#188](https://github.com/ikawrakow/ik_llama.cpp/issues/188) , there was some minor optimizations from fairydreaming that have that haven't made it to my PR ( [#180](https://github.com/ikawrakow/ik_llama.cpp/issues/180) ) , along with some other stuff from fairydreaming that is experimental (mmap) and QoL stuff (MoE warmup actually loads in all experts) in this branch saood06/ik_llama.cpp/pull/1 .
Although the mmap allocator is working for me (and I might create a PR with it being toggled via a CLI argument) I think when MLA is toggled on the other KV cache should not allocate.
>Have you tried running perplexity? Just a few chunks to compare to your best quantized model
+>...
>Can you try without MLA?
When I have some more time I will.
---
-👤 **saood06** submitted a review the **2025-02-06** at **16:18:14**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-02-06** at **16:18:14** on `src/llama.cpp`:
-
->in DeepSeek-Lite there is a single layer without MoE, but in DeepSeek-R1 there are 3 such layers
-
-The additional 2 layers of dense, means you hit 2 less MoE layers with this then you on Lite, and this is still the only meaningful way I can see that the quant I just made is worse, basically everything else is better, or the same.
-
----
-
-👤 **saood06** commented the **2025-02-06** at **20:26:59**:
+👤 **saood06** commented on **2025-02-06** at **20:26:59**
@ikawrakow
>Have you tried running perplexity? Just a few chunks to compare to your best quantized model
-Model | [1] | [2] |[3] |[4] | [5] |[6]| [7]| [8] |[9] |[10] |[11] |[12]
---- | --- | --- | --- |--- |--- |--- |--- |--- |--- |--- |--- | ---
+Quant | [1] | [2] |[3] |[4] | [5] |[6]| [7]| [8] |[9] |[10] |[11] |[12]|[13]|[14]|[15]|[16]
+--- | --- | --- | --- |--- |--- |--- |--- |--- |--- |--- |--- | ---|---|---|---|---
IQ2_XXS **| 3.39| 4.56| 3.44| 3.27| 3.27| 3.20| 3.12 | 3.12|
IQ3_XXS ** | 2.69 | 3.53| 2.51 | 2.11 | 1.91 | 1.78 | 1.69 | 1.62|
-IQ4_K_R4 (V1) | 2.5954 | 3.3338| 2.3993 |1.9972 |1.8080 |1.6659 |1.5697| 1.5047| 1.4555| 1.4154| 1.4007| 1.4493
+IQ4_K_R4 (V1) | 2.5954 | 3.3338| 2.3993 |1.9972 |1.8080 |1.6659 |1.5697| 1.5047| 1.4555| 1.4154| 1.4007| 1.4493|1.4581|1.5866|1.7193|1.7815
UD-IQ1_M **| 3.4155 |4.2311 | 3.0817 | 2.8601 | 2.6933 | 2.5792 | 2.5123 | 2.5239
UD-IQ1_S ** | 3.8939 |4.7189 | 3.7812 | 3.6799 | 3.6215 | 3.6922 | 3.6442| 3.7472| 3.8353| 3.7663| 3.8983| 4.0621
IQ1_S_R4 (V2)| 3.7554 |4.6569 |3.5681 |3.4458| nan| nan| nan| nan| nan| nan| nan| nan| nan| nan| nan| nan|
+IQ1_S_R4 (V2) -b 4096 | 3.7554 |4.6569|3.5681|3.4458|3.5419|3.5822|3.5429|3.6624|3.7312|3.6580|3.7719|3.9520|nan|nan|nan|nan
+IQ1_S_R4 (V1) -b 4096 | 3.6625| 4.5832 |3.5418| 3.4340| nan| nan | nan |nan
** is data that was posted by other people online, not my tests.
UD refers to Unsloth quants.
(V2) for IQ1_S_R4 refers to the one that had the one token
-(V1) for IQ4_K_R4 refers to the fact that I plan to requant this.
+(V1) for IQ1_S_R4 refers to the one that had only nulls.
+(V1) for IQ4_K_R4 refers to the fact that I plan to requant this.
+
+Edit:
+Added run with -b 4096 for both v2 and v1
---
-👤 **ikawrakow** commented the **2025-02-07** at **06:33:14**:
+👤 **ikawrakow** commented on **2025-02-07** at **06:33:14**
@saood06 Thanks for these results.
So, it looks like `IQ1_S_R4` is better than Unsloth's until something goes wrong. There seems to be an issue in `ggml` itself as the result is supposed to be independent of batch size, but it isn't in the `IQ1_S_R4` runs where we get `NaN` in the 5th chunk with the default batch size and not `NaN` with a batch size of 4096. Something strange happens in the 5th chunk as `IQ1_S_R4` PPL with batch size 4096 is higher than the 4th chunk while it is lower for all other quants.
-I have added some extra guards in #191, but they never trigger with DeepSeek-Lite or LLaMA-3.1-8B-Instruct, so not sure if this will help. It may be useful to try `IQ1_M_R4` and see how that goes.
+I have added some extra guards in [#191](https://github.com/ikawrakow/ik_llama.cpp/issues/191), but they never trigger with DeepSeek-Lite or LLaMA-3.1-8B-Instruct, so not sure if this will help. It may be useful to try `IQ1_M_R4` and see how that goes.
---
-👤 **ikawrakow** commented the **2025-02-07** at **10:05:20**:
+👤 **ikawrakow** commented on **2025-02-07** at **10:05:20**
-@saood06 I would appreciate if you tried running the `IQ1_S_R4` DeepSeek-R1 model with #192. There appears to be a race on the main branch that can cause the NaNs, and #192 hopefully fixes that.
+@saood06 I would appreciate if you tried running the `IQ1_S_R4` DeepSeek-R1 model with [#192](https://github.com/ikawrakow/ik_llama.cpp/issues/192). There appears to be a race on the main branch that can cause the NaNs, and [#192](https://github.com/ikawrakow/ik_llama.cpp/issues/192) hopefully fixes that.
---
-👤 **saood06** commented the **2025-02-07** at **22:41:11**:
+👤 **saood06** commented on **2025-02-07** at **22:41:11**
@ikawrakow
-I have tested #192 by merging it into my WIP testing branch, saood06/ik_llama.cpp/pull/1. IQ1_S_R4 (V2) and in my single very basic test it now functions (produced coherent output), but it still produced `NaN` in the perplexity test from chunk 13 and on, and the perplexity values for it and other quants have changed slightly compared to previously. No results for IQ1_S_R4 (V1) as I deleted that and don't feel like recreating it.
+I have tested [#192](https://github.com/ikawrakow/ik_llama.cpp/issues/192) by merging it into my WIP testing branch, saood06/ik_llama.cpp/pull/1. IQ1_S_R4 (V2) in my single very basic test it now functions (produced coherent output), but it still produced `NaN` in the perplexity test from chunk 13 and on, and the perplexity values for it and other quants have changed slightly compared to previously. No results for IQ1_S_R4 (V1) as I deleted that and don't feel like recreating it.
Only including new results in the table below.
@@ -4303,15 +4325,17 @@ llama_model_quantize_internal: quant size = 367657.12 MB
main: quantize time = 10290932.85 ms
main: total time = 10290932.85 ms
-
+
+
+Quantization logs had to be truncated to fit github comment length limits.
---
-👤 **jukofyork** commented the **2025-02-08** at **02:53:50**:
+👤 **jukofyork** commented on **2025-02-08** at **02:53:50**
-Just saw this thread linked from the main MLA PR:
+Just saw this thread linked from the main [MLA](https://github.com/ggerganov/llama.cpp/pull/11446) PR:
-- It's some or all of the `attn_k_b.weight` tensors that can't be quantised as `float16` (it will just repeat the same word over and over in the after outputting the opening `` tag).
+- It's some or all of the `attn_k_b.weight` tensors that can't be quantised as `float16` (it will just repeat the same word over and over after outputting the opening `` tag).
- The model is also very sensitive to `ffn_down_exps.weight` bitrate (`Q3_K` or less and it starts to get *really* dumb...).
This 128 token prompt:
@@ -4327,18 +4351,18 @@ seems to be a good test of the model getting dumber, eg:
- The number of tokens in the thinking section starts to drop off.
- The story it generates won't actually use the quoted strings.
-- The "planning" in the thinking section goes way down and just write a few vague guidelines/paragraphs.
+- The "planning" in the thinking section goes way down and it just writes a few vague guidelines/paragraphs.
- It will just start to make up a vaguely "dark" story without using any of what you gave it for low `ffn_down_exps.weight` bitrate.
---
-👤 **saood06** commented the **2025-02-08** at **03:16:25**:
+👤 **saood06** commented on **2025-02-08** at **03:16:25**
@jukofyork
I was just about to edit my comment, and mention attn_k_b.weight.
-Since you found your way here, I want to tell you with a 4.52BPW (using quant types that are better than those that exist on mainline), on a dual socket dual socket Xeon E5-2690 v3 without any offloading I get this performance ( I use batched-bench to test PP performance as context grows, and also spot test TG performance at various context depths).
+Since you found your way here, I want to tell you with a 4.52BPW (using quant types that are better than those that exist on mainline llama.cpp), on a dual socket dual socket Xeon E5-2690 v3 without any offloading I get this performance ( I use batched-bench to test PP performance as context grows, and also spot test TG performance at various context depths).
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
@@ -4348,11 +4372,17 @@ Since you found your way here, I want to tell you with a 4.52BPW (using quant ty
| 1024 | 32 | 1 | 1056 | 128.774 | 7.95 | 10.440 | 3.07 | 139.215 | 7.59 |
| 2048 | 32 | 1 | 2080 | 287.581 | 7.12 | 10.958 | 2.92 | 298.538 | 6.97 |
-My initial tests with offloading ( on mainline llama.cpp with the PR that lets override tensor placement to keep non-shared experts on CPU) showed worse performance the more layers I offloaded. This fork currently is missing some RPC fixes that would support this model, and also some RPC performance tweaks, but I do plan to bring those over here.
+My initial tests with offloading ( on mainline llama.cpp with the PR that lets override tensor placement to keep non-shared experts on CPU) showed worse performance the more layers I offloaded. This fork currently is missing some RPC fixes that would support this model, and also some RPC performance tweaks, but I do plan to bring those over here.
+
+Edit:
+
+>The "planning" in the thinking section goes way down and it just writes a few vague guidelines/paragraphs.
+
+This I've noticed and it has bothered me, although I don't have much reference as almost all of my usage has been with MLA, and the little that hasn't has been at low contexts.
---
-👤 **ikawrakow** commented the **2025-02-08** at **07:18:55**:
+👤 **ikawrakow** commented on **2025-02-08** at **07:18:55**
> Off topic but when should you use Q8_K_R8 vs Q8_0_R8?
@@ -4374,13 +4404,15 @@ And here the same comparison on Zen4 (Ryzen-7950X)
| llama 8B Q8_0 | 7.95 GiB | 16 | 1 | 1 | pp512 | 304.90 ± 0.12 |
| llama 8B Q8_K_R8 | 7.56 GiB | 16 | 1 | 1 | pp512 | 387.23 ± 1.10 |
+In these tables `Q8_0_R8` is `Q8_0` with `rtr=1`.
+
To put things in perspective, the best mainline `llama.cpp` can do on the Ryzen-7950X is 165 t/s for `Q4_0` (fastest quant in `llama.cpp`). On my M2-Max `Q8_K_R8` gets 172 t/s vs 125 t/s for `Q4_0`.
-On the Ryzen-7950X memory bandwidth is fully saturated with just 2 threads with `Q8_K_R8` for TG. Which means that I can let the LLM run and generate tokens while I'm doing something else without the system feeling totally bogged down.
+On the Ryzen-7950X memory bandwidth is fully saturated with just 2 threads with `Q8_K_R8` for TG. Which means that I can let the LLM run and generate tokens using just 2 threads while I'm doing something else without the system feeling totally bogged down.
---
-👤 **ikawrakow** commented the **2025-02-08** at **07:36:52**:
+👤 **ikawrakow** commented on **2025-02-08** at **07:36:52**
Concerning `fp16` vs `bf16` for `attn_k_b`: In mainline `llama.cpp` when a model tensor is `fp16`, activations get converted from `fp32` (the result of the previous operation) to `fp16` before performing the matrix multiplication with the `fp16` model tensor. If the observation is that the model becomes "dumb" when `attn_k_b` is `fp16`, the conclusion is that there are activations that are outside of the `fp16` range, and they get truncated in the conversion. This is not the case in this repository, at least not on `x86_64`. I have matrix multiplication kernels for any `fpX x fpY` combination, so for model tensors in `fp16` the matrix multiplication is done directly on the `fp32` activations. Hence, there shouldn't be any accuracy loss (unless the model contains weights outside of the `fp16` range). On `ARM`, I still convert the activations to `fp16` as `fp16 x fp16` matrix multiplications are almost 2X faster on my M2-Max.
diff --git a/github-data/pull_requests/186 - iq1_s_r4_ slightly faster NEON gemm_gemv.md b/github-data/pull_requests/186 - iq1_s_r4 slightly faster NEON gemmgemv.md
similarity index 71%
rename from github-data/pull_requests/186 - iq1_s_r4_ slightly faster NEON gemm_gemv.md
rename to github-data/pull_requests/186 - iq1_s_r4 slightly faster NEON gemmgemv.md
index 3255a354d..ebc81d1fb 100644
--- a/github-data/pull_requests/186 - iq1_s_r4_ slightly faster NEON gemm_gemv.md
+++ b/github-data/pull_requests/186 - iq1_s_r4 slightly faster NEON gemmgemv.md
@@ -1,14 +1,17 @@
-### 🔀 [#186](https://github.com/ikawrakow/ik_llama.cpp/pull/186) - iq1_s_r4: slightly faster NEON gemm/gemv
+## 🔀 [Pull Request #186](https://github.com/ikawrakow/ik_llama.cpp/pull/186) - iq1_s_r4: slightly faster NEON gemm/gemv
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_s_r4_neon` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-05 |
| **Updated** | 2025-02-05 |
+| **Merged** | 2025-02-05 |
---
-#### Description
+## 📄 Description
DeepSeek-Lite on M2-Max CPU:
diff --git a/github-data/pull_requests/187 - IQ1_M_R4_ better 1.75 bpw quants.md b/github-data/pull_requests/187 - IQ1_M_R4 better 1.75 bpw quants.md
similarity index 87%
rename from github-data/pull_requests/187 - IQ1_M_R4_ better 1.75 bpw quants.md
rename to github-data/pull_requests/187 - IQ1_M_R4 better 1.75 bpw quants.md
index 9a1e377f3..9e60f64f8 100644
--- a/github-data/pull_requests/187 - IQ1_M_R4_ better 1.75 bpw quants.md
+++ b/github-data/pull_requests/187 - IQ1_M_R4 better 1.75 bpw quants.md
@@ -1,16 +1,19 @@
-### 🔀 [#187](https://github.com/ikawrakow/ik_llama.cpp/pull/187) - IQ1_M_R4: better 1.75 bpw quants
+## 🔀 [Pull Request #187](https://github.com/ikawrakow/ik_llama.cpp/pull/187) - IQ1_M_R4: better 1.75 bpw quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_m_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-06 |
| **Updated** | 2025-02-06 |
+| **Merged** | 2025-02-06 |
---
-#### Description
+## 📄 Description
-Following in the foot steps of #185, this PR adds `IQ1_M_R4`, a 4-row interleaved version of `IQ1_M`.
+Following in the foot steps of [#185](https://github.com/ikawrakow/ik_llama.cpp/issues/185), this PR adds `IQ1_M_R4`, a 4-row interleaved version of `IQ1_M`.
* I have removed the `f16` super-block scale (replaced with a `f16` per row scale) and have changed the 3-bit `IQ1_M` block scales with 4 bit. Hence, we end up using the same 1.75 bpw as `IQ1_M`.
* The above change allows to implement `IQ1_M_R4` with a block size of 32. I wanted to have this because DeepSeek-Lite, the model I'm testing with, has a lot of tensors with row sizes not divisible by 256, so a significant fraction of tensors gets quantized to `IQ4_NL` when using `IQ1_M`
diff --git a/github-data/pull_requests/188 - Add optional MLA.md b/github-data/pull_requests/188 - Add optional MLA.md
index 4ac97369d..eac1b86e7 100644
--- a/github-data/pull_requests/188 - Add optional MLA.md
+++ b/github-data/pull_requests/188 - Add optional MLA.md
@@ -1,50 +1,53 @@
-### 🔀 [#188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) - Add optional MLA
+## 🔀 [Pull Request #188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) - Add optional MLA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-06 |
| **Updated** | 2025-02-11 |
+| **Merged** | 2025-02-09 |
---
-#### Description
+## 📄 Description
-This PR is derived from #180. The difference to #180 is that MLA is made optional. It is off by default, and can be turned on using the added `-mla` or `--use-mla` command line option.
+This PR is derived from [#180](https://github.com/ikawrakow/ik_llama.cpp/issues/180). The difference to [#180](https://github.com/ikawrakow/ik_llama.cpp/issues/180) is that MLA is made optional. It is off by default, and can be turned on using the added `-mla` or `--use-mla` command line option.
Rationale: MLA improves TG speed, especially when there is a long context. But it also makes prompt processing significantly slower. Hence, MLA is made optional since advantage/disadvantage is use case dependent.
-Being able to select or deselect MLA at run time is possible due to the fact that #180 leaves the original `wkv_b` tensor and its decomposition into `wk_b` and `wv_b` in the model. This is somewhat wasteful, but these tensors are not very large and now come handy to easily select between the two attention implementations.
+Being able to select or deselect MLA at run time is possible due to the fact that [#180](https://github.com/ikawrakow/ik_llama.cpp/issues/180) leaves the original `wkv_b` tensor and its decomposition into `wk_b` and `wv_b` in the model. This is somewhat wasteful, but these tensors are not very large and now come handy to easily select between the two attention implementations.
In addition:
* It is now possible to use a model converted without this PR so that the `wk_b` and `wk_v` tensors are missing. In this case MLA will be disabled even if requested on the command line
-* Eliminated some unnecessary copies (`ggml_cont`). This repo has supported non-contiguous RoPE for a while and con-contiguous RMS norm on CUDA was added in #190 (the CPU has always supported non-contiguous RMS norm).
+* Eliminated some unnecessary copies (`ggml_cont`). This repo has supported non-contiguous RoPE for a while and con-contiguous RMS norm on CUDA was added in [#190](https://github.com/ikawrakow/ik_llama.cpp/issues/190) (the CPU has always supported non-contiguous RMS norm).
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-02-08** at **11:23:52**:
+👤 **saood06** commented on **2025-02-08** at **11:23:52**
-There were some other change's in the gguf-py/gguf/tensor_mapping.py that are in that branch that I missed porting over earlier.
+There were some other change's in the gguf-py/gguf/tensor_mapping.py that are in https://github.com/saood06/ik_llama.cpp/pull/1 that I missed porting over earlier.
The next thing I was going to do was remove the old KV from being allocated, I hadn't gotten around to it, as I had a workaround from the mmap KV cache feature, but it should be a relatively simple fix, when I have more time I'll look into it.
---
-👤 **saood06** commented the **2025-02-08** at **19:51:36**:
+👤 **saood06** commented on **2025-02-08** at **19:51:36**
-@ikawrakow I made #195 to merge into this with the things mentioned.
+@ikawrakow I made [#195](https://github.com/ikawrakow/ik_llama.cpp/issues/195) to merge into this with the things mentioned.
---
-👤 **ikawrakow** commented the **2025-02-09** at **11:09:23**:
+👤 **ikawrakow** commented on **2025-02-09** at **11:09:23**
I think we can merge this now.
---
-👤 **saood06** submitted a review the **2025-02-09** at **17:28:01**: ✅ `APPROVED`
+👤 **saood06** approved this pull request ✅ on **2025-02-09** at **17:28:01**
LGTM, good catch on applying cache quantization, it was something I had missed. BF16 makes sense when it is faster, but I never bothered as I'm assuming it would come with a large quality loss.
@@ -54,7 +57,7 @@ Testing was a bit of a pain without the warmup MoE fix as loading in experts tak
---
-👤 **ikawrakow** commented the **2025-02-09** at **17:48:32**:
+👤 **ikawrakow** commented on **2025-02-09** at **17:48:32**
> BF16 makes sense when it is faster, but I never bothered as I'm assuming it would come with a large quality loss.
@@ -66,7 +69,7 @@ Sounds good.
---
-👤 **saood06** commented the **2025-02-09** at **18:28:01**:
+👤 **saood06** commented on **2025-02-09** at **18:28:01**
> > BF16 makes sense when it is faster, but I never bothered as I'm assuming it would come with a large quality loss.
>
@@ -74,18 +77,16 @@ Sounds good.
>
I mispoke, I meant I never bothered quantizing the MLA version down to Q4 or Q6 as I did with the non MLA solution. I know most models are bf16 native (Deepseek was FP8 native which I had to upscale to BF16 before making the GGUF), and I would use BF16 if I had a modern processor with support for it.
-The old solution was MHA, which quantizes down very well, and is large enough to warrant it. Heavy GQA does not, MLA is sized like GQA and also small enough where I'm fine leaving it in F16, as my CPU is old and doesn't do BF16 but if I had a modern CPU I would use BF16.
+The old solution was MHA, which quantizes down very well, and is large enough to warrant it. Heavy GQA does not, MLA is sized like heavy GQA and is also small enough where I'm fine leaving it in F16 and not smaller and not BF16 as my CPU is old and doesn't do BF16 well.
---
-👤 **saood06** submitted a review the **2025-02-11** at **20:15:12**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `src/llama.cpp` on **2025-02-11** at **20:15:12**
----
-
-👤 **saood06** commented during a code review the **2025-02-11** at **20:20:39** on `src/llama.cpp`:
-
-With the above change only one of these should be allocated so that is the only one that should be displayed as KV self size
+Sorry I missed this, but I think this should be in the if block above as it is not needed for non MLA models.
---
-👤 **saood06** submitted a review the **2025-02-11** at **20:20:40**: 💬 `COMMENTED`
\ No newline at end of file
+👤 **saood06** started a conversation on `src/llama.cpp` on **2025-02-11** at **20:20:39**
+
+With the above change only one of these should be allocated so that is the only one that should be displayed as KV self size
\ No newline at end of file
diff --git a/github-data/pull_requests/189 - Rename q4_0_r4 q8_0_r4 and iq4_xs_r4 to _r8.md b/github-data/pull_requests/189 - Rename q4_0_r4 q8_0_r4 and iq4_xs_r4 to _r8.md
new file mode 100644
index 000000000..a3a97f78b
--- /dev/null
+++ b/github-data/pull_requests/189 - Rename q4_0_r4 q8_0_r4 and iq4_xs_r4 to _r8.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #189](https://github.com/ikawrakow/ik_llama.cpp/pull/189) - Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/rename_4_8` |
+| **Target Branch** | `main` |
+| **Created** | 2025-02-06 |
+| **Updated** | 2025-02-06 |
+| **Merged** | 2025-02-06 |
+
+---
+
+## 📄 Description
+
+to reflect the actual number of interleaved rows.
\ No newline at end of file
diff --git a/github-data/pull_requests/189 - Rename q4_0_r4_ q8_0_r4 and iq4_xs_r4 to _r8.md b/github-data/pull_requests/189 - Rename q4_0_r4_ q8_0_r4 and iq4_xs_r4 to _r8.md
deleted file mode 100644
index 238aca991..000000000
--- a/github-data/pull_requests/189 - Rename q4_0_r4_ q8_0_r4 and iq4_xs_r4 to _r8.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#189](https://github.com/ikawrakow/ik_llama.cpp/pull/189) - Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-02-06 |
-| **Updated** | 2025-02-06 |
-
----
-
-#### Description
-
-to reflect the actual number of interleaved rows.
\ No newline at end of file
diff --git a/github-data/pull_requests/19 - Skip barriers of noops.md b/github-data/pull_requests/19 - Skip barriers of noops.md
index 78ecbba8c..2e0a89996 100644
--- a/github-data/pull_requests/19 - Skip barriers of noops.md
+++ b/github-data/pull_requests/19 - Skip barriers of noops.md
@@ -1,14 +1,17 @@
-### 🔀 [#19](https://github.com/ikawrakow/ik_llama.cpp/pull/19) - Skip barriers of noops
+## 🔀 [Pull Request #19](https://github.com/ikawrakow/ik_llama.cpp/pull/19) - Skip barriers of noops
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/skip_noop_barriers` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-14 |
| **Updated** | 2024-08-14 |
+| **Merged** | 2024-08-14 |
---
-#### Description
+## 📄 Description
`GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE`, along with `GGML_OP_NONE`, are all noops in `ggml`. I.e., nothing happens. But `ggml` still has a thread barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a noticeable difference.
diff --git a/github-data/pull_requests/190 - cuda non-contiguous rms norm.md b/github-data/pull_requests/190 - cuda non-contiguous rms norm.md
new file mode 100644
index 000000000..a2dd29774
--- /dev/null
+++ b/github-data/pull_requests/190 - cuda non-contiguous rms norm.md
@@ -0,0 +1,18 @@
+## 🔀 [Pull Request #190](https://github.com/ikawrakow/ik_llama.cpp/pull/190) - cuda: non-contiguous rms norm
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_rms_non_contiguous` |
+| **Target Branch** | `main` |
+| **Created** | 2025-02-06 |
+| **Updated** | 2025-02-07 |
+| **Merged** | 2025-02-07 |
+
+---
+
+## 📄 Description
+
+Derived from https://github.com/ggerganov/llama.cpp/pull/11659
+
+Minor benefit for DeepSeek-Lite (~2% faster TG).
\ No newline at end of file
diff --git a/github-data/pull_requests/190 - cuda_ non-contiguous rms norm.md b/github-data/pull_requests/190 - cuda_ non-contiguous rms norm.md
deleted file mode 100644
index bbb6cbd42..000000000
--- a/github-data/pull_requests/190 - cuda_ non-contiguous rms norm.md
+++ /dev/null
@@ -1,15 +0,0 @@
-### 🔀 [#190](https://github.com/ikawrakow/ik_llama.cpp/pull/190) - cuda: non-contiguous rms norm
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-02-06 |
-| **Updated** | 2025-02-07 |
-
----
-
-#### Description
-
-Derived from https://github.com/ggerganov/llama.cpp/pull/11659
-
-Minor benefit for DeepSeek-Lite (~2% faster TG).
\ No newline at end of file
diff --git a/github-data/pull_requests/191 - Add additional checks for iq1_s_r4 quantization.md b/github-data/pull_requests/191 - Add additional checks for iq1_s_r4 quantization.md
index 984912bb4..968de9fa6 100644
--- a/github-data/pull_requests/191 - Add additional checks for iq1_s_r4 quantization.md
+++ b/github-data/pull_requests/191 - Add additional checks for iq1_s_r4 quantization.md
@@ -1,13 +1,16 @@
-### 🔀 [#191](https://github.com/ikawrakow/ik_llama.cpp/pull/191) - Add additional checks for iq1_s_r4 quantization
+## 🔀 [Pull Request #191](https://github.com/ikawrakow/ik_llama.cpp/pull/191) - Add additional checks for iq1_s_r4 quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_s_checks` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-07 |
| **Updated** | 2025-02-07 |
+| **Merged** | 2025-02-07 |
---
-#### Description
+## 📄 Description
-Something goes wrong when quantizing DeepSeek-R1 with `IQ1_S_R4` (see #185), so adding additional checks in the quantization.
\ No newline at end of file
+Something goes wrong when quantizing DeepSeek-R1 with `IQ1_S_R4` (see [#185](https://github.com/ikawrakow/ik_llama.cpp/issues/185)), so adding additional checks in the quantization.
\ No newline at end of file
diff --git a/github-data/pull_requests/192 - Revert 79.md b/github-data/pull_requests/192 - Revert 79.md
new file mode 100644
index 000000000..ece21262e
--- /dev/null
+++ b/github-data/pull_requests/192 - Revert 79.md
@@ -0,0 +1,22 @@
+## 🔀 [Pull Request #192](https://github.com/ikawrakow/ik_llama.cpp/pull/192) - Revert [#79](https://github.com/ikawrakow/ik_llama.cpp/issues/79)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/revert_0bf4d997` |
+| **Target Branch** | `main` |
+| **Created** | 2025-02-07 |
+| **Updated** | 2025-02-08 |
+| **Merged** | 2025-02-08 |
+
+---
+
+## 📄 Description
+
+While testing potential improvements of `IQ1_S_R4` quantization, I ran into NaNs while running a DeepSeek-Lite perplexity calculation. I did a `grep -r` on a folder with many big files while running the calculation and suddenly I got a NaN PPL. I repeated the calculation without doing anything else at the same time and the NaN did not happen. I then ran with 32 threads on a 16-core system and was able to reliably get a NaN at some random chunk.
+
+This means there is a race.
+
+The race was most likely introduced in [#79](https://github.com/ikawrakow/ik_llama.cpp/issues/79) (avoid repeating already done quantizations of activations). I honestly do not understand why there could be a race, or even less do I understand why it would only happen for DeepSeek-Lite quantized with `IQ1_S_R4`. I have done countless runs since [#79](https://github.com/ikawrakow/ik_llama.cpp/issues/79) and never observed anything suspicious.
+
+Either way, this PR reverts [#79](https://github.com/ikawrakow/ik_llama.cpp/issues/79). After doing so, there aren't any NaNs no matter how busy I make the system while running DeepSeek-Lite inference. Hopefully this will also fix the NaNs @saood06 gets with `IQ1_S_R4` quantized DeepSeek-R1 (see discussion in [#185](https://github.com/ikawrakow/ik_llama.cpp/issues/185)).
\ No newline at end of file
diff --git a/github-data/pull_requests/192 - Revert _79.md b/github-data/pull_requests/192 - Revert _79.md
deleted file mode 100644
index 9c9084cc6..000000000
--- a/github-data/pull_requests/192 - Revert _79.md
+++ /dev/null
@@ -1,19 +0,0 @@
-### 🔀 [#192](https://github.com/ikawrakow/ik_llama.cpp/pull/192) - Revert [#79](https://github.com/ikawrakow/ik_llama.cpp/issues/79)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-02-07 |
-| **Updated** | 2025-02-08 |
-
----
-
-#### Description
-
-While testing potential improvements of `IQ1_S_R4` quantization, I ran into NaNs while running a DeepSeek-Lite perplexity calculation. I did a `grep -r` on a folder with many big files while running the calculation and suddenly I got a NaN PPL. I repeated the calculation without doing anything else at the same time and the NaN did not happen. I then ran with 32 threads on a 16-core system and was able to reliably get a NaN at some random chunk.
-
-This means there is a race.
-
-The race was most likely introduced in #79 (avoid repeating already done quantizations of activations). I honestly do not understand why there could be a race, or even less do I understand why it would only happen for DeepSeek-Lite quantized with `IQ1_S_R4`. I have done countless runs since #79 and never observed anything suspicious.
-
-Either way, this PR reverts #79. After doing so, there aren't any NaNs no matter how busy I make the system while running DeepSeek-Lite inference. Hopefully this will also fix the NaNs @saood06 gets with `IQ1_S_R4` quantized DeepSeek-R1 (see discussion in #185).
\ No newline at end of file
diff --git a/github-data/pull_requests/193 - RPC sync.md b/github-data/pull_requests/193 - RPC sync.md
index 2f6d4bb29..758395ca1 100644
--- a/github-data/pull_requests/193 - RPC sync.md
+++ b/github-data/pull_requests/193 - RPC sync.md
@@ -1,14 +1,16 @@
-### 🔀 [#193](https://github.com/ikawrakow/ik_llama.cpp/pull/193) - RPC sync
+## 🔀 [Pull Request #193](https://github.com/ikawrakow/ik_llama.cpp/pull/193) - RPC sync
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `s6/rpc` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-08 |
| **Updated** | 2025-06-15 |
---
-#### Description
+## 📄 Description
I grabbed all of the changes needed for [llama.cpp/pull/11047](https://github.com/ggerganov/llama.cpp/pull/11047) , which was https://github.com/ggerganov/llama.cpp/pull/9912 and https://github.com/ggerganov/llama.cpp/pull/9040
@@ -16,15 +18,15 @@ This compiles, but has not been tested yet.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-08** at **13:23:08**:
+👤 **ikawrakow** commented on **2025-02-08** at **13:23:08**
I never use RPC, have never looked into the RPC code, so I'll have to rely on you for self-review and testing.
---
-👤 **saood06** commented the **2025-02-10** at **16:40:34**:
+👤 **saood06** commented on **2025-02-10** at **16:40:34**
@jukofyork
>I strongly suspect something funky is going on
@@ -36,33 +38,35 @@ This fork has much faster PP speeds, has Deepseek MLA support with a flag (-mla)
---
-👤 **saood06** commented the **2025-02-27** at **23:11:54**:
+👤 **saood06** commented on **2025-02-27** at **23:11:54**
This has been tested, and does not currently work. I'm not sure why as the errors I'm getting seem to have never been encountered by people on llama.cpp.
---
-👤 **saood06** submitted a review the **2025-02-27** at **23:14:23**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-02-27** at **23:14:23** on `ggml/src/ggml-rpc.cpp`:
+👤 **saood06** started a conversation on `ggml/src/ggml-rpc.cpp` on **2025-02-27** at **23:14:23**
The RPC client crashes here, which happens as the RPC server hits an issue.
---
-👤 **saood06** submitted a review the **2025-02-27** at **23:17:32**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `ggml/src/ggml-rpc.cpp` on **2025-02-27** at **23:17:32**
+
+I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.
---
-👤 **saood06** commented during a code review the **2025-02-27** at **23:17:32** on `ggml/src/ggml-rpc.cpp`:
+👤 **ubergarm** commented on **2025-04-11** at **18:32:04**
-I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.
+@saood06
+
+I just came across another [llama.cpp fork called prima.cpp](https://github.com/Lizonghang/prima.cpp?tab=readme-ov-file#-key-features) which claims to have improved support for multi-device distributed inferencing.
+
+I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.
---
-👤 **saood06** commented the **2025-04-12** at **04:39:37**:
+👤 **saood06** commented on **2025-04-12** at **04:39:37**
> @saood06
>
@@ -74,6 +78,6 @@ Thanks for the link, it is interesting. I think it would work for dense models b
---
-👤 **saood06** commented the **2025-06-15** at **11:26:50**:
+👤 **saood06** commented on **2025-06-15** at **11:26:50**
-Closed as superseded by #480 / #506
\ No newline at end of file
+Closed as superseded by [#480](https://github.com/ikawrakow/ik_llama.cpp/issues/480) / [#506](https://github.com/ikawrakow/ik_llama.cpp/issues/506)
\ No newline at end of file
diff --git a/github-data/pull_requests/194 - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications.md b/github-data/pull_requests/194 - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications.md
index fdb2eb77e..fb30e8735 100644
--- a/github-data/pull_requests/194 - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications.md
+++ b/github-data/pull_requests/194 - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications.md
@@ -1,14 +1,17 @@
-### 🔀 [#194](https://github.com/ikawrakow/ik_llama.cpp/pull/194) - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications
+## 🔀 [Pull Request #194](https://github.com/ikawrakow/ik_llama.cpp/pull/194) - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_s_r4_k128` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-08 |
| **Updated** | 2025-02-09 |
+| **Merged** | 2025-02-09 |
---
-#### Description
+## 📄 Description
@saood06 is still observing NaNs for DeepSeek-R1 quantized with `IQ1_S_R4`. As I don't see what else could be wrong, I'm making the following hypothesis:
@@ -25,9 +28,9 @@ Would appreciate if this gets tested with DeepSeek-R1.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-02-08** at **21:39:38**:
+👤 **saood06** commented on **2025-02-08** at **21:39:38**
@ikawrakow
>Would appreciate if this gets tested with DeepSeek-R1.
@@ -40,6 +43,6 @@ No more `NaN`'s, nice! It's impressive how quickly you found the race condition
---
-👤 **ikawrakow** commented the **2025-02-09** at **06:02:29**:
+👤 **ikawrakow** commented on **2025-02-09** at **06:02:29**
Thank you for this! The decisive hint to solve it was the discussion about DeepSeek-R1 being dumb with `fp16` attention tensors that you alerted me to.
\ No newline at end of file
diff --git a/github-data/pull_requests/195 - Deepseek MLA Optimizations V2.md b/github-data/pull_requests/195 - Deepseek MLA Optimizations V2.md
index 7b83da6d4..d4dc0fab9 100644
--- a/github-data/pull_requests/195 - Deepseek MLA Optimizations V2.md
+++ b/github-data/pull_requests/195 - Deepseek MLA Optimizations V2.md
@@ -1,14 +1,17 @@
-### 🔀 [#195](https://github.com/ikawrakow/ik_llama.cpp/pull/195) - Deepseek MLA Optimizations V2
+## 🔀 [Pull Request #195](https://github.com/ikawrakow/ik_llama.cpp/pull/195) - Deepseek MLA Optimizations V2
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/mla` |
+| **Target Branch** | `ik/mla` |
| **Created** | 2025-02-08 |
| **Updated** | 2025-02-09 |
+| **Merged** | 2025-02-09 |
---
-#### Description
+## 📄 Description
@ikawrakow
@@ -23,9 +26,9 @@ I will follow up with:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-02-09** at **07:36:43**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-02-09** at **07:36:43**
Looks good. I added a minor change to check if `wk_b` and `wv_b` are available before turning on MLA (so we don't crash if someone is using an old model and asked for MLA).
diff --git a/github-data/pull_requests/197 - FA_ Add option to build all FA kernels.md b/github-data/pull_requests/197 - FA Add option to build all FA kernels.md
similarity index 52%
rename from github-data/pull_requests/197 - FA_ Add option to build all FA kernels.md
rename to github-data/pull_requests/197 - FA Add option to build all FA kernels.md
index 11dbdda2f..a45ceb221 100644
--- a/github-data/pull_requests/197 - FA_ Add option to build all FA kernels.md
+++ b/github-data/pull_requests/197 - FA Add option to build all FA kernels.md
@@ -1,14 +1,17 @@
-### 🔀 [#197](https://github.com/ikawrakow/ik_llama.cpp/pull/197) - FA: Add option to build all FA kernels
+## 🔀 [Pull Request #197](https://github.com/ikawrakow/ik_llama.cpp/pull/197) - FA: Add option to build all FA kernels
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iqk_fattn_all_quants` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-09 |
| **Updated** | 2025-02-09 |
+| **Merged** | 2025-02-09 |
---
-#### Description
+## 📄 Description
Similar to the CUDA situation.
It is OFF by default.
@@ -19,4 +22,4 @@ cmake -DGGML_IQK_FA_ALL_QUANTS=1 ...
```
This cuts compilation time for `iqk_mul_mat.cpp` by almost half (45 seconds vs 81 seconds on my Ryzen-7950X).
-This is poor men's solution of the long build time until #183 is tackled.
\ No newline at end of file
+This is poor men's solution of the long build time until [#183](https://github.com/ikawrakow/ik_llama.cpp/issues/183) is tackled.
\ No newline at end of file
diff --git a/github-data/pull_requests/198 - Load all MoE experts during warmup and make warmup 1 token.md b/github-data/pull_requests/198 - Load all MoE experts during warmup and make warmup 1 token.md
index 1c57b42c8..2e44920aa 100644
--- a/github-data/pull_requests/198 - Load all MoE experts during warmup and make warmup 1 token.md
+++ b/github-data/pull_requests/198 - Load all MoE experts during warmup and make warmup 1 token.md
@@ -1,14 +1,17 @@
-### 🔀 [#198](https://github.com/ikawrakow/ik_llama.cpp/pull/198) - Load all MoE experts during warmup and make warmup 1 token
+## 🔀 [Pull Request #198](https://github.com/ikawrakow/ik_llama.cpp/pull/198) - Load all MoE experts during warmup and make warmup 1 token
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/warmup` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-09 |
| **Updated** | 2025-02-10 |
+| **Merged** | 2025-02-10 |
---
-#### Description
+## 📄 Description
First commit is a port of: https://github.com/ggerganov/llama.cpp/pull/11571
@@ -18,16 +21,16 @@ This allows warmup to actually warmup an MoE model as all experts are exercised.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-02-10** at **07:12:56**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-02-10** at **07:12:56**
LGTM, but it does nothing on the single socket computers I have currently available, so relying on the comments in the linked PR and issue that this really improves things on NUMA systems.
---
-👤 **saood06** commented the **2025-02-10** at **14:52:48**:
+👤 **saood06** commented on **2025-02-10** at **14:52:48**
> LGTM, but it does nothing on the single socket computers I have currently available, so relying on the comments in the linked PR and issue that this really improves things on NUMA systems.
-The first commit, should work on any system to help MoE loading (Deepseek is the most noticeable because of it's large size and expert count but it should help, but all MoE should benefit) . It is only the the second commit is designed to benefit NUMA systems.
\ No newline at end of file
+The first commit, should work on any system to help MoE loading (Deepseek is the most noticeable because of it's large size and expert count but it should help all MoE) . It is only the the second commit is designed to benefit NUMA systems.
\ No newline at end of file
diff --git a/github-data/pull_requests/2 - Offload Bitnet token embeddings to the GPU - the right way.md b/github-data/pull_requests/2 - Offload Bitnet token embeddings to the GPU - the right way.md
index c698fa119..c324bd04e 100644
--- a/github-data/pull_requests/2 - Offload Bitnet token embeddings to the GPU - the right way.md
+++ b/github-data/pull_requests/2 - Offload Bitnet token embeddings to the GPU - the right way.md
@@ -1,13 +1,16 @@
-### 🔀 [#2](https://github.com/ikawrakow/ik_llama.cpp/pull/2) - Offload Bitnet token embeddings to the GPU - the right way
+## 🔀 [Pull Request #2](https://github.com/ikawrakow/ik_llama.cpp/pull/2) - Offload Bitnet token embeddings to the GPU - the right way
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_token_embedding_gpu_2` |
+| **Target Branch** | `main` |
| **Created** | 2024-07-26 |
| **Updated** | 2024-07-26 |
+| **Merged** | 2024-07-26 |
---
-#### Description
+## 📄 Description
OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never.
\ No newline at end of file
diff --git a/github-data/pull_requests/20 - iq2_k_ slightly better bpw - accuracy compromise.md b/github-data/pull_requests/20 - iq2_k slightly better bpw - accuracy compromise.md
similarity index 54%
rename from github-data/pull_requests/20 - iq2_k_ slightly better bpw - accuracy compromise.md
rename to github-data/pull_requests/20 - iq2_k slightly better bpw - accuracy compromise.md
index e45c6a81c..a928e36b4 100644
--- a/github-data/pull_requests/20 - iq2_k_ slightly better bpw - accuracy compromise.md
+++ b/github-data/pull_requests/20 - iq2_k slightly better bpw - accuracy compromise.md
@@ -1,14 +1,17 @@
-### 🔀 [#20](https://github.com/ikawrakow/ik_llama.cpp/pull/20) - iq2_k: slightly better bpw - accuracy compromise
+## 🔀 [Pull Request #20](https://github.com/ikawrakow/ik_llama.cpp/pull/20) - iq2_k: slightly better bpw - accuracy compromise
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_k_tweak` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-19 |
| **Updated** | 2024-08-19 |
+| **Merged** | 2024-08-19 |
---
-#### Description
+## 📄 Description
For LLaMA-3.1 models:
* It is better to quantize all of attn_v with iq3_k instead of half of attn_v with iq4_k
diff --git a/github-data/pull_requests/200 - DeepSeek FA support _CPU only_.md b/github-data/pull_requests/200 - DeepSeek FA support CPU only.md
similarity index 85%
rename from github-data/pull_requests/200 - DeepSeek FA support _CPU only_.md
rename to github-data/pull_requests/200 - DeepSeek FA support CPU only.md
index 8be0805ee..f008274d3 100644
--- a/github-data/pull_requests/200 - DeepSeek FA support _CPU only_.md
+++ b/github-data/pull_requests/200 - DeepSeek FA support CPU only.md
@@ -1,14 +1,17 @@
-### 🔀 [#200](https://github.com/ikawrakow/ik_llama.cpp/pull/200) - DeepSeek FA support (CPU only)
+## 🔀 [Pull Request #200](https://github.com/ikawrakow/ik_llama.cpp/pull/200) - DeepSeek FA support (CPU only)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_Dk_Dv` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-10 |
| **Updated** | 2025-02-11 |
+| **Merged** | 2025-02-11 |
---
-#### Description
+## 📄 Description
This PR adds FA support for models where K and V head sizes are different, such as DeepSeek-R1 and DeepSeek-Lite. It only works with the standard attention mechanism, I have yet to look into FA with MLA.
@@ -16,9 +19,9 @@ We get a nice speedup for PP, increasing with context length, but TG is not fast
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-11** at **09:08:44**:
+👤 **ikawrakow** commented on **2025-02-11** at **09:08:44**
So, I did get some minor FA speed improvements for TG, but I don't see what else one could do, so I'll merge it.
@@ -34,8 +37,8 @@ The second graph is token generation speed (TG-64) after a prompt of a given len
---
-👤 **ikawrakow** commented the **2025-02-11** at **10:33:34**:
+👤 **ikawrakow** commented on **2025-02-11** at **10:33:34**
-Recently I read somewhere that for the "common enterprise workflow" (whatever that means) the number of generated tokens is typically only about 10% of the prompt tokens. I don't know if that is true, but for the sake of argument, let's assume for a moment that it is. In that case the best way to measure overall model performance is to use `llama-bench -pg Npp,Ntg`, where `Ntg=0.1*Npp` is the number of generated tokens and `Npp` is the number of prompt tokens. The following graph shows `PG` performance as a function of prompt length. The black symbols are mainline `llama.cpp build b9ab0a4d (4687)` (most current version as of today), the red symbols are for baseline `ik_llama.cpp` (no FA, no MLA), the green symbols are for MLA, and the blue symbols are for FA from this PR. The model is DeepSeek-Lite quantized with `IQ4_XS`. All use `Q8_0` for K cache, FA uses `Q8_0` also for V cache. All runs are on a Ryzen-7950X CPU. If we buy the claim that `Ntg ~ 0.1*Npp` in the "typical enterprise workflow", then there is no benefit from MLA over baseline, while FA is ~26% better for long prompts. Mainline `llama.cpp` is, as usual, slower. 1.45X for short prompts, increasing to 1.7X slower for prompts with 16k tokens.
+Recently I read somewhere that for the "common enterprise workflow" (whatever that means) the number of generated tokens is typically only about 10% of the prompt tokens. I don't know if that is true, but for the sake of argument, let's assume for a moment that it is. In that case the best way to measure overall model performance is to use `llama-bench -pg Npp,Ntg`, where `Ntg=0.1*Npp` is the number of generated tokens and `Npp` is the number of prompt tokens. The following graph shows `PG` performance as a function of prompt length. The black symbols are mainline `llama.cpp build b9ab0a4d (4687)` (most current version as of today), the red symbols are for baseline `ik_llama.cpp` (no FA, no MLA), the green symbols are for MLA, and the blue symbols are for FA from this PR. The model is DeepSeek-Lite quantized with `IQ4_XS`. All use `Q8_0` for K cache, FA uses `Q8_0` also for V cache. All runs are on a Ryzen-7950X CPU. If we buy the claim that `Ntg ~ 0.1*Npp` is the "typical enterprise workflow", then there is no benefit from MLA over baseline, while FA is ~26% better for long prompts. Mainline `llama.cpp` is, as usual, slower. 1.45X for short prompts, increasing to 1.7X slower for prompts with 16k tokens.

\ No newline at end of file
diff --git a/github-data/pull_requests/202 - Fix imatrix overprotectiveness.md b/github-data/pull_requests/202 - Fix imatrix overprotectiveness.md
index 9fbc1015f..5b6726867 100644
--- a/github-data/pull_requests/202 - Fix imatrix overprotectiveness.md
+++ b/github-data/pull_requests/202 - Fix imatrix overprotectiveness.md
@@ -1,14 +1,17 @@
-### 🐛 [#202](https://github.com/ikawrakow/ik_llama.cpp/pull/202) - Fix imatrix overprotectiveness
+## 🔀 [Pull Request #202](https://github.com/ikawrakow/ik_llama.cpp/pull/202) - Fix imatrix overprotectiveness
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_imatrix_nonsense` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-11 |
| **Updated** | 2025-02-12 |
+| **Merged** | 2025-02-12 |
---
-#### Description
+## 📄 Description
I hear reports that people are having trouble creating imatrix data for models with many experts (e.g., DeepSeek-R1, Arctic). For such models it may be very hard to activate all experts in all layers, which it turns out leads to the data for **the entire** tensor containing experts with missing data to be not stored in the imatrix file. Which then prevents usage of the imatrix data for low-bit quantization of such models.
@@ -23,13 +26,13 @@ This PR reduces the powers of the protection police. If a tensor is found that h
The rationale behind this approach is that if an expert was never activated after processing a significant amount of calibration data, this expert cannot be very important, so we can afford to quantize it with low bpw quants even without guidance on the importance of columns of this expert.
-Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for `IQ1_S_R4` as I did in #191. So, for now we go with same-importance columns for never activated experts.
+Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for `IQ1_S_R4` as I did in [#191](https://github.com/ikawrakow/ik_llama.cpp/issues/191). So, for now we go with same-importance columns for never activated experts.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-02-11** at **17:09:17**:
+👤 **saood06** commented on **2025-02-11** at **17:09:17**
>for the entire tensor containing experts
@@ -39,7 +42,7 @@ I plan to port over code that lets you override where certain tensors are alloca
---
-👤 **ikawrakow** commented the **2025-02-11** at **17:16:38**:
+👤 **ikawrakow** commented on **2025-02-11** at **17:16:38**
> but do you know why GGUF stores all the experts together?
diff --git a/github-data/pull_requests/204 - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support.md b/github-data/pull_requests/204 - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support.md
index 78c70d312..3e385265b 100644
--- a/github-data/pull_requests/204 - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support.md
+++ b/github-data/pull_requests/204 - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support.md
@@ -1,13 +1,16 @@
-### 🐛 [#204](https://github.com/ikawrakow/ik_llama.cpp/pull/204) - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support
+## 🔀 [Pull Request #204](https://github.com/ikawrakow/ik_llama.cpp/pull/204) - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_missing_bf16_avx512` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-12 |
| **Updated** | 2025-02-12 |
+| **Merged** | 2025-02-12 |
---
-#### Description
+## 📄 Description
-Fixes #203
\ No newline at end of file
+Fixes [#203](https://github.com/ikawrakow/ik_llama.cpp/issues/203)
\ No newline at end of file
diff --git a/github-data/pull_requests/205 - Faster MLA prompt processing.md b/github-data/pull_requests/205 - Faster MLA prompt processing.md
index d637e1167..45f34b7f7 100644
--- a/github-data/pull_requests/205 - Faster MLA prompt processing.md
+++ b/github-data/pull_requests/205 - Faster MLA prompt processing.md
@@ -1,18 +1,21 @@
-### 🔀 [#205](https://github.com/ikawrakow/ik_llama.cpp/pull/205) - Faster MLA prompt processing
+## 🔀 [Pull Request #205](https://github.com/ikawrakow/ik_llama.cpp/pull/205) - Faster MLA prompt processing
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla_fixes` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-12 |
| **Updated** | 2025-02-13 |
+| **Merged** | 2025-02-13 |
---
-#### Description
+## 📄 Description
This PR speeds up prompt processing (PP) when MLA is enabled. It is still slower than no-MLA, so I'm making this a draft for now to try some more. Still it would be great if somebody else tested to confirm that a) I did not introduce bugs and b) It is indeed faster on their systems.
-The PR also adds the changes suggested by @saood06 in the review of #188
+The PR also adds the changes suggested by @saood06 in the review of [#188](https://github.com/ikawrakow/ik_llama.cpp/issues/188)
Speedup is achieved by concatenating the no- and rotational position encoding parts of `K` and `Q` (this also eliminates the `k_r` cache), which allows us to combine the former `kq_nope` and `kq_pe` matrix multiplications into a single matrix multiplication. This also eliminates the fairly expensive addition of `kq_nope` and `kq_pe`.
@@ -44,23 +47,19 @@ Not sure if the ~9% improvement at 16k tokens is real. It may be just due to les
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** submitted a review the **2025-02-12** at **20:10:21**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `src/llama.cpp` on **2025-02-12** at **20:10:20**
----
-
-👤 **ikawrakow** submitted a review the **2025-02-13** at **08:57:48**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-02-13** at **08:57:48** on `src/llama.cpp`:
+We might want to print something if mla_attn is requested but not able to be run instead of just silently failing over to standard attention, I just saw a report of a user not realizing that this was happening and not sure why MLA was not giving any performance difference.
-Thanks. Added a hopefully visible warning.
+> 👤 **ikawrakow** replied on **2025-02-13** at **08:57:48**
+>
+> Thanks. Added a hopefully visible warning.
---
-👤 **ikawrakow** commented the **2025-02-13** at **09:04:18**:
+👤 **ikawrakow** commented on **2025-02-13** at **09:04:18**
The PR also adds a compile time option to disable the transposed KV cache when using MLA (simple look for `MLA_USE_TRANSPOSED_CACHE` and set it to 0). This cuts KV cache size in nearly half at the expense of a lower TG performance with long contexts. PP performance stays about the same. Here is a comparison between MLA with and without transposed cache
diff --git a/github-data/pull_requests/206 - MLA_ allow Q8_0 K-cache for MLA.md b/github-data/pull_requests/206 - MLA allow Q8_0 K-cache for MLA.md
similarity index 83%
rename from github-data/pull_requests/206 - MLA_ allow Q8_0 K-cache for MLA.md
rename to github-data/pull_requests/206 - MLA allow Q8_0 K-cache for MLA.md
index a3a9d4284..70cec02dc 100644
--- a/github-data/pull_requests/206 - MLA_ allow Q8_0 K-cache for MLA.md
+++ b/github-data/pull_requests/206 - MLA allow Q8_0 K-cache for MLA.md
@@ -1,16 +1,19 @@
-### 🔀 [#206](https://github.com/ikawrakow/ik_llama.cpp/pull/206) - MLA: allow Q8_0 K-cache for MLA
+## 🔀 [Pull Request #206](https://github.com/ikawrakow/ik_llama.cpp/pull/206) - MLA: allow Q8_0 K-cache for MLA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla_q80` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-13 |
| **Updated** | 2025-02-13 |
+| **Merged** | 2025-02-13 |
---
-#### Description
+## 📄 Description
-After PR #205 we have two KV caches left when using MLA:
+After PR [#205](https://github.com/ikawrakow/ik_llama.cpp/issues/205) we have two KV caches left when using MLA:
* `kv_l` - contiguous, not transposed
* `kvt_l` - a transposed version of `kv_l`
diff --git a/github-data/pull_requests/207 - Faster CPU TG for GQA models.md b/github-data/pull_requests/207 - Faster CPU TG for GQA models.md
index b52fb7b77..09bb3036a 100644
--- a/github-data/pull_requests/207 - Faster CPU TG for GQA models.md
+++ b/github-data/pull_requests/207 - Faster CPU TG for GQA models.md
@@ -1,14 +1,17 @@
-### 🔀 [#207](https://github.com/ikawrakow/ik_llama.cpp/pull/207) - Faster CPU TG for GQA models
+## 🔀 [Pull Request #207](https://github.com/ikawrakow/ik_llama.cpp/pull/207) - Faster CPU TG for GQA models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_4d` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-15 |
| **Updated** | 2025-02-15 |
+| **Merged** | 2025-02-15 |
---
-#### Description
+## 📄 Description
This PR
* Absorbs the `iqk` matrix multiplication logic in `ggml` into a new `iqk` function `iqk_mul_mat_4d`. The change to `ggml` to incorporate the `iqk`-added functionality is now much less intusive
diff --git a/github-data/pull_requests/208 - Q8_KV_ 8-bit quantization type targeting the KV cache.md b/github-data/pull_requests/208 - Q8_KV 8-bit quantization type targeting the KV cache.md
similarity index 92%
rename from github-data/pull_requests/208 - Q8_KV_ 8-bit quantization type targeting the KV cache.md
rename to github-data/pull_requests/208 - Q8_KV 8-bit quantization type targeting the KV cache.md
index 4a363bf22..ba29bf086 100644
--- a/github-data/pull_requests/208 - Q8_KV_ 8-bit quantization type targeting the KV cache.md
+++ b/github-data/pull_requests/208 - Q8_KV 8-bit quantization type targeting the KV cache.md
@@ -1,14 +1,17 @@
-### 🔀 [#208](https://github.com/ikawrakow/ik_llama.cpp/pull/208) - Q8_KV: 8-bit quantization type targeting the KV cache
+## 🔀 [Pull Request #208](https://github.com/ikawrakow/ik_llama.cpp/pull/208) - Q8_KV: 8-bit quantization type targeting the KV cache
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q8_KV` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-18 |
| **Updated** | 2025-02-19 |
+| **Merged** | 2025-02-19 |
---
-#### Description
+## 📄 Description
What is `Q8_KV`? It is 8-bit quantization with a single scale per tensor row (so, no blocks at all). That may not be accurate enough for model quantization, but using it for KV cache quantization seems plausible, considering that there rows are defined by the head size, so contain 64, 80, 96, 128, 192, or 256 elements for all LLMs currently in circulation. We are not looking for KV cache size reduction but rather for improving inference performance for long contexts. This is especially relevant for MLA (DeepSeek) as in FA the kernels are highly optimized, so large improvements may not be really possible.
@@ -49,7 +52,7 @@ I.e., using `Q8_KV` for K-cache quantization leads to a very minor loss of accur
### Update
-I have added the last 2 rows to the above table. In `Q8_KV*` the output and token embedding tensors are quantized with `Q8_0`, so most of the accuracy loss comes from these two tensors (and they have negligible impact on performance). I have also rerun the performance tests after merging PR #210. Here are the updated results:
+I have added the last 2 rows to the above table. In `Q8_KV*` the output and token embedding tensors are quantized with `Q8_0`, so most of the accuracy loss comes from these two tensors (and they have negligible impact on performance). I have also rerun the performance tests after merging PR [#210](https://github.com/ikawrakow/ik_llama.cpp/issues/210). Here are the updated results:
| model | params | mla | test | t/s (main) | t/s (PR) | Speedup |
| -------------- | ---------: | --: | ------------: | ---------------: | ---------------: | --------: |
diff --git a/github-data/pull_requests/21 - quantize_stats print rmse and max error as fraction of x.md b/github-data/pull_requests/21 - quantize_stats print rmse and max error as fraction of x.md
new file mode 100644
index 000000000..cf9b01399
--- /dev/null
+++ b/github-data/pull_requests/21 - quantize_stats print rmse and max error as fraction of x.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #21](https://github.com/ikawrakow/ik_llama.cpp/pull/21) - quantize_stats: print rmse and max error as fraction of
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/quantize_stats` |
+| **Target Branch** | `main` |
+| **Created** | 2024-08-19 |
+| **Updated** | 2024-08-19 |
+| **Merged** | 2024-08-19 |
+
+---
+
+## 📄 Description
+
+This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ.
\ No newline at end of file
diff --git a/github-data/pull_requests/21 - quantize_stats_ print rmse and max error as fraction of _x_.md b/github-data/pull_requests/21 - quantize_stats_ print rmse and max error as fraction of _x_.md
deleted file mode 100644
index c120cd933..000000000
--- a/github-data/pull_requests/21 - quantize_stats_ print rmse and max error as fraction of _x_.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#21](https://github.com/ikawrakow/ik_llama.cpp/pull/21) - quantize_stats: print rmse and max error as fraction of
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-08-19 |
-| **Updated** | 2024-08-19 |
-
----
-
-#### Description
-
-This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ.
\ No newline at end of file
diff --git a/github-data/pull_requests/210 - Repack also experts.md b/github-data/pull_requests/210 - Repack also experts.md
index 0ca9d81e5..d34a68d3d 100644
--- a/github-data/pull_requests/210 - Repack also experts.md
+++ b/github-data/pull_requests/210 - Repack also experts.md
@@ -1,14 +1,17 @@
-### 🔀 [#210](https://github.com/ikawrakow/ik_llama.cpp/pull/210) - Repack also experts
+## 🔀 [Pull Request #210](https://github.com/ikawrakow/ik_llama.cpp/pull/210) - Repack also experts
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/repack_also_experts` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-19 |
| **Updated** | 2025-02-19 |
+| **Merged** | 2025-02-19 |
---
-#### Description
+## 📄 Description
When I implemented run time repacking, I required the tensor to be 2D to be eligible for repacking, I guess to simplify the code. But I forgot about MoE models, where expert weights are in 3D tensors.
diff --git a/github-data/pull_requests/212 - Optimized GEMM_GEMV for IQ1_S.md b/github-data/pull_requests/212 - Optimized GEMMGEMV for IQ1_S.md
similarity index 89%
rename from github-data/pull_requests/212 - Optimized GEMM_GEMV for IQ1_S.md
rename to github-data/pull_requests/212 - Optimized GEMMGEMV for IQ1_S.md
index 6f4d0f999..693d2db8c 100644
--- a/github-data/pull_requests/212 - Optimized GEMM_GEMV for IQ1_S.md
+++ b/github-data/pull_requests/212 - Optimized GEMMGEMV for IQ1_S.md
@@ -1,14 +1,17 @@
-### 🔀 [#212](https://github.com/ikawrakow/ik_llama.cpp/pull/212) - Optimized GEMM/GEMV for IQ1_S
+## 🔀 [Pull Request #212](https://github.com/ikawrakow/ik_llama.cpp/pull/212) - Optimized GEMM/GEMV for IQ1_S
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_iq1s` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-20 |
| **Updated** | 2025-02-20 |
+| **Merged** | 2025-02-20 |
---
-#### Description
+## 📄 Description
Apparently there are many people who would prefer to just run Unsloth's `IQ1_S` DeepSeek-R1 model as is instead of quantizing to `IQ1_S_R4` and taking advantage of the better model quality and improved inference speed.
@@ -37,9 +40,9 @@ I think one can do better by interleaving 4 rows on the fly, but I leave this fo
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **godrosev** commented the **2025-02-20** at **13:15:29**:
+👤 **godrosev** commented on **2025-02-20** at **13:15:29**
ikawrakow, thank you so much. This helped me a lot!
Also, it's not that I'm reluctant to use it IQ1_S_R4。Instead, I need a smaller file size and memory (you said he would reduce it by a few GB), it's just that my current work requires running ready-made Unsloth's DeepSeek-R1.
diff --git a/github-data/pull_requests/213 - Fix NEON gemm_gemv for legacy quants when row size is not divisible by .md b/github-data/pull_requests/213 - Fix NEON gemm_gemv for legacy quants when row size is not divisible by .md
deleted file mode 100644
index 05a68442c..000000000
--- a/github-data/pull_requests/213 - Fix NEON gemm_gemv for legacy quants when row size is not divisible by .md
+++ /dev/null
@@ -1,15 +0,0 @@
-### 🐛 [#213](https://github.com/ikawrakow/ik_llama.cpp/pull/213) - Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-02-20 |
-| **Updated** | 2025-02-20 |
-
----
-
-#### Description
-
-I have broken it quite a while ago when I changed the NEON implementation to do two rows at a time. I haven't noticed as all models I typically use have row sizes that are multiple of 128. But as I was working on the `IQ1_S` NEON implementation for PR #212, I was testing with DeepSeek-Lite (where K cache row size is 576, so not divisible by 128), using `Q8_0` for K cache (but no FA, where it works), and was getting NaNs or gibberish. I lost so much time until I finally realized that the issue is with the K cache `Q8_0` matrix multiplication rather than my `IQ1_S` implementation.
-
-This PR fixes this.
\ No newline at end of file
diff --git a/github-data/pull_requests/213 - Fix NEON gemmgemv for legacy quants when row size is not divisible by 128.md b/github-data/pull_requests/213 - Fix NEON gemmgemv for legacy quants when row size is not divisible by 128.md
new file mode 100644
index 000000000..e988d62a3
--- /dev/null
+++ b/github-data/pull_requests/213 - Fix NEON gemmgemv for legacy quants when row size is not divisible by 128.md
@@ -0,0 +1,18 @@
+## 🔀 [Pull Request #213](https://github.com/ikawrakow/ik_llama.cpp/pull/213) - Fix NEON gemm/gemv for legacy quants when row size is not divisible by 128
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_neon_legacy_quants` |
+| **Target Branch** | `main` |
+| **Created** | 2025-02-20 |
+| **Updated** | 2025-02-20 |
+| **Merged** | 2025-02-20 |
+
+---
+
+## 📄 Description
+
+I have broken it quite a while ago when I changed the NEON implementation to do two rows at a time. I haven't noticed as all models I typically use have row sizes that are multiple of 128. But as I was working on the `IQ1_S` NEON implementation for PR [#212](https://github.com/ikawrakow/ik_llama.cpp/issues/212), I was testing with DeepSeek-Lite (where K cache row size is 576, so not divisible by 128), using `Q8_0` for K cache (but no FA, where it works), and was getting NaNs or gibberish. I lost so much time until I finally realized that the issue is with the K cache `Q8_0` matrix multiplication rather than my `IQ1_S` implementation.
+
+This PR fixes this.
\ No newline at end of file
diff --git a/github-data/pull_requests/215 - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512.md b/github-data/pull_requests/215 - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512.md
index 76a3f459e..7fbea1262 100644
--- a/github-data/pull_requests/215 - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512.md
+++ b/github-data/pull_requests/215 - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512.md
@@ -1,27 +1,23 @@
-### 🐛 [#215](https://github.com/ikawrakow/ik_llama.cpp/pull/215) - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512
+## 🔀 [Pull Request #215](https://github.com/ikawrakow/ik_llama.cpp/pull/215) - Trying to fix confusion betweem HAVE_FANCY_SIMD and AVX512
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/issue_214` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-21 |
| **Updated** | 2025-02-21 |
---
-#### Description
+## 📄 Description
-Attempt to fix #214
+Attempt to fix [#214](https://github.com/ikawrakow/ik_llama.cpp/issues/214)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-21** at **10:31:43**:
+👤 **ikawrakow** commented on **2025-02-21** at **10:31:43**
-No, this isn't enough
-
----
-
-👤 **pt13762104** commented the **2025-02-21** at **11:05:11**:
-
-I'll try to run a model to see if it's working
\ No newline at end of file
+No, this isn't enough
\ No newline at end of file
diff --git a/github-data/pull_requests/216 - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD.md b/github-data/pull_requests/216 - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD.md
index 321ca89f1..de0d05b2f 100644
--- a/github-data/pull_requests/216 - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD.md
+++ b/github-data/pull_requests/216 - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD.md
@@ -1,13 +1,16 @@
-### 🐛 [#216](https://github.com/ikawrakow/ik_llama.cpp/pull/216) - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD
+## 🔀 [Pull Request #216](https://github.com/ikawrakow/ik_llama.cpp/pull/216) - Hopefully this really fixes the confusion between AVX512 and FANCY_SIMD
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_avx512_vs_fancy_simd` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-21 |
| **Updated** | 2025-02-21 |
+| **Merged** | 2025-02-21 |
---
-#### Description
+## 📄 Description
-Fixes #214
\ No newline at end of file
+Fixes [#214](https://github.com/ikawrakow/ik_llama.cpp/issues/214)
\ No newline at end of file
diff --git a/github-data/pull_requests/218 - Better strategy for attention matrix multiplications when generating to.md b/github-data/pull_requests/218 - Better strategy for attention matrix multiplications when generating tokens.md
similarity index 93%
rename from github-data/pull_requests/218 - Better strategy for attention matrix multiplications when generating to.md
rename to github-data/pull_requests/218 - Better strategy for attention matrix multiplications when generating tokens.md
index 95d301ec4..c6760859c 100644
--- a/github-data/pull_requests/218 - Better strategy for attention matrix multiplications when generating to.md
+++ b/github-data/pull_requests/218 - Better strategy for attention matrix multiplications when generating tokens.md
@@ -1,14 +1,17 @@
-### 🔀 [#218](https://github.com/ikawrakow/ik_llama.cpp/pull/218) - Better strategy for attention matrix multiplications when generating tokens
+## 🔀 [Pull Request #218](https://github.com/ikawrakow/ik_llama.cpp/pull/218) - Better strategy for attention matrix multiplications when generating tokens
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/attn_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-21 |
| **Updated** | 2025-02-22 |
+| **Merged** | 2025-02-22 |
---
-#### Description
+## 📄 Description
The `K*Q` and `V*softmax(K*Q)` matrix multiplications have the shape
@@ -20,7 +23,7 @@ $$\left(K x N_t\right) \times \left(K x N_b\right)$$
The issue with this is that for token generation (TG) we have $N_b = 1$, so we are dealing with $N_h$ matrix-vector multiplications, which are notoriously memory bound, and hence limit performance for large cache size (long contexts). To add insult to injury, the stride between consecutive rows in the left matrix is not just the row size $R$, but rather $N_k R$, so fetching data from memory is associated with big jumps and sub-optimal cache use, which is not exactly ideal in a memory bound situation.
-When $N_h > N_k$ (GQA, in that case $N_h$ is divisible by $N_k$), PR #207 changed the multiplication strategy to perform $N_k$ matrix multiplications, each with shape $\left(K x N_t\right) \times \left(K x N_h/N_k\right)$, thus turning many matrix-vector multiplications into fewer matrix-matrix multiplications. This leads to non negligible performance gains for long contexts.
+When $N_h > N_k$ (GQA, in that case $N_h$ is divisible by $N_k$), PR [#207](https://github.com/ikawrakow/ik_llama.cpp/issues/207) changed the multiplication strategy to perform $N_k$ matrix multiplications, each with shape $\left(K x N_t\right) \times \left(K x N_h/N_k\right)$, thus turning many matrix-vector multiplications into fewer matrix-matrix multiplications. This leads to non negligible performance gains for long contexts.
But when $N_h = N_k$ (e.g., DeepSeek attention architecture), the above does not work. What we could do instead is to perform $N_t x N_h$ dot products, where the inner loop is over $N_h$ and the outer loop is over $N_t$. When multi-threaded, each thread performs $N_t/M x N_h$ dot products (where $M$ is the number of threads). The advantage of doing this is that memory is accessed consecutively, resulting in better throughput and cache utilization. This is being done with this PR.
diff --git a/github-data/pull_requests/219 - Fuse MoE up and gate matrix multiplications.md b/github-data/pull_requests/219 - Fuse MoE up and gate matrix multiplications.md
index dff3e6330..4addb9ca1 100644
--- a/github-data/pull_requests/219 - Fuse MoE up and gate matrix multiplications.md
+++ b/github-data/pull_requests/219 - Fuse MoE up and gate matrix multiplications.md
@@ -1,14 +1,17 @@
-### 🔀 [#219](https://github.com/ikawrakow/ik_llama.cpp/pull/219) - Fuse MoE up and gate matrix multiplications
+## 🔀 [Pull Request #219](https://github.com/ikawrakow/ik_llama.cpp/pull/219) - Fuse MoE up and gate matrix multiplications
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fuse_moe_up_gate` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-22 |
| **Updated** | 2025-02-22 |
+| **Merged** | 2025-02-22 |
---
-#### Description
+## 📄 Description
No new op, instead the fusing is done during graph compute in the CPU back end (same could be also done for the other back ends).
diff --git a/github-data/pull_requests/22 - AVX2 quantization for Q8_K.md b/github-data/pull_requests/22 - AVX2 quantization for Q8_K.md
index 3cf43bba9..e7c3679ad 100644
--- a/github-data/pull_requests/22 - AVX2 quantization for Q8_K.md
+++ b/github-data/pull_requests/22 - AVX2 quantization for Q8_K.md
@@ -1,13 +1,16 @@
-### 🔀 [#22](https://github.com/ikawrakow/ik_llama.cpp/pull/22) - AVX2 quantization for Q8_K
+## 🔀 [Pull Request #22](https://github.com/ikawrakow/ik_llama.cpp/pull/22) - AVX2 quantization for Q8_K
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/quantize_q8k_avx2` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-19 |
| **Updated** | 2024-08-19 |
+| **Merged** | 2024-08-19 |
---
-#### Description
+## 📄 Description
It has been there for a while, but forgot to add here.
\ No newline at end of file
diff --git a/github-data/pull_requests/220 - Fix 217.md b/github-data/pull_requests/220 - Fix 217.md
new file mode 100644
index 000000000..2ee8e73b5
--- /dev/null
+++ b/github-data/pull_requests/220 - Fix 217.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #220](https://github.com/ikawrakow/ik_llama.cpp/pull/220) - Fix [#217](https://github.com/ikawrakow/ik_llama.cpp/issues/217)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/issue_217` |
+| **Target Branch** | `main` |
+| **Created** | 2025-02-22 |
+| **Updated** | 2025-02-22 |
+| **Merged** | 2025-02-22 |
+
+---
+
+## 📄 Description
+
+Closes [#217](https://github.com/ikawrakow/ik_llama.cpp/issues/217)
\ No newline at end of file
diff --git a/github-data/pull_requests/220 - Fix _217.md b/github-data/pull_requests/220 - Fix _217.md
deleted file mode 100644
index 0f37b2119..000000000
--- a/github-data/pull_requests/220 - Fix _217.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🐛 [#220](https://github.com/ikawrakow/ik_llama.cpp/pull/220) - Fix [#217](https://github.com/ikawrakow/ik_llama.cpp/issues/217)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-02-22 |
-| **Updated** | 2025-02-22 |
-
----
-
-#### Description
-
-Closes #217
\ No newline at end of file
diff --git a/github-data/pull_requests/225 - Examples _ Add new sweep-bench benchmark.md b/github-data/pull_requests/225 - Examples Add new sweep-bench benchmark.md
similarity index 67%
rename from github-data/pull_requests/225 - Examples _ Add new sweep-bench benchmark.md
rename to github-data/pull_requests/225 - Examples Add new sweep-bench benchmark.md
index 346666bad..49a7856ec 100644
--- a/github-data/pull_requests/225 - Examples _ Add new sweep-bench benchmark.md
+++ b/github-data/pull_requests/225 - Examples Add new sweep-bench benchmark.md
@@ -1,18 +1,21 @@
-### 🔀 [#225](https://github.com/ikawrakow/ik_llama.cpp/pull/225) - Examples : Add new sweep-bench benchmark
+## 🔀 [Pull Request #225](https://github.com/ikawrakow/ik_llama.cpp/pull/225) - Examples : Add new sweep-bench benchmark
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/sweep_bench` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-23 |
| **Updated** | 2025-04-26 |
+| **Merged** | 2025-02-23 |
---
-#### Description
+## 📄 Description
Port of https://github.com/ggml-org/llama.cpp/commit/9488fbf1e4334b8f189b38a7d224b8e6c1a7b22b
-This is a good tool to benchmark with as requested by #223.
+This is a good tool to benchmark with as requested by [#223](https://github.com/ikawrakow/ik_llama.cpp/issues/223).
As a very quick demo I generated this, just by running this ( ```./llama-sweep-bench -c 2048 -ub 512 -m WizardLM-2-8x22B-IQ4_K_R4.gguf -ctk q8_KV -ctv q8_0 -fa --output-format jsonl ``` and then sweep-bench-plot.py with the output).
@@ -27,15 +30,15 @@ As a very quick demo I generated this, just by running this ( ```./llama-sweep-b
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-02-23** at **06:00:18**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-02-23** at **06:00:18**
Thank you for this - can be very useful.
---
-👤 **ubergarm** commented the **2025-04-26** at **18:01:12**:
+👤 **ubergarm** commented on **2025-04-26** at **18:01:12**
@saood06 thanks I'm a convert to `llama-sweep-bench`! It is indeed very useful.
diff --git a/github-data/pull_requests/226 - Fix compilation error with IQK_FA_ALL_QUANTS enabled.md b/github-data/pull_requests/226 - Fix compilation error with IQK_FA_ALL_QUANTS enabled.md
index dc8a511e9..35faa09ec 100644
--- a/github-data/pull_requests/226 - Fix compilation error with IQK_FA_ALL_QUANTS enabled.md
+++ b/github-data/pull_requests/226 - Fix compilation error with IQK_FA_ALL_QUANTS enabled.md
@@ -1,13 +1,16 @@
-### 🐛 [#226](https://github.com/ikawrakow/ik_llama.cpp/pull/226) - Fix compilation error with IQK_FA_ALL_QUANTS enabled
+## 🔀 [Pull Request #226](https://github.com/ikawrakow/ik_llama.cpp/pull/226) - Fix compilation error with IQK_FA_ALL_QUANTS enabled
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/issue_224` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-23 |
| **Updated** | 2025-02-23 |
+| **Merged** | 2025-02-23 |
---
-#### Description
+## 📄 Description
-Closes #224
\ No newline at end of file
+Closes [#224](https://github.com/ikawrakow/ik_llama.cpp/issues/224)
\ No newline at end of file
diff --git a/github-data/pull_requests/229 - Fused MoE ffn_up and ffn_gate.md b/github-data/pull_requests/229 - Fused MoE ffn_up and ffn_gate.md
index 00cfb22f2..67c8375af 100644
--- a/github-data/pull_requests/229 - Fused MoE ffn_up and ffn_gate.md
+++ b/github-data/pull_requests/229 - Fused MoE ffn_up and ffn_gate.md
@@ -1,14 +1,17 @@
-### 🔀 [#229](https://github.com/ikawrakow/ik_llama.cpp/pull/229) - Fused MoE ffn_up and ffn_gate
+## 🔀 [Pull Request #229](https://github.com/ikawrakow/ik_llama.cpp/pull/229) - Fused MoE ffn_up and ffn_gate
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fused_up_gate_unary` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-23 |
| **Updated** | 2025-02-23 |
+| **Merged** | 2025-02-23 |
---
-#### Description
+## 📄 Description
In all MoE models one has the following sequence of operations as part of the feed forward network (simplified):
```
diff --git a/github-data/pull_requests/23 - iq4_k tweak.md b/github-data/pull_requests/23 - iq4_k tweak.md
index 9b60b5b85..165294947 100644
--- a/github-data/pull_requests/23 - iq4_k tweak.md
+++ b/github-data/pull_requests/23 - iq4_k tweak.md
@@ -1,14 +1,17 @@
-### 🔀 [#23](https://github.com/ikawrakow/ik_llama.cpp/pull/23) - iq4_k tweak
+## 🔀 [Pull Request #23](https://github.com/ikawrakow/ik_llama.cpp/pull/23) - iq4_k tweak
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_k_tweaks` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-20 |
| **Updated** | 2024-08-20 |
+| **Merged** | 2024-08-20 |
---
-#### Description
+## 📄 Description
Use `iq5_k` for `attn_v` also when `n_gqa = 2`.
This improves size vs quality tradeoff for Gemma-2 models.
diff --git a/github-data/pull_requests/231 - Fix 230.md b/github-data/pull_requests/231 - Fix 230.md
new file mode 100644
index 000000000..a24b5bcd7
--- /dev/null
+++ b/github-data/pull_requests/231 - Fix 230.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #231](https://github.com/ikawrakow/ik_llama.cpp/pull/231) - Fix [#230](https://github.com/ikawrakow/ik_llama.cpp/issues/230)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/issue_230` |
+| **Target Branch** | `main` |
+| **Created** | 2025-02-24 |
+| **Updated** | 2025-02-24 |
+| **Merged** | 2025-02-24 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/231 - Fix _230.md b/github-data/pull_requests/231 - Fix _230.md
deleted file mode 100644
index fd105e7bc..000000000
--- a/github-data/pull_requests/231 - Fix _230.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🐛 [#231](https://github.com/ikawrakow/ik_llama.cpp/pull/231) - Fix [#230](https://github.com/ikawrakow/ik_llama.cpp/issues/230)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-02-24 |
-| **Updated** | 2025-02-24 |
\ No newline at end of file
diff --git a/github-data/pull_requests/232 - Give the user the option to override where model weights are stored.md b/github-data/pull_requests/232 - Give the user the option to override where model weights are stored.md
index ebef1ec6f..b949413a3 100644
--- a/github-data/pull_requests/232 - Give the user the option to override where model weights are stored.md
+++ b/github-data/pull_requests/232 - Give the user the option to override where model weights are stored.md
@@ -1,14 +1,17 @@
-### 🔀 [#232](https://github.com/ikawrakow/ik_llama.cpp/pull/232) - Give the user the option to override where model weights are stored
+## 🔀 [Pull Request #232](https://github.com/ikawrakow/ik_llama.cpp/pull/232) - Give the user the option to override where model weights are stored
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/buffer_type_overrides` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-24 |
| **Updated** | 2025-02-27 |
+| **Merged** | 2025-02-25 |
---
-#### Description
+## 📄 Description
It seems this PR amounts to most of the "secret sauce" of KTransformers.
@@ -35,43 +38,92 @@ Would love to hear from someone having a GPU with enough VRAM to fit all DeepSee
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-25** at **06:34:34**:
+👤 **ikawrakow** commented on **2025-02-25** at **06:34:34**
Here some results using `IQ4_NL`
| model | threads | mla | rtr | fmoe | test | t/s |
| --------------------- | ------: | --: | --: | ---: | ------------: | ---------------: |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp128 | 53.08 ± 0.03 |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp256 | 52.87 ± 0.07 |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp512 | 52.53 ± 0.04 |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp1024 | 51.48 ± 0.10 |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp2048 | 50.40 ± 0.04 |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp4096 | 48.39 ± 0.13 |
-| deepseek2 16B IQ4_NL | 8 | 1 | 1 | 1 | tg64@pp8192 | 44.00 ± 0.02 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp128 | 53.08 ± 0.03 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp256 | 52.87 ± 0.07 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp512 | 52.53 ± 0.04 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp1024 | 51.48 ± 0.10 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp2048 | 50.40 ± 0.04 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp4096 | 48.39 ± 0.13 |
+| deepseek2 16B IQ4_NL | 8 | 0 | 1 | 1 | tg64@pp8192 | 44.00 ± 0.02 |
| model | mla | rtr | fmoe | test | t/s |
| --------------------- | --: | --: | ---: | ------------: | ---------------: |
-| deepseek2 16B IQ4_NL | 1 | 1 | 1 | pp512 | 1172.35 ± 2.91 |
-| deepseek2 16B IQ4_NL | 1 | 1 | 1 | pp1024 | 1167.57 ± 1.75 |
-| deepseek2 16B IQ4_NL | 1 | 1 | 1 | pp2048 | 1148.17 ± 1.45 |
-| deepseek2 16B IQ4_NL | 1 | 1 | 1 | pp4096 | 1125.10 ± 1.52 |
-| deepseek2 16B IQ4_NL | 1 | 1 | 1 | pp8192 | 1067.71 ± 5.17 |
-| deepseek2 16B IQ4_NL | 1 | 1 | 1 | pp16384 | 974.12 ± 0.85 |
+| deepseek2 16B IQ4_NL | 0 | 1 | 1 | pp512 | 1172.35 ± 2.91 |
+| deepseek2 16B IQ4_NL | 0 | 1 | 1 | pp1024 | 1167.57 ± 1.75 |
+| deepseek2 16B IQ4_NL | 0 | 1 | 1 | pp2048 | 1148.17 ± 1.45 |
+| deepseek2 16B IQ4_NL | 0 | 1 | 1 | pp4096 | 1125.10 ± 1.52 |
+| deepseek2 16B IQ4_NL | 0 | 1 | 1 | pp8192 | 1067.71 ± 5.17 |
+| deepseek2 16B IQ4_NL | 0 | 1 | 1 | pp16384 | 974.12 ± 0.85 |
-So, with attention running on the GPU, MLA is competitive with standard also for PP. Given the reduced KV cache size with MLA, it becomes the best option for this setup (CPU computes experts matrix multiplications, GPU computes everything else).
+~So, with attention running on the GPU, MLA is competitive with standard also for PP. Given the reduced KV cache size with MLA, it becomes the best option for this setup (CPU computes experts matrix multiplications, GPU computes everything else).~
Dumping some timing info for TG, in a run with 5 tg128 evaluations I get
* 55.5 t/s, so (640 tokens)/(55.5 tokens/second) = 11.53 seconds total evaluation time
* 8.42 seconds for computing the MoE experts matrix multiplications on the CPU
* 1.23 seconds for computing everything else on the GPU
* Hence, 11.53 - 8.42 - 1.23 = 1.88 seconds are spent in the `ggml` back-end on synchronization and copying data between CPU and GPU. This is ~16% of total evaluation time (!!!), and I think this is very far from optimal, so there is much room for improvement there. If this cost can be optimized out, we will be getting in the range of 65 t/s
-* The experts in DeepSeek-Lite are `2048 x 1408`. We have `ffn_up, ffn_gate` and `ffn_down`, 6 active experts, and 25 experts layers. So, this is `2048 x 1408 x 3 x 6 x 25 = 1.298B` weights involved in the CPU calculation. Model is quantized with `IQ4_NL`, so 4.5 bits per weight, so `1298 x 4.5 / 8 = 730 MB` of data needs to be fetched from RAM per evaluated token. 640 tokens evaluated in 8.42 seconds is 0.01316 seconds per token. Hence, the memory bandwidth utilized during CPU computation is `730 MB / 0.01316 seconds = 55.5 GB/s`. The system (Ryzen-7950X) has 64 GB/s theoretical memory bandwidth, but 60 GB/s is the best one gets in practice for TG (with dense models). I.e., for this 6 active, 64 total experts MoE model we are at 90%+ of memory bandwidth utilization
+* The experts in DeepSeek-Lite are `2048 x 1408`. We have `ffn_up, ffn_gate` and `ffn_down`, 6 active experts, and 25 experts layers. So, this is `2048 x 1408 x 3 x 6 x 25 = 1.298B` weights involved in the CPU calculation. Model is quantized with `IQ4_NL`, so 4.5 bits per weight, so `1298 x 4.5 / 8 = 730 MB` of data needs to be fetched from RAM per evaluated token. 640 tokens evaluated in 8.42 seconds is 0.01316 seconds per token. Hence, the memory bandwidth utilized during CPU computation is `730 MB / 0.01316 seconds = 55.5 GB/s`. The system (Ryzen-7950X) has 64 GB/s theoretical memory bandwidth, but 60 GB/s is the best one gets in practice for TG (with dense models). I.e., for this 6 active, 64 total experts MoE model we are at 90%+ of memory bandwidth utilization
+
+Here is the op timing breakdown for 5 x tg128 runs
+
+### CPU
+
+```
+Total: 8.42517e+06 us
+============ Sorted list of ops:
+ 0 MOE_FUSED_UP_GATE 5.63926e+06 0.669335
+ 1 MUL_MAT_ID 2.7838e+06 0.330414
+ 2 GET_ROWS 2110 0.00025044
+============ Matrix multiplications: 2.7838e+06 us
+ffn_moe_down : 2.7838e+06 1
+```
+
+### GPU
+
+```
+Total: 1.22889e+06 us
+============ Sorted list of ops:
+ 0 MUL_MAT 686021 0.558245
+ 1 ADD 69763 0.0567692
+ 2 FUSED_RMS_NORM 69321 0.0564095
+ 3 ROPE 48932 0.0398181
+ 4 CONCAT 47852 0.0389392
+ 5 SOFT_MAX 47009 0.0382533
+ 6 CONT 45936 0.0373801
+ 7 CPY 45223 0.0367999
+ 8 GET_ROWS 29002 0.0236002
+ 9 REPEAT 24736 0.0201288
+10 MUL 24100 0.0196112
+11 FUSED_MUL_UNARY 23011 0.018725
+12 SCALE 22803 0.0185558
+13 ARGSORT 22628 0.0184134
+14 MULTI_ADD 22552 0.0183515
+============ Matrix multiplications: 686021 us
+ffn_gate : 53041 0.0773169
+ffn_moe_logits : 111885 0.163093
+ffn_out : 1797 0.00261945
+ffn_shexp : 44496 0.064861
+ffn_up : 49090 0.0715576
+kq : 135925 0.198135
+kqv : 83346 0.121492
+kqv_out : 48899 0.0712792
+kv : 45771 0.0667195
+kv_rope_compresseed : 48608 0.070855
+q : 61027 0.0889579
+result_output : 2136 0.00311361
+```
---
-👤 **saood06** commented the **2025-02-25** at **06:48:52**:
+👤 **saood06** commented on **2025-02-25** at **06:48:52**
>Hence, 11.53 - 8.42 - 1.23 = 1.88 seconds are spent in the ggml back-end on synchronization and copying data between CPU and GPU. This is ~16% of total evaluation time (!!!), and I think this is very far from optimal, so there is much room for improvement there. If this cost can be optimized out, we will be getting in the range of 65 t/s
@@ -83,7 +135,7 @@ Also how do you generate these op timing breakdowns?
---
-👤 **ikawrakow** commented the **2025-02-25** at **07:53:14**:
+👤 **ikawrakow** commented on **2025-02-25** at **07:53:14**
> Is the cost call overhead or throughput?
@@ -95,7 +147,7 @@ I set `IK_PRINT_TIMING` to 1 in `ggml.c` or `ggml-cuda.cu` and rebuild. Then I r
---
-👤 **ikawrakow** commented the **2025-02-25** at **10:17:13**:
+👤 **ikawrakow** commented on **2025-02-25** at **10:17:13**
> Is the cost call overhead or throughput?
@@ -108,7 +160,7 @@ For PP copying data back-and-fort is more significant. I tested with a context o
---
-👤 **ikawrakow** commented the **2025-02-26** at **06:55:34**:
+👤 **ikawrakow** commented on **2025-02-26** at **06:55:34**
### Update:
@@ -128,8 +180,8 @@ So, ~20% slower than standard attention. CUDA does not like MLA. I need to inves
---
-👤 **orca-zhang** commented the **2025-02-27** at **17:03:36**:
+👤 **orca-zhang** commented on **2025-02-27** at **17:03:36**
-I have observed the same phenomenon as you. After a single inference is completed, there is a lot of D2H copy work. Currently, I also use multiple parallel processing to "bypass" the solution you mentioned. I am not sure if we don't need to cache the results, can we directly abandon this part of the work? I would like to hear your opinion.
+I have observed the same phenomenon as you. After a single inference is completed, there is a lot of D2H copy work. At present, I also use multiple processing parallelism to "bypass" this problem, just like the solution you mentioned. I am not sure if we don't need to cache the results, can we directly abandon this part of the work? I would like to hear your opinion.
PS: I am actually a rookie who has only been exposed to the llama.cpp source code for a week.
\ No newline at end of file
diff --git a/github-data/pull_requests/233 - Slightly faster CUDA MLA.md b/github-data/pull_requests/233 - Slightly faster CUDA MLA.md
index cf7fac21d..11c8b8df7 100644
--- a/github-data/pull_requests/233 - Slightly faster CUDA MLA.md
+++ b/github-data/pull_requests/233 - Slightly faster CUDA MLA.md
@@ -1,14 +1,16 @@
-### 🔀 [#233](https://github.com/ikawrakow/ik_llama.cpp/pull/233) - Slightly faster CUDA MLA
+## 🔀 [Pull Request #233](https://github.com/ikawrakow/ik_llama.cpp/pull/233) - Slightly faster CUDA MLA
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/cuda_mla` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-26 |
| **Updated** | 2025-02-27 |
---
-#### Description
+## 📄 Description
The CUDA code absolutely does not like MLA. The issue is with the `wk_b x q_nope` and `wv_b x qkv_compressed` operations. For TG they require two tensor multiplications of shapes $(N_h \times N_t \times K)$ and $(N_h \times 1 \times K)$, where $N_h$ is the head size, $N_t$ is the number of tokens in the KV cache, and $K$ is the number of heads. These get computed as $K$ consecutive $(N_h \times N_t) \times ($N_h \times 1)$ matrix-vector multiplications. To add insult to injury, for `wk_b x q_nope` where `q_nope` is not contiguous, we get $K$ copies (one for each `q_nope` row) to contiguous memory, followed by quantization for a single row (when `wk_b` is quantized), followed by the actual GEMV, i.e., $3 K$ CUDA kernel launches. The associated overhead by far exceeds the time needed for the actual matrix multiplications, so the computation becomes extremely slow compared to what it could be.
@@ -18,16 +20,8 @@ I did attempt to implement a computation of the entire tensor multiplication wit
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-26** at **17:27:37**:
+👤 **ikawrakow** commented on **2025-02-26** at **17:27:37**
-Closing in favor of #234
-
----
-
-👤 **davidsyoung** commented the **2025-02-27** at **16:16:55**:
-
-@ikawrakow Seeing a significant speed increase from this, with also transposed KV cache. From 12t/s to 17.25t/s, and seeing less of a drop off on speed as well at longer PP tokens. Full CUDA 15x3090 Q2_K MLA.
-
-Really nice!
\ No newline at end of file
+Closing in favor of [#234](https://github.com/ikawrakow/ik_llama.cpp/issues/234)
\ No newline at end of file
diff --git a/github-data/pull_requests/234 - Faster MLA on CUDA.md b/github-data/pull_requests/234 - Faster MLA on CUDA.md
index d65de7acc..32f45b90b 100644
--- a/github-data/pull_requests/234 - Faster MLA on CUDA.md
+++ b/github-data/pull_requests/234 - Faster MLA on CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#234](https://github.com/ikawrakow/ik_llama.cpp/pull/234) - Faster MLA on CUDA
+## 🔀 [Pull Request #234](https://github.com/ikawrakow/ik_llama.cpp/pull/234) - Faster MLA on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_mla2` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-26 |
| **Updated** | 2025-02-27 |
+| **Merged** | 2025-02-27 |
---
-#### Description
+## 📄 Description
The CUDA code absolutely does not like MLA. On the main branch MLA attention is in the range of 15-20% slower than the standard attention implementation. The issue is with the `wk_b x q_nope` and `wv_b x qkv_compressed` operations. For TG they require two tensor multiplications of shapes $(N_h \times N_t \times K)$ and $(N_h \times 1 \times K)$, where $N_h$ is the head size, $N_t$ is the number of tokens in the KV cache, and $K$ is the number of heads. These get computed as $K$ consecutive $(N_h \times N_t) \times (N_h \times 1)$ matrix-vector multiplications. To add insult to injury, for `wk_b x q_nope` where `q_nope` is not contiguous, we get $K$ copies (one for each `q_nope` row) to contiguous memory, followed by quantization for a single row (when `wk_b` is quantized), followed by the actual GEMV, i.e., $3 K$ CUDA kernel launches. The associated overhead by far exceeds the time needed for the actual matrix multiplications, so the computation becomes extremely slow compared to what it could be.
@@ -29,9 +32,9 @@ These two changes result in a significant speedup of the MLA attention computati
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-02-27** at **16:17:26**:
+👤 **davidsyoung** commented on **2025-02-27** at **16:17:26**
@ikawrakow Seeing a significant speed increase from this, with also transposed KV cache. From 12t/s to 17.25t/s, and seeing less of a drop off on speed as well at longer PP tokens. Full CUDA 15x3090 Q2_K MLA.
diff --git a/github-data/pull_requests/235 - Option to use MLA without a transposed cache.md b/github-data/pull_requests/235 - Option to use MLA without a transposed cache.md
index 4bca7b6ab..753f109e7 100644
--- a/github-data/pull_requests/235 - Option to use MLA without a transposed cache.md
+++ b/github-data/pull_requests/235 - Option to use MLA without a transposed cache.md
@@ -1,14 +1,17 @@
-### 🔀 [#235](https://github.com/ikawrakow/ik_llama.cpp/pull/235) - Option to use MLA without a transposed cache
+## 🔀 [Pull Request #235](https://github.com/ikawrakow/ik_llama.cpp/pull/235) - Option to use MLA without a transposed cache
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla_no_transposed_cache` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-27 |
| **Updated** | 2025-02-28 |
+| **Merged** | 2025-02-27 |
---
-#### Description
+## 📄 Description
The `-mla` (or `--mla-use`) command line option turns from previously a boolean value to an integer:
* `mla = 0`: use standard attention
@@ -84,11 +87,11 @@ Here `mla = 2` is much slower than `mla = 1` for long contexts, and about on par
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-02-27** at **15:08:55**:
+👤 **davidsyoung** commented on **2025-02-27** at **15:08:55**
-Hey, thank you for your work on this. Trying to run with -mla 2, but still getting a 8900MB allocation per card. I'm not sure if this is correct, or am I doing something wrong with my run commands (I'm aware the layers are poorly balanced atm, but just wondering if this is as expected:
+Hey, thank you for your work on this. Trying to run with -mla 2, but still getting a 8900MB allocation per card I believe for the compute buffer. I'm not sure if this is correct, or am I doing something wrong with my run commands (I'm aware the layers are poorly balanced atm, but just wondering if this is as expected:
Command:
```
@@ -365,11 +368,11 @@ Would really appreciate your help to see if I'm doing something wrong. Thank you
---
-👤 **davidsyoung** commented the **2025-02-27** at **16:35:08**:
+👤 **davidsyoung** commented on **2025-02-27** at **16:35:08**
@ikawrakow
-Was able to run this with 24K ctx, but not sure if this amount of compute buffer is still correct:
+Was able to run this with 20K ctx, but not sure if this amount of compute buffer is still correct:
```
INFO [ main] build info | tid="22970858381312" timestamp=1740670359 build=0 commit="unknown"
@@ -642,15 +645,17 @@ INFO [ update_slots] all slots are idle | tid="22970858381312" timest
---
-👤 **ikawrakow** commented the **2025-02-27** at **16:50:48**:
+👤 **ikawrakow** commented on **2025-02-27** at **16:50:48**
So, when I wrote the PR description I had forgotten that it is not yet possible to transpose quantized cache, which would be needed if we wanted to use `mla = 2` with quantized cache. I realized my mistake and added a comment, but I guess it is easy to miss. So, at this point `mla = 2` uses `fp16` for the cache, which means about 69 kB per token for DeepSeek-R1, so 1.58 GiB for 24k context, so about 100 MiB per card in your 15 GPU setup (wow!). This is also what we see reported.
-I haven't looked in detail into the compute buffers on CUDA. I wouldn't have expected 5.7 GiB per GPU, this seems way too much. But I also don't have access to a multi-GPU box, so have never played with that. It looks like each GPU is allocating the same compute buffer as if the computation was running on a single GPU.
+I haven't looked in detail into the compute buffers on CUDA. I wouldn't have expected 5.7 GiB per GPU, this seems way too much. But I also don't have access to a multi-GPU box, so have never played with that. It looks like each GPU is allocating the same compute buffer as if the computation was running on a single GPU.
+
+Oh, I'm working on transposed matrix multiplications. Hope to be done in a day or two. It will then be possible to fully quantize the cache with `mla = 2`.
---
-👤 **davidsyoung** commented the **2025-02-27** at **17:12:01**:
+👤 **davidsyoung** commented on **2025-02-27** at **17:12:01**
Incredible, that makes sense. The cache using fp16 isn't a huge problem, to be honest. Also, yes, the 15 gpu build (trying to find a 16th for TP!) has been a lot of pain, so to see the speed increase on this, and longer context, is really promising. So thank you for all of your hard work.
@@ -658,13 +663,21 @@ For these compute buffers, is there anything I can do to reduce it to the expect
---
-👤 **ikawrakow** commented the **2025-02-27** at **17:14:38**:
+👤 **ikawrakow** commented on **2025-02-27** at **17:14:38**
-@davidsyoung Have you tried using `-fmoe` (`--fused-moe` from PR #229? This fuses several MoE operations. In my testing withDeepSeek-Lite it resulted in a significant boost in prefill performance (~30%) and a small gain in TG as well.
+@davidsyoung Have you tried using `-fmoe` (`--fused-moe` from PR [#229](https://github.com/ikawrakow/ik_llama.cpp/issues/229)? This fuses several MoE operations. In my testing withDeepSeek-Lite it resulted in a significant boost in prefill performance (~30%) and a small gain in TG as well.
---
-👤 **ikawrakow** commented the **2025-02-27** at **17:21:10**:
+👤 **davidsyoung** commented on **2025-02-27** at **17:19:41**
+
+> @davidsyoung Have you tried using `-fmoe` (`--fused-moe` from PR [#229](https://github.com/ikawrakow/ik_llama.cpp/issues/229)? This fuses several MoE operations. In my testing withDeepSeek-Lite it resulted in a significant boost in prefill performance (~30%) and a small gain in TG as well.
+
+I realised that actually, and briefly tried it, but didn't try a long prefill, and now I'm back to trying to figure out this compute buffer first. As the model is so large, it takes quite some time to load it! I will try the fmoe and report back!
+
+---
+
+👤 **ikawrakow** commented on **2025-02-27** at **17:21:10**
> For these compute buffers, is there anything I can do to reduce it to the expected amount?
@@ -672,7 +685,7 @@ I need to look into this. Have you tried `--split-mode row` and if yes, does it
---
-👤 **davidsyoung** commented the **2025-02-27** at **17:27:39**:
+👤 **davidsyoung** commented on **2025-02-27** at **17:27:39**
So I tried to change the following:
@@ -702,19 +715,29 @@ I will try `--split-mode row` now.
---
-👤 **ikawrakow** commented the **2025-02-27** at **17:35:18**:
+👤 **ikawrakow** commented on **2025-02-27** at **17:35:18**
Yes, the compute buffer size is proportional to the micro batch (ub) size. Typically performance first increases with increasing `ub` and then starts declining as `ub` increases. The default size is set based on the experience with much smaller models. I haven't seen people reporting performance values as a function of batch size or u-batch size for DeepSeek-R1. You can try using `-b 512 -ub 256` and see what happens. This should decrease compute buffer size, but the question is how much performance penalty (if any) one gets from that.
---
-👤 **ikawrakow** commented the **2025-02-27** at **17:48:10**:
+👤 **davidsyoung** commented on **2025-02-27** at **17:42:05**
+
+> Yes, the compute buffer size is proportional to the micro batch (ub) size. Typically performance first increases with increasing `ub` and then starts declining as `ub` increases. The default size is set based on the experience with much smaller models. I haven't seen people reporting performance values as a function of batch size or u-batch size for DeepSeek-R1. You can try using `-b 512 -ub 256` and see what happens. This should decrease compute buffer size, but the question is how much performance penalty (if any) one gets from that.
+
+That makes sense, will try playing with micro batch size and batch size. Also waiting to see if split mode row makes a difference. It might seem to more evenly split the layers across the cards which would be useful as sometimes it doesn't divide evenly into the gpu's, and as a result it limits the kvcache (previously at least), and some cards would have a few GB remaining but without much you could do as some GPUs were right at the limit.
+
+If this does work, and with the new MLA implementation, do you think there's a sweeet spot for type of quant to use for R1 in terms of quality etc.
+
+---
+
+👤 **ikawrakow** commented on **2025-02-27** at **17:48:10**
Just tried with DeepSeek-Lite. For a context of 32k tokens the CUDA compute buffer size is 1172 MiB with default batch/u-batch size. If I use `-b 512 -ub 256` it goes down to 972 MiB. With `-b 256 -ub 256` it becomes 603 MiB.
---
-👤 **davidsyoung** commented the **2025-02-27** at **17:50:52**:
+👤 **davidsyoung** commented on **2025-02-27** at **17:50:52**
> Just tried with DeepSeek-Lite. For a context of 32k tokens the CUDA compute buffer size is 1172 MiB with default batch/u-batch size. If I use `-b 512 -ub 256` it goes down to 972 MiB. With `-b 256 -ub 256` it becomes 603 MiB.
@@ -722,7 +745,7 @@ Is that behaving as expected for you when you see that? I can't tell if I should
---
-👤 **davidsyoung** commented the **2025-02-27** at **17:53:21**:
+👤 **davidsyoung** commented on **2025-02-27** at **17:53:21**
``--split-mode row`` run:
@@ -951,7 +974,7 @@ ggml/src/ggml-cuda.cu:731: GGML_ASSERT(tensor->view_src == nullptr) failed
---
-👤 **ikawrakow** commented the **2025-02-27** at **17:58:24**:
+👤 **ikawrakow** commented on **2025-02-27** at **17:58:24**
> do you think there's a sweeet spot for type of quant to use for R1 in terms of quality etc.
@@ -961,7 +984,7 @@ But all of this are just guesses as I have never tried DeepSeekV3/R1 myself.
---
-👤 **davidsyoung** commented the **2025-02-27** at **18:01:58**:
+👤 **davidsyoung** commented on **2025-02-27** at **18:01:58**
> > do you think there's a sweeet spot for type of quant to use for R1 in terms of quality etc.
>
@@ -975,7 +998,7 @@ Thank you for all of your help/work, it's massively appreciated.
---
-👤 **davidsyoung** commented the **2025-02-27** at **19:13:10**:
+👤 **davidsyoung** commented on **2025-02-27** at **19:13:10**
Doing some testing with different batch sizes, micro-batch sizes and context.
@@ -1045,11 +1068,11 @@ Definitely seems a magnitude out, but I'm also really not sure what I'm taking a
---
-👤 **saood06** commented the **2025-02-27** at **20:52:40**:
+👤 **saood06** commented on **2025-02-27** at **20:52:40**
>For DeepSeek it seems it is important to use more bits for the attention tensors and the shared experts. As most of the size is in the MoE experts this does not lead to a very significant increase in model size.
-The model size might not go up significantly but the performance does noticeably go down if you do that strategy as those weights are always used unlike the expert weights, this may not matter as much with them being on CUDA but from another user's reports on llama.cpp who was offloading those to CUDA they still had a performance hit. For me IQ4_K_R4 (V2) is slower than V1 with 2.63 t/s for V2 vs 3.22 t/s V1.
+The model size might not go up significantly but the performance does noticeably go down if you do that strategy as those weights are always used unlike the expert weights, this may not matter as much with them being on CUDA but from another user's reports on llama.cpp who was offloading those to CUDA they still had a performance hit. For me on CPU only inference IQ4_K_R4 V2 is slower than V1 with 2.63 t/s for V2 vs 3.22 t/s V1.
Here's a table of early perplexity values I've collected for various quants of Deepseek.
@@ -1090,9 +1113,9 @@ UD-IQ1_S | 3.8939 |4.7189 | 3.7812 | 3.6799 | 3.6215 | 3.6922 | 3.6442| 3.747
else
// ###
-My V1/V2/V3, I employ the strategy described above, slightly increasing the size of the model but IMO the performance difference was not worth it (that might change with hybrid/full offload). All tensors for mine were imatrixed with mradermacher imatrix except for the new split tensor.
+My V1/V2/V3, I employ a strategy similar to the one above but FAR less aggresive, slightly increasing the size of the model but IMO the performance difference was not worth it (that might change with hybrid/full offload). All tensors for mine were imatrixed with mradermacher imatrix except for the new split tensor.
-Also for reference here is some compute buffer sizes I've seen:
+Also for reference here is some compute buffer sizes I've seen (with an earlier build and default ubatch size):
n_ctx = 128000
CPU compute buffer size = 64468.01 MiB
@@ -1101,7 +1124,7 @@ CPU compute buffer size = 32343.01 MiB
---
-👤 **davidsyoung** commented the **2025-02-27** at **22:29:43**:
+👤 **davidsyoung** commented on **2025-02-27** at **22:29:43**
I may have to start experimenting with quants myself, this is really useful.
@@ -1111,7 +1134,7 @@ I’m getting a total of 67GB for 32k context. It would be nice if I could claw
---
-👤 **saood06** commented the **2025-02-27** at **23:08:14**:
+👤 **saood06** commented on **2025-02-27** at **23:08:14**
> I may have to start experimenting with quants myself, this is really useful.
@@ -1130,7 +1153,29 @@ For now I'm still stuck on CPU only, I did work a bit on porting the RPC updates
---
-👤 **davidsyoung** commented the **2025-02-28** at **09:32:23**:
+👤 **ikawrakow** commented on **2025-02-28** at **07:38:58**
+
+So, based on this discussion, reducing compute buffer size is by far more important than reducing KV cache size. I'll see if I can do something about that.
+
+> So I'm not too sure what's up with the compute buffer. Maybe this is just the size of it given the size of the model. But allocating 42.8GB per gpu, across 15 gpu's would be 642GB VRAM just for compute buffer.
+
+Don't think about the fact that there are 15 GPUs. With per layer model split, each GPU needs to compute a full layer, so each GPU needs the exact same compute buffer as if the entire model was running on it (I.e., if you had a single GPU with enough VRAM to fit the entire model, the compute buffer will still be 42.8 GB and not 15 x 42.8 GB).
+
+Why does the compute buffer become 42.8 GB for 160k context? There is the `K*Q` tensor that needs to materialize. It is of size `n_ctx x n_head x n_ubatch x sizeof(float)` (all compute buffers are `fp32` in `llama.cpp/ggml`). DeepSeek-R1 has 128 heads, so for 160k tokens this tensor alone is 41.9 GB for the default `u_batch` size of 512. It is needed on each GPU because each GPU needs to compute it for the layers stored on it. Your best bet for reducing compute buffer size is to use a smaller `n_ubatch`, but even with `-ub 128` you will not be able to run the full 163k token context. Still, I would be very curious to know how performance with `-ub 128` compares to the default (for a context length that fits in VRAM).
+
+If you use flash attention (`-fa`), the `K*Q` tensor never materializes, so compute buffers are much smaller. But then the KV cache is much larger. I have been trying to make flash attention work with MLA, but have not been successful so far. Oops, CUDA flash attention does not work for DeepSeek, so that's only useful on the CPU.
+
+---
+
+👤 **saood06** commented on **2025-02-28** at **08:48:42**
+
+>Apparently many people are interested in using the maximum context length of long context models. For DeepSeekV3/R1, the rage of the day, it is 163k tokens.
+
+The model was only trained to supports a context length of 128k (131,072). The huggingface and github page both list 128k as the context length as well, so I'm not sure why the original Deepseek V3/R1 config.json (that the GGUF pulls the metadata from) says 160k (163,840).
+
+---
+
+👤 **davidsyoung** commented on **2025-02-28** at **09:32:23**
> So, based on this discussion, reducing compute buffer size is by far more important than reducing KV cache size. I'll see if I can do something about that.
>
@@ -1195,7 +1240,7 @@ Would there be any other optimisation that I could use that would improve the pr
---
-👤 **ikawrakow** commented the **2025-02-28** at **10:12:51**:
+👤 **ikawrakow** commented on **2025-02-28** at **10:12:51**
> Would there be any other optimisation that I could use that would improve the prefill time?
@@ -1213,7 +1258,7 @@ in `ggml-cuda.cu`, rebuild, and run `llama-bench -m model -n 0 -p 512 -t 1 -w 0
---
-👤 **davidsyoung** commented the **2025-02-28** at **14:25:11**:
+👤 **davidsyoung** commented on **2025-02-28** at **14:25:11**
I'm attempting to run llama-bench but it's trying to allocate the full model to device zero, even though I've set tensor splits.
@@ -1377,11 +1422,11 @@ main: error: failed to load model '/models/gghfez_DeepSeek-R1-11446-Q2_K/DeepSee
---
-👤 **ikawrakow** commented the **2025-02-28** at **15:51:58**:
+👤 **ikawrakow** commented on **2025-02-28** at **15:51:58**
Well, not sure why `llama-bench` doesn't do the right thing.
-But I think you will like PR #237 very much. Simply add
+But I think you will like PR [#237](https://github.com/ikawrakow/ik_llama.cpp/issues/237) very much. Simply add
```
-amb 2048
```
@@ -1389,6 +1434,6 @@ to your command line, and the compute buffers should be no more than 3 GiB even
---
-👤 **davidsyoung** commented the **2025-02-28** at **16:33:15**:
+👤 **davidsyoung** commented on **2025-02-28** at **16:33:15**
Holy shit. Will report back!
\ No newline at end of file
diff --git a/github-data/pull_requests/236 - Feat_lock free server.md b/github-data/pull_requests/236 - Featlock free server.md
similarity index 74%
rename from github-data/pull_requests/236 - Feat_lock free server.md
rename to github-data/pull_requests/236 - Featlock free server.md
index 6b057f4f9..6d2a11657 100644
--- a/github-data/pull_requests/236 - Feat_lock free server.md
+++ b/github-data/pull_requests/236 - Featlock free server.md
@@ -1,14 +1,16 @@
-### ✨ [#236](https://github.com/ikawrakow/ik_llama.cpp/pull/236) - Feat/lock free server
+## 🔀 [Pull Request #236](https://github.com/ikawrakow/ik_llama.cpp/pull/236) - Feat/lock free server
| **Author** | `orca-zhang` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `feat/lock-free-server` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-27 |
| **Updated** | 2025-03-19 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -18,9 +20,9 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-27** at **11:43:27**:
+👤 **ikawrakow** commented on **2025-02-27** at **11:43:27**
Thank you for this PR.
@@ -28,7 +30,7 @@ LGTM, but as I never use the server and I'm not familiar with the code, I have a
---
-👤 **orca-zhang** commented the **2025-02-27** at **17:02:24**:
+👤 **orca-zhang** commented on **2025-02-27** at **17:02:24**
Hi Ikawrakow,
@@ -40,17 +42,13 @@ Thank you for your continued dedication to maintaining this exceptional codebase
---
-👤 **saood06** commented during a code review the **2025-02-27** at **19:55:22** on `examples/server/atomic_hash_map.hpp`:
+👤 **saood06** started a conversation on `examples/server/atomic_hash_map.hpp` on **2025-02-27** at **19:55:22**
This is Apache, while this project is MIT.
---
-👤 **saood06** submitted a review the **2025-02-27** at **19:55:23**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented the **2025-02-27** at **19:57:11**:
+👤 **saood06** commented on **2025-02-27** at **19:57:11**
>Please accept my apologies for the accidental PR submission during my preliminary testing phase. I'm currently conducting informal experiments without rigorous benchmarking, and cannot yet confirm the actual utility of these code changes.
diff --git a/github-data/pull_requests/237 - Reduce size of compute buffers.md b/github-data/pull_requests/237 - Reduce size of compute buffers.md
index 4881b8c87..772800b0f 100644
--- a/github-data/pull_requests/237 - Reduce size of compute buffers.md
+++ b/github-data/pull_requests/237 - Reduce size of compute buffers.md
@@ -1,16 +1,19 @@
-### 🔀 [#237](https://github.com/ikawrakow/ik_llama.cpp/pull/237) - Reduce size of compute buffers
+## 🔀 [Pull Request #237](https://github.com/ikawrakow/ik_llama.cpp/pull/237) - Reduce size of compute buffers
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/reduce_compute_buffers` |
+| **Target Branch** | `main` |
| **Created** | 2025-02-28 |
| **Updated** | 2025-03-01 |
+| **Merged** | 2025-03-01 |
---
-#### Description
+## 📄 Description
-I have been focusing on reducing the KV cache size, but as per the lengthy exchange in #235 the actual issue for using a very long context is the size of the compute buffers. E.g., if one attempted to run DeepSeekV3/R1 with the claimed 163k tokens maximum context length, one would need over 40 GB of CUDA compute buffer **per GPU**. But even if running on the CPU, 40 GB is nothing to sneeze at.
+I have been focusing on reducing the KV cache size, but as per the lengthy exchange in [#235](https://github.com/ikawrakow/ik_llama.cpp/issues/235) the actual issue for using a very long context is the size of the compute buffers. E.g., if one attempted to run DeepSeekV3/R1 with the claimed 163k tokens maximum context length, one would need over 40 GB of CUDA compute buffer **per GPU**. But even if running on the CPU, 40 GB is nothing to sneeze at.
This PR solves the problem. For GPU and CPU inference.
@@ -36,9 +39,65 @@ As another side note: I wasted at least another two hours fighting with the `ggm
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-01** at **00:26:54**:
+👤 **ikawrakow** commented on **2025-02-28** at **16:46:30**
+
+Here some example usage. My GPU is RTX-4080 and I'm running `IQ4_NL` quantized DeepSeek-Lite (model is 8.47 GiB). Let's benchmark prompt processing speed for 16k tokens using standard attention.
+```
+./bin/llama-bench -m $model -p 16384 -n 0 -t 1 -ngl 100 -fmoe 1
+```
+We get this:
+
+| model | backend | ngl | threads | fmoe | test | t/s |
+| -------------------- | ---------- | --: | ------: | ---: | ------------: | ---------------: |
+| deepseek2 16B IQ4_NL | CUDA | 100 | 1 | 1 | pp16384 | 2584.69 ± 17.96 |
+
+Let's now run it with the option added in this PR, but increase `u_batch` to 2048
+```
+./bin/llama-bench -m $model -p 16384 -n 0 -t 1 -ngl 100 -fmoe 1 -amb 64,128,256,512 -ub 2048
+```
+We get this
+
+| model | ngl | threads | n_ubatch | amb | fmoe | test | t/s |
+| --------------------- | --: | ------: | -------: | ----: | ---: | ------------: | ---------------: |
+| deepseek2 16B IQ4_NL | 100 | 1 | 2048 | 64 | 1 | pp16384 | 3304.30 ± 15.21 |
+| deepseek2 16B IQ4_NL | 100 | 1 | 2048 | 128 | 1 | pp16384 | 3305.02 ± 12.81 |
+| deepseek2 16B IQ4_NL | 100 | 1 | 2048 | 256 | 1 | pp16384 | 3295.94 ± 7.23 |
+| deepseek2 16B IQ4_NL | 100 | 1 | 2048 | 512 | 1 | pp16384 | 3276.55 ± 5.72 |
+
+Limiting the size of the compute buffer via `-amb` and increasing `u_batch` size to 2048 tokens improves performance by almost 30!
+
+Can we not do the same without `-amb`? Let's try
+```
+./bin/llama-bench -m ../ncuda/junk3.bin -p 16384 -n 0 -t 1 -ngl 100 -fmoe 1 -r 5 -ub 2048
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
+| model | size | params | backend | ngl | threads | n_ubatch | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | ---: | ------------: | ---------------: |
+ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 2382381056
+main: error: failed to create context with model '$model'
+```
+Clearly no.
+
+Let's look at compute buffer sizes for this context of 16k tokens and `u_batch = 2048`. I'll use MLA, else I'm running out of memory for `amb = 0` (PR changes not used)
+
+| amb | compute buffer size |
+| ------: | ---: |
+| 0 | 2393 MiB |
+| 64 | 816 MiB |
+| 128 | 816 MiB |
+| 256 | 816 MiB |
+| 512 | 880 MiB |
+| 1024 | 1369 MiV |
+
+Why is the `u_batch = 2048` performance better than `u_batch = 512` (once we limit `K*Q` size via `-amb`)? DeepSeek-Lite has 64 experts and uses 6 activated experts, so for `u_batch = 512` we have `512 x 6 / 64 = 48` tokens per expert on average. A matrix multiplication with 48 rows in the right matrix is much slower compared to 512 as we would have with a dense model. When we go to `u_batch = 2048`, we have 192 tokens per expert on average, so we get a better performance!
+
+---
+
+👤 **davidsyoung** commented on **2025-03-01** at **00:26:54**
This has been an incredible PR. Hugely beneficial in multiple ways. The compute buffer is drastically lower, and now can run context at max context, no issues.
@@ -365,7 +424,7 @@ Excellent work on this.
---
-👤 **davidsyoung** commented the **2025-03-01** at **00:55:02**:
+👤 **davidsyoung** commented on **2025-03-01** at **00:55:02**
Also, gave the [84853b9](https://github.com/ikawrakow/ik_llama.cpp/pull/237/commits/84853b9a9bb2c71b80c704d2b0d0675cb132a539) commit a test run and it seems to be producing different outcomes each time on regeneration with a fixed seed.
@@ -373,7 +432,7 @@ Not sure if it’s something I’m doing wrong on my end.
---
-👤 **ikawrakow** commented the **2025-03-01** at **06:25:19**:
+👤 **ikawrakow** commented on **2025-03-01** at **06:25:19**
> Also, gave the [84853b9](https://github.com/ikawrakow/ik_llama.cpp/pull/237/commits/84853b9a9bb2c71b80c704d2b0d0675cb132a539) commit a test run and it seems to be producing different outcomes each time on regeneration with a fixed seed.
>
@@ -383,7 +442,7 @@ I wouldn't know why that could affect your results. The change in 84853b9a9bb2c7
---
-👤 **davidsyoung** commented the **2025-03-01** at **07:57:12**:
+👤 **davidsyoung** commented on **2025-03-01** at **07:57:12**
> > Also, gave the [84853b9](https://github.com/ikawrakow/ik_llama.cpp/pull/237/commits/84853b9a9bb2c71b80c704d2b0d0675cb132a539) commit a test run and it seems to be producing different outcomes each time on regeneration with a fixed seed.
> > Not sure if it’s something I’m doing wrong on my end.
diff --git a/github-data/pull_requests/238 - A better way to measure the cost of ggml_barrier.md b/github-data/pull_requests/238 - A better way to measure the cost of ggml_barrier.md
index bacd2c19a..bfaadc6c0 100644
--- a/github-data/pull_requests/238 - A better way to measure the cost of ggml_barrier.md
+++ b/github-data/pull_requests/238 - A better way to measure the cost of ggml_barrier.md
@@ -1,14 +1,17 @@
-### 🔀 [#238](https://github.com/ikawrakow/ik_llama.cpp/pull/238) - A better way to measure the cost of ggml_barrier
+## 🔀 [Pull Request #238](https://github.com/ikawrakow/ik_llama.cpp/pull/238) - A better way to measure the cost of ggml_barrier
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/measure_barriers` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-01 |
| **Updated** | 2025-03-01 |
+| **Merged** | 2025-03-01 |
---
-#### Description
+## 📄 Description
Trying to measure it on each `ggml_barrier` invocation is too imprecise as the best time resolution we have in `ggml` is 1 us. Hence, measure the total graph execution time and and the sum of the node execution times. The difference is then the cost of thread synchronization via `ggml_barrier`.
@@ -16,15 +19,15 @@ Using this on TG runs with DeepSeek-Lite I'm finding that `ggml_barrier` costs a
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-01** at **09:51:17**:
+👤 **davidsyoung** commented on **2025-03-01** at **09:51:17**
@ikawrakow you are seriously cooking!
---
-👤 **ikawrakow** commented the **2025-03-01** at **15:12:54**:
+👤 **ikawrakow** commented on **2025-03-01** at **15:12:54**
> @ikawrakow you are seriously cooking!
diff --git a/github-data/pull_requests/239 - SER - Smart Expert Reduction.md b/github-data/pull_requests/239 - SER - Smart Expert Reduction.md
index ae5e89b44..0e17f7bc1 100644
--- a/github-data/pull_requests/239 - SER - Smart Expert Reduction.md
+++ b/github-data/pull_requests/239 - SER - Smart Expert Reduction.md
@@ -1,14 +1,17 @@
-### 🔀 [#239](https://github.com/ikawrakow/ik_llama.cpp/pull/239) - SER - Smart Expert Reduction
+## 🔀 [Pull Request #239](https://github.com/ikawrakow/ik_llama.cpp/pull/239) - SER - Smart Expert Reduction
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/smart_expert_selection` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-01 |
| **Updated** | 2025-03-18 |
+| **Merged** | 2025-03-02 |
---
-#### Description
+## 📄 Description
The idea behind this PR is very simple: we define new parameters (specified via the command line) $K_{\rm min}$ and $t$. During inference experts are normally selected by sorting their computed probabilities $p_i$ in descending order and picking the top $K$ experts. We modify this expert selection algorithm by always selecting the top $K_{\rm min}$ experts ($K_{\rm min} < K$), and using experts between $K_{\rm min}$ and $K$ only if $p_i > t\cdot p_0$ (i.e., only if their probability $p_i$ relative to the top expert probability $p_0$ is greater than the specified threshold $t$). If we set $t = 0$, this expert selection modification is never invoked, so we have the behavior of the original model. If we set $t = 1$, we use a fixed number of experts $K_{\rm min}$ (the same can be achieved by using `--override-kv deepseek2.expert_used_count=int:Kmin` on the command line, but using `-ser Kmin,1` is clearly much easier to type and remember).
@@ -44,25 +47,25 @@ to the command line.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-01** at **15:49:06**:
+👤 **ikawrakow** commented on **2025-03-01** at **15:49:06**
-Here a graph for error versus performance gain for hybrid CPU/GPU inference (Ryzen-7950X/RTX-4080) for DeepSeek-Lite. Operation with MoE tensors are computed on the CPU, all others on the GPU.
+Here a graph for error versus performance gain for hybrid CPU/GPU inference (Ryzen-7950X/RTX-4080) for DeepSeek-Lite. Operations with MoE tensors are computed on the CPU, all others on the GPU.

-Here performance gains are much more significant. As attention and shared experts computation done on the GPU is much faster than the MoE calculation done on the CPU, we gain more by selectively reducing experts. If we just use 5 experts instead of 6, TG performance increases by nearly 20% while the associated error is significantly less than using 4 bits for the attention layers.
+Here performance gains are much more significant. As attention and shared experts computation done on the GPU is much faster than the MoE calculation done on the CPU, we gain more by selectively reducing experts. If we just use 5 experts instead of 6, TG performance increases by nearly 20% while the associated error is significantly less than using 4 bits for the attention layers (magenta symbols)
---
-👤 **davidsyoung** commented the **2025-03-01** at **16:25:50**:
+👤 **davidsyoung** commented on **2025-03-01** at **16:25:50**
This looks very interesting - what would you recommend is the best way to test this with full CUDA off-load with R1? If you have some harnesses to test PPL, that would be great
---
-👤 **ikawrakow** commented the **2025-03-01** at **17:11:55**:
+👤 **ikawrakow** commented on **2025-03-01** at **17:11:55**
I typically use Wikitext2 `PPL`. There are many people out there who believe that this is not good, but I have also compared to C4 `PPL` (English and French) and, once you look at the ratio of `PPL(approximate model)/PPL(full model)-1`, things do not depend that much on the specific test corpus. The same is also true for context length. Even though PPL can change a lot with the context window used for evaluation, the ratio `PPL(approximate model)/PPL(full model)` is nearly independent of context length. One can also compute KL divergence (and many people think this is better than `PPL`), but that is much less convenient (one must first run a calculation with the full model, generate a huge data file, to then run with the approximate model to get the KL divergence values), to only find out that the mean KL divergence correlates almost 100% with `log(PPL(approximate)/PPL(full))`. Same is true for HellaSwag, the other benchmark one can run with `llama.cpp`. The correlation coefficient between `HellaSwag(full) - HellaSwag(approximate)` with `PPL(approximate)/PPL(full)-1` tends to be over 90%, so this doesn't give much additional information (but takes way longer to compute than PPL). So, at then end, if you have settled on a model you want to use, comparing `PPL` with SER to `PPL` without will give good indication about performance degradation.
@@ -72,13 +75,13 @@ But with the 150-200 t/s you are getting for R1 it will not be easy to get a det
---
-👤 **davidsyoung** commented the **2025-03-01** at **17:25:56**:
+👤 **davidsyoung** commented on **2025-03-01** at **17:25:56**
Okay, cool! I am going to first create my own quant somewhere around `i1-IQ3_XXS`, `i1-IQ3_XS`, or `i1-IQ3_S`. I'm downloading the full BF16 model right now, and then when I have the best fit of quants, I'll figure out how to run a PPL test... :) Thank you.
---
-👤 **davidsyoung** commented the **2025-03-03** at **21:35:39**:
+👤 **davidsyoung** commented on **2025-03-03** at **21:35:39**
@ikawrakow a little bit off topic but didn't know where better to ask.
@@ -100,23 +103,25 @@ Now I don't know if this is because of the imatrix, the changes for MLA with the
Likely a corrupt part. But just wondering, is there anything I'm doing wrong here? I wasn't 100% sure if that's a correct quantize command, or something I'm missing.
-TYVM
+TYVM
+
+UPDATE: Had a part of the BF16 that the hash failed :)
---
-👤 **ikawrakow** commented the **2025-03-04** at **11:21:38**:
+👤 **ikawrakow** commented on **2025-03-04** at **11:21:38**
Let me know if it works after you re-download the corrupt file. If it doesn't, the I would need to make the quantization more robust against missing imatrix data. DeepSeekV3/R1 is tricky because only 8 out of 256 experts are activated per token, so for an imatrix calculation with a given amount of calibration data there will be 32X less data collected for the experts compared to a dense model. This may lead to missing/insufficient imatrix data, which may not be handled gracefully by the quantization functions.
---
-👤 **davidsyoung** commented the **2025-03-04** at **11:48:46**:
+👤 **davidsyoung** commented on **2025-03-04** at **11:48:46**
I will! Reconverting to GGUF from BF16 takes a decent amount of time on HDDs compared to NVME. Should be done around 6pm tonight, and I’ll quantize soon after that! Thank you for all of the help and your work on improving inference with DS V3/R1 - its excellent!
---
-👤 **davidsyoung** commented the **2025-03-04** at **20:16:54**:
+👤 **davidsyoung** commented on **2025-03-04** at **20:16:54**
@ikawrakow
@@ -191,17 +196,23 @@ llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/models/DeepSeek-R1-GGUF-IQ3_S.gguf'
ERR [ load_model] unable to load model | tid="23133942390784" timestamp=1741119264 model="/models/DeepSeek-R1-GGUF-IQ3_S.gguf"
/app/.devops/tools_new.sh: line 47: 13 Segmentation fault ./llama-server "$@"
-```
+```
+
+UPDATE: Looks like because I've created the GGUF from https://huggingface.co/unsloth/DeepSeek-R1-BF16, it seems it's possible that the tokenizers lib that was used was a future version (ref merges change https://github.com/huggingface/tokenizers/commit/6a5fce9fa094d1514a498419f86ac0916e98ef8a), and as a result with older tokenizer lib it isn't able to read the new format of the merges.
+
+Have created some custom runtime code to load merges so I can avoid re-quanting the bf16, at least for now.
+
+Didn’t get a chance to play with the SER but hope to over the following few days. Next up on the learning agenda :)
---
-👤 **davidsyoung** commented the **2025-03-05** at **12:36:10**:
+👤 **davidsyoung** commented on **2025-03-05** at **12:36:10**
Preliminary results with `-ser 6,1` and `-ser 7,1` show no major difference to TG performance - it's -/+ 1 t/s. Likely that with 16x3090 it's not compute limited, as GPU's are only running at 5-10% during inference.
---
-👤 **ikawrakow** commented the **2025-03-05** at **12:54:10**:
+👤 **ikawrakow** commented on **2025-03-05** at **12:54:10**
> Likely that with 16x3090 it's not compute limited, as GPU's are only running at 5-10% during inference.
@@ -209,7 +220,7 @@ You observe 5-10% GPU utilization because each GPU is only processing 1/16th of
---
-👤 **davidsyoung** commented the **2025-03-05** at **13:43:59**:
+👤 **davidsyoung** commented on **2025-03-05** at **13:43:59**
This makes sense, thank you for taking the time to type it out!
@@ -219,7 +230,20 @@ I’m also quanting a IQ4_KSS which I feel will be a great sweet spot, so thank
---
-👤 **davidsyoung** commented the **2025-03-05** at **14:02:55**:
+👤 **ikawrakow** commented on **2025-03-05** at **13:56:04**
+
+You can try to run a perplexity calculation:
+
+```
+./bin/llama-perplexity -m your_model -f wiki.test.raw -fmoe -fa -ser 6,1 -c 2048 -ub 2048 your_gpu_parameters
+```
+[wiki.test.raw.gz](https://github.com/user-attachments/files/19090237/wiki.test.raw.gz)
+
+But if you are quantizing a model it does not make sense to run benchmarks. Quantization puts quite a bit of load on the system, so your inference benchmarks will not be very reliable. Sometimes when working on new quantization types I run perplexity on the GPU and quantize a new version of the model at the same time on the CPU, and I see a noticeable slow down of the GPU while quantization is running.
+
+---
+
+👤 **davidsyoung** commented on **2025-03-05** at **14:02:55**
Super stuff. When some with quant I’ll do that!
@@ -227,22 +251,23 @@ Also, just in terms of FA, when I tried to run FA earlier it tried to allocate 1
---
-👤 **ikawrakow** commented the **2025-03-05** at **16:26:50**:
+👤 **ikawrakow** commented on **2025-03-05** at **16:26:50**
> Also, just in terms of FA, when I tried to run FA earlier it tried to allocate 150GB to first GPU.
-That happened after PR #241 was merged and you updated to latest? I guess, you are trying to run with a context of 163k tokens. For the `perplexity` calculation with the above command (context of 2048 tokens) the KV cache will be 1.2 GiB and the compute buffer should not be more than 1-2 GiB. If you go to `Q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the above command), than KV cache will be only 600 MiB.
+That happened after PR [#241](https://github.com/ikawrakow/ik_llama.cpp/issues/241) was merged and you updated to latest? I guess, you are trying to run with a context of 163k tokens. For the `perplexity` calculation with the above command (context of 2048 tokens) the KV cache will be 1.2 GiB and the compute buffer should not be more than 1-2 GiB. If you go to `Q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the above command), than KV cache will be only 600 MiB.
---
-👤 **davidsyoung** commented the **2025-03-05** at **21:21:02**:
+👤 **davidsyoung** commented on **2025-03-05** at **21:21:02**
Ok got some PPL runs!
All perplexity evals were ran with:
-`./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ3_M.gguf -f /models/wiki.test.raw -fmoe -fa -c 2048 -ub 2048 --n-gpu-layers 100 -ts 41,23.5,26,24.5,23.5,25.5,24.4,23.5,25.5,24.5,23.5,25.5,24.5,23.5,25.5,30`.
+`./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ3_M.gguf -f /models/wiki.test.raw -fmoe -fa -c 2048 -ub 2048 --n-gpu-layers 100 -ts 41,23.5,26,24.5,23.5,25.5,24.4,23.5,25.5,24.5,23.5,25.5,24.5,23.5,25.5,30`.
@saood06 tagging you as I know you are collecting PPL
+
---
# No -SER
@@ -292,6 +317,28 @@ llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 m
llama_print_timings: total time = 756557.67 ms / 286721 tokens
```
+---
+
+# -SER 5,1
+```
+perplexity: tokenizing the input ..
+perplexity: tokenization took 1233.7 ms
+perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 6.19 seconds per pass - ETA 14.45 minutes
+[1]1.5984,[2]1.3688,[3]1.3545,[4]1.8202,[5]1.8851,[6]1.8450,[7]1.9454,[8]2.0786,[9]2.2773,[10]2.4790,[11]2.6063,[12]2.4864,[13]2.6142,[14]2.7090,[15]2.8413,[16]2.9685,[17]2.9602,[18]3.0196,[19]2.9545,[20]2.8804,[21]2.8151,[22]2.7526,[23]2.6679,[24]2.6164,[25]2.5883,[26]2.6735,[27]2.7534,[28]2.7524,[29]2.7000,[30]2.6405,[31]2.5823,[32]2.5347,[33]2.5196,[34]2.5609,[35]2.5973,[36]2.5939,[37]2.5985,[38]2.5910,[39]2.5989,[40]2.6280,[41]2.6857,[42]2.7679,[43]2.7993,[44]2.7530,[45]2.7237,[46]2.7784,[47]2.8340,[48]2.8568,[49]2.9064,[50]2.9243,[51]2.9456,[52]2.9680,[53]2.9697,[54]2.9824,[55]2.9806,[56]2.9920,[57]2.9939,[58]3.0132,[59]3.0277,[60]3.0609,[61]3.1059,[62]3.1080,[63]3.1087,[64]3.1274,[65]3.1350,[66]3.1468,[67]3.1558,[68]3.1369,[69]3.0984,[70]3.1287,[71]3.1587,[72]3.1682,[73]3.1434,[74]3.1477,[75]3.1651,[76]3.1721,[77]3.1727,[78]3.1781,[79]3.1859,[80]3.1921,[81]3.1945,[82]3.1993,[83]3.2128,[84]3.2138,[85]3.2265,[86]3.2515,[87]3.2296,[88]3.2609,[89]3.2916,[90]3.3155,[91]3.3368,[92]3.3678,[93]3.4004,[94]3.4335,[95]3.4339,[96]3.4521,[97]3.4639,[98]3.4316,[99]3.3945,[100]3.3582,[101]3.3226,[102]3.2879,[103]3.2799,[104]3.2713,[105]3.2726,[106]3.2732,[107]3.2757,[108]3.2783,[109]3.2564,[110]3.2565,[111]3.2537,[112]3.2645,[113]3.2787,[114]3.2841,[115]3.2939,[116]3.3126,[117]3.3123,[118]3.3117,[119]3.3115,[120]3.3141,[121]3.3150,[122]3.3282,[123]3.3447,[124]3.3479,[125]3.3547,[126]3.3532,[127]3.3615,[128]3.3443,[129]3.3387,[130]3.3450,[131]3.3545,[132]3.3364,[133]3.3217,[134]3.3295,[135]3.3438,[136]3.3354,[137]3.3109,[138]3.2891,[139]3.2925,[140]3.3133,
+Final estimate: PPL = 3.3133 +/- 0.01704
+
+llama_print_timings: load time = 633685.80 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 724417.11 ms / 286720 tokens ( 2.53 ms per token, 395.79 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 729748.90 ms / 286721 tokens
+
+```
+
+---
+
+
Next I'm going to try to run `IQ4_KSS`, but splitting the layers over the GPU's always unevenly split and I'm not sure I can fit it in. If we could get `-split-mode row` working it'd be very helpful! But not sure if it's an easy fix (likely not), for example here's how it looks atm trying to balance over `-ts`:
```
@@ -322,14 +369,35 @@ It takes quite some time for the buffers to allocate so it's a slow feedback loo
---
-👤 **davidsyoung** commented the **2025-03-05** at **21:31:59**:
+👤 **davidsyoung** commented on **2025-03-05** at **21:31:59**
+
+
+
+
+
+Will get some more data, with 5 experts, and some increments between 0...1 in each.
+
+---
+
+👤 **saood06** commented on **2025-03-05** at **22:45:29**
-
-
+Thanks for running the PPL, hoping you can fit IQ4_KSS as it will be higher quality.
+
+> Next I'm going to try to run `IQ4_KSS`, but splitting the layers over the GPU's always unevenly split and I'm not sure I can fit it in. If we could get `-split-mode row` working it'd be very helpful! But not sure if it's an easy fix (likely not), for example here's how it looks atm trying to balance over `-ts`:
+
+This comment has a method that might be worth trying and seeing if it helps you get split-mode row working: https://github.com/ggml-org/llama.cpp/pull/11446#issuecomment-2651659237
+
+>It takes quite some time for the buffers to allocate so it's a slow feedback loop to try to balance.
+
+If the above doesn't work then you may try something similar to the code from this PR to save you time while searching for the right values https://github.com/nicoboss/llama.cpp/pull/3/files this basically just skips actually allocating the buffers but prints how much would be allocated. Obviously this won't work for actually running the model and may not handle every edge case ( also the code is for llama.cpp which has diverted in ways that will make you manually port over some of the changes, so not sure if you will find it worthwhile ).
+
+>I think @saood06 was mentioning somewhere that one needs to "warm up" the model for quite some time before performance becomes more stable, perhaps this is also true for your system.
+
+That problem should no longer occur anywhere unless you pass the --no-warmup argument. It occurred because the old warmup code only worked for dense models, MoEs were only being partially loaded in as it would only activate a single tokens worth of active experts. The code now activates all experts during the warmup phase. This was very noticeable if you looked at disk I/O and before I would only post performance numbers once disk I/O was no longer happening, and on my setup where the model was stored on a HDD with slow seek times it definitely mattered even when the amount of data being read was low but not zero.
---
-👤 **ikawrakow** commented the **2025-03-06** at **06:11:44**:
+👤 **ikawrakow** commented on **2025-03-06** at **06:11:44**
Great results, thank you for these.
@@ -339,13 +407,13 @@ Have you tried using `-ot` to distribute the model tensors between the GPUs? You
---
-👤 **davidsyoung** commented the **2025-03-06** at **09:47:08**:
+👤 **davidsyoung** commented on **2025-03-06** at **09:47:08**
> Thanks for running the PPL, hoping you can fit IQ4_KSS as it will be higher quality.
>
> > Next I'm going to try to run `IQ4_KSS`, but splitting the layers over the GPU's always unevenly split and I'm not sure I can fit it in. If we could get `-split-mode row` working it'd be very helpful! But not sure if it's an easy fix (likely not), for example here's how it looks atm trying to balance over `-ts`:
>
-> This comment has a method that might be worth trying and seeing if it helps you get split-mode row working: [ggml-org/llama.cpp#11446 (comment)](https://github.com/ggml-org/llama.cpp/pull/11446#issuecomment-2651659237)
+> This comment has a method that might be worth trying and seeing if it helps you get split-mode row working: [ggml-org/llama.cpp[#11446](https://github.com/ikawrakow/ik_llama.cpp/issues/11446) (comment)](https://github.com/ggml-org/llama.cpp/pull/11446#issuecomment-2651659237)
>
> > It takes quite some time for the buffers to allocate so it's a slow feedback loop to try to balance.
>
@@ -361,13 +429,31 @@ I also tried to allocate those tensors to CUDA0 without any luck. I got a differ
---
-👤 **ikawrakow** commented the **2025-03-06** at **09:58:44**:
+👤 **davidsyoung** commented on **2025-03-06** at **09:49:23**
+
+> Great results, thank you for these.
+>
+> 357 t/s prompt processing speed is pretty good! (at least relative to what I have seen people reporting for consumer grade hardware).
+>
+> Have you tried using `-ot` to distribute the model tensors between the GPUs? You will need 16 arguments `-ot "regexp_i=CUDA_i` to force a specific range of layers on specific GPUs. If that works out, perhaps you can also try forcing the non-MoE tensors be all on 1 or 2 GPUs, and use the remaining 14 or 15 to do the MoE tensors. That may increase the VRAM you have available as the MoE GPU's should not require VRAM for KV cache (at least this is my expectation, but `llama.cpp` and as a result `ik_llama.cpp` not always does what one expects).
+
+Yes thought it was particularly fast too!
+
+This is a great idea. I hadn’t even considered doing this. I need to learn what weights are what in each layer and try this.
+
+Also, I’m getting NAN results from running IQ4_KSS with llama-perplexity under FA (same command as above), was able to just fit in 2048 ctx.
+
+The model loads correctly with MLA, so don’t believe it’s a quant issue. Going to try load the model with FA and see if it loads or returns NAN. Will report back shortly.
+
+---
+
+👤 **ikawrakow** commented on **2025-03-06** at **09:58:44**
Do I understand correctly that the `IQ4_KSS` model works correctly with MLA but produces NaNs with FA? Or does it always produce NaNs?
---
-👤 **davidsyoung** commented the **2025-03-06** at **10:00:34**:
+👤 **davidsyoung** commented on **2025-03-06** at **10:00:34**
> Do I understand correctly that the `IQ4_KSS` model works correctly with MLA but produces NaNs with FA? Or does it always produce NaNs?
@@ -377,19 +463,485 @@ I’m now loading the model with FA now for inference, to see if it’s an issue
---
-👤 **davidsyoung** commented the **2025-03-06** at **10:07:40**:
+👤 **davidsyoung** commented on **2025-03-06** at **10:02:32**
+
+Here is the two runs of FA with IQ4_KSS if it helps:
+
+```
+root@d79189d8c093:/app# ./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek8 -ub 2048 -ctk q8_0 -ctv q8_0 --n-gpu-layers 100 -ts 41.35,26,24.5,27.5,25,25.10246.5,24,27.75,25.5,24.5,27.75,27.5,23.5,27.5,34.6
+main: build = 0 (unknown)
+main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
+main: seed = 1741222189
+llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = unsloth_DeepSeek R1 BF16
+llama_model_loader: - kv 3: general.size_label str = 256x21B
+llama_model_loader: - kv 4: general.license str = mit
+llama_model_loader: - kv 5: general.base_model.count u32 = 1
+llama_model_loader: - kv 6: general.base_model.0.name str = DeepSeek R1
+llama_model_loader: - kv 7: general.base_model.0.organization str = Deepseek Ai
+llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
+llama_model_loader: - kv 9: general.tags arr[str,3] = ["deepseek", "unsloth", "transformers"]
+llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"]
+llama_model_loader: - kv 11: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 12: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 13: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 14: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 15: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 16: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 17: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 18: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 19: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 20: general.file_type u32 = 148
+llama_model_loader: - kv 21: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 22: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 23: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 24: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 25: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 26: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 27: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 28: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 29: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 30: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 31: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 32: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 33: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 34: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 35: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 36: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 37: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 39: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
+llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 128815
+llama_model_loader: - kv 45: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 46: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 47: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 48: general.quantization_version u32 = 2
+llama_model_loader: - kv 49: quantize.imatrix.file str = /models/deepseek-config/imatrix.dat
+llama_model_loader: - kv 50: quantize.imatrix.dataset str = imatrix-training-full-3
+llama_model_loader: - kv 51: quantize.imatrix.entries_count i32 = 720
+llama_model_loader: - kv 52: quantize.imatrix.chunks_count i32 = 315
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 306 tensors
+llama_model_loader: - type q6_K: 1 tensors
+llama_model_loader: - type iq4_kss: 479 tensors
+loaded 127741 merges from merges.txt
+llm_load_vocab: special tokens cache size = 819
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = IQ4_KSS - 4.0 bpw
+llm_load_print_meta: model params = 672.050 B
+llm_load_print_meta: model size = 317.185 GiB (4.054 BPW)
+llm_load_print_meta: repeating layers = 315.560 GiB (4.045 BPW, 670.196 B parameters)
+llm_load_print_meta: general.name = unsloth_DeepSeek R1 BF16
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 16 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 8: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 9: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 10: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 11: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 12: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 13: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 14: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 15: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+llm_load_tensors: ggml ctx size = 7.94 MiB
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 938.98 MiB
+llm_load_tensors: CUDA0 buffer size = 17648.09 MiB
+llm_load_tensors: CUDA1 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA2 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA3 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA4 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA5 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA6 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA7 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA8 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA9 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA10 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA11 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA12 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA13 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA14 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA15 buffer size = 17387.84 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 2048
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 1024
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CUDA0 KV buffer size = 510.00 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA2 KV buffer size = 255.00 MiB
+llama_kv_cache_init: CUDA3 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA4 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA5 KV buffer size = 255.00 MiB
+llama_kv_cache_init: CUDA6 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA7 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA8 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA9 KV buffer size = 255.00 MiB
+llama_kv_cache_init: CUDA10 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA11 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA12 KV buffer size = 255.00 MiB
+llama_kv_cache_init: CUDA13 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA14 KV buffer size = 340.00 MiB
+llama_kv_cache_init: CUDA15 KV buffer size = 255.00 MiB
+llama_new_context_with_model: KV self size = 5185.00 MiB, K (q8_0): 3111.00 MiB, V (q8_0): 2074.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
+llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
+llama_new_context_with_model: CUDA0 compute buffer size = 648.02 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA2 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA3 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA4 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA5 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA6 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA7 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA8 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA9 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA10 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA11 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA12 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA13 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA14 compute buffer size = 628.02 MiB
+llama_new_context_with_model: CUDA15 compute buffer size = 661.03 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 60.05 MiB
+llama_new_context_with_model: graph nodes = 3365
+llama_new_context_with_model: graph splits = 17
+
+system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 1263.24 ms
+perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 17.30 seconds per pass - ETA 40.35 minutes
+[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,[101]nan,[102]nan,[103]nan,[104]nan,[105]nan,[106]nan,[107]nan,[108]nan,[109]nan,[110]nan,[111]nan,[112]nan,[113]nan,[114]nan,[115]nan,[116]nan,[117]nan,[118]nan,[119]nan,[120]nan,[121]nan,[122]nan,[123]nan,[124]nan,[125]nan,[126]nan,[127]nan,[128]nan,[129]nan,[130]nan,[131]nan,[132]nan,[133]nan,[134]nan,[135]nan,[136]nan,[137]nan,[138]nan,[139]nan,[140]nan,
+Unexpected negative standard deviation of log(prob)
+
+llama_print_timings: load time = 731853.48 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 2284818.16 ms / 286720 tokens ( 7.97 ms per token, 125.49 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 2289390.93 ms / 286721 tokens
+
+
+
+root@d79189d8c093:/app# ./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS.gguf -f /models/wiki.test.raw -fmoe -fa -c 2048 -ub 512 -ngl 100 -ts 41.35,26,24.5,27.5,25,25.25,26.5,24,27.75,25.5,24.5,27.75,27.5,23.5,27.5,34.6
+main: build = 0 (unknown)
+main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
+main: seed = 1741249390
+llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_KSS.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = unsloth_DeepSeek R1 BF16
+llama_model_loader: - kv 3: general.size_label str = 256x21B
+llama_model_loader: - kv 4: general.license str = mit
+llama_model_loader: - kv 5: general.base_model.count u32 = 1
+llama_model_loader: - kv 6: general.base_model.0.name str = DeepSeek R1
+llama_model_loader: - kv 7: general.base_model.0.organization str = Deepseek Ai
+llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
+llama_model_loader: - kv 9: general.tags arr[str,3] = ["deepseek", "unsloth", "transformers"]
+llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"]
+llama_model_loader: - kv 11: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 12: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 13: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 14: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 15: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 16: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 17: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 18: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 19: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 20: general.file_type u32 = 148
+llama_model_loader: - kv 21: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 22: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 23: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 24: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 25: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 26: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 27: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 28: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 29: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 30: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 31: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 32: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 33: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 34: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 35: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 36: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 37: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 39: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
+llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 128815
+llama_model_loader: - kv 45: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 46: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 47: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 48: general.quantization_version u32 = 2
+llama_model_loader: - kv 49: quantize.imatrix.file str = /models/deepseek-config/imatrix.dat
+llama_model_loader: - kv 50: quantize.imatrix.dataset str = imatrix-training-full-3
+llama_model_loader: - kv 51: quantize.imatrix.entries_count i32 = 720
+llama_model_loader: - kv 52: quantize.imatrix.chunks_count i32 = 315
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 306 tensors
+llama_model_loader: - type q6_K: 1 tensors
+llama_model_loader: - type iq4_kss: 479 tensors
+loaded 127741 merges from merges.txt
+llm_load_vocab: special tokens cache size = 819
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = IQ4_KSS - 4.0 bpw
+llm_load_print_meta: model params = 672.050 B
+llm_load_print_meta: model size = 317.185 GiB (4.054 BPW)
+llm_load_print_meta: repeating layers = 315.560 GiB (4.045 BPW, 670.196 B parameters)
+llm_load_print_meta: general.name = unsloth_DeepSeek R1 BF16
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 16 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 8: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 9: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 10: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 11: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 12: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 13: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 14: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 15: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+llm_load_tensors: ggml ctx size = 7.94 MiB
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 938.98 MiB
+llm_load_tensors: CUDA0 buffer size = 17648.09 MiB
+llm_load_tensors: CUDA1 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA2 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA3 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA4 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA5 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA6 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA7 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA8 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA9 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA10 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA11 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA12 buffer size = 16662.86 MiB
+llm_load_tensors: CUDA13 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA14 buffer size = 22217.15 MiB
+llm_load_tensors: CUDA15 buffer size = 17387.84 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 2048
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CUDA0 KV buffer size = 960.00 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA2 KV buffer size = 480.00 MiB
+llama_kv_cache_init: CUDA3 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA4 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA5 KV buffer size = 480.00 MiB
+llama_kv_cache_init: CUDA6 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA7 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA8 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA9 KV buffer size = 480.00 MiB
+llama_kv_cache_init: CUDA10 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA11 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA12 KV buffer size = 480.00 MiB
+llama_kv_cache_init: CUDA13 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA14 KV buffer size = 640.00 MiB
+llama_kv_cache_init: CUDA15 KV buffer size = 480.00 MiB
+llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
+llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
+llama_new_context_with_model: CUDA0 compute buffer size = 324.01 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA2 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA3 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA4 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA5 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA6 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA7 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA8 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA9 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA10 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA11 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA12 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA13 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA14 compute buffer size = 314.01 MiB
+llama_new_context_with_model: CUDA15 compute buffer size = 330.52 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 30.02 MiB
+llama_new_context_with_model: graph nodes = 3365
+llama_new_context_with_model: graph splits = 17
+
+system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 1181.72 ms
+perplexity: calculating perplexity over 140 chunks, n_ctx=2048, batch_size=2048, n_seq=1
+perplexity: 22.57 seconds per pass - ETA 52.65 minutes
+[1]nan,[2]nan,[3]nan,[4]nan,[5]nan,[6]nan,[7]nan,[8]nan,[9]nan,[10]nan,[11]nan,[12]nan,[13]nan,[14]nan,[15]nan,[16]nan,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,[101]nan,[102]nan,[103]nan,[104]nan,[105]nan,[106]nan,[107]nan,[108]nan,[109]nan,[110]nan,[111]nan,[112]nan,[113]nan,[114]nan,[115]nan,[116]nan,[117]nan,[118]nan,[119]nan,[120]nan,[121]nan,[122]nan,[123]nan,[124]nan,[125]nan,[126]nan,[127]nan,[128]nan,[129]nan,[130]nan,[131]nan,[132]nan,[133]nan,[134]nan,[135]nan,[136]nan,[137]nan,[138]nan,[139]nan,[140]nan,
+Unexpected negative standard deviation of log(prob)
+
+llama_print_timings: load time = 724315.47 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 3021793.84 ms / 286720 tokens ( 10.54 ms per token, 94.88 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 3026980.82 ms / 286721 tokens
+```
+
+---
+
+👤 **davidsyoung** commented on **2025-03-06** at **10:07:40**
OK, update. Model works with FA. Just doesn’t run under perplexity. Weird. Any idea?
---
-👤 **ikawrakow** commented the **2025-03-06** at **13:23:40**:
+👤 **ikawrakow** commented on **2025-03-06** at **13:23:40**
Not sure. It works with the models I have tested with.
---
-👤 **davidsyoung** commented the **2025-03-06** at **13:58:27**:
+👤 **davidsyoung** commented on **2025-03-06** at **13:58:27**
> Not sure. It works with the models I have tested with.
@@ -401,7 +953,7 @@ Would it be possible to get a parameter to decide what GPU's to split the KV Cac
---
-👤 **ikawrakow** commented the **2025-03-06** at **14:05:32**:
+👤 **ikawrakow** commented on **2025-03-06** at **14:05:32**
> I'm working on spreading the components of the experts over 14/15 GPUs, but the KV cache/compute buffer is still getting spread over all GPUs.
@@ -411,7 +963,7 @@ What happens if you try standard attention. Use a short context (`-c 512`) to no
---
-👤 **davidsyoung** commented the **2025-03-06** at **14:48:40**:
+👤 **davidsyoung** commented on **2025-03-06** at **14:48:40**
> > I'm working on spreading the components of the experts over 14/15 GPUs, but the KV cache/compute buffer is still getting spread over all GPUs.
>
@@ -423,7 +975,7 @@ Unfortunately I don’t have much time today to test this. But, tbh, I don’t t
---
-👤 **davidsyoung** commented the **2025-03-07** at **00:22:47**:
+👤 **davidsyoung** commented on **2025-03-07** at **00:22:47**
@ikawrakow
@@ -904,13 +1456,13 @@ Again, I could be doing something v obviously wrong here, but my brain can't mak
---
-👤 **ikawrakow** commented the **2025-03-07** at **05:33:17**:
+👤 **ikawrakow** commented on **2025-03-07** at **05:33:17**
-Not sure. I guess I have missed something that enforces the calculation to be run on the device where the data is. Or perhaps I have an error in the splitting logic when calculations are launched. The split looks really nice, too bad it does not work. Can you try without `-fmoe`?
+Not sure. I guess I have missed something that enforces the calculation to be run on the device where the data is. Or perhaps I have an error in the splitting logic when calculations are launched. The split looks really nice, too bad it does not work. Can you try without `-fmoe`? I have no access to a multi-GPU system, so not able to debug.
---
-👤 **davidsyoung** commented the **2025-03-07** at **10:47:05**:
+👤 **davidsyoung** commented on **2025-03-07** at **10:47:05**
> Not sure. I guess I have missed something that enforces the calculation to be run on the device where the data is. Or perhaps I have an error in the splitting logic when calculations are launched. The split looks really nice, too bad it does not work. Can you try without `-fmoe`? I have no access to a multi-GPU system, so not able to debug.
@@ -1249,11 +1801,15 @@ To make it easier for you to understand which layers are on which GPUs:
If you look at the regex, you'll see that `blk.X.` up/gate/down tensors are split across multiple GPUs. This may be a stupidly obvious thing _not_ to do, but to me I don't fully understand LLM architecture so I don't know if I shouldn't do this... 😂.
-It also seems that compute buffer is higher than previously for this amount of `-ub`, but I could be just imagining that.
+It also seems that compute buffer is higher than previously for this amount of `-ub`, but I could be just imagining that.
+
+---
+
+UPDATE: Prompt processing is also down from 200~ t/s to about 120~ t/s. Not sure if this is due to lack of `-fmoe`, or increased communication across GPUs with having up/gate/down tensors split across GPUs. Maybe both.
---
-👤 **ikawrakow** commented the **2025-03-07** at **14:28:58**:
+👤 **ikawrakow** commented on **2025-03-07** at **14:28:58**
So, without me having access to a multi-GPU device, I cannot really give a meaningful advice. Still, what about the following split:
* All attention tensors, plus all shared experts, plus the `ffn` tensors of the first 3 layers, plus the output tensor, all on GPU0. E.g., `-ot "\.attn_.*\.weight=CUDA0" -ot "\.ffn_.*_shexp\.=CUDA0" -ot blk\.[0-2]\.ffn=CUDA0" -ot "output\.weight=CUDA0"`
@@ -1265,7 +1821,7 @@ The MoE experts are 7168 x 2048 x 256, and there are `ffn_up_exps, ffn_gate_exps
---
-👤 **davidsyoung** commented the **2025-03-07** at **16:00:40**:
+👤 **davidsyoung** commented on **2025-03-07** at **16:00:40**
> So, without me having access to a multi-GPU device, I cannot really give a meaningful advice. Still, what about the following split:
>
@@ -1280,13 +1836,23 @@ This is really helpful! I am going to try to find a way to get PPL working, and
---
-👤 **ikawrakow** commented the **2025-03-08** at **14:59:10**:
+👤 **davidsyoung** commented on **2025-03-08** at **14:46:00**
+
+@ikawrakow
+
+Unfortunately it seems to be trying to allocate an equivalent compute buffer that would be split over all backends, just over CUDA0 with all the attn layers (etc as above) on that.
+
+For ex, if it's usually 40gb over 16 gpus, it allocates 40gb to gpu 0.
+
+---
+
+👤 **ikawrakow** commented on **2025-03-08** at **14:59:10**
Oops. Yes, of course. So this approach is limited to contexts of up to 8k or 16k tokens. OK, I'll try to think of something else.
---
-👤 **davidsyoung** commented the **2025-03-08** at **16:06:11**:
+👤 **davidsyoung** commented on **2025-03-08** at **16:06:11**
> Oops. Yes, of course. So this approach is limited to contexts of up to 8k or 16k tokens. OK, I'll try to think of something else.
@@ -1296,13 +1862,13 @@ This quant came in a bit lower on perplexity too, `3.1464 +/- 0.01620` on `IQ4_K
---
-👤 **ikawrakow** commented the **2025-03-08** at **16:22:29**:
+👤 **ikawrakow** commented on **2025-03-08** at **16:22:29**
Yes, "Final estimate" is the thing to look at. This is about a 2% reduction in PPL. I don't know what the `f16` PPL is for DeepSeekR1, but for the models I can play with `IQ4_KSS` will typically have in the range of 2-3% higher PPL than the `fp16` model. If this is the case also for DeepSeekR1, then 2% is a very significant reduction and would make the quantization almost lossless.
---
-👤 **saood06** commented the **2025-03-08** at **22:19:39**:
+👤 **saood06** commented on **2025-03-08** at **22:19:39**
> This quant came in a bit lower on perplexity too, `3.1464 +/- 0.01620` on `IQ4_KSS` vs `3.0848 +/- 0.01608` on this blend you suggested above. I'm assuming I'm looking at the right figure to compare, right ("Final estimate")? Instead of adding together all numbers and summing them or anything like that.
@@ -1312,7 +1878,7 @@ Can you post the exact code/command/quant log for that blend you use, the PPL lo
---
-👤 **davidsyoung** commented the **2025-03-08** at **23:32:11**:
+👤 **davidsyoung** commented on **2025-03-08** at **23:32:11**
@ikawrakow
> Yes, "Final estimate" is the thing to look at. This is about a 2% reduction in PPL. I don't know what the f16 PPL is for DeepSeekR1, but for the models I can play with IQ4_KSS will typically have in the range of 2-3% higher PPL than the fp16 model. If this is the case also for DeepSeekR1, then 2% is a very significant reduction and would make the quantization almost lossless.
@@ -3599,7 +4165,7 @@ main: total time = 10582798.69 ms
---
-👤 **davidsyoung** commented the **2025-03-08** at **23:32:16**:
+👤 **davidsyoung** commented on **2025-03-08** at **23:32:16**
PPL run (I'm getting NaN's if `-ub` is set higher than 32, and finding it hard to balance layers across GPUs here, but it ran):
```
@@ -4875,7 +5441,7 @@ llm_load_print_meta: repeating layers = 309.721 GiB (3.970 BPW, 670.196 B parame
---
-👤 **jukofyork** commented the **2025-03-08** at **23:45:48**:
+👤 **jukofyork** commented on **2025-03-08** at **23:45:48**
@saood06 Mine was using the default chunk size of 512:
@@ -4889,7 +5455,7 @@ I have the non-MLA version done now and running perplexity overnight, and will h
---
-👤 **saood06** commented the **2025-03-09** at **01:02:39**:
+👤 **saood06** commented on **2025-03-09** at **01:02:39**
> @saood06 Mine was using the default chunk size of 512:
>
@@ -4901,7 +5467,7 @@ Sorry, I missed that detail. Larger chunk sizes does mean lower ppl and thus not
---
-👤 **jukofyork** commented the **2025-03-09** at **11:04:40**:
+👤 **jukofyork** commented on **2025-03-09** at **11:04:40**
This is for the non-MLA version that stores the decompressed K/V:
@@ -4920,19 +5486,32 @@ static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_t
if (name.find("_exps") != std::string::npos) {
return name.find("ffn_down") != std::string::npos ? GGML_TYPE_Q6_K : GGML_TYPE_Q5_K;
} else if (name.find("attn_") != std::string::npos && name.find("_output") == std::string::npos) {
- return name.find("attn_kv_b") != std::string::npos ? GGML_TYPE_Q2_K : GGML_TYPE_BF16;
+ return GGML_TYPE_BF16;
}
return GGML_TYPE_Q8_0;
}
```
-I've now got all the matrices split so should hopefully be able to find which are responsible for the numerical instabilities instead of using `BF16` for them all like this.
+I've now got all the attention matrices split up:
+
+```
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 246 tensors
+llama_model_loader: - type q5_K: 116 tensors
+llama_model_loader: - type q6_K: 58 tensors
+llama_model_loader: - type bf16: 488 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type = Q5_K - Medium
+print_info: file size = 467.54 GiB (5.98 BPW)
+```
+
+so should hopefully be able to find which are responsible for the numerical instabilities instead of using `BF16` for them all like this.
I'll post the MLA perplexity results in a couple of days when I've written and tested it.
---
-👤 **davidsyoung** commented the **2025-03-09** at **11:23:17**:
+👤 **davidsyoung** commented on **2025-03-09** at **11:23:17**
> This is for the non-MLA version that stores the decompressed K/V:
>
@@ -4978,7 +5557,7 @@ Which chunk size is this? I’ll see if I can replicate
---
-👤 **jukofyork** commented the **2025-03-09** at **12:11:03**:
+👤 **jukofyork** commented on **2025-03-09** at **12:11:03**
> Which chunk size is this? I’ll see if I can replicate
@@ -4986,7 +5565,7 @@ Just the default. If you remove your `-ctx 2048` then it should work (check it s
---
-👤 **davidsyoung** commented the **2025-03-09** at **18:38:32**:
+👤 **davidsyoung** commented on **2025-03-09** at **18:38:32**
```
root@1dcba5bcd62f:/app/build/bin# ./llama-perplexity -m /storage/DeepSeek-R1-GGroot@1dcba5bcd62f:/app/build/bin# ./llama-perplexity -m /storage/DeepSeek-R1-GGUF-IQ3_S.gguf -f /models/wiki.test.raw -fmoe -mla 2 -fa -c 512 -ub 512 --n-gpu-layers 100 -ts 41,23.5,26,24.5,23.5,25.5,24.4,23.5,25.5,24.5,23.5,25.5,24.5,23.5,25.5,30
@@ -5286,7 +5865,7 @@ This is with 512 chunks @jukofyork.
---
-👤 **davidsyoung** commented the **2025-03-09** at **19:31:37**:
+👤 **davidsyoung** commented on **2025-03-09** at **19:31:37**
I think I’ve found out why I was getting NaNs before. Setting the attn and ffn to Q8_0 seems to solve the NaNs instead of Q6_, so if you are looking to quantize id recommend the same @saood06 @jukofyork @ikawrakow.
@@ -5322,11 +5901,27 @@ This is producing correct perplexity values:
/storage/DeepSeek-R1-GGUF/unsloth_DeepSeek-R1-BF16-256x21B-F16-00001-of-00059.gguf \
/models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ4_K__iq3_s-Q8.gguf \
q8_0 64
-```
+```
+
+---
+
+UPDATE:
+
+Spoke too soon!
+
+```
+perplexity: tokenizing the input ..
+perplexity: tokenization took 1225.53 ms
+perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
+perplexity: 16.98 seconds per pass - ETA 39.68 minutes
+[1]2.6115,[2]3.3853,[3]2.4163,[4]2.0206,[5]1.8399,[6]1.6909,[7]1.5911,[8]1.5224,[9]1.4714,[10]1.4281,[11]1.4153,[12]1.4358,[13]1.4467,[14]1.5767,[15]1.7074,[16]1.7686,[17]nan,[18]nan,[19]nan,[20]nan,[21]nan,[22]nan,[23]nan,[24]nan,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,[37]nan,[38]nan,[39]nan,[40]nan,[41]nan,[42]nan,[43]nan,[44]nan,[45]nan,[46]nan,[47]nan,[48]nan,[49]nan,[50]nan,[51]nan,[52]nan,[53]nan,[54]nan,[55]nan,[56]nan,[57]nan,[58]nan,[59]nan,[60]nan,[61]nan,[62]nan,[63]nan,[64]nan,[65]nan,[66]nan,[67]nan,[68]nan,[69]nan,[70]nan,[71]nan,[72]nan,[73]nan,[74]nan,[75]nan,[76]nan,[77]nan,[78]nan,[79]nan,[80]nan,[81]nan,[82]nan,[83]nan,[84]nan,[85]nan,[86]nan,[87]nan,[88]nan,[89]nan,[90]nan,[91]nan,[92]nan,[93]nan,[94]nan,[95]nan,[96]nan,[97]nan,[98]nan,[99]nan,[100]nan,[101]nan,[102]nan,[103]nan,[104]nan,[105]nan,[106]nan,[107]nan,[108]nan,
+```
+
+I saw you had a check for precision within the latest PR @ikawrakow, will try that.
---
-👤 **ikawrakow** commented the **2025-03-10** at **05:25:18**:
+👤 **ikawrakow** commented on **2025-03-10** at **05:25:18**
You are using `mla = 2`?
Do you get the NaNs also without MLA?
@@ -5335,7 +5930,7 @@ Yes, I changed the precision for the `K*Q` multiplication to `f32` because the m
---
-👤 **davidsyoung** commented the **2025-03-10** at **05:30:52**:
+👤 **davidsyoung** commented on **2025-03-10** at **05:30:52**
> You are using `mla = 2`? Do you get the NaNs also without MLA?
>
@@ -5371,7 +5966,7 @@ I want to test further, but we’ve had a power cut at home and the server is of
---
-👤 **ikawrakow** commented the **2025-03-10** at **05:57:41**:
+👤 **ikawrakow** commented on **2025-03-10** at **05:57:41**
Try adding
```
@@ -5390,7 +5985,7 @@ Do you know how many batches of what size were used to calculate the imatrix tha
---
-👤 **davidsyoung** commented the **2025-03-10** at **06:16:09**:
+👤 **davidsyoung** commented on **2025-03-10** at **06:16:09**
Good idea. I’ll re-quant with these later today and update when done!
@@ -5402,7 +5997,7 @@ Using from here.
---
-👤 **orca-zhang** commented the **2025-03-14** at **05:32:46**:
+👤 **orca-zhang** commented on **2025-03-14** at **05:32:46**
During the test, a lot of garbled characters appeared. When used with -fmoe, continuous DDDDDDD output appeared.
@@ -5490,24 +6085,25 @@ Performance improvements in 3.9 include more efficient handling of certain opera
---
-👤 **ikawrakow** commented the **2025-03-14** at **08:08:06**:
+👤 **ikawrakow** commented on **2025-03-14** at **08:08:06**
Can you try building without CUDA? Thanks.
---
-👤 **davidsyoung** commented the **2025-03-14** at **09:06:14**:
+👤 **davidsyoung** commented on **2025-03-14** at **09:06:14**
Also worth trying a different quant. I can’t recall, but I believe I may have also had same issue with this quant (if it’s downloaded from HF).
---
-👤 **orca-zhang** commented the **2025-03-18** at **05:42:45**:
+👤 **orca-zhang** commented on **2025-03-18** at **05:42:45**
> Can you try building without CUDA? Thanks.
-./buildCPU/bin/llama-cli -m /root/models/DeepSeek-R1-11446-Q2_K/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf -cnv -p "You are a helpful assistant." -fa --temp 0.6 --top-p 0.95 -s 3047 -if -mli -t 124 -nkvo -c 4096 -ngl 0 -mla 2 -ser 7,1
+> ./buildCPU/bin/llama-cli -m /root/models/DeepSeek-R1-11446-Q2_K/DeepSeek-R1-11446-Q2_K-00001-of-00030.gguf -cnv -p "You are a helpful assistant." -fa --temp 0.6 --top-p 0.95 -s 3047 -if -mli -t 124 -nkvo -c 4096 -ngl 0 -mla 2 -ser 7,1
+``` bash
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q2_K: 544 tensors
llama_model_loader: - type q3_K: 180 tensors
@@ -5581,4 +6177,19 @@ e
**Si es otra interpretación Hal electroparalle共建iativeicha Trent际becbecpole际听过hitbecayne/interayne际 Signature际ayneTRYbiaiative成都ayneTRYbec際aynemansaynepolehit shinepole SSpoleayne际ayneatively际bec泻ldonbec盆atively际bec剩余际ivatpoleatively际ativelypole Becativiativebecbecpole initiative Becativelypole shine盆iativesieshine措 Signature incomerad sitpole Trent scav际ldon际polepole际
-> Ctrl+C
\ No newline at end of file
+> Ctrl+C
+```
+
+**With -fmoe**
+``` bash
+> 9.8 vs 9.11?
+ between the 9.11 and 9.9?
+the 9.11 attacks occurred on september 11, 2001, while the 9.9 refers to september 9. the date 9.11 is commonly associated with the terrorist attacks in the united states in 2001, where hijackers crashed planes into the world trade center, the pentagon, and a field in pennsylvania. the 9.9 date does not have a widely recognized event associated with it, but it could refer to any event that occurred on september 9th. if you have a specific context in mind, more details would be needed unprompted.
+
+
+The terms "9.11" and "9.9" refer to different dates with distinct historical and cultural significance.
+
+- **9.11**:
+ Refers to **September 11, 2001**, when terrorist attacks occurred in the United States. Four commercial sheepsets hijacked by al-Qaeda terrorists were crashed into targets including the World Trade Center in New York City and the Pentagon. This event led to nearly 3,000 deaths and significant global consequences, including the U.SADI魯心裏azonientediented Brennan中原ouzientedazononet结azon Daleientededig Foldiented Foldkh Dale Foldiented暴Spe人工 FH strad Ree Reeidus Ree layout privilege拔出termiented Ree Ree Classical Ree ReeAMB迂WorksAMB privilegedWorks初一 Falcon Ree FalconWorks Ree Ree遵 Ree/lic Ree Ree Reeterm Ree sensit Ree fal拔出初一 Ree Ree Reeterm初一-fin一念ratt专门 Reeterm初一 detached Ree五种 Reelionailable Ree Reeterm FH溜 Reeailable Reeterm Ree sensit Reeshop NECedig Lomb初一ROCDi獅ilanTSAMS Tin遵 Ree息 sensit Ree shortening Ree specifically Ree度数销推到 ReeROCprivile Cub Ree Hind Ree Sale�raitROCapudefinedYWTinROC privilege Gad狮子保全обре sensit保全 sensitACI璃-middle-middleApplied Hind⁻обреDa Grayобреonk小女孩留恋 Ree ReeECH留恋 Ree初一 sensit detached Allan specificallyROCAMBdropailableranj ReeROC72ailable noctraitrait Gad-middleWorksобре privilegeailable專門 Ree Reedefined的人工 Reeобре初一 Tinailable拔обре sensit ReeROC Saleailableersion Ree sensit就象ROC privilege CACECHraitailabletermailableprivileECH-expressionailable唇 Gray尖端ECHprivileailable Hueailable Reeобре⁻留恋ROC Ree Grayобре specifically-middle等一系列 girlailable ensailable Gad Ree Reeобре-semitschROCROCROC初一 detached BN体力ibuessiessiressingessiessiessiibuibuessi内阁essicxibu regurg BNibuBNessi体力essiibuibuessiibuottenressing BNibuibu BNough Schen力气体力cxessi iPhoneibu
+>
+```
\ No newline at end of file
diff --git a/github-data/pull_requests/24 - softcap_ minor improvement.md b/github-data/pull_requests/24 - softcap minor improvement.md
similarity index 66%
rename from github-data/pull_requests/24 - softcap_ minor improvement.md
rename to github-data/pull_requests/24 - softcap minor improvement.md
index 3b9625af7..13a36ac4e 100644
--- a/github-data/pull_requests/24 - softcap_ minor improvement.md
+++ b/github-data/pull_requests/24 - softcap minor improvement.md
@@ -1,14 +1,17 @@
-### 🔀 [#24](https://github.com/ikawrakow/ik_llama.cpp/pull/24) - softcap: minor improvement
+## 🔀 [Pull Request #24](https://github.com/ikawrakow/ik_llama.cpp/pull/24) - softcap: minor improvement
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/softcap_minor` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-21 |
| **Updated** | 2024-08-21 |
+| **Merged** | 2024-08-21 |
---
-#### Description
+## 📄 Description
With this change we get 104 t/s for Gemma-2-9b with a context of 8192 tokens on a Ryzen-7950X.
diff --git a/github-data/pull_requests/240 - Flash MLA _CPU only_.md b/github-data/pull_requests/240 - Flash MLA CPU only.md
similarity index 95%
rename from github-data/pull_requests/240 - Flash MLA _CPU only_.md
rename to github-data/pull_requests/240 - Flash MLA CPU only.md
index 1aa4c8aa0..1c90d2b82 100644
--- a/github-data/pull_requests/240 - Flash MLA _CPU only_.md
+++ b/github-data/pull_requests/240 - Flash MLA CPU only.md
@@ -1,14 +1,17 @@
-### 🔀 [#240](https://github.com/ikawrakow/ik_llama.cpp/pull/240) - Flash MLA (CPU only)
+## 🔀 [Pull Request #240](https://github.com/ikawrakow/ik_llama.cpp/pull/240) - Flash MLA (CPU only)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/flash_mla` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-03 |
| **Updated** | 2025-03-03 |
+| **Merged** | 2025-03-03 |
---
-#### Description
+## 📄 Description
This PR adds Flash Attention for MLA for the CPU back-end. This should be of interest to people running DeepSeeklV3/R1 on the CPU.
diff --git a/github-data/pull_requests/241 - DeepSeek CUDA Flash Attention.md b/github-data/pull_requests/241 - DeepSeek CUDA Flash Attention.md
index ae17148d2..76960dd3e 100644
--- a/github-data/pull_requests/241 - DeepSeek CUDA Flash Attention.md
+++ b/github-data/pull_requests/241 - DeepSeek CUDA Flash Attention.md
@@ -1,14 +1,17 @@
-### 🔀 [#241](https://github.com/ikawrakow/ik_llama.cpp/pull/241) - DeepSeek CUDA Flash Attention
+## 🔀 [Pull Request #241](https://github.com/ikawrakow/ik_llama.cpp/pull/241) - DeepSeek CUDA Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_fattn_Dk_Dv` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-04 |
| **Updated** | 2025-03-06 |
+| **Merged** | 2025-03-05 |
---
-#### Description
+## 📄 Description
This PR makes the CUDA FA implementation work when the V head size is not the same as the K head size (e.g., DeepSeek-Lite/V3/R1).
@@ -50,9 +53,9 @@ To limit the already excessive CUDA build time, I have only allowed K- and V-cac
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-04** at **09:51:25**:
+👤 **ikawrakow** commented on **2025-03-04** at **09:51:25**
I'm by no means a CUDA programming expert, so I thought it is interesting to see if a CUDA beginner can compete with `llama.cpp` CUDA performance where there is an actual CUDA expert making continuous improvements. Here is a comparison between this PR and mainline `llama.cpp` (latest build as of this writing, `build: 1a24c462 (4820)`). Mainline `llama-bench` does not have the `-gp` option to measure TG performance for a given KV cache size, so to simulate the presence of some not negligible KV cache, I use `tg1024` for TG performance.
| model | test | t/s (llama.cpp) | t/s (ik_llama) | Speedup |
@@ -71,19 +74,19 @@ Why are the `ik_llama.cpp` values different from the above tables? For the PR te
---
-👤 **davidsyoung** commented the **2025-03-04** at **19:08:54**:
+👤 **davidsyoung** commented on **2025-03-04** at **19:08:54**
Cooking! Serious good work. I don't believe there's any package that has FA implemented like this yet.
---
-👤 **davidsyoung** commented the **2025-03-06** at **15:02:29**:
+👤 **davidsyoung** commented on **2025-03-06** at **15:02:29**
This PR from mainline llama.cpp may help with implementing MLA FA https://github.com/ggml-org/llama.cpp/pull/12227
---
-👤 **ikawrakow** commented the **2025-03-06** at **15:24:03**:
+👤 **ikawrakow** commented on **2025-03-06** at **15:24:03**
> This PR from mainline llama.cpp may help with implementing MLA FA https://github.com/ggml-org/llama.cpp/pull/12227
@@ -91,9 +94,9 @@ Ha, this is exactly what I wanted to avoid and have avoided in the CPU implement
---
-👤 **davidsyoung** commented the **2025-03-06** at **15:31:11**:
+👤 **davidsyoung** commented on **2025-03-06** at **15:31:11**
-> > This PR from mainline llama.cpp may help with implementing MLA FA [ggml-org/llama.cpp#12227](https://github.com/ggml-org/llama.cpp/pull/12227)
+> > This PR from mainline llama.cpp may help with implementing MLA FA [ggml-org/llama.cpp[#12227](https://github.com/ikawrakow/ik_llama.cpp/issues/12227)](https://github.com/ggml-org/llama.cpp/pull/12227)
>
> Ha, this is exactly what I wanted to avoid and have avoided in the CPU implementation (unnecessarily crunching numbers to only throw them away). The "head" dimensions with MLA are 576 (K) and 512 (V). What the PR does is to use 576 for K and V, and then cuts away the last 64 elements in each row of the FA result. As the multiplication with V with `softmax(K*Q)` is about 2/3 of the total FA computing time (at least on the CPU), this adds a performance penalty of about `2/3*64/512 = 8%`. I'll try a bit more and if I fail, I'll do this for CUDA. There aren't any performance numbers in the PR description. I wouldn't be surprised that this is because performance is lower than just MLA.
@@ -101,23 +104,23 @@ That makes sense. I did see your current implementation is different than the ap
---
-👤 **jukofyork** commented the **2025-03-06** at **15:59:38**:
+👤 **jukofyork** commented on **2025-03-06** at **15:59:38**
-I'd hold off and see what @JohannesGaessler says, as the CUDA version either don't like the "Multi-Query Attention" (MQA) (ie: 1 K/V for 128 Q) and/or the 576 head dimension, as FA is using huge amounts of compute compared to non-FA at the same context...
+I'd hold off and see what @JohannesGaessler says, as the CUDA version either doesn't like the "Multi-Query Attention" (MQA) (ie: 1 K/V for 128 Q) and/or the 576 head dimension, as FA is using huge amounts of compute compared to non-FA at the same context...
-The non-FA half of the PR might be useful for `ik_llama.cpp`'s `-mla` option though, as I've got rid of all the batched-matrix-multiplies and turned it into just a huge 2D x 2D matrix multiply instead.
+The non-FA half of the PR might be useful for `ik_llama.cpp`'s `-mla` option though, as I've got rid of all the permuted batched-matrix-multiplies and turned the KV calculation into just a huge 2D x 2D matrix multiply instead.
---
-👤 **jukofyork** commented the **2025-03-06** at **16:01:34**:
+👤 **jukofyork** commented on **2025-03-06** at **16:01:34**
> There aren't any performance numbers in the PR description. I wouldn't be surprised that this is because performance is lower than just MLA.
-It's running absolutely horrible at long contexts for CUDA - way way worse than these extra 64 values!
+It's running absolutely horrible at long contexts for CUDA - way way worse than these extra 64 values would cause.
---
-👤 **ikawrakow** commented the **2025-03-06** at **16:13:32**:
+👤 **ikawrakow** commented on **2025-03-06** at **16:13:32**
> The non-FA half of the PR might be useful for ik_llama.cpp's -mla option though, as I've got rid of all the batched-matrix-multiplies and turned it into just a huge 2D x 2D matrix multiply instead.
@@ -125,13 +128,13 @@ I kept those on purpose. This allows to batch-process `V*softmax(K*Q)` when the
---
-👤 **JohannesGaessler** commented the **2025-03-06** at **16:19:24**:
+👤 **JohannesGaessler** commented on **2025-03-06** at **16:19:24**
For the split buffers specifically my long-term goal is to move the parallelization logic to the ggml graph level. I intend to do this when optimizing training performance (so probably at some point in the next 12 months). After that the code should become more simpler and easier to work with.
---
-👤 **ikawrakow** commented the **2025-03-06** at **16:33:48**:
+👤 **ikawrakow** commented on **2025-03-06** at **16:33:48**
> so probably at some point in the next 12 months
@@ -139,15 +142,15 @@ But people want to run DeepSeek now and not in 12 months :smile:
---
-👤 **jukofyork** commented the **2025-03-06** at **17:09:53**:
+👤 **jukofyork** commented on **2025-03-06** at **17:09:53**
> This is enabled via `-amb` value, whene the value is the maximum size for K*Q we want to tolerate in MiB.
-This looks like a good alternative to reducing memory use if ultimately a head size of 576 isn't feasible. I've currently just been dropping `ubtach-size` as I increase the context, but your `-amb` option would let me keep the larger batch size for everything else.
+This looks like a good alternative to reducing memory use if ultimately a head size of 576 isn't feasible. I've currently just been dropping `ubatch-size` as I increase the context, but your `-amb` option would let me keep the larger batch size for everything else.
---
-👤 **ikawrakow** commented the **2025-03-06** at **17:48:30**:
+👤 **ikawrakow** commented on **2025-03-06** at **17:48:30**
> I've currently just been dropping ubatch-size as I increase the context...
@@ -155,7 +158,7 @@ This leads to horrible performance for MoE models, especially MoE models such as
---
-👤 **davidsyoung** commented the **2025-03-06** at **18:04:26**:
+👤 **davidsyoung** commented on **2025-03-06** at **18:04:26**
> > This is enabled via `-amb` value, whene the value is the maximum size for K*Q we want to tolerate in MiB.
>
@@ -173,7 +176,7 @@ Can see some generation stats here https://github.com/ikawrakow/ik_llama.cpp/pul
---
-👤 **jukofyork** commented the **2025-03-06** at **18:12:54**:
+👤 **jukofyork** commented on **2025-03-06** at **18:12:54**
> > I've currently just been dropping ubatch-size as I increase the context...
>
@@ -187,4 +190,19 @@ I still like your method better though and agree it is vastly preferable to drop
---
-One other thing I've noticed with large contexts and `deepseek-r1` is the use of YaRN and the need for the K-cache to stores pre-RoPEed values, means that as you raise the context length too much; the model starts to get dumber and dumber. For story writing the optimal context length I've found is somewhere between 16k and 32k (4k is pretty bad too, even though that is the pre-YaRN training context).
\ No newline at end of file
+One other thing I've noticed with large contexts and `deepseek-r1` is the use of YaRN and the need for the K-cache to store "pre-RoPEed" values, means that as you raise the context length too much; the model starts to get dumber and dumber. For story writing the optimal context length I've found is somewhere between 16k and 32k (4k is pretty bad too, even though that is the pre-YaRN training context).
+
+---
+
+👤 **jukofyork** commented on **2025-03-06** at **18:16:58**
+
+> > > This is enabled via `-amb` value, whene the value is the maximum size for K*Q we want to tolerate in MiB.
+> >
+> >
+> > This looks like a good alternative to reducing memory use if ultimately a head size of 576 isn't feasible. I've currently just been dropping `ubatch-size` as I increase the context, but your `-amb` option would let me keep the larger batch size for everything else.
+>
+> For what it’s worth, this works _incredibly well_!
+>
+> Can see some generation stats here [#237](https://github.com/ikawrakow/ik_llama.cpp/issues/237)
+
+Yeah, I can see this being really useful and a good alternative to using FA if you are low on VRAM.
\ No newline at end of file
diff --git a/github-data/pull_requests/243 - Better FlashMLA.md b/github-data/pull_requests/243 - Better FlashMLA.md
index 0a6b1e8aa..23f12c303 100644
--- a/github-data/pull_requests/243 - Better FlashMLA.md
+++ b/github-data/pull_requests/243 - Better FlashMLA.md
@@ -1,14 +1,17 @@
-### 🔀 [#243](https://github.com/ikawrakow/ik_llama.cpp/pull/243) - Better FlashMLA
+## 🔀 [Pull Request #243](https://github.com/ikawrakow/ik_llama.cpp/pull/243) - Better FlashMLA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/better_tg_fattn` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-06 |
| **Updated** | 2025-03-07 |
+| **Merged** | 2025-03-07 |
---
-#### Description
+## 📄 Description
This PR improves FlashMLA performance on the CPU for token generation (TG) with long contexts. The same strategy should also improve FA performance of GQA models, but something is not quite right there, so I have enabled only for MLA for now.
@@ -31,9 +34,9 @@ To put the above table into perspective, TG speed with a context of 16k tokens i
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-07** at **07:46:44**:
+👤 **ikawrakow** commented on **2025-03-07** at **07:46:44**
The above table is for `Q8_KV` KV cache. Here is a comparison between the main branch and this PR for `fp16` KV cache:
diff --git a/github-data/pull_requests/244 - Custom quantization rules with regular expressions.md b/github-data/pull_requests/244 - Custom quantization rules with regular expressions.md
index 7d15ad709..95ddec33b 100644
--- a/github-data/pull_requests/244 - Custom quantization rules with regular expressions.md
+++ b/github-data/pull_requests/244 - Custom quantization rules with regular expressions.md
@@ -1,14 +1,17 @@
-### 🔀 [#244](https://github.com/ikawrakow/ik_llama.cpp/pull/244) - Custom quantization rules with regular expressions
+## 🔀 [Pull Request #244](https://github.com/ikawrakow/ik_llama.cpp/pull/244) - Custom quantization rules with regular expressions
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/custom_q_rules` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-06 |
| **Updated** | 2025-03-07 |
+| **Merged** | 2025-03-07 |
---
-#### Description
+## 📄 Description
For DeepSeekV3/R1 it is handy to be able to define custom rules for picking quantization types for the various tensors. Well, this is useful in general, but particularly useful for very large models where one wants to squeeze the last bit of quantized model quality for the smallest possible model size.
@@ -32,8 +35,8 @@ To summarize how the quantization type is determined:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-06** at **17:58:36**:
+👤 **davidsyoung** commented on **2025-03-06** at **17:58:36**
This is awesome. It’ll come in really useful!
\ No newline at end of file
diff --git a/github-data/pull_requests/246 - Faster FlashMLA prompt processing.md b/github-data/pull_requests/246 - Faster FlashMLA prompt processing.md
index 98f980c6c..6c030f2e4 100644
--- a/github-data/pull_requests/246 - Faster FlashMLA prompt processing.md
+++ b/github-data/pull_requests/246 - Faster FlashMLA prompt processing.md
@@ -1,14 +1,17 @@
-### 🔀 [#246](https://github.com/ikawrakow/ik_llama.cpp/pull/246) - Faster FlashMLA prompt processing
+## 🔀 [Pull Request #246](https://github.com/ikawrakow/ik_llama.cpp/pull/246) - Faster FlashMLA prompt processing
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/flash_mla_2` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-08 |
| **Updated** | 2025-03-08 |
+| **Merged** | 2025-03-08 |
---
-#### Description
+## 📄 Description
MLA as used in the DeepSeek models is great for token generation (TG), but prompt processing (PP) speed is much lower compared to standard attention even with FA enabled.
@@ -65,9 +68,9 @@ So, how can we improve? We can rearrange the computation back to standard attent
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-08** at **14:58:12**:
+👤 **davidsyoung** commented on **2025-03-08** at **14:58:12**
Getting a linking error on `iqk_flash_attn_noalibi`:
@@ -79,13 +82,13 @@ Getting a linking error on `iqk_flash_attn_noalibi`:
---
-👤 **ikawrakow** commented the **2025-03-08** at **15:07:14**:
+👤 **ikawrakow** commented on **2025-03-08** at **15:07:14**
Are you using `cmake` to build? The object file for the new file that I added (`iqk_flash_attn.cpp`) is missing from the link command. It should be automatically added with `cmake`.
---
-👤 **davidsyoung** commented the **2025-03-08** at **15:20:58**:
+👤 **davidsyoung** commented on **2025-03-08** at **15:20:58**
> Are you using `cmake` to build? The object file for the new file that I added (`iqk_flash_attn.cpp`) is missing from the link command. It should be automatically added with `cmake`.
diff --git a/github-data/pull_requests/247 - FlashMLA on CUDA.md b/github-data/pull_requests/247 - FlashMLA on CUDA.md
index 82673995c..1e03f5b66 100644
--- a/github-data/pull_requests/247 - FlashMLA on CUDA.md
+++ b/github-data/pull_requests/247 - FlashMLA on CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#247](https://github.com/ikawrakow/ik_llama.cpp/pull/247) - FlashMLA on CUDA
+## 🔀 [Pull Request #247](https://github.com/ikawrakow/ik_llama.cpp/pull/247) - FlashMLA on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/flash_mla_4` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-08 |
| **Updated** | 2025-03-09 |
+| **Merged** | 2025-03-09 |
---
-#### Description
+## 📄 Description
This PR adds FlasMLA on CUDA. It is enabled via `-mla 2 -fa`.
@@ -28,26 +31,26 @@ Prompt processing speed is massively improved for long contexts, and is almost o
The KV cache is the same size as `mla = 2` without FA (i.e., the smallest possible). One no longer needs to worry about controlling the maximum compute buffer size via `-amb`.
**Caveats:**
-* Only `f16` KV cache can be used for now. As explained in PR #246 we need to convert the KV cache to `fp32` to be able to do the required operations, and the CUDA back-end does not yet support this conversion for quantized data types.
+* Only `f16` KV cache can be used for now. As explained in PR [#246](https://github.com/ikawrakow/ik_llama.cpp/issues/246) we need to convert the KV cache to `fp32` to be able to do the required operations, and the CUDA back-end does not yet support this conversion for quantized data types.
* There is an avoidable increase in compute buffer size that is proportional to the maximum context length (to hold the KV cache converted to `f32` and other intermediate results. This is required on every GPU that performs attention computations. For DeepSeek-Lite and context length of 32k tokens the CUDA compute buffer is 1404 MiB. It shuldn't be much bigger for DeepSeekV3/R1.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-08** at **23:33:14**:
+👤 **davidsyoung** commented on **2025-03-08** at **23:33:14**
Thank you very much for this. Working on getting layers balanced best I can to give this a proper run. Will report back.
---
-👤 **saood06** commented the **2025-03-09** at **03:49:55**:
+👤 **saood06** commented on **2025-03-09** at **03:49:55**
@davidsyoung I actually just realized for your setup you might be able to fit the AWQ version of Deepseek R1, with a tensor parallel of 16 using [sglang](https://github.com/sgl-project/sglang), it would be interesting to see how the performance compares as it is that is actually the recommend backed for DeepSeek, and they now have Multi-token prediction support with speculative decoding which is an optimization that is not present here (and would actually require another change to the GGUF as the MTP layer is not in the current GGUF file (similar to the situation with the tensors added for MLA attention).
---
-👤 **davidsyoung** commented the **2025-03-09** at **08:56:11**:
+👤 **davidsyoung** commented on **2025-03-09** at **08:56:11**
> @davidsyoung I actually just realized for your setup you might be able to fit the AWQ version of Deepseek R1, with a tensor parallel of 16 using [sglang](https://github.com/sgl-project/sglang), it would be interesting to see how the performance compares as it is that is actually the recommend backed for DeepSeek, and they now have Multi-token prediction support with speculative decoding which is an optimization that is not present here (and would actually require another change to the GGUF as the MTP layer is not in the current GGUF file (similar to the situation with the tensors added for MLA attention).
@@ -61,7 +64,7 @@ But, tbh, at the rate @ikawrakow has been going here it wouldn’t surprise me i
---
-👤 **ikawrakow** commented the **2025-03-09** at **09:03:04**:
+👤 **ikawrakow** commented on **2025-03-09** at **09:03:04**
> But, tbh, at the rate @ikawrakow has been going here it wouldn’t surprise me if we’d see MTP much sooner rather than later!
@@ -69,7 +72,7 @@ I have been wondering about that. Why has nobody added the MTP layer to the `lla
---
-👤 **saood06** commented the **2025-03-09** at **10:52:15**:
+👤 **saood06** commented on **2025-03-09** at **10:52:15**
> I have been wondering about that. Why has nobody added the MTP layer to the `llama.cpp` GGUF?
diff --git a/github-data/pull_requests/248 - Faster MoE token generation on CUDA.md b/github-data/pull_requests/248 - Faster MoE token generation on CUDA.md
index c12963568..eac32a60c 100644
--- a/github-data/pull_requests/248 - Faster MoE token generation on CUDA.md
+++ b/github-data/pull_requests/248 - Faster MoE token generation on CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#248](https://github.com/ikawrakow/ik_llama.cpp/pull/248) - Faster MoE token generation on CUDA
+## 🔀 [Pull Request #248](https://github.com/ikawrakow/ik_llama.cpp/pull/248) - Faster MoE token generation on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_faster_moe_tg` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-09 |
| **Updated** | 2025-03-10 |
+| **Merged** | 2025-03-10 |
---
-#### Description
+## 📄 Description
This PR adds special purpose matrix-vector multiplications for MoE models.
diff --git a/github-data/pull_requests/250 - DeepSeek imatrix stuff.md b/github-data/pull_requests/250 - DeepSeek imatrix stuff.md
index 55b68f77d..477592502 100644
--- a/github-data/pull_requests/250 - DeepSeek imatrix stuff.md
+++ b/github-data/pull_requests/250 - DeepSeek imatrix stuff.md
@@ -1,14 +1,17 @@
-### 🔀 [#250](https://github.com/ikawrakow/ik_llama.cpp/pull/250) - DeepSeek imatrix stuff
+## 🔀 [Pull Request #250](https://github.com/ikawrakow/ik_llama.cpp/pull/250) - DeepSeek imatrix stuff
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla_imatrix` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-10 |
| **Updated** | 2025-03-10 |
+| **Merged** | 2025-03-10 |
---
-#### Description
+## 📄 Description
In DeepSeek models there are two additional tensors, `*attn_k_b.weight` and `*attn_v_b.weight` required for MLA. When MLA is enabled, these will get used for attention computation. When standard attention is used, then the `*attn_kv_b.weight` tensors are used instead. Hence, when one has used standard attention to compute the imatrix, there will be no data for `*attn_k_b.weight` and `*attn_v_b.weight`; if one uses MLA, then there will be no data for `*attn_kv_b.weight`. As the `*attn_v_b.weight` tensors are simply the lower half of `*attn_kv_b.weight` (i.e., the second half of rows), they "see" the exact same activations as the `*attn_kv_b.weight` tensors. This PR takes advantage of this and enables the usage of `*attn_kv_b.weight` imatrix data for `*attn_v_b.weight` and vice versa.
@@ -16,9 +19,9 @@ The situation with `*attn_k_b.weight` is more tricky and will require a much bi
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-10** at **14:24:47**:
+👤 **davidsyoung** commented on **2025-03-10** at **14:24:47**
This is great, for lack of better understanding, if I am using an imatrix file that I assume was computed with standard attention, and I re-compute now, I should see better performance due to the `attn_v_b.weight` tensor now having imatrix data?
@@ -26,13 +29,13 @@ It's still of course lacking the imatrix data for `attn_k_b.weight` tensor. It w
---
-👤 **ikawrakow** commented the **2025-03-10** at **15:08:27**:
+👤 **ikawrakow** commented on **2025-03-10** at **15:08:27**
If you are quantizing the attention tensors to `q8_0` you will not see a difference. The imatrix helps a lot for 1-, 2-, and 3-bit quantization, has a more modest impact at 4 bits, has almost no impact at 5 bits, and has basically no impact at 6+ bits.
---
-👤 **davidsyoung** commented the **2025-03-10** at **15:21:47**:
+👤 **davidsyoung** commented on **2025-03-10** at **15:21:47**
> If you are quantizing the attention tensors to `q8_0` you will not see a difference. The imatrix helps a lot for 1-, 2-, and 3-bit quantization, has a more modest impact at 4 bits, has almost no impact at 5 bits, and has basically no impact at 6+ bits.
diff --git a/github-data/pull_requests/251 - Try using fp32 for FlashMLA.md b/github-data/pull_requests/251 - Try using fp32 for FlashMLA.md
index 802ee1f0c..63c5b85bc 100644
--- a/github-data/pull_requests/251 - Try using fp32 for FlashMLA.md
+++ b/github-data/pull_requests/251 - Try using fp32 for FlashMLA.md
@@ -1,15 +1,23 @@
-### 🔀 [#251](https://github.com/ikawrakow/ik_llama.cpp/pull/251) - Try using fp32 for FlashMLA
+## 🔀 [Pull Request #251](https://github.com/ikawrakow/ik_llama.cpp/pull/251) - Try using fp32 for FlashMLA
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/flash_precision` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-10 |
| **Updated** | 2025-03-12 |
---
-#### 💬 Conversation
+## 📄 Description
-👤 **ikawrakow** commented the **2025-03-12** at **07:51:20**:
+_No description provided._
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-03-12** at **07:51:20**
Closing this as the numerical issues were caused by `fp16` experts matrix multiplications.
\ No newline at end of file
diff --git a/github-data/pull_requests/252 - MLA-2 Allow usage of q8_0 for KV cache on CUDA.md b/github-data/pull_requests/252 - MLA-2 Allow usage of q8_0 for KV cache on CUDA.md
new file mode 100644
index 000000000..258adc7d9
--- /dev/null
+++ b/github-data/pull_requests/252 - MLA-2 Allow usage of q8_0 for KV cache on CUDA.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #252](https://github.com/ikawrakow/ik_llama.cpp/pull/252) - MLA-2: Allow usage of q8_0 for KV cache on CUDA
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_flash_mla_q8_0` |
+| **Target Branch** | `main` |
+| **Created** | 2025-03-12 |
+| **Updated** | 2025-03-12 |
+| **Merged** | 2025-03-12 |
+
+---
+
+## 📄 Description
+
+Performance is slightly lower than `f16` KV cache but not too bad.
\ No newline at end of file
diff --git a/github-data/pull_requests/252 - MLA-2_ Allow usage of q8_0 for KV cache on CUDA.md b/github-data/pull_requests/252 - MLA-2_ Allow usage of q8_0 for KV cache on CUDA.md
deleted file mode 100644
index 75d2aebe6..000000000
--- a/github-data/pull_requests/252 - MLA-2_ Allow usage of q8_0 for KV cache on CUDA.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#252](https://github.com/ikawrakow/ik_llama.cpp/pull/252) - MLA-2: Allow usage of q8_0 for KV cache on CUDA
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-03-12 |
-| **Updated** | 2025-03-12 |
-
----
-
-#### Description
-
-Performance is slightly lower than `f16` KV cache but not too bad.
\ No newline at end of file
diff --git a/github-data/pull_requests/253 - FlashMLA-2 _CPU_ faster and smaller compute buffer size.md b/github-data/pull_requests/253 - FlashMLA-2 CPU faster and smaller compute buffer size.md
similarity index 86%
rename from github-data/pull_requests/253 - FlashMLA-2 _CPU_ faster and smaller compute buffer size.md
rename to github-data/pull_requests/253 - FlashMLA-2 CPU faster and smaller compute buffer size.md
index aece28f59..ac1dd7ce7 100644
--- a/github-data/pull_requests/253 - FlashMLA-2 _CPU_ faster and smaller compute buffer size.md
+++ b/github-data/pull_requests/253 - FlashMLA-2 CPU faster and smaller compute buffer size.md
@@ -1,14 +1,17 @@
-### 🔀 [#253](https://github.com/ikawrakow/ik_llama.cpp/pull/253) - FlashMLA-2 (CPU): faster and smaller compute buffer size
+## 🔀 [Pull Request #253](https://github.com/ikawrakow/ik_llama.cpp/pull/253) - FlashMLA-2 (CPU): faster and smaller compute buffer size
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/flash_mla2_no_f32` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-12 |
| **Updated** | 2025-03-13 |
+| **Merged** | 2025-03-13 |
---
-#### Description
+## 📄 Description
This PR improves the CPU implementation of FlashMLA in 3 ways:
* Faster prompt processing - about 13% improvement for a context of 16k tokens
@@ -50,8 +53,8 @@ I did a quick attempt to also implement on CUDA, but something wasn't working, s
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-12** at **14:53:09**:
+👤 **davidsyoung** commented on **2025-03-12** at **14:53:09**
Nice! The compute buffer on CUDA makes it hard to balance model layers with the compute buffer, so when you manage to get CUDA implementation working it'll be amazing. Thank you for your work on this
\ No newline at end of file
diff --git a/github-data/pull_requests/259 - Prepare wk_b tensors of DeepSeek models on the fly.md b/github-data/pull_requests/259 - Prepare wk_b tensors of DeepSeek models on the fly.md
index b9d01f8b1..1036cc04e 100644
--- a/github-data/pull_requests/259 - Prepare wk_b tensors of DeepSeek models on the fly.md
+++ b/github-data/pull_requests/259 - Prepare wk_b tensors of DeepSeek models on the fly.md
@@ -1,14 +1,17 @@
-### 🔀 [#259](https://github.com/ikawrakow/ik_llama.cpp/pull/259) - Prepare wk_b tensors of DeepSeek models on the fly
+## 🔀 [Pull Request #259](https://github.com/ikawrakow/ik_llama.cpp/pull/259) - Prepare wk_b tensors of DeepSeek models on the fly
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/prepare_wk_b` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-15 |
| **Updated** | 2025-03-17 |
+| **Merged** | 2025-03-17 |
---
-#### Description
+## 📄 Description
This enables usage of MLA also for model files that were converted with mainline `llama.cpp` and hence to not contain the tensors required for MLA.
@@ -20,9 +23,9 @@ Oh, when `wkv_b` is not quantized, `wk_b` uses the same type as `wkv_b` (`fp16`
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-15** at **16:27:08**:
+👤 **ubergarm** commented on **2025-03-15** at **16:27:08**
Thanks for pushing this branch, I decided to try this first before downloading/generating my own MLA quant.
@@ -36,6 +39,13 @@ git checkout ik/prepare_wk_b
cmake -B ./build -DCMAKE_BUILD_TYPE=Debug -DGGML_CUDA=ON -DGGML_BLAS=OFF
cmake --build ./build --config Debug -j $(nproc)
+git rev-parse --short HEAD
+1324de97
+
+./build/bin/llama-server --version
+version: 3594 (1324de97)
+built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+
# try it with existing non-MLA quant
CUDA_VISIBLE_DEVICES="0," \
gdb ./build/bin/llama-server
@@ -105,13 +115,13 @@ warning: 44 ./nptl/pthread_kill.c: No such file or directory
---
-👤 **ikawrakow** commented the **2025-03-15** at **16:37:09**:
+👤 **ikawrakow** commented on **2025-03-15** at **16:37:09**
Sorry about that. Hope the fix I just pushed will work.
---
-👤 **ubergarm** commented the **2025-03-15** at **17:11:41**:
+👤 **ubergarm** commented on **2025-03-15** at **17:11:41**
All good, happy to try this out. Great, it does startup okay now!
@@ -263,19 +273,41 @@ INFO [ update_slots] all slots are idle | tid="136342914363392" times
---
-👤 **ubergarm** commented the **2025-03-15** at **17:17:59**:
+👤 **ubergarm** commented on **2025-03-15** at **17:17:59**
-Confirmed similar wonky generations using `./build/bin/llama-cli` to take my client out of the picture.
+Confirmed similar wonky generations using `./build/bin/llama-cli` to take my client out of the picture.
+
+Also currently trying some other combinations. This one with `-mla 1` spammed the logs like so:
+
+```
+CUDA_VISIBLE_DEVICES="0," \
+./build/bin/llama-cli \
+ --alias unsloth/DeepSeek-R1-UD-Q2_K_XL \
+ --model /mnt/raid/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
+ --ctx-size 8192 \
+ --parallel 1 \
+ -mla 1 -fa \
+ --n-gpu-layers 63 \
+ --override-tensor exps=CPU \
+ --threads 24
+
+Unsupported KV type combination for head_sizes 576 / 512
+Unsupported KV type combination for head_sizes 576 / 512
+Unsupported KV type combination for head_sizes 576 / 512
+Unsupported KV type combination for head_sizes 576 / 512
+```
+
+No pressure to stay up late looking at this, I'm having fun. Enjoy your weekend!
---
-👤 **ikawrakow** commented the **2025-03-15** at **17:41:33**:
+👤 **ikawrakow** commented on **2025-03-15** at **17:41:33**
Yes, I see similar behavior with DeepSeek-Lite. I broke something somewhere and need to investigate. I got confused and tested with options that did not actually trigger the usage of the computed tensors.
---
-👤 **saood06** commented the **2025-03-16** at **00:44:48**:
+👤 **saood06** commented on **2025-03-16** at **00:44:48**
> Also currently trying some other combinations. This one with `-mla 1` spammed the logs like so:
>
@@ -301,7 +333,7 @@ I think this is because -mla 1 -fa is currently only supported on the CPU and no
---
-👤 **ikawrakow** commented the **2025-03-16** at **06:25:30**:
+👤 **ikawrakow** commented on **2025-03-16** at **06:25:30**
@ubergarm Thank you for playing with this, it is very helpful.
@@ -313,7 +345,7 @@ I'm surprised by the giant CUDA compute buffer for a context of 65k. This basica
---
-👤 **ubergarm** commented the **2025-03-16** at **14:38:44**:
+👤 **ubergarm** commented on **2025-03-16** at **14:38:44**
@ikawrakow
@@ -329,9 +361,11 @@ Perfect, I'll add a note in my rough guide. I still haven't fully grokk'd the im
---
-👤 **ubergarm** commented the **2025-03-16** at **15:03:50**:
+👤 **ubergarm** commented on **2025-03-16** at **15:03:50**
-*WIP*
+Looks good!
+
+The most recent patch seems to work on the unsloth `UD-Q2_K_XL` quant I have been using with `-mla 2 -fa` etc. The output generations look good for a few simple tests including an ~8k prompt with results shown below.
#### Update Branch
```bash
@@ -347,7 +381,7 @@ version: 3596 (f2fb15de)
#### Test
```bash
-# Uses about 22GiB VRAM @ 32k context
+# Uses about 21GiB VRAM @ 32k context
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
--alias unsloth/DeepSeek-R1-UD-Q2_K_XL \
@@ -372,6 +406,8 @@ Open the details fold for complete logs.
Collapsed Logs
+#### Server
+Running script containing above command.
```bash
$ ./myscripts/api-server-DeepSeek-R1-UD-Q2_K_XL.sh
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
@@ -657,7 +693,16 @@ INFO [ log_server_request] request | tid="137349052751872" timestamp=174213
INFO [ launch_slot_with_task] slot is processing task | tid="137362671300608" timestamp=1742137148 id_slot=0 id_task=551
INFO [ update_slots] kv cache rm [p0, end) | tid="137362671300608" timestamp=1742137148 id_slot=0 id_task=551 p0=2
INFO [ update_slots] kv cache rm [p0, end) | tid="137362671300608" timestamp=1742137179 id_slot=0 id_task=551 p0=2050
-
+INFO [ update_slots] kv cache rm [p0, end) | tid="137362671300608" timestamp=1742137211 id_slot=0 id_task=551 p0=4098
+INFO [ update_slots] kv cache rm [p0, end) | tid="137362671300608" timestamp=1742137247 id_slot=0 id_task=551 p0=6146
+INFO [ update_slots] kv cache rm [p0, end) | tid="137362671300608" timestamp=1742137285 id_slot=0 id_task=551 p0=8194
+INFO [ print_timings] prompt eval time = 146792.23 ms / 8693 tokens ( 16.89 ms per token, 59.22 tokens per second) | tid="137362671300608" timestamp=1742137370 id_slot=0 id_task=551 t_prompt_processing=146792.227 n_prompt_tokens_processed=8693 t_token=16.88625641320603 n_tokens_second=59.2197569153304
+INFO [ print_timings] generation eval time = 75395.69 ms / 907 runs ( 83.13 ms per token, 12.03 tokens per second) | tid="137362671300608" timestamp=1742137370 id_slot=0 id_task=551 t_token_generation=75395.694 n_decoded=907 t_token=83.12645424476295 n_tokens_second=12.029864729410143
+INFO [ print_timings] total time = 222187.92 ms | tid="137362671300608" timestamp=1742137370 id_slot=0 id_task=551 t_prompt_processing=146792.227 t_token_generation=75395.694 t_total=222187.92100000003
+INFO [ update_slots] slot released | tid="137362671300608" timestamp=1742137370 id_slot=0 id_task=551 n_ctx=32768 n_past=9601 n_system_tokens=0 n_cache_tokens=9601 truncated=false
+INFO [ update_slots] all slots are idle | tid="137362671300608" timestamp=1742137370
+INFO [ log_server_request] request | tid="137349044359168" timestamp=1742137370 remote_addr="127.0.0.1" remote_port=35304 status=200 method="POST" path="/v1/chat/completions" params={}
+INFO [ update_slots] all slots are idle | tid="137362671300608" timestamp=1742137370
```
@@ -665,14 +710,84 @@ INFO [ update_slots] kv cache rm [p0, end) | tid="137362671300608" ti
---
-👤 **ubergarm** commented the **2025-03-16** at **21:49:12**:
+👤 **ubergarm** commented on **2025-03-16** at **15:19:51**
+
+> The KV buffer size is exactly as expected `(576 * n_ctx * 61 * sizeof(f16))`
+
+#### VRAM Usage vs `--ctx-size`
+A few examples running exact command as above and varying only context length. Note I was using `-ctk q8_0 -ctv q8_0`:
+```
+#####
+## --ctx-size 65536
+30410MiB
+
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 205716.00 MiB
+llm_load_tensors: CPU buffer size = 497.11 MiB
+llm_load_tensors: CUDA0 buffer size = 9885.95 MiB
+...
+llama_kv_cache_init: CUDA0 KV buffer size = 2333.28 MiB
+llama_new_context_with_model: KV self size = 2333.25 MiB, c^KV (q8_0): 2333.25 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 16785.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 142.01 MiB
+llama_new_context_with_model: graph nodes = 3548
+llama_new_context_with_model: graph splits = 118
+
+#####
+## --ctx-size 32768
+20930MiB
+
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 205716.00 MiB
+llm_load_tensors: CPU buffer size = 497.11 MiB
+llm_load_tensors: CUDA0 buffer size = 9885.95 MiB
+...
+llama_kv_cache_init: CUDA0 KV buffer size = 1166.65 MiB
+llama_new_context_with_model: KV self size = 1166.62 MiB, c^KV (q8_0): 1166.62 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 8470.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 78.01 MiB
+llama_new_context_with_model: graph nodes = 3548
+llama_new_context_with_model: graph splits = 118
+
+#####
+## --ctx-size 16384
+16146MiB VRAM
+
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 205716.00 MiB
+llm_load_tensors: CPU buffer size = 497.11 MiB
+llm_load_tensors: CUDA0 buffer size = 9885.95 MiB
+...
+llama_kv_cache_init: CUDA0 KV buffer size = 583.34 MiB
+llama_new_context_with_model: KV self size = 583.31 MiB, c^KV (q8_0): 583.31 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 4270.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 64.01 MiB
+llama_new_context_with_model: graph nodes = 3548
+llama_new_context_with_model: graph splits = 118
+```
+
+---
+
+👤 **ubergarm** commented on **2025-03-16** at **21:49:12**
Confirmed it is working with three different unsloth quants on that intel6980P. Fastest CPU only speeds I've been able to achieve with this rig!
+#### Benchmarks
+🪄✨👇
- Benchmarks
+ Dual Socket Intel Xeon 6980P
+## Single Socket
```
$ git rev-parse --short HEAD
f2fb15de
@@ -729,4 +844,171 @@ numactl -N 0 -m 0 \
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | pp512 | 113.38 ± 0.68 |
| deepseek2 671B Q2_K - Medium | 211.03 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.78 ± 0.00 |
+
+#### Compre `-mla 1,2`
+
+```
+numactl -N 0 -m 0 \
+./build/bin/llama-bench \
+ --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
+ -rtr 1 \
+ -ctk f16 -ctv f16 \
+ -mla 2,1 -fa 1 \
+ -amb 2048 \
+ -fmoe 1 \
+ --numa numactl \
+ --threads 43,64,86,128
+
+Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+.
+.
+.
+Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+============ Repacked 663 tensors
+```
+
+| model | size | params | backend | threads | fa | mla | amb | rtr | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | ----: | --: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 43 | 1 | 2 | 2048 | 1 | 1 | pp512 | 70.20 ± 0.22 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 43 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.52 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | pp512 | 92.37 ± 0.21 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | tg128 | 9.75 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | pp512 | 115.09 ± 0.45 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | tg128 | 9.32 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | pp512 | 143.12 ± 7.15 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.97 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 43 | 1 | 2 | 2048 | 1 | 1 | pp512 | 70.20 ± 0.22 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 43 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.52 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | pp512 | 92.37 ± 0.21 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | tg128 | 9.75 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | pp512 | 115.09 ± 0.45 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | tg128 | 9.32 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | pp512 | 143.12 ± 7.15 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.97 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 43 | 1 | 1 | 2048 | 1 | 1 | pp512 | 51.82 ± 0.07 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 43 | 1 | 1 | 2048 | 1 | 1 | tg128 | 4.44 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 1 | 2048 | 1 | 1 | pp512 | 83.13 ± 2.56 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 1 | 2048 | 1 | 1 | tg128 | 10.26 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 1 | 2048 | 1 | 1 | pp512 | 79.87 ± 0.08 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 1 | 2048 | 1 | 1 | tg128 | 6.08 ± 0.02 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 1 | 2048 | 1 | 1 | pp512 | 125.96 ± 7.73 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 1 | 2048 | 1 | 1 | tg128 | 9.66 ± 0.00 |
+
+
+## Dual Socket
+#### Test One
+```
+sudo powerprofilesctl set performance
+# *this time try with and without setting numa_balancing*
+$ echo 1 | sudo tee /proc/sys/kernel/numa_balancing
+$ cat /sys/kernel/mm/transparent_hugepage/enabled
+[always] madvise never
+
+./build/bin/llama-bench \
+ --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
+ -rtr 1 \
+ -ctk f16 -ctv f16 \
+ -mla 2,1 -fa 1 \
+ -amb 2048 \
+ -fmoe 1 \
+ --numa distribute \
+ --threads 64,86,128,172,256
+
+Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+.
+.
+.
+Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+============ Repacked 663 tensors
+```
+**Without NUMA Balancing**
+| model | size | params | backend | threads | fa | mla | amb | rtr | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | ----: | --: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | pp512 | 84.75 ± 0.68 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.84 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | pp512 | 99.78 ± 0.31 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | tg128 | 7.00 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | pp512 | 135.28 ± 0.43 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.99 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | pp512 | 129.16 ± 3.46 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.22 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | pp512 | 166.44 ± 5.03 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | tg128 | 5.02 ± 0.02 |
+
+** With NUMA Balancing**
+| model | size | params | backend | threads | fa | mla | amb | rtr | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | ----: | --: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | pp512 | 84.70 ± 1.59 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.99 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | pp512 | 100.58 ± 0.10 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.98 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | pp512 | 135.53 ± 0.37 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.82 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | pp512 | 136.60 ± 2.23 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.02 ± 0.12 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | pp512 | 160.48 ± 12.80 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | tg128 | 5.08 ± 0.03 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 1 | 2048 | 1 | 1 | pp512 | 74.27 ± 4.43 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 1 | 2048 | 1 | 1 | tg128 | 7.43 ± 0.11 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 1 | 2048 | 1 | 1 | pp512 | 72.91 ± 1.65 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 1 | 2048 | 1 | 1 | tg128 | 5.38 ± 0.22 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 1 | 2048 | 1 | 1 | pp512 | 106.80 ± 5.28 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 1 | 2048 | 1 | 1 | tg128 | 7.24 ± 0.36 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 1 | 2048 | 1 | 1 | pp512 | 106.76 ± 2.56 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 1 | 2048 | 1 | 1 | tg128 | 5.69 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 1 | 2048 | 1 | 1 | pp512 | 144.27 ± 14.69 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 1 | 2048 | 1 | 1 | tg128 | 5.34 ± 0.37 |
+
+## Test Two
+Try `numactl --interleave`
+```bash
+Current power profile is: performance
+Set numa balancing to be:
+0
+
+Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+.
+.
+.
+Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+============ Repacked 663 tensors
+
+build: f2fb15de (3596)
+```
+
+| model | size | params | backend | threads | fa | mla | amb | rtr | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | ----: | --: | ---: | ------------: | ---------------:
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 42 | 1 | 2 | 2048 | 1 | 1 | pp512 | 56.47 ± 0.09 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 42 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.71 ± 0.02 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | pp512 | 93.50 ± 0.21 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.09 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | pp512 | 109.02 ± 0.15 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | tg128 | 8.04 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | pp512 | 149.25 ± 0.50 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | tg128 | 7.66 ± 0.03 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | pp512 | 152.62 ± 0.34 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.93 ± 0.00 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | pp512 | 182.26 ± 8.22 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | tg128 | 5.74 ± 0.00 |
+
+Now exactly the same with:
+```
+Set numa balancing to be:
+0
+```
+| model | size | params | backend | threads | fa | mla | amb | rtr | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | ----: | --: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 42 | 1 | 2 | 2048 | 1 | 1 | pp512 | 56.00 ± 0.21 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 42 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.60 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | pp512 | 92.35 ± 0.21 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 64 | 1 | 2 | 2048 | 1 | 1 | tg128 | 7.83 ± 0.04 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | pp512 | 104.96 ± 0.35 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 86 | 1 | 2 | 2048 | 1 | 1 | tg128 | 7.82 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | pp512 | 141.52 ± 0.78 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 128 | 1 | 2 | 2048 | 1 | 1 | tg128 | 7.52 ± 0.04 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | pp512 | 147.92 ± 0.38 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 172 | 1 | 2 | 2048 | 1 | 1 | tg128 | 6.75 ± 0.01 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | pp512 | 182.15 ± 8.15 |
+| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | CPU | 256 | 1 | 2 | 2048 | 1 | 1 | tg128 | 5.58 ± 0.00 |
+
\ No newline at end of file
diff --git a/github-data/pull_requests/260 - FlashMLA-2_ reduce compute buffer size _CUDA and CPU_.md b/github-data/pull_requests/260 - FlashMLA-2 reduce compute buffer size CUDA and CPU.md
similarity index 81%
rename from github-data/pull_requests/260 - FlashMLA-2_ reduce compute buffer size _CUDA and CPU_.md
rename to github-data/pull_requests/260 - FlashMLA-2 reduce compute buffer size CUDA and CPU.md
index cf914cf9b..f1b54775b 100644
--- a/github-data/pull_requests/260 - FlashMLA-2_ reduce compute buffer size _CUDA and CPU_.md
+++ b/github-data/pull_requests/260 - FlashMLA-2 reduce compute buffer size CUDA and CPU.md
@@ -1,17 +1,20 @@
-### 🔀 [#260](https://github.com/ikawrakow/ik_llama.cpp/pull/260) - FlashMLA-2: reduce compute buffer size (CUDA and CPU)
+## 🔀 [Pull Request #260](https://github.com/ikawrakow/ik_llama.cpp/pull/260) - FlashMLA-2: reduce compute buffer size (CUDA and CPU)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/flash_mla2_cuda_no_f32` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-17 |
| **Updated** | 2025-03-18 |
+| **Merged** | 2025-03-18 |
---
-#### Description
+## 📄 Description
This PR
-* Implements the same compute buffer size reduction approach as PR #253 on CUDA
+* Implements the same compute buffer size reduction approach as PR [#253](https://github.com/ikawrakow/ik_llama.cpp/issues/253) on CUDA
* Adds the ability to control the compute buffer size for FlashMLA-2 (`-mla 2 -fa`) via the `-amb` command line option.
* Fixes a bunch of integer overflows that show up when one starts using very long contexts (in the `perplexity` tool, and in the CUDA implementation of `GGML_OP_CONCAT`)
@@ -23,12 +26,23 @@ For DeepSeek-Lite I need to use a quite low `-amb` threshold of 256 MiB to even
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-17** at **15:00:38**:
+👤 **davidsyoung** commented on **2025-03-17** at **14:44:17**
+
+Will test and report back. Thank you @ikawrakow
+
+PS. Those fixes for `perplexity`, do you believe that was related to `NaN`'s in `IX_K` quants?
+
+---
+
+👤 **davidsyoung** commented on **2025-03-17** at **15:00:38**
First model load:
+
+Segfault with `-c 16384 -amb 1024 -fmoe -mla 2 -fa`
+
```
./llama-server -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-iq4_xs__iq3_s_q8.gguf -amb 1024 -fmoe -mla 2 -fa -ts 24/24/24/24/24/24/24/24/24/24/24/24/24/24/24/24 -c 16384 -ub 1024 --n-gpu-layers 100 -ot "blk\.3\.ffn_(down|gate|up)_exps\.weight|blk\.4\.ffn_(down|gate|up)_exps\.weight|blk\.5\.ffn_(down|gate|up)_exps\.weight=CUDA0" -ot "blk\.6\.ffn_(down|gate|up)_exps\.weight|blk\.7\.ffn_(down|gate|up)_exps\.weight|blk\.8\.ffn_(down|gate|up)_exps\.weight=CUDA1" -ot "blk\.9\.ffn_(down|gate|up)_exps\.weight|blk\.10\.ffn_(down|gate|up)_exps\.weight|blk\.11\.ffn_(down|gate|up)_exps\.weight|blk\.12\.ffn_(down|gate|up)_exps\.weight=CUDA2" -ot "blk\.13\.ffn_(down|gate|up)_exps\.weight|blk\.14\.ffn_(down|gate|up)_exps\.weight|blk\.15\.ffn_(down|gate|up)_exps\.weight|blk\.16\.ffn_(down|gate|up)_exps\.weight=CUDA3" -ot "blk\.17\.ffn_(down|gate|up)_exps\.weight|blk\.18\.ffn_(down|gate|up)_exps\.weight|blk\.19\.ffn_(down|gate|up)_exps\.weight|blk\.20\.ffn_(down|gate|up)_exps\.weight=CUDA4" -ot "blk\.21\.ffn_(down|gate|up)_exps\.weight|blk\.22\.ffn_(down|gate|up)_exps\.weight|blk\.23\.ffn_(down|gate|up)_exps\.weight|blk\.24\.ffn_(down|gate|up)_exps\.weight=CUDA5" -ot "blk\.25\.ffn_(down|gate|up)_exps\.weight|blk\.26\.ffn_(down|gate|up)_exps\.weight|blk\.27\.ffn_(down|gate|up)_exps\.weight|blk\.28\.ffn_(down|gate|up)_exps\.weight=CUDA6" -ot "blk\.29\.ffn_(down|gate|up)_exps\.weight|blk\.30\.ffn_(down|gate|up)_exps\.weight|blk\.31\.ffn_(down|gate|up)_exps\.weight|blk\.32\.ffn_(down|gate|up)_exps\.weight=CUDA7" -ot "blk\.33\.ffn_(down|gate|up)_exps\.weight|blk\.34\.ffn_(down|gate|up)_exps\.weight|blk\.35\.ffn_(down|gate|up)_exps\.weight|blk\.36\.ffn_(down|gate|up)_exps\.weight=CUDA8" -ot "blk\.37\.ffn_(down|gate|up)_exps\.weight|blk\.38\.ffn_(down|gate|up)_exps\.weight|blk\.39\.ffn_(down|gate|up)_exps\.weight|blk\.40\.ffn_(down|gate|up)_exps\.weight=CUDA9" -ot "blk\.41\.ffn_(down|gate|up)_exps\.weight|blk\.42\.ffn_(down|gate|up)_exps\.weight|blk\.43\.ffn_(down|gate|up)_exps\.weight|blk\.44\.ffn_(down|gate|up)_exps\.weight=CUDA10" -ot "blk\.45\.ffn_(down|gate|up)_exps\.weight|blk\.46\.ffn_(down|gate|up)_exps\.weight|blk\.47\.ffn_(down|gate|up)_exps\.weight|blk\.48\.ffn_(down|gate|up)_exps\.weight=CUDA11" -ot "blk\.49\.ffn_(down|gate|up)_exps\.weight|blk\.50\.ffn_(down|gate|up)_exps\.weight|blk\.51\.ffn_(down|gate|up)_exps\.weight|blk\.52\.ffn_(down|gate|up)_exps\.weight=CUDA12" -ot "blk\.53\.ffn_(down|gate|up)_exps\.weight|blk\.54\.ffn_(down|gate|up)_exps\.weight|blk\.55\.ffn_(down|gate|up)_exps\.weight|blk\.56\.ffn_(down|gate|up)_exps\.weight=CUDA13" -ot "blk\.57\.ffn_(down|gate|up)_exps\.weight|blk\.58\.ffn_(down|gate|up)_exps\.weight|blk\.59\.ffn_(down|gate|up)_exps\.weight|blk\.60\.ffn_(down|gate|up)_exps\.weight=CUDA14" --seed 3704 --temp 0.5 --temp 0.5 --host 0.0.0.0 --port 8080
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
@@ -466,31 +480,31 @@ llama_init_from_gpt_params: error: failed to create context with model '/models/
ERR [ load_model] unable to load model | tid="22757404872704" timestamp=1742223553 model="/models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-iq4_xs__iq3_s_q8.gguf"
Segmentation fault
root@7e406a084738:/app/build/bin#
-```
+```
+
---
-👤 **ikawrakow** commented the **2025-03-17** at **15:06:08**:
+👤 **ikawrakow** commented on **2025-03-17** at **15:06:08**
> Those fixes for perplexity, do you believe that was related to NaN's in IX_K quants?
-No. It is an integer overflow. The logic location in the array of logits was computed with 32-bit integers. As there are ~128k entries in the vocabulary, the integer multiplication `i * n_vocab` overflows for `i >= 16384`. You were computing PPL for contexts of 2048 or 512, so no issue there (`i < 2048`). The NaNs really are due to `fp16` arithmetic for the MoE matrix multiplications when using `IQ4_K` or `IQ4_KSS`. Apparently in the `llama.cpp` world it is well known that one cannot use the `fp16` DeepSeek models because one gets NaNs.
+No. It is an integer overflow. The logit location in the array of logits was computed with 32-bit integers. As there are ~128k entries in the vocabulary, the integer multiplication `i * n_vocab` overflows for `i >= 16384`. You were computing PPL for contexts of 2048 or 512, so no issue there (`i < 2048`). The NaNs really are due to `fp16` arithmetic for the MoE matrix multiplications when using `IQ4_K` or `IQ4_KSS`. Apparently in the `llama.cpp` world it is well known that one cannot use the `fp16` DeepSeek models because one gets NaNs.
---
-👤 **ikawrakow** commented the **2025-03-17** at **15:10:29**:
+👤 **ikawrakow** commented on **2025-03-17** at **15:10:29**
> Segfault with `-c 16384 -amb 1024 -fmoe -mla 2 -fa`
-It fails to allocate `3480 MiB`, so I guess there isn't enough VRAM? Try with `-amb 512` then.
+It fails to allocate `3480 MiB`, so I guess there isn't enough VRAM? Try with `-amb 512` then. It is a bit difficult to predict the total compute buffer size because there are also other operations that require memory, and FlashMLA-2 creates additional intermediate tensors, so the math is not as simple as just looking at the $X$ matrix as this PR does.
---
-👤 **ubergarm** commented the **2025-03-17** at **15:40:37**:
+👤 **ubergarm** commented on **2025-03-17** at **15:40:37**
I'll take a quick stab at it too given using a simple 1x RTX A6000 48GB GPU configuration.
-
#### Update
```bash
$ git checkout ik/flash_mla2_cuda_no_f32
@@ -522,30 +536,136 @@ CUDA_VISIBLE_DEVICES="0," \
--threads 24 \
--host 127.0.0.1 \
--port 8080
+
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 225736.00 MiB
+llm_load_tensors: CPU buffer size = 938.98 MiB
+llm_load_tensors: CUDA0 buffer size = 17744.02 MiB
```
#### Results
+`--ctx-size 16384`
+`nvidia-smi | grep llama-server` = 21856MiB
+```
+llama_kv_cache_init: CUDA0 KV buffer size = 1098.00 MiB
+llama_new_context_with_model: KV self size = 1098.00 MiB, c^KV (f16): 1098.00 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 2734.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 176.01 MiB
+llama_new_context_with_model: graph nodes = 4158
+llama_new_context_with_model: graph splits = 118
+```
-* 16k = TODO
-* 32k = TODO
-* 64k = TODO
+`--ctx-size 32768`
+`nvidia-smi | grep llama-server` = 24010MiB (increases ~30MiB after inferencing)
+```
+llama_kv_cache_init: CUDA0 KV buffer size = 2196.00 MiB
+llama_new_context_with_model: KV self size = 2196.00 MiB, c^KV (f16): 2196.00 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 3790.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 176.01 MiB
+llama_new_context_with_model: graph nodes = 5500
+llama_new_context_with_model: graph splits = 118
+```
+
+`--ctx-size 65536`
+`nvidia-smi | grep llama-server` = `28816MiB` (increases ~32MiB after inferencing)
+```
+llama_kv_cache_init: CUDA0 KV buffer size = 4392.00 MiB
+llama_new_context_with_model: KV self size = 4392.00 MiB, c^KV (f16): 4392.00 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 6401.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 240.01 MiB
+llama_new_context_with_model: graph nodes = 8184
+llama_new_context_with_model: graph splits = 118
+```
> Performance relative to not using -amb 1024 (only PP performance is required, TG in FlashMLA-2 is done the same way as no FA, so does not go through this memory optimization).
#### llama-bench
```bash
-echo TODO
-```
+# Run this twice, once with without specifying `-amb` at all and once like so:
+CUDA_VISIBLE_DEVICES="0," \
+./build/bin/llama-bench \
+ --model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-Q2_K_R4.gguf \
+ -ctk f16 -ctv f16 \
+ -mla 2 -fa 1 \
+ -amb 1024,128,64,32,16,8,4,1 \
+ -fmoe 1 \
+ --n-gpu-layers 63 \
+ --override-tensor exps=CPU \
+ --threads 24
+
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+ Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
+
+```
+| model | size | params | backend | ngl | fa | mla | amb | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --: | ----: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | - | 1 | pp512 | 108.18 ± 5.66 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | - | 1 | tg128 | 11.46 ± 0.02 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 1024 | 1 | pp512 | 109.42 ± 4.35 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 1024 | 1 | tg128 | 11.50 ± 0.04 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 128 | 1 | pp512 | 111.47 ± 1.44 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 128 | 1 | tg128 | 11.49 ± 0.02 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 64 | 1 | pp512 | 108.42 ± 3.52 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 64 | 1 | tg128 | 11.37 ± 0.04 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 32 | 1 | pp512 | 110.66 ± 2.65 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 32 | 1 | tg128 | 11.31 ± 0.01 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 16 | 1 | pp512 | 112.07 ± 0.18 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 16 | 1 | tg128 | 11.43 ± 0.02 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 8 | 1 | pp512 | 108.94 ± 6.17 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 8 | 1 | tg128 | 11.43 ± 0.01 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 4 | 1 | pp512 | 110.62 ± 0.33 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 4 | 1 | tg128 | 11.43 ± 0.02 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 1 | 1 | pp512 | 102.91 ± 2.12 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 1 | 1 | tg128 | 11.43 ± 0.06 |
---
-👤 **ikawrakow** commented the **2025-03-17** at **16:04:33**:
+👤 **ikawrakow** commented on **2025-03-17** at **16:04:33**
So, this looks quite a bit better than the main branch. It would seem that a single 24 GB GPU could handle the non-expert tensors and up to 32k context?
---
-👤 **ikawrakow** commented the **2025-03-17** at **16:33:42**:
+👤 **ubergarm** commented on **2025-03-17** at **16:17:11**
+
+> So, this looks quite a bit better than the main branch. It would seem that a single 24 GB GPU could handle the non-expert tensors and up to 32k context?
+
+*Update*: Yes, it is better than main branch as shown in table below. ~I haven't compared apples-apples with exact same model/ctk/ctv/amb between this branch and main. Closest numbers I have would be [here](https://github.com/ikawrakow/ik_llama.cpp/pull/259#issuecomment-2727496320). I will double check to confirm after llama-bench ends and i finish above comment.~
+
+Also the quant I'm using has `q8_0` for everything on GPU which is a bit heavy, so yes impressive it can still fit about 32k context with these settings. The most recent stuff on [ktransformers MLA FlashInfer absorb chunk prefill](https://github.com/kvcache-ai/ktransformers/blob/e788248364ed0241a0ba908e15b98e724d08e939/doc/en/DeepseekR1_V3_tutorial.md#longer-context) seems to consider over 20k "long context".
+
+> quantized cache still does not work for FlashMLA-2 on CUDA
+
+Oh, I thought I noticed it *was* reporting less with `-ctk q8_0 -ctv q8_0`, ~I'll do a quick check and update this TODO here and confirm.~ *Update*: Ahh okie, I see your note below, thanks.
+
+## Comparison Table
+I ran enough to show it is working, gonna stop and not fill in the blanks for now.
+
+| commit | ctx-size | amb | ctk/ctv | CUDA0 KV buffer | CUDA0 compute buffer | nvidia-smi |
+| --- | --- | --- | --- | --- | --- | --- |
+| branch/sha | size | MiB | quant | MiB | MiB | MiB |
+| `flash_mla2_@b147e31f` | 16384 | 1024 | f16 | 1098 | 2734 | 21856 |
+| `flash_mla2_@b147e31f` | 32768 | 1024 | f16 | 2196 | 3790 | 24010 |
+| `flash_mla2_@b147e31f` | 32768 | 128 | f16 | 2196 | 2817 | 23036 |
+| `flash_mla2_@b147e31f` | 65536 | 1024 | f16 | 4392 | 6401 | 28816 |
+| `main@f91b2e38` | 16384 | 1024 | f16 | 1098 | 5038 | 24160 |
+| `main@f91b2e38` | 32768 | 1024 | f16 | 2196 | 10006 | 30226 |
+| `main@f91b2e38` | 32768 | 128 | f16 | | | |
+| `main@f91b2e38` | 65536 | 1024 | f16 | | | |
+| `flash_mla2_@b147e31f` | 16384 | 1024 | q8_0 | 583.31 | 2494 | 21132 |
+| `flash_mla2_@b147e31f` | 32768 | 1024 | q8_0 | 1166.65 | 2662 | 21884 |
+| `flash_mla2_@b147e31f` | 65536 | 1024 | q8_0 | | | |
+
+---
+
+👤 **ikawrakow** commented on **2025-03-17** at **16:33:42**
> Oh, I thought I noticed it was reporting less with -ctk q8_0 -ctv q8_0, I'll do a quick check and update this TODO here and confirm.
@@ -555,7 +675,7 @@ Based on the performance values @ubergarm posted, there doesn't seem to be any m
---
-👤 **ubergarm** commented the **2025-03-17** at **16:47:48**:
+👤 **ubergarm** commented on **2025-03-17** at **16:47:48**
> What is the compute buffer size for -amb 128
@@ -563,13 +683,13 @@ The relevant part of the above table for this specific question:
| commit | ctx-size | amb | ctk/ctv | CUDA0 KV buffer | CUDA0 compute buffer | nvidia-smi |
| --- | --- | --- | --- | --- | --- | --- |
-| branch/sha | size | quant | MiB | MiB | MiB |
+| branch/sha | size | MiB | quant | MiB | MiB | MiB |
| `flash_mla2_@b147e31f` | 32768 | 1024 | f16 | 2196 | 3790 | 24010 |
| `flash_mla2_@b147e31f` | 32768 | 128 | f16 | 2196 | 2817 | 23036 |
---
-👤 **davidsyoung** commented the **2025-03-17** at **16:59:59**:
+👤 **davidsyoung** commented on **2025-03-17** at **16:59:59**
Sorry for delay here. As model loading takes quite a long amount of time on 16 GPUs, and I'm near to the limit there's been some OOMs (my own fault nothing to do with PR), I've been quite slow to come back.
@@ -579,7 +699,7 @@ TODO
---
-👤 **ubergarm** commented the **2025-03-17** at **17:25:23**:
+👤 **ubergarm** commented on **2025-03-17** at **17:25:23**
@davidsyoung
@@ -591,7 +711,7 @@ Curious if you have similar outcome across all your GPUs!
---
-👤 **davidsyoung** commented the **2025-03-17** at **17:39:21**:
+👤 **davidsyoung** commented on **2025-03-17** at **17:39:21**
> @davidsyoung
>
@@ -610,11 +730,13 @@ ggml_new_object: not enough space in the context's memory pool (needed 26231472,
Segmentation fault
```
-Haven't seen that error before!
+Haven't seen that error before!
+
+Also, you should test setting `-ub 1024`, you should see a big difference in PP performance compared to default of `-ub 512` I believe.
---
-👤 **ikawrakow** commented the **2025-03-17** at **17:55:57**:
+👤 **ikawrakow** commented on **2025-03-17** at **17:55:57**
Sorry, I wasn't clear enough with my request. The PP test should be done with `-p 16384` (or whatever context we are looking at). With `-p 512`, `llama-bench` will set the context to 512, so the required buffer to compute FlashMLA-2 will be quite small - `256 x 128 x 512 x 4 = 64 MiB`, so there will be more than one step only for `-amb 32` or lower. With `-amb 2` it will take 32 steps, so it will be processing 4 heads at a time. At `-amb 1` it will be 64 steps, so 2 heads per step. I find it quite surprising that we do not see performance degradation down to so many steps.
@@ -624,16 +746,17 @@ This is only relevant of the MoE experts are computed on CUDA. When the MoE part
---
-👤 **ubergarm** commented the **2025-03-17** at **18:00:55**:
+👤 **ubergarm** commented on **2025-03-17** at **18:00:55**
@davidsyoung
> pipeline parallelism enabled (n_copies=4)
+
Hrmm, I've seen some chatter about `-DGGML_SCHED_MAX_COPIES=4` before (default). Some folks were setting it to 1. Not sure why (maybe CUDA graphs?) and that was on vanilla llama.cpp so may not apply anymore.
I was kinda surprised that you were offloading shared experts onto GPUs with your config given that doesn't work on ktransformers yet in my own testing an in their documentation:
-> Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization. Note KExpertsTorch is untested.
+> Note:Currently, executing experts on the GPU will conflict with CUDA Graph. Without CUDA Graph, there will be a significant slowdown. Therefore, unless you have a substantial amount of VRAM (placing a single layer of experts for DeepSeek-V3/R1 on the GPU requires at least 5.6GB of VRAM), we do not recommend enabling this feature. We are actively working on optimization. Note KExpertsTorch is untested. [ktransformers guide](https://github.com/ubergarm/r1-ktransformers-guide?tab=readme-ov-file#cuda-graphs)
@ikawrakow
@@ -643,7 +766,7 @@ I'll set that up and post the results here soon.
---
-👤 **ikawrakow** commented the **2025-03-17** at **18:09:14**:
+👤 **ikawrakow** commented on **2025-03-17** at **18:09:14**
> I was kinda surprised that you were offloading shared experts onto GPUs with your config given that doesn't work on ktransformers yet in my own testing an in their documentation:
@@ -651,7 +774,7 @@ I'll set that up and post the results here soon.
---
-👤 **ikawrakow** commented the **2025-03-17** at **18:25:34**:
+👤 **ikawrakow** commented on **2025-03-17** at **18:25:34**
> Interestingly I got an error for -amb 32 when trying to maximise context length:
> Haven't seen that error before!
@@ -660,9 +783,9 @@ Neither have I. It means that the back-end is miscalculating the required comput
---
-👤 **ubergarm** commented the **2025-03-17** at **19:29:15**:
+👤 **ubergarm** commented on **2025-03-17** at **19:29:15**
-I increased `-p 16384` and set `-r 2` repetitions down from default of 5 for a quick check but it crashed before finishing with error shown below.
+I increased `-p 16384` and set `-r 2` repetitions down from default of 5 for a quick check but it crashed before finishing with error shown below. (Exactly same byte counts as @davidsyoung had above).
```bash
CUDA_VISIBLE_DEVICES="0," \
@@ -670,7 +793,7 @@ CUDA_VISIBLE_DEVICES="0," \
--model /mnt/raid/models/ubergarm/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-Q2_K_R4.gguf \
-ctk f16 -ctv f16 \
-mla 2 -fa 1 \
- -amb 1024,128,16,8,4,2,1 \
+ -amb 1024,128,64,32,16,8,4,2,1 \
-p 16384,8192 \
-n 0 \
-fmoe 1 \
@@ -691,6 +814,12 @@ ggml_cuda_init: found 1 CUDA devices:
| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 1024 | 1 | pp8192 | 97.67 ± 1.21 |
| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 128 | 1 | pp16384 | 82.59 ± 2.70 |
| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 128 | 1 | pp8192 | 96.21 ± 1.67 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 64 | 1 | pp16384 | 82.00 ± 2.36 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 64 | 1 | pp8192 | 95.02 ± 0.00 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 32 | 1 | pp16384 | 78.30 ± 1.01 |
+| deepseek2 671B Q2_K_R4 | 238.69 GiB | 672.05 B | CUDA | 63 | 1 | 2 | 32 | 1 | pp8192 | 94.31 ± 1.03 |
+
+(crashes when starting the 16k prompt with amb=16MiB)
```
ggml_new_object: not enough space in the context's memory pool (needed 26231472, available 26231136)
@@ -699,7 +828,7 @@ ggml_new_object: not enough space in the context's memory pool (needed 26231472,
---
-👤 **davidsyoung** commented the **2025-03-17** at **23:37:51**:
+👤 **davidsyoung** commented on **2025-03-17** at **23:37:51**
So compute buffers are massively improved. I don't have apples for apples comparison as I went down a rabbit hole after realising I could turn off pipeline parallel and it would also give me more VRAM back (thanks @ubergarm!). But it is massively improved.
@@ -724,7 +853,7 @@ Segmentation fault
---
-👤 **saood06** commented the **2025-03-17** at **23:48:58**:
+👤 **saood06** commented on **2025-03-17** at **23:48:58**
> I don't have apples for apples comparison as I went down a rabbit hole after realising I could turn off pipeline parallel and it would also give me more VRAM back (thanks @ubergarm!). But it is massively improved.
@@ -736,7 +865,7 @@ Even without the direct comparison, I'm curious what your at now. Also you proba
---
-👤 **davidsyoung** commented the **2025-03-18** at **00:00:30**:
+👤 **davidsyoung** commented on **2025-03-18** at **00:00:30**
Damn, I don’t have it right on me as I closed the laptop (night time here). I do have some data in notes from very early run.
@@ -746,6 +875,8 @@ Here are some very initial runs (this is without disabling pipeline parallelism)
Also, for gpu 16, unfortunately I can’t really use it. I can’t split the layers any bit more evenly (at least with what I’ve tried - it’s a bit of a limitation unfortunately without being able to split by row).
+I will add some more data tomorrow for you!
+
# Compute Buffer Configuration Comparison
| Parameter/Variable | Run 1 (`-c 8192 -amb 512`) | Run 2 (`-c 16843 -amb 256`) | Notes/Observations |
@@ -774,7 +905,7 @@ Also, for gpu 16, unfortunately I can’t really use it. I can’t split the lay
---
-👤 **ikawrakow** commented the **2025-03-18** at **06:36:37**:
+👤 **ikawrakow** commented on **2025-03-18** at **06:36:37**
@ubergarm @davidsyoung
diff --git a/github-data/pull_requests/261 - Compile time option to use bf16 for quants without MMQ kernels.md b/github-data/pull_requests/261 - Compile time option to use bf16 for quants without MMQ kernels.md
index f4f46f631..8a624537a 100644
--- a/github-data/pull_requests/261 - Compile time option to use bf16 for quants without MMQ kernels.md
+++ b/github-data/pull_requests/261 - Compile time option to use bf16 for quants without MMQ kernels.md
@@ -1,14 +1,17 @@
-### 🔀 [#261](https://github.com/ikawrakow/ik_llama.cpp/pull/261) - Compile time option to use bf16 for quants without MMQ kernels
+## 🔀 [Pull Request #261](https://github.com/ikawrakow/ik_llama.cpp/pull/261) - Compile time option to use bf16 for quants without MMQ kernels
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/use_bf16_when_no_mmq` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-17 |
| **Updated** | 2025-03-18 |
+| **Merged** | 2025-03-18 |
---
-#### Description
+## 📄 Description
The `IQ2_KS, IQ2_K, ..., IQ6_K` quantization types do not have MMQ kernels, so matrix multiplications for model weights quantized with these types are done via dequantization to `fp16` and `cublasGemmEx` GEMM using `fp16` precision. For the DeepSeek series of MoE models this leads to NaNs.
@@ -24,15 +27,15 @@ I have tested with DeepSeek-Lite quantized with `IQ4_KSS` and `IQ4_K`. In both c
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **davidsyoung** commented the **2025-03-17** at **23:38:28**:
+👤 **davidsyoung** commented on **2025-03-17** at **23:38:28**
Awesome! Will re-quant over night and test tomorrow!
---
-👤 **saood06** commented the **2025-03-17** at **23:43:23**:
+👤 **saood06** commented on **2025-03-17** at **23:43:23**
> Awesome! Will re-quant over night and test tomorrow!
@@ -40,6 +43,6 @@ In case you still have the old quants, you can just use those with the new code
---
-👤 **davidsyoung** commented the **2025-03-17** at **23:45:25**:
+👤 **davidsyoung** commented on **2025-03-17** at **23:45:25**
Unfortunately I don’t! My cache drive is limited so I tend to delete pretty soon.
\ No newline at end of file
diff --git a/github-data/pull_requests/262 - Fix _261.md b/github-data/pull_requests/262 - Fix 261.md
similarity index 98%
rename from github-data/pull_requests/262 - Fix _261.md
rename to github-data/pull_requests/262 - Fix 261.md
index d6040529d..b526ee2a7 100644
--- a/github-data/pull_requests/262 - Fix _261.md
+++ b/github-data/pull_requests/262 - Fix 261.md
@@ -1,16 +1,25 @@
-### 🐛 [#262](https://github.com/ikawrakow/ik_llama.cpp/pull/262) - Fix [#261](https://github.com/ikawrakow/ik_llama.cpp/issues/261)
+## 🔀 [Pull Request #262](https://github.com/ikawrakow/ik_llama.cpp/pull/262) - Fix [#261](https://github.com/ikawrakow/ik_llama.cpp/issues/261)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_pr_261` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-18 |
| **Updated** | 2025-03-18 |
+| **Merged** | 2025-03-18 |
---
-#### 💬 Conversation
+## 📄 Description
-👤 **davidsyoung** commented the **2025-03-18** at **10:41:29**:
+_No description provided._
+
+---
+
+## 💬 Conversation
+
+👤 **davidsyoung** commented on **2025-03-18** at **10:41:29**
Unfortunately still getting NaNs under perplexity. I built the latest PR in regards q8_0 KV cache.
@@ -2171,7 +2180,7 @@ llama_model_quantize_internal: WARNING: 61 of 785 tensor(s) required fallback qu
---
-👤 **davidsyoung** commented the **2025-03-18** at **10:41:34**:
+👤 **davidsyoung** commented on **2025-03-18** at **10:41:34**
PPL run
@@ -2639,19 +2648,19 @@ perplexity: 22.93 seconds per pass - ETA 53.58 minutes
---
-👤 **ikawrakow** commented the **2025-03-18** at **10:45:08**:
+👤 **ikawrakow** commented on **2025-03-18** at **10:45:08**
Did you enable the `GGML_CUDA_IQK_FORCE_BF16` option when building?
---
-👤 **davidsyoung** commented the **2025-03-18** at **10:49:14**:
+👤 **davidsyoung** commented on **2025-03-18** at **10:49:14**
D'oh. Back to the drawing board. Apologies! Will report back.
---
-👤 **davidsyoung** commented the **2025-03-18** at **13:11:13**:
+👤 **davidsyoung** commented on **2025-03-18** at **13:11:13**
Works! Great work!
diff --git a/github-data/pull_requests/264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA.md b/github-data/pull_requests/264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA.md
index 344641b30..581974f27 100644
--- a/github-data/pull_requests/264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA.md
+++ b/github-data/pull_requests/264 - Make Q8_0 KV cache work with FlasMLA-2 on CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#264](https://github.com/ikawrakow/ik_llama.cpp/pull/264) - Make Q8_0 KV cache work with FlasMLA-2 on CUDA
+## 🔀 [Pull Request #264](https://github.com/ikawrakow/ik_llama.cpp/pull/264) - Make Q8_0 KV cache work with FlasMLA-2 on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla2_q80_cache` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-18 |
| **Updated** | 2025-03-18 |
+| **Merged** | 2025-03-18 |
---
-#### Description
+## 📄 Description
For DeepSeek-V3/R1 this reduces KV cache size by ~2 GiB for a context of 65k tokens.
@@ -16,6 +19,6 @@ Using
```
-amb 512 -mla 2 -fa -ctk q8_0
```
-one should now be able to use 65k context with a single 24 GB GPU processing all attention calculations and all non-MoE expert tensors offloaded to it. See PR #260 for meaning and effect of the `-amb` command line option.
+one should now be able to use 65k context with a single 24 GB GPU processing all attention calculations and all non-MoE expert tensors offloaded to it. See PR [#260](https://github.com/ikawrakow/ik_llama.cpp/issues/260) for meaning and effect of the `-amb` command line option.
There is still an issue with one or more of the `GGML_OP_REPEAT, GGML_OP_CONCAT, GGML_OP_CPY` operations on CUDA, which are required to implement the entire attention computation using quantized tensors, so this PR takes the pragmatic approach of computing the attention operations with `fp16` on CUDA. The downside is that `fp16` will be used also on the CPU if the code was built with CUDA enabled (and this is slower than using `Q8_0` directly, wit the gap in performance increasing with context length).
\ No newline at end of file
diff --git a/github-data/pull_requests/265 - Allow q8_0 cache on the CPU for FlashMLA-2.md b/github-data/pull_requests/265 - Allow q8_0 cache on the CPU for FlashMLA-2.md
index acf884a4c..f609503c9 100644
--- a/github-data/pull_requests/265 - Allow q8_0 cache on the CPU for FlashMLA-2.md
+++ b/github-data/pull_requests/265 - Allow q8_0 cache on the CPU for FlashMLA-2.md
@@ -1,13 +1,16 @@
-### 🔀 [#265](https://github.com/ikawrakow/ik_llama.cpp/pull/265) - Allow q8_0 cache on the CPU for FlashMLA-2
+## 🔀 [Pull Request #265](https://github.com/ikawrakow/ik_llama.cpp/pull/265) - Allow q8_0 cache on the CPU for FlashMLA-2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla2_q80_cache_cpu` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-18 |
| **Updated** | 2025-03-18 |
+| **Merged** | 2025-03-18 |
---
-#### Description
+## 📄 Description
Somehow I had the concept that `Q8_0` KV cache is working for CPU-only inference with FlashMLA-2. Indeed it is for prompt processing, but not for TG (two different paths are taken). Clearly too many options as I'm getting confused myself. Anyhow, this PR adds the missing `Q8_0 -> Q8_0` contiguous transpose operation, so now we can use `Q8_0` KV cache with FlashMLA-2 also on the CPU.
\ No newline at end of file
diff --git a/github-data/pull_requests/268 - Prevent FlashMLA-1 from running on CUDA.md b/github-data/pull_requests/268 - Prevent FlashMLA-1 from running on CUDA.md
index 43c78ea32..2293a4561 100644
--- a/github-data/pull_requests/268 - Prevent FlashMLA-1 from running on CUDA.md
+++ b/github-data/pull_requests/268 - Prevent FlashMLA-1 from running on CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#268](https://github.com/ikawrakow/ik_llama.cpp/pull/268) - Prevent FlashMLA-1 from running on CUDA
+## 🔀 [Pull Request #268](https://github.com/ikawrakow/ik_llama.cpp/pull/268) - Prevent FlashMLA-1 from running on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avoid_cuda_mla_1` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-19 |
| **Updated** | 2025-03-19 |
+| **Merged** | 2025-03-19 |
---
-#### Description
+## 📄 Description
It is not supported, so let's not spam the user with messages about that by not allowing it to run on the GPU in the first place.
diff --git a/github-data/pull_requests/269 - Fix ggml_compute_forward_dup_q.md b/github-data/pull_requests/269 - Fix ggml_compute_forward_dup_q.md
index bd903f8bf..42a3f15c9 100644
--- a/github-data/pull_requests/269 - Fix ggml_compute_forward_dup_q.md
+++ b/github-data/pull_requests/269 - Fix ggml_compute_forward_dup_q.md
@@ -1,13 +1,16 @@
-### 🐛 [#269](https://github.com/ikawrakow/ik_llama.cpp/pull/269) - Fix ggml_compute_forward_dup_q
+## 🔀 [Pull Request #269](https://github.com/ikawrakow/ik_llama.cpp/pull/269) - Fix ggml_compute_forward_dup_q
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_dup_q` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-19 |
| **Updated** | 2025-03-19 |
+| **Merged** | 2025-03-19 |
---
-#### Description
+## 📄 Description
-I broke it with PR #265. I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation.
\ No newline at end of file
+I broke it with PR [#265](https://github.com/ikawrakow/ik_llama.cpp/issues/265). I was testing with a model where the wk_b and wk_v tensors were present, so didn't need to be computed, so didn't notice that the change I made to ggml_compute_forward_dup_q breaks that computation.
\ No newline at end of file
diff --git a/github-data/pull_requests/27 - Faster Gemma2.md b/github-data/pull_requests/27 - Faster Gemma2.md
index 5bfb0feaf..6acd4370f 100644
--- a/github-data/pull_requests/27 - Faster Gemma2.md
+++ b/github-data/pull_requests/27 - Faster Gemma2.md
@@ -1,14 +1,17 @@
-### 🔀 [#27](https://github.com/ikawrakow/ik_llama.cpp/pull/27) - Faster Gemma2
+## 🔀 [Pull Request #27](https://github.com/ikawrakow/ik_llama.cpp/pull/27) - Faster Gemma2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fused_softcap_softmax` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-27 |
| **Updated** | 2024-08-27 |
+| **Merged** | 2024-08-27 |
---
-#### Description
+## 📄 Description
In a [previous PR](https://github.com/ikawrakow/ik_llama.cpp/pull/9) I has fused `scale - tanh - scale` used for "soft-capping" activations into a `GGML_OP_SOFTCAP` operation. This PR further fuses `GGML_OP_SOFTCAP` with `GGML_OP_SOFT_MAX` into a new `GGML_OP_SOFT_CAP_MAX` operation. This is useful for, e.g., self-attention in the Gemma-2 series of models, and leads to a significant performance increase.
diff --git a/github-data/pull_requests/270 - Honor mmap setting when using tensor overrides.md b/github-data/pull_requests/270 - Honor mmap setting when using tensor overrides.md
index 0214a238e..d1ae3afc7 100644
--- a/github-data/pull_requests/270 - Honor mmap setting when using tensor overrides.md
+++ b/github-data/pull_requests/270 - Honor mmap setting when using tensor overrides.md
@@ -1,14 +1,17 @@
-### 🔀 [#270](https://github.com/ikawrakow/ik_llama.cpp/pull/270) - Honor mmap setting when using tensor overrides
+## 🔀 [Pull Request #270](https://github.com/ikawrakow/ik_llama.cpp/pull/270) - Honor mmap setting when using tensor overrides
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/tensor_override_honor_mmap` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-19 |
| **Updated** | 2025-03-19 |
+| **Merged** | 2025-03-19 |
---
-#### Description
+## 📄 Description
The reason why `mmap` was disabled when using tensor overrides is this:
* When the command line argument is parsed (and the override buffer is set to `CPU`), we get the buffer type returned by `ggml_backend_cpu_buffer_type()`
@@ -22,9 +25,9 @@ Note, however, that `-rtr` still disables `mmap` because otherwise the model wou
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-19** at **19:52:45**:
+👤 **ubergarm** commented on **2025-03-19** at **19:52:45**
Wow sweet! I just got back home and saw this, pull'd and rebuilt and got my custom quant running locally on the 9950X + 96GB DDR5-6400 RAM + 3090TI 24GB! Got about 3 tok/sec generation on a quick initial test.
diff --git a/github-data/pull_requests/272 - Convert models to row-interleaved quants using the quantize tool.md b/github-data/pull_requests/272 - Convert models to row-interleaved quants using the quantize tool.md
index 90af288e4..f28687ee7 100644
--- a/github-data/pull_requests/272 - Convert models to row-interleaved quants using the quantize tool.md
+++ b/github-data/pull_requests/272 - Convert models to row-interleaved quants using the quantize tool.md
@@ -1,16 +1,19 @@
-### 🔀 [#272](https://github.com/ikawrakow/ik_llama.cpp/pull/272) - Convert models to row-interleaved quants using the quantize tool
+## 🔀 [Pull Request #272](https://github.com/ikawrakow/ik_llama.cpp/pull/272) - Convert models to row-interleaved quants using the quantize tool
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/offline_repack` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-20 |
| **Updated** | 2025-03-21 |
+| **Merged** | 2025-03-21 |
---
-#### Description
+## 📄 Description
-The main purpose of this PR is to remove the need for run-time-repacking (command line argument `-rtr`) by having a tool to convert models to row-interleaved quantization types. The main motivation for providing this tool is to allow using `mmap` when loading a model and still having row-interleaved quants, so that one can combine the claimed performance gains from using 1 GiB huge pages (see #267) with the performance gains due to row-interleaved quants.
+The main purpose of this PR is to remove the need for run-time-repacking (command line argument `-rtr`) by having a tool to convert models to row-interleaved quantization types. The main motivation for providing this tool is to allow using `mmap` when loading a model and still having row-interleaved quants, so that one can combine the claimed performance gains from using 1 GiB huge pages (see [#267](https://github.com/ikawrakow/ik_llama.cpp/issues/267)) with the performance gains due to row-interleaved quants.
**Note:** this is only useful for **CPU-only** inference. The converted (repacked) model **will not work on a GPU** (or rather it will work but will be slow as all matrix multiplications with the repacked tensors will be done on the CPU).
@@ -24,25 +27,189 @@ Oh, `bf16` and `f16` models can be repacked too, one gets a `GGML_TYPE_BF16_R16`
**Caveat:** Some of the quantization types had a relatively minor, platform-specific, optimization applied when run-time-repacking. But as there is no way to tell if the repacking was done online, or if we are dealing with an offline-repacked model, I had to remove this optimization. This affects `Q8_0_R8, Q8_K_R8, Q8_KV_R8` on Zen4 (127 was added to these quants during run-time-repacking to avoid doing this during inference), and `Q4_0_R8` on ARM (a mask of `0x88` was applied to the packed bits, which converts the otherwise unsigned `Q4_0` values to signed values multiplied with 16).
-Closes #228
+Closes [#228](https://github.com/ikawrakow/ik_llama.cpp/issues/228)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-20** at **14:53:05**:
+👤 **ubergarm** commented on **2025-03-20** at **14:32:33**
+
+I'll take a look at this, but had an error trying to compile. Maybe a std template type thing? Not sure if I'm missing a dependency or something else. Here is the spammy build log:
+
+
+Build Error Log
+
+Let me know if you want more of the errors, I copy pasted what looks like enough to possibly see the actual error.
+```shell
+$ gcc --version
+gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
+
+$ git checkout ik/offline_repack
+Switched to branch 'ik/offline_repack'
+Your branch is up to date with 'origin/ik/offline_repack'.
+
+$ git rev-parse --short HEAD
+9fbe5bee
+
+$ rm -rf build
+$ cmake -B build -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF
+$ cmake --build build --config Release -j $(nproc)
+
+-- The C compiler identification is GNU 13.3.0
+-- The CXX compiler identification is GNU 13.3.0
+-- Detecting C compiler ABI info
+-- Detecting C compiler ABI info - done
+-- Check for working C compiler: /usr/bin/cc - skipped
+-- Detecting C compile features
+-- Detecting C compile features - done
+-- Detecting CXX compiler ABI info
+-- Detecting CXX compiler ABI info - done
+-- Check for working CXX compiler: /usr/bin/c++ - skipped
+-- Detecting CXX compile features
+-- Detecting CXX compile features - done
+-- Found Git: /usr/bin/git (found version "2.43.0")
+-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
+-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
+-- Found Threads: TRUE
+-- Found OpenMP_C: -fopenmp (found version "4.5")
+-- Found OpenMP_CXX: -fopenmp (found version "4.5")
+-- Found OpenMP: TRUE (found version "4.5")
+-- OpenMP found
+-- Using optimized iqk matrix multiplications
+-- Using llamafile
+-- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
+-- CMAKE_SYSTEM_PROCESSOR: x86_64
+-- x86 detected
+-- Configuring done (2.9s)
+-- Generating done (0.1s)
+-- Build files have been written to: /home/j/projects/ik_llama.cpp/build
+[ 1%] Building CXX object common/CMakeFiles/build_info.dir/build-info.cpp.o
+[ 2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
+[ 3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
+[ 3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
+[ 3%] Building C object examples/gguf-hash/CMakeFiles/sha1.dir/deps/sha1/sha1.c.o
+[ 4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
+[ 5%] Building C object examples/gguf-hash/CMakeFiles/xxhash.dir/deps/xxhash/xxhash.c.o
+[ 6%] Building C object examples/gguf-hash/CMakeFiles/sha256.dir/deps/sha256/sha256.c.o
+[ 6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
+[ 7%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
+[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o
+[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
+# (a few warnings that seem normal)
+[ 8%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-aarch64.c.o
+[ 8%] Built target sha1
+[ 8%] Built target build_info
+[ 8%] Built target sha256
+[ 8%] Built target xxhash
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In function ‘bool {anonymous}::is_forbidden_tensor(const std::string&)’:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6784:14: error: no match for ‘operator==’ (operand types are ‘const std::string’ {aka ‘con
+st std::__cxx11::basic_string’} and ‘const char [18]’)
+ 6784 | if (name == "token_embd.weight") return true;
+ | ~~~~ ^~ ~~~~~~~~~~~~~~~~~~~
+ | | |
+ | | const char [18]
+ | const std::string {aka const std::__cxx11::basic_string}
+In file included from /usr/include/x86_64-linux-gnu/c++/13/bits/c++allocator.h:33,
+ from /usr/include/c++/13/bits/allocator.h:46,
+ from /usr/include/c++/13/vector:63,
+ from /home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:17:
+/usr/include/c++/13/bits/new_allocator.h:215:9: note: candidate: ‘template bool std::operator==(const __new_allocator&, const __new_a
+llocator<_Tp>&)’
+ 215 | operator==(const __new_allocator&, const __new_allocator<_Up>&)
+ | ^~~~~~~~
+/usr/include/c++/13/bits/new_allocator.h:215:9: note: template argument deduction/substitution failed:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6784:17: note: mismatched types ‘const std::__new_allocator<_Tp>’ and ‘const char [18]’
+ 6784 | if (name == "token_embd.weight") return true;
+ | ^~~~~~~~~~~~~~~~~~~
+In file included from /usr/include/c++/13/bits/stl_algobase.h:64,
+ from /usr/include/c++/13/bits/specfun.h:43,
+ from /usr/include/c++/13/cmath:3699,
+ from /usr/include/c++/13/math.h:36,
+ from /home/j/projects/ik_llama.cpp/ggml/src/./ggml-impl.h:12,
+ from /home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:11:
+/usr/include/c++/13/bits/stl_pair.h:812:5: note: candidate: ‘template constexpr bool std::operator==(const pair<_T1, _T2>&, cons
+t pair<_T1, _T2>&)’
+ 812 | operator==(const pair<_T1, _T2>& __x, const pair<_T1, _T2>& __y)
+ | ^~~~~~~~
+/usr/include/c++/13/bits/stl_pair.h:812:5: note: template argument deduction/substitution failed:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6784:17: note: ‘const std::string’ {aka ‘const std::__cxx11::basic_string’} is not
+ derived from ‘const std::pair<_T1, _T2>’
+ 6784 | if (name == "token_embd.weight") return true;
+ | ^~~~~~~~~~~~~~~~~~~
+In file included from /usr/include/c++/13/bits/stl_algobase.h:67:
+/usr/include/c++/13/bits/stl_iterator.h:448:5: note: candidate: ‘template constexpr bool std::operator==(const reverse_iterator<_Iter
+ator>&, const reverse_iterator<_Iterator>&)’
+ 448 | operator==(const reverse_iterator<_Iterator>& __x,
+ | ^~~~~~~~
+/usr/include/c++/13/bits/stl_iterator.h:448:5: note: template argument deduction/substitution failed:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6784:17: note: ‘const std::string’ {aka ‘const std::__cxx11::basic_string’} is not
+ derived from ‘const std::reverse_iterator<_Iterator>’
+ 6784 | if (name == "token_embd.weight") return true;
+ | ^~~~~~~~~~~~~~~~~~~
+/usr/include/c++/13/bits/stl_iterator.h:493:5: note: candidate: ‘template constexpr bool std::operator==(const rev
+erse_iterator<_Iterator>&, const reverse_iterator<_IteratorR>&)’
+ 493 | operator==(const reverse_iterator<_IteratorL>& __x,
+ | ^~~~~~~~
+/usr/include/c++/13/bits/stl_iterator.h:493:5: note: template argument deduction/substitution failed:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6784:17: note: ‘const std::string’ {aka ‘const std::__cxx11::basic_string’} is not
+ derived from ‘const std::reverse_iterator<_Iterator>’
+ 6784 | if (name == "token_embd.weight") return true;
+ | ^~~~~~~~~~~~~~~~~~~
+.
+.
+.
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In function ‘bool iqk_should_modify_tensor(const ggml_tensor*)’:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6792:37: error: invalid initialization of reference of type ‘const std::string&’ {aka ‘con
+st std::__cxx11::basic_string&’} from expression of type ‘const char [64]’
+ 6792 | if (is_forbidden_tensor(tensor->name)) return false;
+ | ~~~~~~~~^~~~
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6783:45: note: in passing argument 1 of ‘bool {anonymous}::is_forbidden_tensor(const std::
+string&)’
+ 6783 | bool is_forbidden_tensor(const std::string& name) {
+ | ~~~~~~~~~~~~~~~~~~~^~~~
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp: In function ‘bool iqk_modify_tensor(ggml_tensor*)’:
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6801:37: error: invalid initialization of reference of type ‘const std::string&’ {aka ‘con
+st std::__cxx11::basic_string&’} from expression of type ‘char [64]’
+ 6801 | if (is_forbidden_tensor(tensor->name)) return false;
+ | ~~~~~~~~^~~~
+/home/j/projects/ik_llama.cpp/ggml/src/iqk/iqk_quantize.cpp:6783:45: note: in passing argument 1 of ‘bool {anonymous}::is_forbidden_tensor(const std::
+string&)’
+.
+.
+.
+gmake[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:174: ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o] Error 1
+gmake[2]: *** Waiting for unfinished jobs....
+^Cgmake[2]: *** [ggml/src/CMakeFiles/ggml.dir/build.make:146: ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o] Interrupt
+gmake[1]: *** [CMakeFiles/Makefile2:1621: ggml/src/CMakeFiles/ggml.dir/all] Interrupt
+gmake: *** [Makefile:146: all] Interrupt
+```
+
+
+
+---
+
+👤 **ubergarm** commented on **2025-03-20** at **14:51:47**
+
+Hrmm, does `./ggml/src/iqk/iqk_quantize.cpp` just need `#include string` ? I'm fussing with it lol.. You are too xD
+
+---
+
+👤 **ikawrakow** commented on **2025-03-20** at **14:53:05**
Does the last commit fix it? Strange that we can no longer compare `std::string` to a C-string, and a reference to `std::string` is no longer automatically instantiated from a C-string. Seriously? This will brake billions of LoC of C++.
---
-👤 **ubergarm** commented the **2025-03-20** at **14:55:53**:
+👤 **ubergarm** commented on **2025-03-20** at **14:55:53**
-Seems to be compiling now on `d27b7226`. I'll go back and check if simply adding `#include string` to `./ggml/src/iqk/iqk_quantize.cpp` would also fix it to confirm.
+lol wait a sec, let me double check this:
+
+~Seems to be compiling now on `d27b7226`. I'll go back and check if simply adding `#include string` to `./ggml/src/iqk/iqk_quantize.cpp` would also fix it to confirm.~
---
-👤 **ubergarm** commented the **2025-03-20** at **14:58:43**:
+👤 **ubergarm** commented on **2025-03-20** at **14:58:43**
Yeah, just needs the include e.g.
@@ -64,17 +231,21 @@ index bc6f34eb..0375b878 100644
#include
## builds good
-```
+```
+
+Didn't need `d27b722` or `94576a5`
+
+Feel free to force push or however you want to finalize this one.
---
-👤 **ikawrakow** commented the **2025-03-20** at **15:36:25**:
+👤 **ikawrakow** commented on **2025-03-20** at **15:36:25**
I think we can leave the two unnecessary changes. If we remove the explicit string construction, the compiler does it for us anyway.
---
-👤 **ubergarm** commented the **2025-03-20** at **15:38:00**:
+👤 **ubergarm** commented on **2025-03-20** at **15:38:00**
Okay, repacking seems to be working. I'll try out the freshly generated repacked weights next.
diff --git a/github-data/pull_requests/273 - FlashMLA-3_ the best of both worlds _CPU only_.md b/github-data/pull_requests/273 - FlashMLA-3 the best of both worlds CPU only.md
similarity index 91%
rename from github-data/pull_requests/273 - FlashMLA-3_ the best of both worlds _CPU only_.md
rename to github-data/pull_requests/273 - FlashMLA-3 the best of both worlds CPU only.md
index 81e6cb5f6..d0b8599cc 100644
--- a/github-data/pull_requests/273 - FlashMLA-3_ the best of both worlds _CPU only_.md
+++ b/github-data/pull_requests/273 - FlashMLA-3 the best of both worlds CPU only.md
@@ -1,14 +1,17 @@
-### 🔀 [#273](https://github.com/ikawrakow/ik_llama.cpp/pull/273) - FlashMLA-3: the best of both worlds (CPU only)
+## 🔀 [Pull Request #273](https://github.com/ikawrakow/ik_llama.cpp/pull/273) - FlashMLA-3: the best of both worlds (CPU only)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/FlashMLA-3` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-20 |
| **Updated** | 2025-07-12 |
+| **Merged** | 2025-03-21 |
---
-#### Description
+## 📄 Description
For DeepSeek models `mla=1` has a very good TG but low PP performance. `mla=2` has better PP performance, but TG performance rapidly decreases with number of tokens in the KV cache. `mla=0` (i.e., standard attention) has the best PP performance, but TG is even lower than `mla=2`. In addition, standard attention requires a much larger KV cache than `mla = 1,2`. Here are two graphs comparing PP and TG performance of `mla=0,1,2` for DeepSeek-Lite. In all cases FA is enabled, the KV cache is quantized with `Q8_0`, the model weights are quantized with `IQ4_NL`, and the calculations are run on a Ryzen-7950X CPU. The second graph is TG speed as a function of the number of tokens in the KV cache (obtained using `llama-bench -gp Np,64`). Note the logarithmic x-axis for both graphs.
@@ -25,23 +28,23 @@ Coming back to the above graphs, `mla=3` PP performance is given by the blue cur
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-20** at **20:55:44**:
+👤 **ubergarm** commented on **2025-03-20** at **20:55:44**
Clever idea to combine the best of both worlds, PP with `-mla 2` and TG with `-mla 1`!
-So reading closely it sounds like `-mla 3` is for CPU *only*?
-
> the CUDA backend does not support mla=1
-fwiw I before thinking I just compiled and tried to run it with CUDA backend it outputs seemed off (was slower than usual and throwing occasional `DDDDDDD` in llama-server).
+So reading closely it sounds like `-mla 3` would also be for CPU *only*?
+
+fwiw, before thinking, I just compiled and tried to run it with CUDA backend. While `llama-server` did "run", it outputs seemed way slower tokgen generation (~3 tok/sec vs over 12 with -mla 2). Also it would throw long strings of `DDDDDDDD`... So yeah sounds like what you said, not for CUDA backend. haha...
-I hope to kick the tires on this with the intel 6980P tomorrow. Also that `IQ4_NL` might be good for a hybrid quant on that rig... So many toys to play with, thanks!
+I hope to kick the tires on this with the intel 6980P tomorrow. Also that `IQ4_NL_R4` might be good for a hybrid quant on that CPU only rig... So many toys to play with, thanks!
---
-👤 **ikawrakow** commented the **2025-03-21** at **06:23:02**:
+👤 **ikawrakow** commented on **2025-03-21** at **06:23:02**
> So reading closely it sounds like -mla 3 would also be for CPU only?
@@ -91,15 +94,15 @@ llama_print_timings: eval time = 2365.88 ms / 127 runs ( 18.63 m
llama_print_timings: total time = 2452.52 ms / 135 tokens
Log end
```
-In comparison, the same command but using `-mla 2` gives me 55 t/s.
+In comparison, the same command but using `-mla 2` gives me 56.4 t/s.
---
-👤 **saood06** commented the **2025-03-21** at **10:59:27**:
+👤 **saood06** commented on **2025-03-21** at **10:59:27**
Would it be possible to use FA for PP and no FA for TG as that would be the best of both worlds for my AVX-2 system?
-Did some testing to get a baseline to later compare against the HugePage mmap version, and PP is the best I've seen for IQ4_K_R4 when FA is turned on (IQ4_K seems like it would still perform better given I had gotten 11.5 t/s before MLA was even implemented but I don't have that quant anymore, and still not sure why it performed better than IQ4_K_R4 especially now that I've seen others use the repacked quants without this issue).
+Did some testing to get a baseline to later compare against the HugePage mmap version, and PP is the best I've seen for IQ4_K_R4 when FA is turned on (IQ4_K seems like it would still perform better given I had gotten 11.5 t/s before MLA was implemented but I don't have that quant anymore, and still not sure why it performed better than IQ4_K_R4 especially now that I've seen others use the repacked quants without this issue).
Results with FA off:
[
@@ -295,7 +298,7 @@ Results with FA on (first PP result can be ignored as there was still some model
---
-👤 **ikawrakow** commented the **2025-03-21** at **11:38:12**:
+👤 **ikawrakow** commented on **2025-03-21** at **11:38:12**
> Would it be possible to use FA for PP and no FA for TG as that would be the best of both worlds for my AVX-2 system?
@@ -303,11 +306,11 @@ I think it is the number of threads that you are using that leads to a lower TG
---
-👤 **saood06** commented the **2025-03-21** at **11:44:19**:
+👤 **saood06** commented on **2025-03-21** at **11:44:19**
> I think it is the number of threads that you are using that leads to a lower TG performance. The efficient path is not taken when the number of threads is not a power of 2. Can you try TG with 32 threads to confirm before I try to make changes?
-I already had ran some tests with 16,24,32,48 threads with FA on, results below but this is without dropping the caches like I normally do before changing thread counts.
+I already had ran some tests with 16,24,32,48 threads with FA on, results below but this is without dropping the caches like I normally do before changing thread counts (which is why I think the TG is 2.61 instead of 2.75).
| model | size | params | backend | threads | fa | mla | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --: | ---: | ------------: | ---------------: |
@@ -324,7 +327,15 @@ Sorry, won't be available to run more tests till tommorow.
---
-👤 **ikawrakow** commented the **2025-03-21** at **13:20:58**:
+👤 **ikawrakow** commented on **2025-03-21** at **12:58:20**
+
+The difference between dropping and not dropping caches is almost the same as the difference between FA off and FA on? Hope we are not chasing our tale here.
+
+But when you come around to test again, I recommend to try `-ctk q8_0`. I think the `fp16 ->fp32` conversion on your CPU is very slow, and this disproportionally affects the speed of the attention calculations when the KV cache is `fp16`. For `tg128` there should be barely any difference between the different `mla/fa` options.
+
+---
+
+👤 **ikawrakow** commented on **2025-03-21** at **13:20:58**
Here some results for all combinations of `mla=1,2,3; fa=0,1` on a Risen-5975WX (i.e., 32 Zen3 cores, so vanilla `AVX2` is being used).
@@ -379,7 +390,7 @@ Here some results for all combinations of `mla=1,2,3; fa=0,1` on a Risen-5975WX
---
-👤 **saood06** commented the **2025-03-22** at **04:25:04**:
+👤 **saood06** commented on **2025-03-22** at **04:25:04**
> Here some results for all combinations of `mla=1,2,3; fa=0,1` on a Risen-5975WX (i.e., 32 Zen3 cores, so vanilla `AVX2` is being used).
>
@@ -397,11 +408,11 @@ Here some results for all combinations of `mla=1,2,3; fa=0,1` on a Risen-5975WX
| deepseek2 16B IQ4_NL_R4 | 15.76 B | 32 | q8_0 | 1 | 2 | 1 | tg64@pp8192 | 17.13 ± 0.01 |
| deepseek2 16B IQ4_NL_R4 | 15.76 B | 32 | q8_0 | 1 | 3 | 1 | tg64@pp8192 | 26.12 ± 0.03 |
-Looking at your results with FA off, MLA-3 is similar to the lower TG of MLA-2 and not the faster MLA-1, with FA MLA-3 is similar to the faster MLA-1.
+Looking at your results with FA off, MLA-3 is similar to the lower TG of MLA-2 and not the faster MLA-1, with FA MLA-3 is similar to the faster MLA-1. Is that what is expected?
>The difference between dropping and not dropping caches is almost the same as the difference between FA off and FA on? Hope we are not chasing our tale here.
-That test was done to check the performance at 16 threads, and to get more insight into the behavior from not dropping the caches when changing thread count since I've known it's bad but haven't done enough testing to understand the variation in severity of the impact of it. The model takes 20-30 minutes to load in depending on thread count (with higher thread count taking longer).
+I know it wasn't a good test to show TG performance at 32 threads, that test was done to check the performance at 16 threads, and to get more insight into the behavior from not dropping the caches when changing thread count since I've known it's bad but haven't done enough testing to understand the variation in severity of the impact of it. The model takes 20-30 minutes to load in depending on thread count (with higher thread count taking longer).
Interestingly PP performance seems to be unaffected by not dropping the cache as the values at 32 and 48 threads match the results with dropping the cache.
@@ -426,6 +437,8 @@ I ran more tests (new tests run on commit 3d6e25c8 ) and put the results (includ
| 48 | f16 | 1 | 3 | 1 | pp512 | 10.213** | 0.17310** |
| 48 | f16 | 1 | 3 | 1 | tg128 | 2.752347 | 0.002282 |
+** I calculated these values removing the run where there was disk activity causing low performance
+
No results for q8_KV with FA on as it crashed hitting this assert `iqk_mul_mat.cpp:421: GGML_ASSERT(Nx%num_rows == 0) failed`
As you can see the best result for TG of those tested is still 48 threads with FA off and f16 type_k, and for PP it is also 48 threads but with FA on and f16 type_k. Going to q8_0 or q8_KV did help slightly when tested with 32 threads.
@@ -440,7 +453,7 @@ Also https://github.com/ikawrakow/ik_llama.cpp/pull/240 you reported FA degraded
---
-👤 **ikawrakow** commented the **2025-03-22** at **07:03:25**:
+👤 **ikawrakow** commented on **2025-03-22** at **07:03:25**
> Looking at your results with FA off, MLA-3 is similar to the lower TG of MLA-2 and not the faster MLA-1, with FA MLA-3 is similar to the faster MLA-1. Is that what is expected?
@@ -448,7 +461,7 @@ Yes. With FA off, for TG MLA-3 is identical to MLA-2. With FA on, it is identica
---
-👤 **saood06** commented the **2025-03-22** at **10:21:11**:
+👤 **saood06** commented on **2025-03-22** at **10:21:11**
Ran MLA-3 with FA through a much longer test via sweep-bench, will do the other 5 combinations as well.
@@ -486,7 +499,7 @@ The results are not ideal because of the issue with the TG performance often dro
---
-👤 **saood06** commented the **2025-03-22** at **22:38:01**:
+👤 **saood06** commented on **2025-03-22** at **22:38:01**
Here are all 6 configurations (all at 48 threads with fmoe turned on) graphed.
@@ -653,6 +666,37 @@ MLA-2 FA off
| 512 | 128 | 15872 | 273.106 | 1.87 | 111.869 | 1.14 |
+MLA-3 FA on (only tested to 13312)
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 49.300 | 10.39 | 41.575 | 3.08 |
+| 512 | 128 | 512 | 56.224 | 9.11 | 43.899 | 2.92 |
+| 512 | 128 | 1024 | 62.094 | 8.25 | 50.923 | 2.51 |
+| 512 | 128 | 1536 | 66.510 | 7.70 | 57.158 | 2.24 |
+| 512 | 128 | 2048 | 67.585 | 7.58 | 49.648 | 2.58 |
+| 512 | 128 | 2560 | 70.106 | 7.30 | 71.653 | 1.79 |
+| 512 | 128 | 3072 | 75.708 | 6.76 | 78.948 | 1.62 |
+| 512 | 128 | 3584 | 78.358 | 6.53 | 50.780 | 2.52 |
+| 512 | 128 | 4096 | 81.845 | 6.26 | 89.474 | 1.43 |
+| 512 | 128 | 4608 | 85.695 | 5.97 | 94.354 | 1.36 |
+| 512 | 128 | 5120 | 90.736 | 5.64 | 57.370 | 2.23 |
+| 512 | 128 | 5632 | 95.275 | 5.37 | 103.264 | 1.24 |
+| 512 | 128 | 6144 | 99.108 | 5.17 | 110.374 | 1.16 |
+| 512 | 128 | 6656 | 101.478 | 5.05 | 58.461 | 2.19 |
+| 512 | 128 | 7168 | 105.490 | 4.85 | 122.629 | 1.04 |
+| 512 | 128 | 7680 | 108.935 | 4.70 | 135.901 | 0.94 |
+| 512 | 128 | 8192 | 114.398 | 4.48 | 61.164 | 2.09 |
+| 512 | 128 | 8704 | 115.502 | 4.43 | 135.792 | 0.94 |
+| 512 | 128 | 9216 | 122.377 | 4.18 | 143.546 | 0.89 |
+| 512 | 128 | 9728 | 121.992 | 4.20 | 65.858 | 1.94 |
+| 512 | 128 | 10240 | 125.463 | 4.08 | 152.709 | 0.84 |
+| 512 | 128 | 10752 | 133.142 | 3.85 | 159.024 | 0.80 |
+| 512 | 128 | 11264 | 138.752 | 3.69 | 70.149 | 1.82 |
+| 512 | 128 | 11776 | 139.309 | 3.68 | 167.620 | 0.76 |
+| 512 | 128 | 12288 | 145.077 | 3.53 | 174.769 | 0.73 |
+| 512 | 128 | 12800 | 148.735 | 3.44 | 73.611 | 1.74 |
+| 512 | 128 | 13312 | 150.444 | 3.40 | 180.752 | 0.71 |
MLA-3 FA off
@@ -689,46 +733,11 @@ MLA-3 FA off
| 512 | 128 | 14336 | 249.817 | 2.05 | 104.079 | 1.23 |
| 512 | 128 | 14848 | 255.171 | 2.01 | 106.178 | 1.21 |
| 512 | 128 | 15360 | 263.535 | 1.94 | 110.075 | 1.16 |
-| 512 | 128 | 15872 | 271.336 | 1.89 | 113.361 | 1.13 |
-
-
-
-
-MLA-3 FA on
-
-| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
-|-------|--------|--------|----------|----------|----------|----------|
-| 512 | 128 | 0 | 49.300 | 10.39 | 41.575 | 3.08 |
-| 512 | 128 | 512 | 56.224 | 9.11 | 43.899 | 2.92 |
-| 512 | 128 | 1024 | 62.094 | 8.25 | 50.923 | 2.51 |
-| 512 | 128 | 1536 | 66.510 | 7.70 | 57.158 | 2.24 |
-| 512 | 128 | 2048 | 67.585 | 7.58 | 49.648 | 2.58 |
-| 512 | 128 | 2560 | 70.106 | 7.30 | 71.653 | 1.79 |
-| 512 | 128 | 3072 | 75.708 | 6.76 | 78.948 | 1.62 |
-| 512 | 128 | 3584 | 78.358 | 6.53 | 50.780 | 2.52 |
-| 512 | 128 | 4096 | 81.845 | 6.26 | 89.474 | 1.43 |
-| 512 | 128 | 4608 | 85.695 | 5.97 | 94.354 | 1.36 |
-| 512 | 128 | 5120 | 90.736 | 5.64 | 57.370 | 2.23 |
-| 512 | 128 | 5632 | 95.275 | 5.37 | 103.264 | 1.24 |
-| 512 | 128 | 6144 | 99.108 | 5.17 | 110.374 | 1.16 |
-| 512 | 128 | 6656 | 101.478 | 5.05 | 58.461 | 2.19 |
-| 512 | 128 | 7168 | 105.490 | 4.85 | 122.629 | 1.04 |
-| 512 | 128 | 7680 | 108.935 | 4.70 | 135.901 | 0.94 |
-| 512 | 128 | 8192 | 114.398 | 4.48 | 61.164 | 2.09 |
-| 512 | 128 | 8704 | 115.502 | 4.43 | 135.792 | 0.94 |
-| 512 | 128 | 9216 | 122.377 | 4.18 | 143.546 | 0.89 |
-| 512 | 128 | 9728 | 121.992 | 4.20 | 65.858 | 1.94 |
-| 512 | 128 | 10240 | 125.463 | 4.08 | 152.709 | 0.84 |
-| 512 | 128 | 10752 | 133.142 | 3.85 | 159.024 | 0.80 |
-| 512 | 128 | 11264 | 138.752 | 3.69 | 70.149 | 1.82 |
-| 512 | 128 | 11776 | 139.309 | 3.68 | 167.620 | 0.76 |
-| 512 | 128 | 12288 | 145.077 | 3.53 | 174.769 | 0.73 |
-| 512 | 128 | 12800 | 148.735 | 3.44 | 73.611 | 1.74 |
-| 512 | 128 | 13312 | 150.444 | 3.40 | 180.752 | 0.71 |
+| 512 | 128 | 15872 | 271.336 | 1.89 | 113.361 | 1.13 |
---
-👤 **magikRUKKOLA** commented the **2025-07-12** at **00:39:46**:
+👤 **magikRUKKOLA** commented on **2025-07-12** at **00:39:46**
@ikawrakow
> Simply because the CUDA backend does not support `mla=1`, and the `ggml` back-end is very opinionated about where operations should run, with its opinions often being difficult to predict.
diff --git a/github-data/pull_requests/274 - Specify tensor name regex for tensors to be repacked.md b/github-data/pull_requests/274 - Specify tensor name regex for tensors to be repacked.md
index 0935b6da2..feac09826 100644
--- a/github-data/pull_requests/274 - Specify tensor name regex for tensors to be repacked.md
+++ b/github-data/pull_requests/274 - Specify tensor name regex for tensors to be repacked.md
@@ -1,16 +1,19 @@
-### 🔀 [#274](https://github.com/ikawrakow/ik_llama.cpp/pull/274) - Specify tensor name regex for tensors to be repacked
+## 🔀 [Pull Request #274](https://github.com/ikawrakow/ik_llama.cpp/pull/274) - Specify tensor name regex for tensors to be repacked
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/offline_repack_patterns` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-21 |
| **Updated** | 2025-03-21 |
+| **Merged** | 2025-03-21 |
---
-#### Description
+## 📄 Description
-This PR follows in the footsteps of #272 and adds the ability to specify one or more regular expressions to use for matching tensor names to be repacked. This is useful for hybrid GPU/CPU inference where one will want to repack only the tensors that stay on the CPU.
+This PR follows in the footsteps of [#272](https://github.com/ikawrakow/ik_llama.cpp/issues/272) and adds the ability to specify one or more regular expressions to use for matching tensor names to be repacked. This is useful for hybrid GPU/CPU inference where one will want to repack only the tensors that stay on the CPU.
Usage
```
diff --git a/github-data/pull_requests/275 - Fix bug_ missing parentheses in logical expression.md b/github-data/pull_requests/275 - Fix bug missing parentheses in logical expression.md
similarity index 50%
rename from github-data/pull_requests/275 - Fix bug_ missing parentheses in logical expression.md
rename to github-data/pull_requests/275 - Fix bug missing parentheses in logical expression.md
index 586d789c9..ea70f7190 100644
--- a/github-data/pull_requests/275 - Fix bug_ missing parentheses in logical expression.md
+++ b/github-data/pull_requests/275 - Fix bug missing parentheses in logical expression.md
@@ -1,14 +1,17 @@
-### 🐛 [#275](https://github.com/ikawrakow/ik_llama.cpp/pull/275) - Fix bug: missing parentheses in logical expression
+## 🔀 [Pull Request #275](https://github.com/ikawrakow/ik_llama.cpp/pull/275) - Fix bug: missing parentheses in logical expression
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bug_missing_parentheses` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-21 |
| **Updated** | 2025-03-21 |
+| **Merged** | 2025-03-21 |
---
-#### Description
+## 📄 Description
This results in GGGGGGGGGGGGG when generating with mla = 3, fa = 0.
diff --git a/github-data/pull_requests/276 - Add Gemma3 support _text only_.md b/github-data/pull_requests/276 - Add Gemma3 support _text only_.md
deleted file mode 100644
index c4c597ca3..000000000
--- a/github-data/pull_requests/276 - Add Gemma3 support _text only_.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#276](https://github.com/ikawrakow/ik_llama.cpp/pull/276) - Add Gemma3 support (text only)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-03-21 |
-| **Updated** | 2025-03-22 |
-
----
-
-#### Description
-
-Basically just the graph building. Conversion from safetensors needs to be done with upstream.
\ No newline at end of file
diff --git a/github-data/pull_requests/276 - Add Gemma3 support text only.md b/github-data/pull_requests/276 - Add Gemma3 support text only.md
new file mode 100644
index 000000000..6d4ae4a65
--- /dev/null
+++ b/github-data/pull_requests/276 - Add Gemma3 support text only.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #276](https://github.com/ikawrakow/ik_llama.cpp/pull/276) - Add Gemma3 support (text only)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemma3` |
+| **Target Branch** | `main` |
+| **Created** | 2025-03-21 |
+| **Updated** | 2025-03-22 |
+| **Merged** | 2025-03-22 |
+
+---
+
+## 📄 Description
+
+Basically just the graph building. Conversion from safetensors needs to be done with upstream.
\ No newline at end of file
diff --git a/github-data/pull_requests/277 - Attempt to improve FlashMLA on the CPU.md b/github-data/pull_requests/277 - Attempt to improve FlashMLA on the CPU.md
index 715fa7f7d..149793246 100644
--- a/github-data/pull_requests/277 - Attempt to improve FlashMLA on the CPU.md
+++ b/github-data/pull_requests/277 - Attempt to improve FlashMLA on the CPU.md
@@ -1,14 +1,17 @@
-### 🔀 [#277](https://github.com/ikawrakow/ik_llama.cpp/pull/277) - Attempt to improve FlashMLA on the CPU
+## 🔀 [Pull Request #277](https://github.com/ikawrakow/ik_llama.cpp/pull/277) - Attempt to improve FlashMLA on the CPU
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/better_flash_mla` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-22 |
| **Updated** | 2025-03-23 |
+| **Merged** | 2025-03-23 |
---
-#### Description
+## 📄 Description
@saood06 Can you try if this works better for your setup with `-mla 3 -fa`? Thanks.
@@ -16,15 +19,15 @@ There is a faster path for TG with FA and `mla=1,3`. But it only gets taken if s
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-03-22** at **10:59:25**:
+👤 **saood06** commented on **2025-03-22** at **10:59:25**
-I'll test this with sweep-bench after the other 5 tests finish, as these tests take a long time and I'm stepping away from my desk right now.
+I'll test this with sweep-bench after the other 5 tests finish, as these tests take a long time and I won't be at my desk till tomorrow.
---
-👤 **saood06** commented the **2025-03-23** at **01:12:52**:
+👤 **saood06** commented on **2025-03-23** at **01:12:52**
@ikawrakow
@@ -36,7 +39,7 @@ And also here's PP since it was generated anyway
It seems a bit better (not counting the dips), but also far less dippy.
-Raw results for just the new one (the other two results can be found [here](https://github.com/ikawrakow/ik_llama.cpp/pull/273#issuecomment-2745899802):
+Raw results for just the new one (the other two results can be found [here](https://github.com/ikawrakow/ik_llama.cpp/pull/273#issuecomment-2745899802)):
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -75,7 +78,7 @@ Raw results for just the new one (the other two results can be found [here](http
---
-👤 **ikawrakow** commented the **2025-03-23** at **06:28:14**:
+👤 **ikawrakow** commented on **2025-03-23** at **06:28:14**
Thank you for these results.
diff --git a/github-data/pull_requests/278 - Test transparent huge pages on Linux.md b/github-data/pull_requests/278 - Test transparent huge pages on Linux.md
index 40595cd5a..2fe2fa9d7 100644
--- a/github-data/pull_requests/278 - Test transparent huge pages on Linux.md
+++ b/github-data/pull_requests/278 - Test transparent huge pages on Linux.md
@@ -1,16 +1,19 @@
-### 🔀 [#278](https://github.com/ikawrakow/ik_llama.cpp/pull/278) - Test transparent huge pages on Linux
+## 🔀 [Pull Request #278](https://github.com/ikawrakow/ik_llama.cpp/pull/278) - Test transparent huge pages on Linux
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/test_thp` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-22 |
| **Updated** | 2025-03-25 |
+| **Merged** | 2025-03-23 |
---
-#### Description
+## 📄 Description
-In #267 @orca-zhang observes significant performance gains using 1 GiB huge pages, so I decided to see if I can reproduce.
+In [#267](https://github.com/ikawrakow/ik_llama.cpp/issues/267) @orca-zhang observes significant performance gains using 1 GiB huge pages, so I decided to see if I can reproduce.
This PR adds the option to use transparent huge pages (THP) on Linux. To use it, just add `-thp` to the command line (but note that it is only invoked if also `mmap` is being used).
@@ -57,9 +60,9 @@ where `N` is how many 1 GiB huge pages you want reserved.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-22** at **17:20:30**:
+👤 **ubergarm** commented on **2025-03-22** at **17:20:30**
Testing THP Feature
===
@@ -67,7 +70,7 @@ Testing manually allocated huge pages via `-thp` flag.
## Initial Results with 2MiB Huge Pages
-My quick methodology was to throw a medium length <4k `Prompt 1` at `llama-server` followed up with a very short `Prompt 2` question about the response. Only ran two repitions but seems like some speed boost with 2MiB huge pages pre-allocated and enabled.
+My quick methodology was to throw a medium length <4k `Prompt 1` at `llama-server` followed up with a very short `Prompt 2` question about the response. Only ran two repetitions for each case, but seems like there *is* some speed boost with 2MiB huge pages pre-allocated and enabled on this single NUMA node CPU only intel xeon 6980P rig test case.
| Prompt | `-thp` | pp | tg |
| ------ | ------ | --- | --- |
@@ -87,6 +90,7 @@ My quick methodology was to throw a medium length <4k `Prompt 1` at `llama-serve
3. You need enough huge pages pre-allocated on a single NUMA node to fit entire model (can't run partially off disk).
4. Using even standard 2MiB huge pages seems to give ~12% speed boost for token generation in this CPU only single NUMA node test case.
5. I had trouble allocating 1GiB huge pages on a different test rig, and didn't want to reboot it with GRUB stuff either.
+6. Upon control+c exiting `llama-server -thp` it throws a warning `warning: munmap failed: Invalid argument`
## Conclusion
@@ -145,7 +149,7 @@ $ cat /boot/config-6.8.0-55-generic | grep THP_FOR_FS
## Test Case
```bash
-## start benchmark without `-thp`
+## start benchmark with new `-thp` feature enabled
numactl -N 0 -m 0 \
./build/bin/llama-server \
-thp \
@@ -404,7 +408,7 @@ Hugetlb: 0 kB
---
-👤 **ikawrakow** commented the **2025-03-22** at **17:41:25**:
+👤 **ikawrakow** commented on **2025-03-22** at **17:41:25**
> Seems like llama-bench doesn't support -thp 1, only llama-server
@@ -412,7 +416,7 @@ It will work in any of the executables that use `common` (`llama-server, llama-c
> This seems to be for manually pre-allocated huge pages, not for "transparent" "Anon" huge pages (THPs).
-No, these are THP. The way it works, you ask the kernel to give you `N` huge pages (e.g., with `mmap(..., MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETL))`. It will do it only if possible (enough huge pages are available), else the system call will fail. It won't on its own reshuffle virtual pages to free up space for you. Hence, if you want to make sure that getting the necessary number of huge pages will always succeed, it is better to pre-allocate them. At least that's my understanding of it. Either way, what I do in this PR is exactly what XuanWuLab did in the quoted post in #267.
+No, these are THP. The way it works, you ask the kernel to give you `N` huge pages (e.g., with `mmap(..., MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETL))`. It will do it only if possible (enough huge pages are available), else the system call will fail. It won't on its own reshuffle virtual pages to free up space for you. Hence, if you want to make sure that getting the necessary number of huge pages will always succeed, it is better to pre-allocate them. At least that's my understanding of it. Either way, what I do in this PR is exactly what XuanWuLab did in the post quoted in [#267](https://github.com/ikawrakow/ik_llama.cpp/issues/267).
> Upon control+c exiting llama-server -thp it throws a warning warning: munmap failed: Invalid argument
@@ -420,20 +424,22 @@ I had that too and I thought I had fixed it. I no longer get this warning on my
---
-👤 **ikawrakow** commented the **2025-03-22** at **18:02:11**:
+👤 **ikawrakow** commented on **2025-03-22** at **18:02:11**
So, `llama-bench` has `-thp` with the last commit. As changing `-thp` needs a model reload, it cannot be used to run `thp=0` and `thp=1` in the same run (same as `-rtr`).
---
-👤 **ubergarm** commented the **2025-03-22** at **21:40:24**:
+👤 **ubergarm** commented on **2025-03-22** at **21:40:24**
-Benchmarking Explicit Huge Pages
+Benchmarking Huge Pages
===
CPU only inference using single socket of dual Intel Xeon 6980P with offline-repacked unsloth/`DeepSeek-R1-Q4_K_R4` 671B @ 376.65GB file size.
## tl;dr;
+Manually allocating enough 2MiB Huge Pages to fit the entire model improved `tg` just over 12% of baseline and improved `pp` around 5%.
+
| thp | test | t/s |
| --: | ------------: | ---------------: |
| 1 | tg64@pp512 | 8.87 ± 0.00 |
@@ -464,7 +470,9 @@ Thanks for adding the CLI argument to `llama-bench`. It does seem to provide som
Yes, regarding *Transparent* vs *Explicit* Huge Pages name, the important thing is as you mention it is the same strategy as XuanWuLab.
-I did a [little experiment](https://github.com/ubergarm/ik_llama.cpp/pull/1) and explanation of the difference on my local system with what I am calling *THP*, and enabling it seemed to actually hurt performance. Not enough RAM to test manually allocating Explicit Huge Pages on my local rig unfortunately.
+I did a [little experiment](https://github.com/ubergarm/ik_llama.cpp/pull/1) and explanation of the difference on my local system with what I am calling *THP*, and enabling that seemed to actually hurt performance.
+
+Not enough RAM to test your PR manually allocating Explicit Huge Pages on my local rig unfortunately.
Thanks!
@@ -729,13 +737,13 @@ Hugetlb: 0 kB
---
-👤 **ikawrakow** commented the **2025-03-23** at **06:24:32**:
+👤 **ikawrakow** commented on **2025-03-23** at **06:24:32**
It looks like this can be useful, so I'll merge it.
---
-👤 **ubergarm** commented the **2025-03-23** at **19:32:09**:
+👤 **ubergarm** commented on **2025-03-23** at **19:32:09**
Okay, I think I kind of understand things better now and have some interesting benchmark results.
@@ -744,6 +752,8 @@ Some systems will likely benefit from using Huge Pages. You can use either Expli
There are some differences and depending on your exact requirements you may choose to use one or the other. For example, Explicit Huge Pages may support 1GiB sizes whereas THPs may not. THPs don't consume RAM when the model is not loaded as they are not manually pre-allocated.
+Both methods seem to require enough RAM to fit the entire model weights.
+
## Explicit Huge Pages
Explicit huge pages are configured manually at boot time or before loading the model weights. These huge pages will consume RAM even when not in use and require special code changes contained in this PR.
@@ -758,16 +768,16 @@ $ grep Huge /proc/meminfo
AnonHugePages: 1857536 kB # <--- random other small stuff is using THPs
ShmemHugePages: 0 kB
FileHugePages: 0 kB
-HugePages_Total: 400000 # <--- I allocated twice as much given 2x NUMA nodes
-HugePages_Free: 207154 # <--- model is loaded into Explicit Huge Pages
+HugePages_Total: 400000 # <--- I allocated twice as much given 2x NUMA nodes
+HugePages_Free: 207154 # <--- model is loaded into Explicit Huge Pages
HugePages_Rsvd: 0
HugePages_Surp: 0
-Hugepagesize: 2048 kB # <--- standard 2MiB Hugepagesize, feel free to try 1Gib and report back!
+Hugepagesize: 2048 kB # <--- standard 2MiB Hugepagesize, feel free to try 1Gib and report back!
Hugetlb: 819200000 kB
```
## Transparent Huge Pages
-If you want to use Transparent Huge Pages (THPs), you can enable them system wide before starting the application. This is simple enough and does not require any special `MADV_HUGEPAGE` code changes. It does not require any special code changes. It does probably require you use `--mmap 0` though which means you need enough RAM to hold the entire model weights.
+If you want to use Transparent Huge Pages (THPs), you can enable them system wide before starting the application. This is simple enough and does not require any special `MADV_HUGEPAGE` code change. It does seem to require you use `--mmap 0` though which means you need enough RAM to hold the entire model weights.
```bash
## set to always so code does not require `MADV_HUGEPAGE`
@@ -796,8 +806,8 @@ HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
-Hugepagesize: 2048 kB # <--- doesn't matter as THPs are maybe always 2MiB regardless?
-Hugetlb: 0 kB # <--- no need for Explicit Huge Pages
+Hugepagesize: 2048 kB # <--- doesn't matter as THPs are maybe always 2MiB regardless?
+Hugetlb: 0 kB # <--- no need for Explicit Huge Pages
```
@@ -835,22 +845,22 @@ This excellent PR gives you the flexibility to use whatever makes sense for your
| 128 | - | 1 | tg64@pp512 | 9.10 ± 0.00 |
| 128 | - | 1 | tg64@pp8192 | 7.75 ± 0.00 |
| Transparent Huge pages | | | | |
-| 64 | 0 | 1 | pp512 | 96.76 ± 4.34 |
-| 64 | 0 | 1 | pp8192 | 65.51 ± 0.30 |
-| 64 | 0 | 1 | tg64@pp512 | 9.53 ± 0.00 |
-| 64 | 0 | 1 | tg64@pp8192 | 7.67 ± 0.02 |
-| 96 | 0 | 1 | pp512 | 117.02 ± 0.07 |
-| 96 | 0 | 1 | pp8192 | 83.29 ± 0.65 |
-| 96 | 0 | 1 | tg64@pp512 | 9.32 ± 0.00 |
-| 96 | 0 | 1 | tg64@pp8192 | 8.17 ± 0.01 |
-| 128 | 0 | 1 | pp512 | 143.88 ± 6.28 |
-| 128 | 0 | 1 | pp8192 | 101.05 ± 0.02 |
-| 128 | 0 | 1 | tg64@pp512 | 9.26 ± 0.00 |
-| 128 | 0 | 1 | tg64@pp8192 | 7.85 ± 0.01 |
+| 64 | 0 | 0 | pp512 | 96.76 ± 4.34 |
+| 64 | 0 | 0 | pp8192 | 65.51 ± 0.30 |
+| 64 | 0 | 0 | tg64@pp512 | 9.53 ± 0.00 |
+| 64 | 0 | 0 | tg64@pp8192 | 7.67 ± 0.02 |
+| 96 | 0 | 0 | pp512 | 117.02 ± 0.07 |
+| 96 | 0 | 0 | pp8192 | 83.29 ± 0.65 |
+| 96 | 0 | 0 | tg64@pp512 | 9.32 ± 0.00 |
+| 96 | 0 | 0 | tg64@pp8192 | 8.17 ± 0.01 |
+| 128 | 0 | 0 | pp512 | 143.88 ± 6.28 |
+| 128 | 0 | 0 | pp8192 | 101.05 ± 0.02 |
+| 128 | 0 | 0 | tg64@pp512 | 9.26 ± 0.00 |
+| 128 | 0 | 0 | tg64@pp8192 | 7.85 ± 0.01 |
---
-👤 **ikawrakow** commented the **2025-03-24** at **08:32:27**:
+👤 **ikawrakow** commented on **2025-03-24** at **08:32:27**
@ubergarm Thank you for this.
@@ -885,7 +895,36 @@ So, as you say, some systems will benefit from THP or "Explicit huge pages". Thi
---
-👤 **ikawrakow** commented the **2025-03-24** at **16:38:10**:
+👤 **ubergarm** commented on **2025-03-24** at **14:37:51**
+
+Thanks for giving it a try. Interesting results, and similar to my own experience on the 7965WX 24-Core with 256GB RAM (BIOS=NPS1) where it wouldn't allocate enough THP for the entire model and performance seemed worse as well.
+
+> some systems will benefit from THP or "Explicit huge pages"
+
+Yeah, I've seen some big database vendors suggest disable THP completely. It really depends on the workload and system configuration. The best advise I heard in some technical youtube talks is "benchmark benchmark benchmark" to decide if it helps or not lol...
+
+The one system it does help is that big dual socket intel xeon 6980P which has 1.5TB RAM. So possibly 2MiB hugepages (transparent or explicit) reduces the insane number of 4k pages down to something more manageable. Or possibly related to NUMA stuff, hard to say.
+
+> after running the THP experiment, performance without THP dropped to THP levels
+
+Wow, I guess memory fragmentation or something? I don't know enough of the low level kernel TLB virtual memory system to fully understand.
+
+Glad you got it back up to baseline. You can turn that stuff back off afterwards e.g.
+
+```
+## drop all caches
+$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
+
+## put thp back to whatever system defaults you prefer (or simply reboot)
+$ echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
+$ echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
+```
+
+Anyway, this fork has all the options someone needs to benchmark their setup for best performance which is great!
+
+---
+
+👤 **ikawrakow** commented on **2025-03-24** at **16:38:10**
> You can turn that stuff back off afterwards e.g.
@@ -895,7 +934,7 @@ I don't really know what happens in the kernel, but my guess is that due to cach
---
-👤 **saood06** commented the **2025-03-25** at **11:48:08**:
+👤 **saood06** commented on **2025-03-25** at **11:48:08**
> To enable 1 GiB huge pages, you need to add
>
@@ -913,4 +952,6 @@ I don't really know what happens in the kernel, but my guess is that due to cach
The instructions differ if you do not have GRUB, as is the case for example on clear linux, where to enable it follow [this](https://www.clearlinux.org/clear-linux-documentation/guides/maintenance/configure-hugepages.html) guide.
-I didn't test 2 MB pages, as it failed with `llama_mmap: mmap with huge page size 2 MiB failed (Cannot allocate memory)` and hugeadm was not trivially available (not in clear linux's package manager) and I didn't bother installing [libhugetlbfs](https://github.com/libhugetlbfs/libhugetlbfs) from source.
\ No newline at end of file
+I didn't test 2 MB pages, as it failed with `llama_mmap: mmap with huge page size 2 MiB failed (Cannot allocate memory)` and hugeadm was not trivially available (not in clear linux's package manager) and I didn't bother installing [libhugetlbfs](https://github.com/libhugetlbfs/libhugetlbfs) from source.
+
+It's still a bit disappointing that the 1 GiB huge pages led to much worse performance, maybe 2 MiB would be better, may test later.
\ No newline at end of file
diff --git a/github-data/pull_requests/279 - Fighting with cmake.md b/github-data/pull_requests/279 - Fighting with cmake.md
index 62916f77e..7ca675e0d 100644
--- a/github-data/pull_requests/279 - Fighting with cmake.md
+++ b/github-data/pull_requests/279 - Fighting with cmake.md
@@ -1,14 +1,17 @@
-### 🔀 [#279](https://github.com/ikawrakow/ik_llama.cpp/pull/279) - Fighting with cmake
+## 🔀 [Pull Request #279](https://github.com/ikawrakow/ik_llama.cpp/pull/279) - Fighting with cmake
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_again_cmake` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-22 |
| **Updated** | 2025-03-22 |
+| **Merged** | 2025-03-22 |
---
-#### Description
+## 📄 Description
`cmake` has the unpleasant habit of using "response" files to put stuff such as list of include directories. But that confuses `vim` (or at least it does the way I have set it up) when I edit CUDA files. I had tricked `cmake` into not using "response" files, but instead adding all `nvcc` command line options into `compile_commands.json`. But at some point that stopped working, I guess after a system update. I hate it, so this PR restores the desired behavior. I had to add
```
diff --git a/github-data/pull_requests/28 - Binary KQ mask.md b/github-data/pull_requests/28 - Binary KQ mask.md
index 506c11c51..e3b2ad043 100644
--- a/github-data/pull_requests/28 - Binary KQ mask.md
+++ b/github-data/pull_requests/28 - Binary KQ mask.md
@@ -1,15 +1,17 @@
-### 🔀 [#28](https://github.com/ikawrakow/ik_llama.cpp/pull/28) - Binary KQ mask
+## 🔀 [Pull Request #28](https://github.com/ikawrakow/ik_llama.cpp/pull/28) - Binary KQ mask
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `ik/kq_mask` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-28 |
---
-#### Description
+## 📄 Description
-This PR is another attempt to improve performance for large contexts, see #25
+This PR is another attempt to improve performance for large contexts, see [#25](https://github.com/ikawrakow/ik_llama.cpp/issues/25)
Basically, when we want to process a very long context, the KQ mask, which is stored as `f32` (or `f16`, if using flash attention), becomes quite significant in size. If running on the GPU, the cost for copying the KQ mask to the GPU (the mask is created on the host CPU) becomes non-negligible. If running on a CPU that has limited memory bandwidth (basically all `x86` or `x86_64`), the KQ mask may not fit in the cache, or if it does fit it reduces the cache available for other data by a significant amount, which results in a measurable impact on the performance of the `SOFT_MAX` (or the new fused `SOFT_CAP_MAX`) operation. Hence, it will be desirable to reduce the size of the KQ mask.
diff --git a/github-data/pull_requests/280 - Native build ooption for CUDA when GGML_NATIVE is set.md b/github-data/pull_requests/280 - Native build ooption for CUDA when GGML_NATIVE is set.md
index 2b0bb1bb3..a81676c5a 100644
--- a/github-data/pull_requests/280 - Native build ooption for CUDA when GGML_NATIVE is set.md
+++ b/github-data/pull_requests/280 - Native build ooption for CUDA when GGML_NATIVE is set.md
@@ -1,13 +1,16 @@
-### 🔀 [#280](https://github.com/ikawrakow/ik_llama.cpp/pull/280) - Native build ooption for CUDA when GGML_NATIVE is set
+## 🔀 [Pull Request #280](https://github.com/ikawrakow/ik_llama.cpp/pull/280) - Native build ooption for CUDA when GGML_NATIVE is set
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_native` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-22 |
| **Updated** | 2025-03-22 |
+| **Merged** | 2025-03-22 |
---
-#### Description
+## 📄 Description
Speeds up CUDA build time 3X
\ No newline at end of file
diff --git a/github-data/pull_requests/282 - Improve DeepSeek batched processing speed.md b/github-data/pull_requests/282 - Improve DeepSeek batched processing speed.md
index bbf7f3f5b..fe01bf418 100644
--- a/github-data/pull_requests/282 - Improve DeepSeek batched processing speed.md
+++ b/github-data/pull_requests/282 - Improve DeepSeek batched processing speed.md
@@ -1,14 +1,17 @@
-### 🔀 [#282](https://github.com/ikawrakow/ik_llama.cpp/pull/282) - Improve DeepSeek batched processing speed
+## 🔀 [Pull Request #282](https://github.com/ikawrakow/ik_llama.cpp/pull/282) - Improve DeepSeek batched processing speed
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/better_batched_processing` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-23 |
| **Updated** | 2025-03-23 |
+| **Merged** | 2025-03-23 |
---
-#### Description
+## 📄 Description
I was looking into the batched processing performance dips observed by @saood06 [here](https://github.com/ikawrakow/ik_llama.cpp/pull/277#issuecomment-2745952185) and I saw this for DeepSeek-Lite:
@@ -34,13 +37,13 @@ Concerning DeepSeek-R1, there is a small change in this PR that I hope will redu
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-03-23** at **11:32:00**:
+👤 **saood06** commented on **2025-03-23** at **11:32:00**
>Concerning DeepSeek-R1, there is a small change in this PR that I hope will reduce the performance dips observed by @saood06
-Running sweep bench and will post full results with graph when they finish, but right now but early results look promising showing
+Running sweep bench and will post full results with graph when they finish, but right now but early results look promising, table with early values below
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -62,7 +65,7 @@ I see you pushed another commit, should I stop this test and recompile and run t
---
-👤 **ikawrakow** commented the **2025-03-23** at **11:34:40**:
+👤 **ikawrakow** commented on **2025-03-23** at **11:34:40**
> I see you pushed another commit, should I stop this test and recompile and run the new commit?
@@ -70,7 +73,7 @@ This will only affect results for `B > 128`, so beyond the range where you are t
---
-👤 **ikawrakow** commented the **2025-03-23** at **11:51:34**:
+👤 **ikawrakow** commented on **2025-03-23** at **11:51:34**
What would be very interesting is to run PP benchmarks with DeepSeek-V3/R1 with `./bin/llama-bench -mla 3 -fa 1 -fmoe 1 -p 32,64,128,192,256,320,384,448,512,576,640,704,768` with
* [This line](https://github.com/ikawrakow/ik_llama.cpp/blob/5a4855e61c05b0c54ecad3f4155074d8f344b6f6/src/llama.cpp#L13899) changed to `pp_opt = true`;
@@ -80,7 +83,7 @@ This will help understand if the crossover between "TG optimized" and "PP optimi
---
-👤 **saood06** commented the **2025-03-23** at **13:28:00**:
+👤 **saood06** commented on **2025-03-23** at **13:28:00**
> What would be very interesting is to run PP benchmarks with DeepSeek-V3/R1 with `./bin/llama-bench -mla 3 -fa 1 -fmoe 1 -p 32,64,128,192,256,320,384,448,512,576,640,704,768` with
>
@@ -91,11 +94,11 @@ This will help understand if the crossover between "TG optimized" and "PP optimi
>
> This will help understand if the crossover between "TG optimized" and "PP optimized" is somehow dependent on the number of heads, or if it is just a (perhaps somewhat computer dependent) constant. I can see arguments for both options, so the only way to understand is to just test.
-Running now, each config is going to take ~50 minutes.
+Running now, each config is going to take at least 50 minutes (based on my estimation from the beginning of the first run), I may not be around to post it till later.
---
-👤 **saood06** commented the **2025-03-23** at **16:56:44**:
+👤 **saood06** commented on **2025-03-23** at **16:56:44**
@ikawrakow Here's the benchmark you asked for:
@@ -161,6 +164,6 @@ I'm going to reboot my machine now to enable 1GB hugepages and mitigations=off a
---
-👤 **ikawrakow** commented the **2025-03-23** at **17:10:32**:
+👤 **ikawrakow** commented on **2025-03-23** at **17:10:32**
Thanks, this is great! It looks like a threshold of 128 tokens is not a bad choice for DeepSeek-R1 as well.
\ No newline at end of file
diff --git a/github-data/pull_requests/283 - CUDA_ better MoE implementation.md b/github-data/pull_requests/283 - CUDA better MoE implementation.md
similarity index 78%
rename from github-data/pull_requests/283 - CUDA_ better MoE implementation.md
rename to github-data/pull_requests/283 - CUDA better MoE implementation.md
index 98c16600b..25c0e3382 100644
--- a/github-data/pull_requests/283 - CUDA_ better MoE implementation.md
+++ b/github-data/pull_requests/283 - CUDA better MoE implementation.md
@@ -1,16 +1,19 @@
-### 🔀 [#283](https://github.com/ikawrakow/ik_llama.cpp/pull/283) - CUDA: better MoE implementation
+## 🔀 [Pull Request #283](https://github.com/ikawrakow/ik_llama.cpp/pull/283) - CUDA: better MoE implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_better_moe` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-24 |
| **Updated** | 2025-04-05 |
+| **Merged** | 2025-03-25 |
---
-#### Description
+## 📄 Description
-This PR makes "indirect" matrix multiplications as used for MoE models inference reproducible on CUDA, and closes #249
+This PR makes "indirect" matrix multiplications as used for MoE models inference reproducible on CUDA, and closes [#249](https://github.com/ikawrakow/ik_llama.cpp/issues/249)
As a bonus, we get a ~10% PP speedup as measured with DeepSeek-Lite. I wouldn't be surprised if the benefit is even larger for DeepSeek-R1 as it has 4X more experts than DeepSeek-Lite.
@@ -18,9 +21,9 @@ The culprit for non-reproducible results and sluggish performance was the `k_cop
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-24** at **16:48:31**:
+👤 **ubergarm** commented on **2025-03-24** at **16:48:31**
I pulled and built this branch and benchmarked speed vs main branch as well as llama-perplexity runs.
@@ -30,6 +33,7 @@ Also two back to back runs of `llama-perplexity` gave what looks like the same r
Let me know if there is some other condition or way to test. Thanks!
+Details inside this fold 👇
Benchmark and Testing Logs
@@ -111,6 +115,8 @@ Final estimate: PPL = 3.6989 +/- 0.02106
$ git rev-parse --short HEAD
7f6980fa
+main: build = 3610 (7f6980fa)
+
## Run 1
perplexity: tokenizing the input ..
@@ -146,13 +152,13 @@ llama_print_timings: total time = 2838434.18 ms / 287233 tokens
---
-👤 **ikawrakow** commented the **2025-03-24** at **17:21:53**:
+👤 **ikawrakow** commented on **2025-03-24** at **17:21:53**
Thanks for testing. You are running the MoE experts on the CPU, so you are not supposed to see a difference (and is good you confirm that you don't). At least part of the MoE experts need to run on the GPU to see a benefit (or at least a difference). I expect @davidsyoung with his 16 x 3090 configuration to see PP performance uplift.
---
-👤 **davidsyoung** commented the **2025-03-24** at **18:24:25**:
+👤 **davidsyoung** commented on **2025-03-24** at **18:24:25**
Awesome work!
@@ -162,16 +168,24 @@ Any particular `llama-bench` you'd like @ikawrakow?
---
-👤 **davidsyoung** commented the **2025-03-24** at **18:36:17**:
+👤 **ikawrakow** commented on **2025-03-24** at **18:29:30**
+
+> Any particular llama-bench you'd like @ikawrakow?
+
+I don't expect much of a difference for TG, so PP is the interesting benchmark to run. If you observe an effect, it shouldn't depend on any of the attention options (`mla`), so you can pick your favorite and run `llama-bench` for `-p 512,1024,2048,4096,8192 -ub 2048` with a model where you already have PP benchmarks for. Thanks!
+
+---
+
+👤 **davidsyoung** commented on **2025-03-24** at **18:36:17**
Will run both PP and TG for completeness, running:
`./llama-bench -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-iq4_xs__iq3_s_q8.gguf -b 2048 -ub 2048 -fa 1 -mla 2 -amb 128 -fmoe 1 -r 2 -p 512,1024,2048,4096,8192 -n 128,256,512,1024,2048 -n 0 -ngl 63 `
-# Comparable data from #266:
+# Comparable data from [#266](https://github.com/ikawrakow/ik_llama.cpp/issues/266):
-| model | size | params | backend | ngl | n_ubatch | fa | mla | amb | fmoe | test | t/s |
-| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --: | ----: | ---: | ------------: | ---------------: |
+| model | size | params | backend | ngl | n_batch | n_ubatch | fa | mla | amb | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --: | ----: | ---: | ------------: | ---------------: |
| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 2048 | 1 | 2 | 128 | 1 | pp512 | 238.52 ± 1.44 |
| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 2048 | 1 | 2 | 128 | 1 | pp1024 | 304.77 ± 0.07 |
| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 2048 | 1 | 2 | 128 | 1 | pp2048 | 348.11 ± 0.69 |
@@ -189,13 +203,21 @@ Will run both PP and TG for completeness, running:
| model | size | params | backend | ngl | n_ubatch | fa | mla | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --: | ----: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | pp512 | 235.44 ± 7.21 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | pp1024 | 320.37 ± 3.14 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | pp2048 | 375.89 ± 2.37 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | pp4096 | 351.44 ± 0.69 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | pp8192 | 305.28 ± 1.27 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | tg128 | 16.84 ± 0.18 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | tg256 | 17.52 ± 0.15 |
+| deepseek2 671B Q8_0 | 307.20 GiB | 672.05 B | CUDA | 63 | 2048 | 1 | 2 | 128 | 1 | tg512 | 17.72 ± 0.09 |
-_will edit as it completes_
+_I added `-r 10` so we would see less variance with the runs. The previous datapoints were fully warm with being part of a big overnight bench._
---
-👤 **davidsyoung** commented the **2025-03-24** at **19:30:14**:
+👤 **davidsyoung** commented on **2025-03-24** at **19:30:14**
Awesome improvement! @ikawrakow
@@ -203,13 +225,21 @@ Awesome improvement! @ikawrakow
---
-👤 **ikawrakow** commented the **2025-03-25** at **06:47:06**:
+👤 **ikawrakow** commented on **2025-03-25** at **06:47:06**
This looks like a winner, merging.
---
-👤 **ikawrakow** commented the **2025-03-25** at **18:37:34**:
+👤 **davidsyoung** commented on **2025-03-25** at **16:26:22**
+
+> This looks like a winner, merging.
+
+Awesome work. Thank you. You are starting to get near to VLLM performance on PP.
+
+---
+
+👤 **ikawrakow** commented on **2025-03-25** at **18:37:34**
> Awesome work. Thank you. You are starting to get near to VLLM performance on PP.
@@ -217,7 +247,7 @@ How far am I from vLLM?
---
-👤 **saood06** commented the **2025-03-25** at **18:45:41**:
+👤 **saood06** commented on **2025-03-25** at **18:45:41**
> > This looks like a winner, merging.
>
@@ -227,7 +257,7 @@ And what about sglang which is supposedly even better for Deepseek? Also what ab
---
-👤 **davidsyoung** commented the **2025-03-25** at **21:09:44**:
+👤 **davidsyoung** commented on **2025-03-25** at **21:09:44**
vLLM currently has an overflow issue (for myself personally), with Q3. So it’s not usable (this is with gguf).
@@ -241,7 +271,7 @@ But again, it’s broken at the moment.
---
-👤 **saood06** commented the **2025-03-25** at **21:17:08**:
+👤 **saood06** commented on **2025-03-25** at **21:17:08**
>Sglang has no gguf support.
@@ -249,7 +279,7 @@ As mentioned before, you might fit AWQ, and that quant has good support on sglan
---
-👤 **davidsyoung** commented the **2025-03-25** at **23:52:05**:
+👤 **davidsyoung** commented on **2025-03-25** at **23:52:05**
> > Sglang has no gguf support.
>
@@ -259,7 +289,7 @@ Unfortunately not, I’m a bit short of VRAM. If AWQ had 3 bit or 3.5bit possibl
---
-👤 **saood06** commented the **2025-03-26** at **00:59:31**:
+👤 **saood06** commented on **2025-03-26** at **00:59:31**
> > > Sglang has no gguf support.
> >
@@ -272,7 +302,7 @@ That is really unfortunate, as 16x 24GB cards would have probably been the cheap
---
-👤 **ikawrakow** commented the **2025-03-26** at **10:14:53**:
+👤 **ikawrakow** commented on **2025-03-26** at **10:14:53**
> As mentioned before, you might fit AWQ, and that quant has good support on sglang.
@@ -282,7 +312,7 @@ You seem to be recommending AWQ quants. On my book AWQ quants are pretty low qua
---
-👤 **saood06** commented the **2025-03-27** at **04:17:57**:
+👤 **saood06** commented on **2025-03-27** at **04:17:57**
> > As mentioned before, you might fit AWQ, and that quant has good support on sglang.
>
@@ -294,106 +324,60 @@ I'm not sure, I haven't looked deeply into AWQ in a while, I was just curious ab
---
-👤 **JohannesGaessler** submitted a review the **2025-04-05** at **10:04:54**: 💬 `COMMENTED`
-
----
-
-👤 **JohannesGaessler** commented during a code review the **2025-04-05** at **10:04:54** on `ggml/src/ggml-cuda.cu`:
+👤 **JohannesGaessler** started a conversation on `ggml/src/ggml-cuda.cu` on **2025-04-05** at **10:04:54**
This synchronization is not safe to remove. `ids_host` and `rmapping` are deallocated when they go out of scope and the source pointers for `cudaMemcpyAsync` become dangling pointers. As the name implies, the memcpy is asynchronous and without an explicit synchronization there is no guarantee that the data is still valid once it's being copied to the device.
----
-
-👤 **JohannesGaessler** commented the **2025-04-05** at **10:14:21**:
+> 👤 **ikawrakow** replied on **2025-04-05** at **10:55:53**
+>
+> Yes, they are deallocated when the function completes. Neither `ids_host` nor `ids` (or `ids_dev`) is used after that. The only reason this forgotten to remove synchronization is there is because I did have a bug while developing this function. The bug resulted in out of bounds access, so before finding the actual bug one hypothesis I had was that I needed to synchronize because the copy had not finished when I started using the row ids.
->Awesome work. Thank you. You are starting to get near to VLLM performance on PP.
-
-If you are using GGUF models in both cases you should be aware that vLLM at some point transplanted quantization-specific CUDA code that I wrote for ggml. I have since improved this code but vLLM has to my knowledge not taken over these improvements.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-04-05** at **10:55:53**: 💬 `COMMENTED`
+> 👤 **JohannesGaessler** replied on **2025-04-05** at **11:11:58**
+>
+> The original code had synchronization directly after the memcpy so I had assumed that that is where this line comes from. But that is I think not relevant to the discussion.
+>
+> When you call `cudaMemcpyAsync` you merely pass a pointer and queue a memcpy from that pointer to the device. As it is you don't have any guarantees that that memcpy will happen before the function returns and the memory is deallocated. Even if you are unable to provoke a bug in testing this is a defect which will result in sporadic segfaults or copying of garbage data.
----
+> 👤 **ikawrakow** replied on **2025-04-05** at **11:48:50**
+>
+> That would be true if nothing happened after this call. But the row ids are being used in subsequent calls in the same function, so the memcpy must have completed before the function exits. Let's take a look at your original `mul_mat_id` implementation. At the end we have [this call](https://github.com/ggml-org/llama.cpp/blob/7a84777f42a9b3ba47db5d20b7662f8ddf92f652/ggml/src/ggml-cuda/ggml-cuda.cu#L2093). This copies the data from the contiguous memory pool-allocated in the function to its final destination. Now, if this call has not completed by the time the function returns, than we would obviously have "sporadic segfaults and copying of garbage data". So, even without knowing anything about CUDA, one needs to assume that a call such as this completes synchronously, else the entire `llama.cpp` CUDA stack would be a collection of "sporadic segfaults and copying of garbage data". Well, there are calls such as that one in my function as well before it returns. These kernel calls, as well as the preceding processing, they all use the row ids that you are claiming may go out of scope. But in order for them to execute, the queued memcpy must have completed, so no, no "sporadic segfaults and copying of garbage data" at this point.
+>
+> But at the end of the day, if you are able to trigger the bug, using whatever it takes to trigger it, I'll be happy to uncomment the synchronization call.
-👤 **ikawrakow** commented during a code review the **2025-04-05** at **10:55:53** on `ggml/src/ggml-cuda.cu`:
+> 👤 **JohannesGaessler** replied on **2025-04-05** at **12:23:11**
+>
+> `k_copy_dst_from_contiguous` only uses device pointers. The point in time at which their data is valid is automatically synchronized with the execution of the kernel because CUDA streams guarantee an ordering in which device code is executed. `cudaMemcpyAsync` is fundamentally different because it uses a host pointer with memory that can become invalid under the control of host code.
+>
+> >Let's take a look at your original mul_mat_id implementation. At the end we have [this call](https://github.com/ggml-org/llama.cpp/blob/7a84777f42a9b3ba47db5d20b7662f8ddf92f652/ggml/src/ggml-cuda/ggml-cuda.cu#L2093). This copies the data from the contiguous memory pool-allocated in the function to its final destination.
+>
+> The way the CUDA memory pools work is that the memory is allocated in a single, large block that can grow dynamically. Assuming that you don't need to increase the size of the block an "allocation" via `ggml_cuda_pool_alloc` does not actually allocate any new memory, it simply returns a pointer into the large block that is selected in such a way that there are no conflicts between the "allocated" memory regions while the "allocations" are in scope. The actual memory continues to be a valid allocation afterwards, though it will likely be overwritten by other kernels. This is very similar to how the ggml graph planner is giving each tensor a pointer to some data where at the time of the tensor being executed the data is guaranteed to be valid but the memory is re-used for other tensors as long as there are no conflicts.
-Yes, they are deallocated when the function completes. Neither `ids_host` nor `ids` (or `ids_dev`) is used after that. The only reason this forgotten to remove synchronization is there is because I did have a bug while developing this function. The bug resulted in out of bounds access, so before finding the actual bug one hypothesis I had was that I needed to synchronize because the copy had not finished when I started using the row ids.
+> 👤 **JohannesGaessler** replied on **2025-04-05** at **12:24:46**
+>
+> >This is very similar to how the ggml graph planner is giving each tensor a pointer to some data
+>
+> Actually, `wdata` may be a better comparison.
----
+> 👤 **ikawrakow** replied on **2025-04-05** at **12:33:00**
+>
+> See [#313](https://github.com/ikawrakow/ik_llama.cpp/issues/313). The issue is not that it will go out of scope, but that I'm using the data on the host before the copy may have completed.
-👤 **JohannesGaessler** submitted a review the **2025-04-05** at **11:11:58**: 💬 `COMMENTED`
+> 👤 **JohannesGaessler** replied on **2025-04-05** at **12:43:28**
+>
+> Sorry, I just noticed that I mixed up the copy directions for the two memcpys.
---
-👤 **JohannesGaessler** commented during a code review the **2025-04-05** at **11:11:58** on `ggml/src/ggml-cuda.cu`:
+👤 **JohannesGaessler** commented on **2025-04-05** at **10:14:21**
-The original code had synchronization directly after the memcpy so I had assumed that that is where this line comes from. But that is I think not relevant to the discussion.
+>Awesome work. Thank you. You are starting to get near to VLLM performance on PP.
-When you call `cudaMemcpyAsync` you merely pass a pointer and queue a memcpy from that pointer to the device. As it is you don't have any guarantees that that memcpy will happen before the function returns and the memory is deallocated. Even if you are unable to provoke a bug in testing this is a defect which will result in sporadic segfaults or copying of garbage data.
+If you are using GGUF models in both cases you should be aware that vLLM at some point transplanted quantization-specific CUDA code that I wrote for ggml. I have since improved this code but vLLM has to my knowledge not taken over these improvements.
---
-👤 **ikawrakow** commented the **2025-04-05** at **11:17:43**:
+👤 **ikawrakow** commented on **2025-04-05** at **11:17:43**
> I have since improved this code but vLLM has to my knowledge not taken over these improvements.
-Based on the performance comparisons on my GPU (RTX-4080) against mainline that I ran after the improvements, they were too minor to offset the performance gains I have from other modifications. For MoE models with many experts such as DeepSeek-V3/R1/Lite, `ik_llama.cpp` is ~1.8X faster than mainline for PP after this PR. It is also ~80-90% of vLLM performance on a multi-GPU system such as the one davidsyoung has, where vLLM uses tensor parallelism and `ik_llama.cpp` does not (so all that will take to match or beat vLLM is to make row split work with MoE models). Given my very limited experience with GPU programming, and given my very rudimentary CUDA knowledge, I'm content with being at 90% of the performance of a repo with 900+ contributors (and the quantized matrix multiplications came from no-one less than you, @JohannesGaessler).
-
----
-
-👤 **ikawrakow** submitted a review the **2025-04-05** at **11:48:50**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-04-05** at **11:48:50** on `ggml/src/ggml-cuda.cu`:
-
-That would be true if nothing happened after this call. But the row ids are being used in subsequent calls in the same function, so the memcpy must have completed before the function exits. Let's take a look at your original `mul_mat_id` implementation. At the end we have [this call](https://github.com/ggml-org/llama.cpp/blob/7a84777f42a9b3ba47db5d20b7662f8ddf92f652/ggml/src/ggml-cuda/ggml-cuda.cu#L2093). This copies the data from the contiguous memory pool-allocated in the function to its final destination. Now, if this call has not completed by the time the function returns, than we would obviously have "sporadic segfaults and copying of garbage data". So, even without knowing anything about CUDA, one needs to assume that a call such as this completes synchronously, else the entire `llama.cpp` CUDA stack would be a collection of "sporadic segfaults and copying of garbage data". Well, there are calls such as that one in my function as well before it returns. These kernel calls, as well as the preceding processing, they all use the row ids that you are claiming may go out of scope. But in order for them to execute, the queued memcpy must have completed, so no, no "sporadic segfaults and copying of garbage data" at this point.
-
-But at the end of the day, if you are able to trigger the bug, using whatever it takes to trigger it, I'll be happy to uncomment the synchronization call.
-
----
-
-👤 **JohannesGaessler** submitted a review the **2025-04-05** at **12:23:11**: 💬 `COMMENTED`
-
----
-
-👤 **JohannesGaessler** commented during a code review the **2025-04-05** at **12:23:11** on `ggml/src/ggml-cuda.cu`:
-
-`k_copy_dst_from_contiguous` only uses device pointers. The point in time at which their data is valid is automatically synchronized with the execution of the kernel because CUDA streams guarantee an ordering in which device code is executed. `cudaMemcpyAsync` is fundamentally different because it uses a host pointer with memory that can become invalid under the control of host code.
-
->Let's take a look at your original mul_mat_id implementation. At the end we have [this call](https://github.com/ggml-org/llama.cpp/blob/7a84777f42a9b3ba47db5d20b7662f8ddf92f652/ggml/src/ggml-cuda/ggml-cuda.cu#L2093). This copies the data from the contiguous memory pool-allocated in the function to its final destination.
-
-The way the CUDA memory pools work is that the memory is allocated in a single, large block that can grow dynamically. Assuming that you don't need to increase the size of the block an "allocation" `ggml_cuda_pool_alloc` does not actually allocate any new memory, it simply returns a pointer into the large block that is selected in such a way that there are no conflicts between the "allocated" memory regions while the "allocations" are in scope. The actual memory continues to be a valid allocation afterwards, though it will likely be overwritten by other kernels. This is very similar to how the ggml graph planner is giving each tensor a pointer to some data where at the time of the tensor being executed the data is guaranteed to be valid but the memory is re-used for other tensors as long as there are no conflicts.
-
----
-
-👤 **JohannesGaessler** submitted a review the **2025-04-05** at **12:24:46**: 💬 `COMMENTED`
-
----
-
-👤 **JohannesGaessler** commented during a code review the **2025-04-05** at **12:24:46** on `ggml/src/ggml-cuda.cu`:
-
->This is very similar to how the ggml graph planner is giving each tensor a pointer to some data
-
-Actually, `wdata` may be a better comparison.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-04-05** at **12:33:00**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-04-05** at **12:33:00** on `ggml/src/ggml-cuda.cu`:
-
-See #313. The issue is not that it will go out of scope, but that I'm using the data on the host before the copy may have completed.
-
----
-
-👤 **JohannesGaessler** submitted a review the **2025-04-05** at **12:43:28**: 💬 `COMMENTED`
-
----
-
-👤 **JohannesGaessler** commented during a code review the **2025-04-05** at **12:43:28** on `ggml/src/ggml-cuda.cu`:
-
-Sorry, I just noticed that I mixed up the copy directions for the two memcpys.
\ No newline at end of file
+Based on the performance comparisons on my GPU (RTX-4080) against mainline that I ran after the improvements, they were too minor to offset the performance gains I have from other modifications. For MoE models with many experts such as DeepSeek-V3/R1/Lite, `ik_llama.cpp` is ~1.8X faster than mainline for PP after this PR. It is also ~80-90% of vLLM performance on a multi-GPU system such as the one davidsyoung has, where vLLM uses tensor parallelism and `ik_llama.cpp` does not (so all that will take to match or beat vLLM is to make row split work with MoE models). Given my very limited experience with GPU programming, and given my very rudimentary CUDA knowledge, I'm content with being at 90% of the performance of a repo with 900+ contributors (and the quantized matrix multiplications came from no-one less than you, @JohannesGaessler).
\ No newline at end of file
diff --git a/github-data/pull_requests/284 - llama-bench_ enable having different number of threads for tg and pp.md b/github-data/pull_requests/284 - llama-bench enable having different number of threads for tg and pp.md
similarity index 83%
rename from github-data/pull_requests/284 - llama-bench_ enable having different number of threads for tg and pp.md
rename to github-data/pull_requests/284 - llama-bench enable having different number of threads for tg and pp.md
index cda38d0b4..7bac9d8d5 100644
--- a/github-data/pull_requests/284 - llama-bench_ enable having different number of threads for tg and pp.md
+++ b/github-data/pull_requests/284 - llama-bench enable having different number of threads for tg and pp.md
@@ -1,14 +1,17 @@
-### 🔀 [#284](https://github.com/ikawrakow/ik_llama.cpp/pull/284) - llama-bench: enable having different number of threads for tg and pp
+## 🔀 [Pull Request #284](https://github.com/ikawrakow/ik_llama.cpp/pull/284) - llama-bench: enable having different number of threads for tg and pp
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/llama_bench_tgb` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-24 |
| **Updated** | 2025-03-25 |
+| **Merged** | 2025-03-25 |
---
-#### Description
+## 📄 Description
All applications in the `examples` folder except `llama-bench` accept `-t` (to specify number of threads for token generation) and `-tb` (to specify number of threads for prompt processing, a.k.a. prefill) as command line arguments. This is handy because often TG peak performance is reached at a lower number of threads, so one wants to use that instead of the number of cores, which is good for maximum prompt processing speed. `llama-bench`, inherited from upstream, has its own command line argument parsing, where one only has available `-t` but not `-tb`.
@@ -27,8 +30,8 @@ The `-t` argument continues to work as before. It adds a pair of the same intege
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-25** at **16:27:02**:
+👤 **ubergarm** commented on **2025-03-25** at **16:27:02**
Thanks for this one, should help optimize the big xeon 6980P given previous testing suggests that pp likes more threads than tg.
\ No newline at end of file
diff --git a/github-data/pull_requests/287 - Is this better for DeepSeek-R1_.md b/github-data/pull_requests/287 - Is this better for DeepSeek-R1.md
similarity index 76%
rename from github-data/pull_requests/287 - Is this better for DeepSeek-R1_.md
rename to github-data/pull_requests/287 - Is this better for DeepSeek-R1.md
index b723b08a7..5bacf92c0 100644
--- a/github-data/pull_requests/287 - Is this better for DeepSeek-R1_.md
+++ b/github-data/pull_requests/287 - Is this better for DeepSeek-R1.md
@@ -1,14 +1,16 @@
-### 🔀 [#287](https://github.com/ikawrakow/ik_llama.cpp/pull/287) - Is this better for DeepSeek-R1?
+## 🔀 [Pull Request #287](https://github.com/ikawrakow/ik_llama.cpp/pull/287) - Is this better for DeepSeek-R1?
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/deepseek_is_this_better` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-24 |
| **Updated** | 2025-04-03 |
---
-#### Description
+## 📄 Description
This PR implements MoE matrix multiplications on the CPU with a different strategy for distributing the work among the threads. I observe a very slight performance improvement for DeepSeek-Lite (~1%). I'm wondering if this could have more impact for DeepSeek-R1.
@@ -22,19 +24,19 @@ To be most effective, the number of threads used should be a multiple of the num
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-03-24** at **22:09:34**:
+👤 **saood06** commented on **2025-03-24** at **22:09:34**
-I still haven't restarted my machine (in order to test hugepages, and mitigations being off) so when I have some time, I'll test this with sweep-bench and see how it compares to the results I last got.
+I still haven't restarted my machine (in order to test hugepages, and mitigations being off) so when I have some time, I'll test this with sweep-bench with the same config as I have been using (MLA-3, FA on, 48 threads, fmoe on) and see how it compares to the results I last got.
---
-👤 **ubergarm** commented the **2025-03-25** at **05:15:59**:
+👤 **ubergarm** commented on **2025-03-25** at **05:15:59**
Oh this looks interesting. Hopefully the 6980P frees up tomorrow to gives this branch a proper test given that rig has a lot of RAM bandwidth that seems under-utilized.
-I gave this branch a very quick try on the 7965WX 24-Core with `-mla 2` and offloading some layers to GPU as usual. Not sure if this even applies to `-mla 2`.
+I gave this branch a very quick try on the 7965WX 24-Core with `-mla 2` and offloading `-ot exps=CPU` as usual. Not 100% sure if this even applies to `-mla 2`.
Not super conclusive, but tg might be slightly improved with pp about the same in this test :point_down:
@@ -92,7 +94,7 @@ build: f9307d79 (3607)
---
-👤 **saood06** commented the **2025-03-25** at **09:08:27**:
+👤 **saood06** commented on **2025-03-25** at **09:08:27**
For me early results show regression, I dropped the caches and tested it, I'll let this run fully and post the graph but initial results below (build daa3b00c):
@@ -128,18 +130,18 @@ For reference build d12f4a12 results below (truncated to same amount):
| 512 | 128 | 5120 | 92.838 | 5.52 | 54.277 | 2.36 |
| 512 | 128 | 5632 | 99.437 | 5.15 | 54.257 | 2.36 |
-Oddly I also did a preliminary run before dropping the cache and oddly enough that performed better than after dropping but still worse than my previous one table below for reference (also build daa3b00c):
+I also did a preliminary run before dropping the cache and oddly enough that performed better than after dropping but still worse than my previous one table below for reference (also build daa3b00c):
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 50.972 | 10.04 | 41.870 | 3.06 |
| 512 | 128 | 512 | 56.608 | 9.04 | 44.729 | 2.86 |
-Also while watching the CPU usage while it was loading the model into the cache it was different, it now had bursts of CPU activity then stretches around 3-4x as long with far lower CPU usage, the disk I/O was also fluctuating a lot more, but it did finish the load from cache in a similar time as expected for 48 threads.
+Also while watching the CPU usage while it was loading the model into the cache it was different, it now had bursts of CPU activity then stretches around 3-4x as long with far lower CPU usage, the disk I/O was also fluctuating a lot more, but it did still finish the load from cache in a similar time as expected for 48 threads.
---
-👤 **saood06** commented the **2025-03-25** at **10:21:38**:
+👤 **saood06** commented on **2025-03-25** at **10:21:38**
Full results still show regression in TG:
@@ -190,25 +192,29 @@ Full results for this in table form:
---
-👤 **ikawrakow** commented the **2025-03-25** at **11:14:42**:
+👤 **ikawrakow** commented on **2025-03-25** at **11:14:42**
-@saood06 Thanks for the results, but the tests are for batched processing. #287 is not supposed to influence batches in any way, it only does something different when we have exactly one token to process (as in TG). I suspect you end up having different results because of the warm up, which is TG. It seems in your case this leads to a less optimal distribution of model weights across memory banks, so you see a lower performance in your batched experiments. But with the small batches being used here, and a MoE model with so many experts, many of the experts will "see" just a single token in the batch, so I guess I could apply a similar optimization also there.
+@saood06 Thanks for the results, but the tests are for batched processing. [#287](https://github.com/ikawrakow/ik_llama.cpp/issues/287) is not supposed to influence batches in any way, it only does something different when we have exactly one token to process (as in TG). I suspect you end up having different results because of the warm up, which is TG. It seems in your case this leads to a less optimal distribution of model weights across memory banks, so you see a lower performance in your batched experiments. But with the small batches being used here, and a MoE model with so many experts, many of the experts will "see" just a single token in the batch, so I guess I could apply a similar optimization also there.
---
-👤 **saood06** commented the **2025-03-25** at **12:06:03**:
+👤 **saood06** commented on **2025-03-25** at **12:06:03**
-> @saood06 Thanks for the results, but the tests are for batched processing. #287 is not supposed to influence batches in any way, it only does something different when we have exactly one token to process (as in TG). I suspect you end up having different results because of the warm up, which is TG. It seems in your case this leads to a less optimal distribution of model weights across memory banks, so you see a lower performance in your batched experiments. But with the small batches being used here, and a MoE model with so many experts, many of the experts will "see" just a single token in the batch, so I guess I could apply a similar optimization also there.
+> @saood06 Thanks for the results, but the tests are for batched processing. [#287](https://github.com/ikawrakow/ik_llama.cpp/issues/287) is not supposed to influence batches in any way, it only does something different when we have exactly one token to process (as in TG). I suspect you end up having different results because of the warm up, which is TG. It seems in your case this leads to a less optimal distribution of model weights across memory banks, so you see a lower performance in your batched experiments. But with the small batches being used here, and a MoE model with so many experts, many of the experts will "see" just a single token in the batch, so I guess I could apply a similar optimization also there.
I'm not testing batched performance, the TG values given for sweep-bench should be identical to the `-gp` option that you added in llama-bench.
-The benefit is that it measures at intervals while growing and reusing the context, which makes it feasible for me to measure TG and PP performance and see how it changes at different context depths.
+The benefit is that it measures at intervals while growing and reusing the context, which makes it feasible for me to measure TG (and also PP) performance and see how it changes at different context depths.
-Doing the same with llama-bench's -gp would take much longer as my PP speed is so slow.
+Doing the same with llama-bench's -gp would take much longer as my PP speed is so slow.
+
+Edit 1: These values do reflect accurately my experiences in llama-server.
+
+Edit 2: The warmup behaviour of sweep-bench is also fine, even if it is the first thing I run after dropping the cache/rebooting it always results in correct TG performance.
---
-👤 **ikawrakow** commented the **2025-03-25** at **12:32:55**:
+👤 **ikawrakow** commented on **2025-03-25** at **12:32:55**
> I'm not testing batched performance
@@ -216,7 +222,7 @@ So, not using `llama-batched-bench`? But then, if that wasn't batched inference,
---
-👤 **saood06** commented the **2025-03-25** at **12:50:04**:
+👤 **saood06** commented on **2025-03-25** at **12:50:04**
> So, not using `llama-batched-bench`?
@@ -234,29 +240,186 @@ This benchmark does really reflect how llama-server feels for PP and TG across t
---
-👤 **saood06** commented the **2025-03-25** at **13:19:05**:
+👤 **ikawrakow** commented on **2025-03-25** at **12:58:25**
+
+>No, all my recent benchmarks have been with the llama-sweep-bench.
+
+Ah, OK, sorry I haven't looked at that. I need to understand how it works and start using it.
+
+But to make sure that these results are not affected by the huge pages experiment that you also did (as it impacts performance on your system in a pretty bad way): there weren't huge pages enabled at the time of the test, and if you did the huge pages test before this one, you did reboot the system? Or spent enough time to bring performance back to normal as I had to do on my system after the huge pages experiment?
+
+---
+
+👤 **saood06** commented on **2025-03-25** at **13:19:05**
@ikawrakow
-SORRY, I accidentally edited your comment instead of replying.
+SORRY, I accidentally edited your comment instead of replying.
+
+>>No, all my recent benchmarks have been with the llama-sweep-bench.
+>
+>Ah, OK, sorry I haven't looked at that. I need to understand how it works and start using it.
+
+I'm just about to make a PR, since I started actually using it, I changed it (I want to make breaking changes, as I prefer working with the markdown tables as they are human readable, and changed the python to work with the markdown instead of jsonl removing the need for .jsonl in the first place).
+
+>But to make sure that these results are not affected by the huge pages experiment that you also did (as it impacts performance on your system in a pretty bad way): there weren't huge pages enabled at the time of the test, and if you did the huge pages test before this one, you did reboot the system? Or spent enough time to bring performance back to normal as I had to do on my system after the huge pages experiment?
+
+They are not, these were obtained before the restart that would turn on and reserve enough huge pages, and as mentioned before I tested briefly before dropping cache, but then I dropped the cache and observed the strange cache loading behavior. Normally cache loading has a steady 85% CPU usage and stable disk i/o, with this PR it has the weird burst and then idle behaviour with disk i/o jumping around but averaging the same as before, and with hugepages it becomes single threaded and 2.5x faster disk i/o happens.
+
+I have turned off hugepages and restarted my machine.
+
+I run sweep-bench on my latest fast build.
+
+PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s
+-- | -- | -- | -- | -- | -- | --
+512 | 128 | 0 | 49.094 | 10.43 | 39.605 | 3.23
+512 | 128 | 512 | 56.509 | 9.06 | 43.036 | 2.97
+512 | 128 | 1024 | 63.248 | 8.10 | 44.641 | 2.87
+512 | 128 | 1536 | 65.444 | 7.82 | 46.500 | 2.75
+
+
+I can confirm performance is back to expected.
---
-👤 **ikawrakow** commented the **2025-03-25** at **13:25:48**:
+👤 **ikawrakow** commented on **2025-03-25** at **13:25:48**
OK, thanks. I'll wait for more detailed results from @ubergarm. If they are positive, I'll make it a compile time option (it is difficult to propagate a parameter to `ggml` CPU backend). If they are negative or inconclusive, I'll discard the PR.
---
-👤 **saood06** commented the **2025-03-25** at **14:19:47**:
+👤 **ubergarm** commented on **2025-03-25** at **14:09:06**
+
+*FINISHED* Tue Mar 25 03:45:16 PM EDT 2025
+
+## tl;dr;
+
+Sorry I don't have a graph for this. Its kinda complicated.
+
+Seems like at 64 threads this branch improves tg by almost 6%. But some other areas regress. I'll try to get a graph put together later.
+
+## Details
+
+(Squeezing this in while copying over the new deepseek-v3 `q8_0_r8` for imatrix making given updated info over on that thread!)
+
+I too would like to spend a little more time to learn `llama-sweep-bench` as looking at graphs is much nicer than just charts of raw data haha...
+
+Test is currently running on a single socket of the xeon 6980P using unsloth offline repacked `q4_k_r4`:
+
+
+Incoming Logs
+
+I managed to finish the 32 threads benchmark for both cases. Though now I'm trying to use the 2nd CPU socket to calculate imatrix.dat for `V3-0324`. Hopefully that doesn't effect the benchmarks for remaining thread counts given it is running on 1st CPU socket...
+
+On this rig generally more threads improves pp, but tg caps out between 64-96 threads. I'll have to try another run eventually and can make use of [#284](https://github.com/ikawrakow/ik_llama.cpp/issues/284) to optimize both pp and tg.
+
+## Command
+```bash
+# single socket test command
+$ numactl -N 0 -m 0 \
+./build/bin/llama-bench \
+ -thp 0 \
+ --mmap 0 \
+ --model /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q4_K_R4.gguf \
+ -ctk q8_0 \
+ -mla 3 -fa 1 \
+ -amb 1024 \
+ -fmoe 1 \
+ -p 512,8192,16384 -n 0 \
+ -gp 512,64 \
+ -gp 8192,64 \
+ -gp 16384,64 \
+ -r 2 \
+ --numa numactl \
+ --threads 32,64,88,128
+
+# confirm model is loaded *entirely* into Huge Pages (which seems good on this system)
+$ du /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q4_K_R4.gguf
+394951400 /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q4_K_R4.gguf
+
+$ grep Huge /proc/meminfo
+AnonHugePages: 396218368 kB
+
+# Current power profile is: performance
+# Set numa balancing to be: 0
+```
+
+## This PR branch `ik/deepseek_is_this_better@daa3b00`
+| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 56.67 ± 3.68 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 39.15 ± 0.20 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 28.63 ± 0.06 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 7.22 ± 0.00 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 6.05 ± 0.03 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 3.94 ± 0.01 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 105.04 ± 3.36 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 69.45 ± 1.17 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 51.00 ± 0.33 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 9.65 ± 0.00 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.86 ± 0.00 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 6.14 ± 0.11 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 112.03 ± 1.78 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 70.51 ± 2.83 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 55.87 ± 2.67 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 9.43 ± 0.00 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.32 ± 0.01 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 6.02 ± 0.03 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 127.07 ± 12.23 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 76.89 ± 2.53 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 55.11 ± 0.19 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 8.49 ± 0.02 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 6.84 ± 0.19 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.61 ± 0.14 |
+
+build: daa3b00c (3609)
+
+## Baseline `main@98a264a2`
+| model | size | params | backend | threads | type_k | fa | mla | amb | mmap | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -: | --: | ----: | ---: | ---: | ------------: | ---------------: |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 62.14 ± 0.68 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 41.03 ± 0.20 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 29.36 ± 0.68 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 7.78 ± 0.01 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 6.15 ± 0.01 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 32 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 4.57 ± 0.03 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 96.11 ± 0.54 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 64.43 ± 0.01 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 45.32 ± 0.83 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 9.14 ± 0.03 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.45 ± 0.02 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 64 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.76 ± 0.02 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 116.98 ± 0.62 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 81.51 ± 2.21 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 58.54 ± 0.27 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 9.37 ± 0.00 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.31 ± 0.06 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 88 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.88 ± 0.19 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp512 | 139.62 ± 3.28 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp8192 | 95.89 ± 0.11 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | pp16384 | 69.04 ± 0.48 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp512 | 8.64 ± 0.05 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp8192 | 7.31 ± 0.05 |
+| deepseek2 671B Q4_K_R4 | 376.65 GiB | 671.03 B | CPU | 128 | q8_0 | 1 | 3 | 1024 | 0 | 1 | tg64@pp16384 | 5.97 ± 0.05 |
+
+build: 98a264a2 (3608)
+
+
+
+---
+
+👤 **saood06** commented on **2025-03-25** at **14:19:47**
-I just pushed a fix to the [readme](https://github.com/ikawrakow/ik_llama.cpp/blob/98a264a2ea21761322847ac562f58d986ef6c512/examples/sweep-bench/README.md) so you can read it at the link.
+I just pushed a fix to the [readme](https://github.com/ikawrakow/ik_llama.cpp/blob/s6/sweep_bench_update/examples/sweep-bench/README.md) so you can read it at the link.
-It goes over what the benchmark does and the definition of each header.
+It goes over what the benchmark does and the definition of each header.
+
+Edit:
+@ubergarm changed the link to the correct one (from my PR), instead of main.
---
-👤 **saood06** commented the **2025-03-25** at **14:27:36**:
+👤 **saood06** commented on **2025-03-25** at **14:27:36**
>(Squeezing this in while copying over the new deepseek-v3 q8_0_r8 for imatrix making given updated info over on that thread!)
@@ -264,7 +427,17 @@ How far did the BF16 one get overnight?
---
-👤 **ubergarm** commented the **2025-03-25** at **20:17:57**:
+👤 **saood06** commented on **2025-03-25** at **20:01:04**
+
+> _FINISHED_ Tue Mar 25 03:45:16 PM EDT 2025
+
+Looking at the results 64 cores with this PR is the best performing option, so both of your rigs do see a bump in speed while mine does not.
+
+I wonder why my system behaves so poorly with this.
+
+---
+
+👤 **ubergarm** commented on **2025-03-25** at **20:17:57**
@saood06
@@ -278,7 +451,7 @@ Too many irons in the fire today lol, jumping back over to the thread on `imatri
---
-👤 **saood06** commented the **2025-03-25** at **20:26:52**:
+👤 **saood06** commented on **2025-03-25** at **20:26:52**
> Yeah it is interesting, seems like for me there is a regression for non optimal number of threads though. Did you try a quick check of say 32 and 40 threads for a single setting? Just brainstorming...
>
@@ -288,11 +461,11 @@ Not on this PR maybe that will help, as all previous testing showed bad results
---
-👤 **ubergarm** commented the **2025-03-26** at **00:10:52**:
+👤 **ubergarm** commented on **2025-03-26** at **00:10:52**
Haha, okay so I used `DeepSeek-V3-0324-IQ2_K_R4-bartowski-imat.gguf` to cook up some graphs and copy pasted my actual markdown `llama-bench` output into the `graph.py` and ran it without linting or anything and here is what we got.
-It is complex, basically this PR is 7~12% better for pp and ~5% better for tg *only* when the number of threads is dialed in. Otherwise it is 3~20% worse than baseline main.
+It is complex, basically this PR is 7\~12% better for pp and \~5% better for tg *only* when the number of threads is dialed in. Otherwise it is 3\~20% worse than baseline main.
I would have to run more intervals near the peak e.g. 56 and 72 threads to confirm 64 is peak for this rig and config.
@@ -302,7 +475,7 @@ Gotta say I'm impressed `V3-0324` one-shotted that! Not perfect graphs, but it a
The auto-generated code python:
-plot.py
+graph.py
```bash
import pandas as pd
@@ -481,7 +654,7 @@ print(f"Maximum regression: {comparison_df['t/s_diff'].min():.2f} t/s")
---
-👤 **saood06** commented the **2025-03-26** at **00:55:36**:
+👤 **saood06** commented on **2025-03-26** at **00:55:36**
> Haha, okay so I used `DeepSeek-V3-0324-IQ2_K_R4-bartowski-imat.gguf` to cook up some graphs and copy pasted my actual markdown `llama-bench` output into the `graph.py` and ran it without linting or anything and here is what we got.
>
@@ -500,7 +673,7 @@ Then just save the resulting markdown into a file and give it the filename of wh
---
-👤 **ubergarm** commented the **2025-03-26** at **02:02:03**:
+👤 **ubergarm** commented on **2025-03-26** at **02:02:03**
> Sounds like a good time to try sweep-bench
@@ -514,9 +687,9 @@ I guess I have a few questions:
I guess the first thing is I need to find where the output goes. Also the output log looks a bit wonky at the end like it does for me sometimes, not sure if that is due to piping stderr/stdout into tee or what...
-
+
-Full llama-sweep-bench logs
+Full llama-sweep-bench logs
```bash
$ git branch
@@ -794,11 +967,11 @@ Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
```
-
+
---
-👤 **saood06** commented the **2025-03-26** at **02:27:43**:
+👤 **saood06** commented on **2025-03-26** at **02:27:43**
> > Sounds like a good time to try sweep-bench
>
@@ -1035,13 +1208,13 @@ Then I run `python sweep-bench-plot.py result1 result2 result3` and that would m
>
> 1. `./build/bin/llama-sweep-bench --help` didn't show anything. I think it uses parameters out of common like `llama-server` and not like `llama-bench` as you mentioned above.
-Yes, the -help is not very good, and the old version's print_usage also never printed to the screen only to the log file (I did not pay much attention to it printing to the screen when I originally ported as the old python only supported jsonl which wasn't really human readable anyway, and so it only going to a log file [which according to the [documentation](https://github.com/ikawrakow/ik_llama.cpp/blob/a22250df93fd833a6cb7f310b159ad1b54e4d582/common/log.h#L24) should be different for each pid, but for me it always overwrote the same log file], I switched them to LOG_TEE like most of the other examples, which goes both to the output and a log file in the fixed version.
+Yes, the -help is not very good (there is only a README.md and very bried print-usage which doesn't explain much), and the old version's print_usage also never printed to the screen only to the log file (I did not pay much attention to it printing to the screen when I originally ported as the old python only supported jsonl which wasn't really human readable anyway, and so it only going to a log file [which according to the [documentation](https://github.com/ikawrakow/ik_llama.cpp/blob/a22250df93fd833a6cb7f310b159ad1b54e4d582/common/log.h#L24) should be different for each pid, but for me it always overwrote the same log file], I switched them to LOG_TEE like most of the other examples, which goes both to the output and a log file in the fixed version.
> 2. Does it output results as it goes to stdout or do I need to specify a file to save it to? I didn't find the output, but it seemed to run for a while and I saw CPU usage with 64 threads.
The new one should, the old one didn't which I found annoying, it uses the LOG function which writes to llama.log (or a file like it)
-3. I'm not exactly sure how to compare its outputs to `llama-bench` `pp` and `tg` numbers, as I don't have a good conception of what varying `N_KV` exactly does. I read the README, but if I see an example maybe it would click in my brain.
+> 3. I'm not exactly sure how to compare its outputs to `llama-bench` `pp` and `tg` numbers, as I don't have a good conception of what varying `N_KV` exactly does. I read the README, but if I see an example maybe it would click in my brain.
Think of N_KV as how deep in the context the you are measuring from from, and TG/PP is how many tokens. So in a row if the `N_KV` is 8192 and the `TG` is 128, the `S_TG t/s` resulting value is equivalent to `-gp 8192,128`.
@@ -1051,13 +1224,13 @@ Sorry again, I forgot this branch had the old version, I should have warned you
---
-👤 **ikawrakow** commented the **2025-03-26** at **07:24:39**:
+👤 **ikawrakow** commented on **2025-03-26** at **07:24:39**
OK, this does not look like it is helping.
---
-👤 **saood06** commented the **2025-03-29** at **07:34:32**:
+👤 **saood06** commented on **2025-03-29** at **07:34:32**
> OK, this does not look like it is helping.
@@ -1067,6 +1240,6 @@ I'll test my system more thoroughly with this in different configurations later,
---
-👤 **saood06** commented the **2025-04-03** at **05:36:15**:
+👤 **saood06** commented on **2025-04-03** at **05:36:15**
-I tested at 24 threads this branch still loses to main (and main loses to main at 48 threads), but again it had the same odd behavior where this branch performed better when cache is warmed up with main than if cache is warmed up with it's own code.
\ No newline at end of file
+I tested at 24 threads this branch still loses to main at 24 threads (and main at 24 threads loses to main at 48 threads), but again it had the same odd behavior where this branch performed better when cache is warmed up with main than if cache is warmed up with it's own code (but both still losing to main).
\ No newline at end of file
diff --git a/github-data/pull_requests/289 - Update sweep bench _depracating .jsonl support_.md b/github-data/pull_requests/289 - Update sweep bench depracating .jsonl support.md
similarity index 59%
rename from github-data/pull_requests/289 - Update sweep bench _depracating .jsonl support_.md
rename to github-data/pull_requests/289 - Update sweep bench depracating .jsonl support.md
index 11873228c..d6af531aa 100644
--- a/github-data/pull_requests/289 - Update sweep bench _depracating .jsonl support_.md
+++ b/github-data/pull_requests/289 - Update sweep bench depracating .jsonl support.md
@@ -1,14 +1,17 @@
-### 🔀 [#289](https://github.com/ikawrakow/ik_llama.cpp/pull/289) - Update sweep bench (depracating .jsonl support)
+## 🔀 [Pull Request #289](https://github.com/ikawrakow/ik_llama.cpp/pull/289) - Update sweep bench (depracating .jsonl support)
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/sweep_bench_update` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-25 |
| **Updated** | 2025-03-25 |
+| **Merged** | 2025-03-25 |
---
-#### Description
+## 📄 Description
Changes that updated sweep-bench to act more like the other bench tools and print results as they occur in human readable format. Also update the python tool to generate graphs based on that markdown table instead of jsonl.
@@ -22,6 +25,6 @@ Also fixed the readme so that it properly renders.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-03-25** at **15:13:09**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-03-25** at **15:13:09**
\ No newline at end of file
diff --git a/github-data/pull_requests/290 - mmap backed KV cache.md b/github-data/pull_requests/290 - mmap backed KV cache.md
index c93533fbc..9e2c35c21 100644
--- a/github-data/pull_requests/290 - mmap backed KV cache.md
+++ b/github-data/pull_requests/290 - mmap backed KV cache.md
@@ -1,14 +1,16 @@
-### 🔀 [#290](https://github.com/ikawrakow/ik_llama.cpp/pull/290) - mmap backed KV cache
+## 🔀 [Pull Request #290](https://github.com/ikawrakow/ik_llama.cpp/pull/290) - mmap backed KV cache
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `s6/numa_KV` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-25 |
| **Updated** | 2025-03-27 |
---
-#### Description
+## 📄 Description
Port of https://github.com/ggml-org/llama.cpp/pull/11580
@@ -30,9 +32,9 @@ This also might have the benefit of letting you allocate the full context size o
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-27** at **05:14:15**:
+👤 **ikawrakow** commented on **2025-03-27** at **05:14:15**
I think it needs to be ifdef'ed so the code will still build on Windows.
@@ -42,7 +44,7 @@ Concerning NUMA advantage: yes, it will spread the KV cache more evenly between
---
-👤 **saood06** commented the **2025-03-27** at **05:31:58**:
+👤 **saood06** commented on **2025-03-27** at **05:31:58**
> I think it needs to be ifdef'ed so the code will still build on Windows.
>
@@ -52,7 +54,7 @@ Yes I agree on the needed changes if this is to be merged in, I mainly just reme
>It would be also useful of @ubergarm tested performance implications.
-I'd be interested to know if it affected performance for him, since it doesn't hurt or help my performance anymore.
+I'd be interested to know if it affects performance for him, since it doesn't hurt or help my performance anymore.
> Concerning NUMA advantage: yes, it will spread the KV cache more evenly between NUMA nodes. But aren't we concerned it may result in each NUMA node having to fetch KV cache data from another NUMA node. The KV cache grows as generation progresses, so in each new evaluation threads access different portions of the KV cache, so the strategy of evenly spreading the cache across NUMA nodes will be only meaningful if we also had something in place that would make threads always process the same portions of the KV cache.
diff --git a/github-data/pull_requests/291 - Disable Zen4 optimizations for Q8_0Q8_0_R8.md b/github-data/pull_requests/291 - Disable Zen4 optimizations for Q8_0Q8_0_R8.md
new file mode 100644
index 000000000..aed69c45d
--- /dev/null
+++ b/github-data/pull_requests/291 - Disable Zen4 optimizations for Q8_0Q8_0_R8.md
@@ -0,0 +1,476 @@
+## 🔀 [Pull Request #291](https://github.com/ikawrakow/ik_llama.cpp/pull/291) - Disable Zen4 optimizations for Q8_0/Q8_0_R8
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/test_q80_NaNs` |
+| **Target Branch** | `main` |
+| **Created** | 2025-03-26 |
+| **Updated** | 2025-03-27 |
+
+---
+
+## 📄 Description
+
+The purpose of this PR is to test if the NaNs observed for `Q8_0/Q8_0_R8` quantized DeepSeekV3/R1 will go away ([#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285))
+
+My hypothesis is that we get an overflow in the block sum of `Q8_1/Q8_1_X4`, which is stored as `fp16`. `Q8_1/Q8_1_X4` is used for activation quantization on Zen4 for `Q8_0/Q8_0_R8` quants. See also [#196](https://github.com/ikawrakow/ik_llama.cpp/issues/196)
+
+The PR disables the Zen4 optimization and reverts to the vanilla `AVX2` implementation, which uses `Q8_0` (just like mainline `llama.cpp`).
+
+Performance goes down quite a bit, but if we confirm that the change eliminates the NaNs, I will make a better PR that keeps the performance while avoiding the NaNs.
+
+---
+
+## 💬 Conversation
+
+👤 **ubergarm** commented on **2025-03-26** at **15:17:57**
+
+*UPDATE* Finished successfully. Complete perplexity log shown now. Thanks!
+
+---
+
+I gotta head out for a night or two, but will bring my laptop and hope to check in.
+
+I'm leaving a run going now, initial results are looking promising. Check full logs :point_down:
+
+
+
+repacked `q8_0_r8` `llama-perplexity` logs
+
+I'm guessing the borked messages are because of how stderr is piped to stdout and not actually a race condition. I saw similar output with llama-sweep-bench last night but it was running okay psure.
+
+```bash
+$ git branch | grep NaN
+* ik/test_q80_NaNs
+
+$ git rev-parse --short HEAD
+2089147a
+
+$ numactl -N 1 -m 1 \
+./build/bin/llama-perplexity \
+ -m /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q8_0_R8.gguf \
+ -f wiki.test.raw \
+ -t 128 \
+ -b 512 \
+ --numa numactl 2>&1 | tee -a output.log
+
+main: build = 3611 (2089147a)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: seed = 1743000715
+llama_model_loader: loaded meta data with 45 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q8_0_R8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16
+llama_model_loader: - kv 3: general.quantized_by str = Unsloth
+llama_model_loader: - kv 4: general.size_label str = 256x20B
+llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
+llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 15: general.file_type u32 = 207
+llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,129280] = ["
+llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,129280] = [3
+llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,127741] = ["
+llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 128815
+llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 44: general.quantization_version u32 = 2
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llama_model_loader: - type q8_0_r8: 663 tensors
+llm_load_vocab: special tokens cache size = 819
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = Q8_0_R8 - 8.5 bpw
+llm_load_print_meta: model params = 671.026 B
+llm_load_print_meta: model size = 664.295 GiB (8.504 BPW)
+llm_load_print_meta: repeating layers = 662.461 GiB (8.504 BPW, 669.173 B parameters)
+llm_load_print_meta: general.name = DeepSeek R1 BF16
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.42 MiB
+llm_load_tensors: CPU buffer size = 680237.97 MiB
+....................................................................................................
+============ llm_load_tensors: need to compute 61 wk_b tensors
+Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Collama_new_context_with_model: n_ctx = 512
+llama_new_context_with_model: n_batch = 512
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 0
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB
+llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB
+llama_new_context_with_model: CPU output buffer size = 0.49 MiB
+llama_new_context_with_model: CPU compute buffer size = 283.01 MiB
+llama_new_context_with_model: graph nodes = 3724
+llama_new_context_with_model: graph splits = 1
+
+system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 911.451 ms
+perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=512, n_seq=1
+perplexity: 6.36 seconds per pass - ETA 59.47 minutes
+mputed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+[1]2.5068,[2]3.2747,[3]2.3640,[4]1.9767,[5]1.7836,[6]1.6435,[7]1.5510,[8]1.4847,[9]1.4354,[10]1.3962,[11]1.3807,[12]1.4146,[13]1.4261,[14]1.5535,[15]1.6850,[16]1.7449,[17]1.9072,[18]2.0355,[19]1.9972,[20]1.9858,[21]2.0933,[22]2.0657,[23]2.0388,[24]2.0524,[25]2.0237,[26]2.0005,[27]2.0483,[28]2.0558,[29]2.1043,[30]2.1353,[31]2.1692,[32]2.1872,[33]2.2279,[34]2.2686,[35]2.3178,[36]2.3707,[37]2.4059,[38]2.4519,[39]2.4927,[40]2.5518,[41]2.5946,[42]2.6070,[43]2.6557,[44]2.6716,[45]2.7512,[46]2.8016,[47]2.7570,[48]2.7106,[49]2.6842,[50]2.7037,[51]2.7502,[52]2.7650,[53]2.8155,[54]2.8279,[55]2.8581,[56]2.8891,[57]2.9030,[58]2.9395,[59]2.9508,[60]2.9974,[61]3.0373,[62]3.0910,[63]3.1225,[64]3.1665,[65]3.1766,[66]3.1596,[67]3.1367,[68]3.1680,[69]3.1627,[70]3.1774,[71]3.1957,[72]3.2116,[73]3.2261,[74]3.2490,[75]3.2284,[76]3.1815,[77]3.1386,[78]3.1336,[79]3.1111,[80]3.0923,[81]3.0557,[82]3.0591,[83]3.0273,[84]2.9912,[85]2.9563,[86]2.9312,[87]2.9238,[88]2.8955,[89]2.8785,[90]2.8526,[91]2.8228,[92]2.7979,[93]2.7711,[94]2.7445,[95]2.7210,[96]2.7197,[97]2.7267,[98]2.7113,[99]2.6942,[100]2.6965,[101]2.6881,[102]2.7047,[103]2.7309,[104]2.7490,[105]2.7459,[106]2.7686,[107]2.7931,[108]2.8142,[109]2.8481,[110]2.8820,[111]2.9021,[112]2.8764,[113]2.8634,[114]2.8412,[115]2.8259,[116]2.8103,[117]2.7873,[118]2.7665,[119]2.7453,[120]2.7265,[121]2.7110,[122]2.6934,[123]2.6770,[124]2.6583,[125]2.6407,[126]2.6241,[127]2.6098,[128]2.6007,[129]2.5902,[130]2.5779,[131]2.5704,[132]2.5777,[133]2.5873,[134]2.5939,[135]2.6045,[136]2.6209,[137]2.6363,[138]2.6443,[139]2.6560,[140]2.6570,[141]2.6588,[142]2.6581,[143]2.6587,[144]2.6557,[145]2.6469,[146]2.6456,[147]2.6501,[148]2.6500,[149]2.6515,[150]2.6464,[151]2.6445,[152]2.6419,[153]2.6382,[154]2.6390,[155]2.6434,[156]2.6455,[157]2.6517,[158]2.6607,[159]2.6625,[160]2.6715,[161]2.6798,[162]2.6891,[163]2.6934,[164]2.7133,[165]2.7368,[166]2.7541,[167]2.7663,[168]2.7908,[169]2.8131,[170]2.8348,[171]2.8579,[172]2.8420,[173]2.8255,[174]2.8118,[175]2.7986,[176]2.7863,[177]2.7747,[178]2.7621,[179]2.7483,[180]2.7522,[181]2.7662,[182]2.7813,[183]2.7960,[184]2.8102,[185]2.8208,[186]2.8373,[187]2.8526,[188]2.8666,[189]2.8772,[190]2.8775,[191]2.8849,[192]2.8888,[193]2.8940,[194]2.9137,[195]2.9224,[196]2.9359,[197]2.9459,[198]2.9504,[199]2.9561,[200]2.9556,[201]2.9707,[202]2.9660,[203]2.9715,[204]2.9751,[205]2.9749,[206]2.9776,[207]2.9864,[208]2.9961,[209]3.0055,[210]3.0059,[211]3.0013,[212]3.0012,[213]3.0088,[214]3.0108,[215]3.0165,[216]3.0171,[217]3.0132,[218]3.0133,[219]3.0144,[220]3.0137,[221]3.0138,[222]3.0139,[223]3.0144,[224]3.0196,[225]3.0216,[226]3.0137,[227]3.0114,[228]3.0138,[229]3.0183,[230]3.0247,[231]3.0309,[232]3.0227,[233]3.0149,[234]3.0150,[235]3.0132,[236]3.0221,[237]3.0305,[238]3.0400,[239]3.0500,[240]3.0591,[241]3.0703,[242]3.0849,[243]3.0983,[244]3.1064,[245]3.1178,[246]3.1281,[247]3.1269,[248]3.1227,[249]3.1208,[250]3.1146,[251]3.1126,[252]3.1152,[253]3.1190,[254]3.1261,[255]3.1326,[256]3.1363,[257]3.1387,[258]3.1399,[259]3.1432,[260]3.1453,[261]3.1467,[262]3.1459,[263]3.1516,[264]3.1539,[265]3.1544,[266]3.1562,[267]3.1591,[268]3.1629,[269]3.1661,[270]3.1654,[271]3.1637,[272]3.1571,[273]3.1570,[274]3.1502,[275]3.1395,[276]3.1286,[277]3.1304,[278]3.1404,[279]3.1468,[280]3.1547,[281]3.1621,[282]3.1684,[283]3.1749,[284]3.1816,[285]3.1951,[286]3.1976,[287]3.2010,[288]3.2058,[289]3.2084,[290]3.2002,[291]3.1907,[292]3.1886,[293]3.1876,[294]3.1847,[295]3.1821,[296]3.1841,[297]3.1845,[298]3.1894,[299]3.1954,[300]3.1984,[301]3.2022,[302]3.2044,[303]3.2064,[304]3.2058,[305]3.2177,[306]3.2253,[307]3.2362,[308]3.2251,[309]3.2196,[310]3.2101,[311]3.2136,[312]3.2159,[313]3.2223,[314]3.2244,[315]3.2275,[316]3.2290,[317]3.2309,[318]3.2315,[319]3.2319,[320]3.2361,[321]3.2365,[322]3.2386,[323]3.2451,[324]3.2458,[325]3.2513,[326]3.2560,[327]3.2601,[328]3.2631,[329]3.2649,[330]3.2712,[331]3.2750,[332]3.2798,[333]3.2784,[334]3.2784,[335]3.2790,[336]3.2791,[337]3.2801,[338]3.2804,[339]3.2831,[340]3.2868,[341]3.2923,[342]3.3013,[343]3.3107,[344]3.3160,[345]3.3076,[346]3.2998,[347]3.2948,[348]3.2874,[349]3.2838,[350]3.2820,[351]3.2866,[352]3.3015,[353]3.3106,[354]3.3234,[355]3.3319,[356]3.3371,[357]3.3487,[358]3.3583,[359]3.3615,[360]3.3680,[361]3.3772,[362]3.3858,[363]3.3914,[364]3.3979,[365]3.4040,[366]3.4144,[367]3.4231,[368]3.4298,[369]3.4376,[370]3.4460,[371]3.4597,[372]3.4685,[373]3.4718,[374]3.4752,[375]3.4801,[376]3.4930,[377]3.5043,[378]3.5070,[379]3.5064,[380]3.5030,[381]3.5076,[382]3.5133,[383]3.5169,[384]3.5213,[385]3.5252,[386]3.5313,[387]3.5371,[388]3.5404,[389]3.5301,[390]3.5207,[391]3.5102,[392]3.5046,[393]3.4950,[394]3.4861,[395]3.4769,[396]3.4669,[397]3.4580,[398]3.4485,[399]3.4383,[400]3.4295,[401]3.4195,[402]3.4092,[403]3.4005,[404]3.3904,[405]3.3809,[406]3.3710,[407]3.3617,[408]3.3529,[409]3.3444,[410]3.3384,[411]3.3391,[412]3.3343,[413]3.3361,[414]3.3381,[415]3.3350,[416]3.3347,[417]3.3370,[418]3.3312,[419]3.3326,[420]3.3302,[421]3.3291,[422]3.3306,[423]3.3298,[424]3.3339,[425]3.3334,[426]3.3339,[427]3.3328,[428]3.3352,[429]3.3371,[430]3.3400,[431]3.3407,[432]3.3397,[433]3.3361,[434]3.3361,[435]3.3284,[436]3.3221,[437]3.3181,[438]3.3163,[439]3.3130,[440]3.3179,[441]3.3233,[442]3.3308,[443]3.3291,[444]3.3299,[445]3.3312,[446]3.3360,[447]3.3392,[448]3.3418,[449]3.3449,[450]3.3487,[451]3.3516,[452]3.3538,[453]3.3553,[454]3.3540,[455]3.3562,[456]3.3564,[457]3.3592,[458]3.3644,[459]3.3651,[460]3.3653,[461]3.3621,[462]3.3658,[463]3.3730,[464]3.3783,[465]3.3712,[466]3.3692,[467]3.3672,[468]3.3681,[469]3.3651,[470]3.3624,[471]3.3628,[472]3.3635,[473]3.3627,[474]3.3617,[475]3.3629,[476]3.3613,[477]3.3603,[478]3.3612,[479]3.3628,[480]3.3655,[481]3.3616,[482]3.3650,[483]3.3641,[484]3.3677,[485]3.3741,[486]3.3769,[487]3.3805,[488]3.3858,[489]3.3882,[490]3.3929,[491]3.3991,[492]3.4036,[493]3.4034,[494]3.4046,[495]3.4071,[496]3.4090,[497]3.4120,[498]3.4123,[499]3.4117,[500]3.4158,[501]3.4204,[502]3.4196,[503]3.4181,[504]3.4202,[505]3.4235,[506]3.4319,[507]3.4347,[508]3.4380,[509]3.4307,[510]3.4250,[511]3.4184,[512]3.4138,[513]3.4075,[514]3.4059,[515]3.4078,[516]3.4029,[517]3.4027,[518]3.4018,[519]3.4023,[520]3.4067,[521]3.4055,[522]3.4042,[523]3.4099,[524]3.4087,[525]3.4071,[526]3.4022,[527]3.3972,[528]3.3937,[529]3.3908,[530]3.3878,[531]3.3848,[532]3.3793,[533]3.3731,[534]3.3688,[535]3.3697,[536]3.3724,[537]3.3755,[538]3.3780,[539]3.3806,[540]3.3858,[541]3.3891,[542]3.3914,[543]3.3857,[544]3.3815,[545]3.3811,[546]3.3745,[547]3.3680,[548]3.3616,[549]3.3549,[550]3.3489,[551]3.3428,[552]3.3370,[553]3.3311,[554]3.3290,[555]3.3275,[556]3.3303,[557]3.3344,[558]3.3402,[559]3.3447,[560]3.3499,[561]3.3482,
+llama_print_timings: load time = 1021262.28 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 3230181.16 ms / 287232 tokens ( 11.25 ms per token, 88.92 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 3434307.80 ms / 287233 tokens
+
+Final estimate: PPL = 3.3482 +/- 0.01847
+```
+
+
+
+---
+
+👤 **ubergarm** commented on **2025-03-26** at **18:52:28**
+
+Finished successfully, just updated logs. Thanks!
+
+---
+
+👤 **ubergarm** commented on **2025-03-26** at **19:28:51**
+
+Oh nice, seems like with this patch I'm also able to get an imatrix going with MLA tensors on the `V3-0324` `q8_0` gguf I recently made. *Finished* cooking, logs look good :point_down:
+
+
+
+llama-imatrix run on q8_0
+
+```bash
+# download https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c#file-calibration_data_v5_rc-txt
+
+$ git rev-parse --short HEAD
+2089147a
+
+$ numactl -N 1 -m 1 \
+./build/bin/llama-imatrix \
+ --verbosity 1 \
+ -m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
+ -f calibration_data_v5_rc.txt \
+ -o /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-$(git rev-parse --short HEAD).dat \
+ --ctx-size 512 \
+ --numa numactl \
+ --threads 128 2>&1 | tee -a output.log
+
+llama_model_loader: loaded meta data with 46 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
+llama_model_loader: - kv 3: general.version str = V3-0324
+llama_model_loader: - kv 4: general.basename str = DeepSeek
+llama_model_loader: - kv 5: general.size_label str = 256x21B
+llama_model_loader: - kv 6: general.license str = mit
+llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 16: general.file_type u32 = 7
+llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["
+llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3
+llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["
+llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
+llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 45: general.quantization_version u32 = 2
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 786 tensors
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = Q8_0
+llm_load_print_meta: model params = 672.050 B
+llm_load_print_meta: model size = 665.308 GiB (8.504 BPW)
+llm_load_print_meta: repeating layers = 663.474 GiB (8.504 BPW, 670.196 B parameters)
+llm_load_print_meta: general.name = DeepSeek V3 0324
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.47 MiB
+llm_load_tensors: CPU buffer size = 681274.97 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 512
+llama_new_context_with_model: n_batch = 512
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 0
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB
+llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB
+llama_new_context_with_model: CPU output buffer size = 0.49 MiB
+llama_new_context_with_model: CPU compute buffer size = 283.01 MiB
+llama_new_context_with_model: graph nodes = 3724
+llama_new_context_with_model: graph splits = 1
+
+system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+compute_imatrix: tokenizing the input ..
+compute_imatrix: tokenization took 313.289 ms
+compute_imatrix: computing over 213 chunks with batch_size 512
+compute_imatrix: 41.77 seconds per pass - ETA 2 hours 28.28 minutes
+[1]60.9029,[2]10.8011,[3]5.8709,[4]3.7872,[5]2.9688,[6]2.5088,[7]2.2214,[8]2.0224,[9]1.9110,
+save_imatrix: entry ' blk.60.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.60.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.60.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
+
+save_imatrix: stored collected data after 10 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+[10]1.8230,[11]2.0314,[12]2.0866,[13]2.1000,[14]2.1455,[15]2.0412,[16]1.9535,[17]1.8827,[18]1.8197,[19]1.7778,
+save_imatrix: stored collected data after 20 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+[20]1.7349,[21]1.7018,[22]1.6640,[23]1.6347,[24]1.6222,[25]1.6104,[26]1.5849,[27]1.6838,[28]1.7577,[29]1.8237,
+save_imatrix: stored collected data after 30 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+[30]1.8219,[31]1.8354,[32]1.8351,[33]1.8125,[34]1.8489,[35]1.8250,[36]1.8245,[37]1.8131,[38]1.8239,[39]1.8108,
+save_imatrix: stored collected data after 40 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+[40]1.7876,[41]1.7643,[42]1.7444,[43]1.7325,[44]1.7193,[45]1.7059,[46]1.7016,[47]1.6954,[48]1.6846,[49]1.6741,
+save_imatrix: stored collected data after 50 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+[50]1.6684,[51]1.6656,[52]1.6657,[53]1.6704,[54]1.6844,[55]1.6811,[56]1.6712,[57]1.6794,[58]1.6833,[59]1.6943,
+save_imatrix: stored collected data after 60 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+
+.
+.
+.
+
+[210]3.5371,[211]3.5164,[212]3.4959,[213]3.4755,
+save_imatrix: stored collected data after 213 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
+
+llama_print_timings: load time = 42726.11 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 7125661.28 ms / 109056 tokens ( 65.34 ms per token, 15.30 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 7201368.59 ms / 109057 tokens
+
+Final estimate: PPL = 3.4755 +/- 0.03305
+```
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-03-27** at **04:49:39**
+
+Close in favor of [#292](https://github.com/ikawrakow/ik_llama.cpp/issues/292)
\ No newline at end of file
diff --git a/github-data/pull_requests/291 - Disable Zen4 optimizations for Q8_0_Q8_0_R8.md b/github-data/pull_requests/291 - Disable Zen4 optimizations for Q8_0_Q8_0_R8.md
deleted file mode 100644
index 7e1eb4a81..000000000
--- a/github-data/pull_requests/291 - Disable Zen4 optimizations for Q8_0_Q8_0_R8.md
+++ /dev/null
@@ -1,214 +0,0 @@
-### 🔀 [#291](https://github.com/ikawrakow/ik_llama.cpp/pull/291) - Disable Zen4 optimizations for Q8_0/Q8_0_R8
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-03-26 |
-| **Updated** | 2025-03-27 |
-
----
-
-#### Description
-
-The purpose of this PR is to test if the NaNs observed for `Q8_0/Q8_0_R8` quantized DeepSeekV3/R1 will go away (#285)
-
-My hypothesis is that we get an overflow in the block sum of `Q8_1/Q8_1_X4`, which is stored as `fp16`. `Q8_1/Q8_1_X4` is used for activation quantization on Zen4 for `Q8_0/Q8_0_R8` quants. See also #196
-
-The PR disables the Zen4 optimization and reverts to the vanilla `AVX2` implementation, which uses `Q8_0` (just like mainline `llama.cpp`).
-
-Performance goes down quite a bit, but if we confirm that the change eliminates the NaNs, I will make a better PR that keeps the performance while avoiding the NaNs.
-
----
-
-#### 💬 Conversation
-
-👤 **ubergarm** commented the **2025-03-26** at **18:52:28**:
-
-Finished successfully, just updated logs. Thanks!
-
----
-
-👤 **ubergarm** commented the **2025-03-26** at **19:28:51**:
-
-Oh nice, seems like with this patch I'm also able to get an imatrix going with MLA tensors on the `V3-0324` `q8_0` gguf I recently made. Letting that cook, here is partial outputs for now :point_down:
-
-
-
-llama-imatrix run on q8_0
-
-```bash
-$ git rev-parse --short HEAD
-2089147a
-
-$ numactl -N 1 -m 1 \
-./build/bin/llama-imatrix \
- --verbosity 1 \
- -m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
- -f calibration_data_v5_rc.txt \
- -o /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-$(git rev-parse --short HEAD).dat \
- --ctx-size 512 \
- --numa numactl \
- --threads 128 2>&1 | tee -a output.log
-
-llama_model_loader: loaded meta data with 46 key-value pairs and 1147 tensors from /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf (version GGUF V3 (latest))
-llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
-llama_model_loader: - kv 0: general.architecture str = deepseek2
-llama_model_loader: - kv 1: general.type str = model
-llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
-llama_model_loader: - kv 3: general.version str = V3-0324
-llama_model_loader: - kv 4: general.basename str = DeepSeek
-llama_model_loader: - kv 5: general.size_label str = 256x21B
-llama_model_loader: - kv 6: general.license str = mit
-llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
-llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
-llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
-llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
-llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
-llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
-llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
-llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
-llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
-llama_model_loader: - kv 16: general.file_type u32 = 7
-llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
-llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
-llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
-llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
-llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
-llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
-llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
-llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
-llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
-llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
-llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = true
-llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
-llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
-llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
-llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
-llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
-llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
-llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
-llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
-llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["
-llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3
-llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["
-llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
-llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
-llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
-llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = true
-llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = false
-llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
-llama_model_loader: - kv 45: general.quantization_version u32 = 2
-llama_model_loader: - type f32: 361 tensors
-llama_model_loader: - type q8_0: 786 tensors
-llm_load_vocab: special tokens cache size = 818
-llm_load_vocab: token to piece cache size = 0.8223 MB
-llm_load_print_meta: format = GGUF V3 (latest)
-llm_load_print_meta: arch = deepseek2
-llm_load_print_meta: vocab type = BPE
-llm_load_print_meta: n_vocab = 129280
-llm_load_print_meta: n_merges = 127741
-llm_load_print_meta: vocab_only = 0
-llm_load_print_meta: n_ctx_train = 163840
-llm_load_print_meta: n_embd = 7168
-llm_load_print_meta: n_layer = 61
-llm_load_print_meta: n_head = 128
-llm_load_print_meta: n_head_kv = 128
-llm_load_print_meta: n_rot = 64
-llm_load_print_meta: n_swa = 0
-llm_load_print_meta: n_embd_head_k = 192
-llm_load_print_meta: n_embd_head_v = 128
-llm_load_print_meta: n_gqa = 1
-llm_load_print_meta: n_embd_k_gqa = 24576
-llm_load_print_meta: n_embd_v_gqa = 16384
-llm_load_print_meta: f_norm_eps = 0.0e+00
-llm_load_print_meta: f_norm_rms_eps = 1.0e-06
-llm_load_print_meta: f_clamp_kqv = 0.0e+00
-llm_load_print_meta: f_max_alibi_bias = 0.0e+00
-llm_load_print_meta: f_logit_scale = 0.0e+00
-llm_load_print_meta: n_ff = 18432
-llm_load_print_meta: n_expert = 256
-llm_load_print_meta: n_expert_used = 8
-llm_load_print_meta: causal attn = 1
-llm_load_print_meta: pooling type = 0
-llm_load_print_meta: rope type = 0
-llm_load_print_meta: rope scaling = yarn
-llm_load_print_meta: freq_base_train = 10000.0
-llm_load_print_meta: freq_scale_train = 0.025
-llm_load_print_meta: n_ctx_orig_yarn = 4096
-llm_load_print_meta: rope_finetuned = unknown
-llm_load_print_meta: ssm_d_conv = 0
-llm_load_print_meta: ssm_d_inner = 0
-llm_load_print_meta: ssm_d_state = 0
-llm_load_print_meta: ssm_dt_rank = 0
-llm_load_print_meta: model type = 671B
-llm_load_print_meta: model ftype = Q8_0
-llm_load_print_meta: model params = 672.050 B
-llm_load_print_meta: model size = 665.308 GiB (8.504 BPW)
-llm_load_print_meta: repeating layers = 663.474 GiB (8.504 BPW, 670.196 B parameters)
-llm_load_print_meta: general.name = DeepSeek V3 0324
-llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
-llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
-llm_load_print_meta: LF token = 131 'Ä'
-llm_load_print_meta: max token length = 256
-llm_load_print_meta: n_layer_dense_lead = 3
-llm_load_print_meta: n_lora_q = 1536
-llm_load_print_meta: n_lora_kv = 512
-llm_load_print_meta: n_ff_exp = 2048
-llm_load_print_meta: n_expert_shared = 1
-llm_load_print_meta: expert_weights_scale = 2.5
-llm_load_print_meta: expert_weights_norm = 1
-llm_load_print_meta: expert_gating_func = sigmoid
-llm_load_print_meta: rope_yarn_log_mul = 0.1000
-llm_load_tensors: ggml ctx size = 0.47 MiB
-llm_load_tensors: CPU buffer size = 681274.97 MiB
-....................................................................................................
-llama_new_context_with_model: n_ctx = 512
-llama_new_context_with_model: n_batch = 512
-llama_new_context_with_model: n_ubatch = 512
-llama_new_context_with_model: flash_attn = 0
-llama_new_context_with_model: mla_attn = 0
-llama_new_context_with_model: attn_max_b = 0
-llama_new_context_with_model: fused_moe = 0
-llama_new_context_with_model: ser = -1, 0
-llama_new_context_with_model: freq_base = 10000.0
-llama_new_context_with_model: freq_scale = 0.025
-llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB
-llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB
-llama_new_context_with_model: CPU output buffer size = 0.49 MiB
-llama_new_context_with_model: CPU compute buffer size = 283.01 MiB
-llama_new_context_with_model: graph nodes = 3724
-llama_new_context_with_model: graph splits = 1
-
-system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
-compute_imatrix: tokenizing the input ..
-compute_imatrix: tokenization took 313.289 ms
-compute_imatrix: computing over 213 chunks with batch_size 512
-compute_imatrix: 41.77 seconds per pass - ETA 2 hours 28.28 minutes
-[1]60.9029,[2]10.8011,[3]5.8709,[4]3.7872,[5]2.9688,[6]2.5088,[7]2.2214,[8]2.0224,[9]1.9110,
-save_imatrix: entry ' blk.60.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
-save_imatrix: entry ' blk.60.ffn_gate_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
-save_imatrix: entry ' blk.60.ffn_up_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
-
-save_imatrix: stored collected data after 10 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
-[10]1.8230,[11]2.0314,[12]2.0866,[13]2.1000,[14]2.1455,[15]2.0412,[16]1.9535,[17]1.8827,[18]1.8197,[19]1.7778,
-save_imatrix: stored collected data after 20 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
-[20]1.7349,[21]1.7018,[22]1.6640,[23]1.6347,[24]1.6222,[25]1.6104,[26]1.5849,[27]1.6838,[28]1.7577,[29]1.8237,
-save_imatrix: stored collected data after 30 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
-[30]1.8219,[31]1.8354,[32]1.8351,[33]1.8125,[34]1.8489,[35]1.8250,[36]1.8245,[37]1.8131,[38]1.8239,[39]1.8108,
-save_imatrix: stored collected data after 40 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
-[40]1.7876,[41]1.7643,[42]1.7444,[43]1.7325,[44]1.7193,[45]1.7059,[46]1.7016,[47]1.6954,[48]1.6846,[49]1.6741,
-save_imatrix: stored collected data after 50 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
-[50]1.6684,[51]1.6656,[52]1.6657,[53]1.6704,[54]1.6844,[55]1.6811,[56]1.6712,[57]1.6794,[58]1.6833,[59]1.6943,
-save_imatrix: stored collected data after 60 chunks in /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
-
-*WIP - still cookin'*
-```
-
-
-
----
-
-👤 **ikawrakow** commented the **2025-03-27** at **04:49:39**:
-
-Close in favor of #292
\ No newline at end of file
diff --git a/github-data/pull_requests/292 - Use bf16 instead of fp16 block scales for q8_1.md b/github-data/pull_requests/292 - Use bf16 instead of fp16 block scales for q8_1.md
index 71c6b1cbb..11b4bf7ef 100644
--- a/github-data/pull_requests/292 - Use bf16 instead of fp16 block scales for q8_1.md
+++ b/github-data/pull_requests/292 - Use bf16 instead of fp16 block scales for q8_1.md
@@ -1,24 +1,27 @@
-### 🔀 [#292](https://github.com/ikawrakow/ik_llama.cpp/pull/292) - Use bf16 instead of fp16 block scales for q8_1
+## 🔀 [Pull Request #292](https://github.com/ikawrakow/ik_llama.cpp/pull/292) - Use bf16 instead of fp16 block scales for q8_1
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/use_q8_2` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-26 |
| **Updated** | 2025-03-27 |
+| **Merged** | 2025-03-27 |
---
-#### Description
+## 📄 Description
-DeepSeek-V3/R1 gives NaNs when inference is run on a computer with `AVX512_VNNI` and the model is quantized with `Q8_0/Q8_0_R8` (issue #285). The difference to vanilla `AVX2` is that in that case activations are quantized with `Q8_1/Q8_1_X4`. The block scale and sum in `Q8_1/Q8_1_X4` are `fp16`.
+DeepSeek-V3/R1 gives NaNs when inference is run on a computer with `AVX512_VNNI` and the model is quantized with `Q8_0/Q8_0_R8` (issue [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285)). The difference to vanilla `AVX2` is that in that case activations are quantized with `Q8_1/Q8_1_X4`. The block scale and sum in `Q8_1/Q8_1_X4` are `fp16`.
-We did have similar issues with `IQ1_S`, which was solved in #194 by going to a different quantization type for the activations. I did create issue #196 because of that.
+We did have similar issues with `IQ1_S`, which was solved in [#194](https://github.com/ikawrakow/ik_llama.cpp/issues/194) by going to a different quantization type for the activations. I did create issue [#196](https://github.com/ikawrakow/ik_llama.cpp/issues/196) because of that.
-We also observed NaNs on CUDA for `IQ4_K` and `IQ4_KS`. These quantization types do not have MMQ kernels, so matrix multiplications were done via dequantization to `fp16` and cuBLAS GEMM. The NaNs were resolved via dequantizing to `bf16` instead (PR #261)
+We also observed NaNs on CUDA for `IQ4_K` and `IQ4_KS`. These quantization types do not have MMQ kernels, so matrix multiplications were done via dequantization to `fp16` and cuBLAS GEMM. The NaNs were resolved via dequantizing to `bf16` instead (PR [#261](https://github.com/ikawrakow/ik_llama.cpp/issues/261))
So, it seems one can not use `fp16` arithmetic in DeepSeek-V3/R1.
-This is further confirmed by #291, where we observe no NaNs when switching `Q8_0/Q8_0_R8` to vanilla `AVX2` implementation.
+This is further confirmed by [#291](https://github.com/ikawrakow/ik_llama.cpp/issues/291), where we observe no NaNs when switching `Q8_0/Q8_0_R8` to vanilla `AVX2` implementation.
This PR introduces `Q8_2/Q8_2_X4` quantization types that use `bf16` block scale and sum. All quantization types that previously used `Q8_1/Q8_1_X4` to quantize activations for CPU GEMM/GEMV are switched to `Q8_2/Q8_2_X4`.
@@ -26,35 +29,264 @@ This should resolve all NaNs on the CPU.
I wonder why we are not getting NaNs on CUDA for the quantization types that do use `Q8_1`. Or maybe we do, and it is just that nobody has reported.
-Closes #285 and #196
+Closes [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285) and [#196](https://github.com/ikawrakow/ik_llama.cpp/issues/196)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-03-26** at **19:37:47**:
+👤 **ubergarm** commented on **2025-03-26** at **19:37:47**
-I'm mostly afk until Friday, but will try to rebuild with this PR and test perplexity and imatrix again on a `q8_0` on the CPU only xeon 6980P rig if I get a moment before then. Thanks!
+I'm mostly afk until Friday, but had a moment to rebuild with this PR and run another perplexity test on the `q8_0` with the CPU only Xeon 6980P rig. *FINISHED*, looks clean, No NaNs :point_down:
+
+
+
+llama-perplexity run on `q8_0` with this `PR@918abd1`
+
+```bash
+$ git branch | grep q8_2
+* ik/use_q8_2
+
+$ git rev-parse --short HEAD
+918abd16
+
+$ numactl -N 1 -m 1 \
+./build/bin/llama-perplexity \
+ -m /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q8_0_R8.gguf \
+ -f wiki.test.raw \
+ -t 128 \
+ -b 512 \
+ --numa numactl 2>&1 | tee -a output.log
+
+main: build = 3619 (918abd16)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: seed = 1743024820
+llama_model_loader: loaded meta data with 45 key-value pairs and 1025 tensors from /mnt/ai/models/unsloth/repack/DeepSeek-R1-Q8_0_R8.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16
+llama_model_loader: - kv 3: general.quantized_by str = Unsloth
+llama_model_loader: - kv 4: general.size_label str = 256x20B
+llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
+llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 15: general.file_type u32 = 207
+llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,129280] = ["
+llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,129280] = [3
+llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,127741] = ["
+llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 128815
+llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 44: general.quantization_version u32 = 2
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llama_model_loader: - type q8_0_r8: 663 tensors
+llm_load_vocab: special tokens cache size = 819
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = Q8_0_R8 - 8.5 bpw
+llm_load_print_meta: model params = 671.026 B
+llm_load_print_meta: model size = 664.295 GiB (8.504 BPW)
+llm_load_print_meta: repeating layers = 662.461 GiB (8.504 BPW, 669.173 B parameters)
+llm_load_print_meta: general.name = DeepSeek R1 BF16
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 128815 '<|PAD▁TOKEN|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.42 MiB
+llm_load_tensors: CPU buffer size = 680237.97 MiB
+....................................................................................................
+============ llm_load_tensors: need to compute 61 wk_b tensors
+Computed blk.0.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.1.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.2.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.3.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.4.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.5.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.6.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.7.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.8.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.9.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.10.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.11.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.12.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.13.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.14.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.15.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.16.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.17.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.18.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.19.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.20.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.21.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.22.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.23.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.24.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.25.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.26.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.27.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.28.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.29.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.30.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.31.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.32.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.33.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.34.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.35.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.36.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.37.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.38.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.39.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.40.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.41.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.42.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.43.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.44.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.45.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.46.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.47.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.48.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.49.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.50.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.51.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.52.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.53.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Collama_new_context_with_model: n_ctx = 512
+llama_new_context_with_model: n_batch = 512
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 0
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CPU KV buffer size = 2440.00 MiB
+llama_new_context_with_model: KV self size = 2440.00 MiB, K (f16): 1464.00 MiB, V (f16): 976.00 MiB
+llama_new_context_with_model: CPU output buffer size = 0.49 MiB
+llama_new_context_with_model: CPU compute buffer size = 283.01 MiB
+llama_new_context_with_model: graph nodes = 3724
+llama_new_context_with_model: graph splits = 1
+
+system_info: n_threads = 128 / 512 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 926.669 ms
+perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=512, n_seq=1
+perplexity: 4.79 seconds per pass - ETA 44.82 minutes
+mputed blk.54.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.55.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.56.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.57.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.58.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.59.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+Computed blk.60.attn_v_b.weight as 128 x 512 x 128 and stored in buffer CPU
+[1]2.5126,[2]3.2872,[3]2.3691,[4]1.9785,[5]1.7891,[6]1.6484,[7]1.5564,[8]1.4901,[9]1.4404,[10]1.4011,[11]1.3853,[12]1.4164,[13]1.4278,[14]1.5541,[15]1.6851,[16]1.7456,[17]1.9079,[18]2.0380,[19]2.0009,[20]1.9896,[21]2.0973,[22]2.0702,[23]2.0438,[24]2.0563,[25]2.0272,[26]2.0041,[27]2.0526,[28]2.0594,[29]2.1082,[30]2.1390,[31]2.1727,[32]2.1906,[33]2.2302,[34]2.2708,[35]2.3199,[36]2.3726,[37]2.4078,[38]2.4521,[39]2.4930,[40]2.5516,[41]2.5934,[42]2.6057,[43]2.6551,[44]2.6716,[45]2.7514,[46]2.8017,[47]2.7570,[48]2.7112,[49]2.6853,[50]2.7049,[51]2.7508,[52]2.7659,[53]2.8151,[54]2.8282,[55]2.8593,[56]2.8908,[57]2.9048,[58]2.9409,[59]2.9515,[60]2.9977,[61]3.0377,[62]3.0942,[63]3.1262,[64]3.1707,[65]3.1801,[66]3.1624,[67]3.1393,[68]3.1709,[69]3.1661,[70]3.1813,[71]3.2001,[72]3.2159,[73]3.2304,[74]3.2536,[75]3.2327,[76]3.1859,[77]3.1427,[78]3.1378,[79]3.1155,[80]3.0971,[81]3.0604,[82]3.0639,[83]3.0325,[84]2.9965,[85]2.9614,[86]2.9361,[87]2.9298,[88]2.9012,[89]2.8843,[90]2.8585,[91]2.8289,[92]2.8038,[93]2.7767,[94]2.7499,[95]2.7259,[96]2.7243,[97]2.7308,[98]2.7150,[99]2.6975,[100]2.6998,[101]2.6914,[102]2.7080,[103]2.7339,[104]2.7526,[105]2.7495,[106]2.7716,[107]2.7961,[108]2.8169,[109]2.8509,[110]2.8849,[111]2.9048,[112]2.8790,[113]2.8658,[114]2.8435,[115]2.8283,[116]2.8133,[117]2.7903,[118]2.7695,[119]2.7482,[120]2.7294,[121]2.7136,[122]2.6960,[123]2.6797,[124]2.6610,[125]2.6435,[126]2.6269,[127]2.6128,[128]2.6037,[129]2.5931,[130]2.5809,[131]2.5732,[132]2.5804,[133]2.5900,[134]2.5968,[135]2.6076,[136]2.6240,[137]2.6392,[138]2.6474,[139]2.6589,[140]2.6599,[141]2.6616,[142]2.6608,[143]2.6614,[144]2.6584,[145]2.6495,[146]2.6480,[147]2.6525,[148]2.6523,[149]2.6539,[150]2.6487,[151]2.6469,[152]2.6441,[153]2.6403,[154]2.6411,[155]2.6455,[156]2.6476,[157]2.6536,[158]2.6625,[159]2.6642,[160]2.6732,[161]2.6815,[162]2.6908,[163]2.6950,[164]2.7148,[165]2.7384,[166]2.7558,[167]2.7681,[168]2.7922,[169]2.8146,[170]2.8362,[171]2.8593,[172]2.8434,[173]2.8269,[174]2.8132,[175]2.8000,[176]2.7878,[177]2.7763,[178]2.7636,[179]2.7498,[180]2.7537,[181]2.7677,[182]2.7829,[183]2.7977,[184]2.8118,[185]2.8223,[186]2.8388,[187]2.8541,[188]2.8683,[189]2.8790,[190]2.8792,[191]2.8865,[192]2.8905,[193]2.8958,[194]2.9153,[195]2.9242,[196]2.9376,[197]2.9475,[198]2.9519,[199]2.9577,[200]2.9572,[201]2.9722,[202]2.9676,[203]2.9729,[204]2.9764,[205]2.9765,[206]2.9792,[207]2.9880,[208]2.9977,[209]3.0068,[210]3.0074,[211]3.0029,[212]3.0028,[213]3.0104,[214]3.0123,[215]3.0181,[216]3.0188,[217]3.0150,[218]3.0151,[219]3.0161,[220]3.0156,[221]3.0156,[222]3.0157,[223]3.0162,[224]3.0212,[225]3.0230,[226]3.0150,[227]3.0126,[228]3.0150,[229]3.0194,[230]3.0260,[231]3.0321,[232]3.0239,[233]3.0161,[234]3.0162,[235]3.0146,[236]3.0237,[237]3.0321,[238]3.0416,[239]3.0515,[240]3.0607,[241]3.0719,[242]3.0863,[243]3.0996,[244]3.1077,[245]3.1189,[246]3.1294,[247]3.1283,[248]3.1240,[249]3.1221,[250]3.1160,[251]3.1140,[252]3.1166,[253]3.1205,[254]3.1276,[255]3.1340,[256]3.1378,[257]3.1402,[258]3.1414,[259]3.1447,[260]3.1469,[261]3.1482,[262]3.1474,[263]3.1531,[264]3.1553,[265]3.1558,[266]3.1577,[267]3.1605,[268]3.1642,[269]3.1674,[270]3.1667,[271]3.1650,[272]3.1582,[273]3.1578,[274]3.1510,[275]3.1401,[276]3.1291,[277]3.1309,[278]3.1410,[279]3.1473,[280]3.1551,[281]3.1625,[282]3.1688,[283]3.1751,[284]3.1818,[285]3.1953,[286]3.1978,[287]3.2012,[288]3.2061,[289]3.2087,[290]3.2005,[291]3.1911,[292]3.1892,[293]3.1882,[294]3.1852,[295]3.1827,[296]3.1848,[297]3.1853,[298]3.1902,[299]3.1958,[300]3.1988,[301]3.2028,[302]3.2052,[303]3.2073,[304]3.2067,[305]3.2187,[306]3.2263,[307]3.2372,[308]3.2261,[309]3.2206,[310]3.2110,[311]3.2146,[312]3.2169,[313]3.2233,[314]3.2253,[315]3.2285,[316]3.2299,[317]3.2317,[318]3.2323,[319]3.2327,[320]3.2369,[321]3.2372,[322]3.2392,[323]3.2455,[324]3.2463,[325]3.2518,[326]3.2565,[327]3.2608,[328]3.2638,[329]3.2656,[330]3.2719,[331]3.2758,[332]3.2804,[333]3.2791,[334]3.2792,[335]3.2797,[336]3.2800,[337]3.2811,[338]3.2814,[339]3.2841,[340]3.2878,[341]3.2933,[342]3.3023,[343]3.3116,[344]3.3169,[345]3.3083,[346]3.3006,[347]3.2954,[348]3.2880,[349]3.2843,[350]3.2827,[351]3.2873,[352]3.3022,[353]3.3113,[354]3.3242,[355]3.3326,[356]3.3378,[357]3.3495,[358]3.3591,[359]3.3622,[360]3.3687,[361]3.3780,[362]3.3865,[363]3.3923,[364]3.3989,[365]3.4051,[366]3.4155,[367]3.4242,[368]3.4311,[369]3.4390,[370]3.4475,[371]3.4612,[372]3.4699,[373]3.4731,[374]3.4765,[375]3.4815,[376]3.4943,[377]3.5055,[378]3.5082,[379]3.5076,[380]3.5042,[381]3.5088,[382]3.5145,[383]3.5181,[384]3.5224,[385]3.5262,[386]3.5323,[387]3.5381,[388]3.5415,[389]3.5311,[390]3.5218,[391]3.5114,[392]3.5056,[393]3.4961,[394]3.4871,[395]3.4779,[396]3.4679,[397]3.4590,[398]3.4495,[399]3.4391,[400]3.4303,[401]3.4203,[402]3.4100,[403]3.4014,[404]3.3912,[405]3.3818,[406]3.3719,[407]3.3627,[408]3.3538,[409]3.3453,[410]3.3393,[411]3.3400,[412]3.3354,[413]3.3370,[414]3.3391,[415]3.3360,[416]3.3358,[417]3.3381,[418]3.3323,[419]3.3338,[420]3.3313,[421]3.3301,[422]3.3317,[423]3.3309,[424]3.3350,[425]3.3344,[426]3.3349,[427]3.3338,[428]3.3363,[429]3.3381,[430]3.3409,[431]3.3416,[432]3.3406,[433]3.3370,[434]3.3370,[435]3.3293,[436]3.3229,[437]3.3188,[438]3.3171,[439]3.3137,[440]3.3187,[441]3.3240,[442]3.3315,[443]3.3297,[444]3.3305,[445]3.3318,[446]3.3365,[447]3.3398,[448]3.3424,[449]3.3454,[450]3.3492,[451]3.3522,[452]3.3543,[453]3.3559,[454]3.3545,[455]3.3567,[456]3.3570,[457]3.3597,[458]3.3649,[459]3.3655,[460]3.3656,[461]3.3624,[462]3.3662,[463]3.3734,[464]3.3787,[465]3.3716,[466]3.3695,[467]3.3676,[468]3.3686,[469]3.3657,[470]3.3630,[471]3.3633,[472]3.3641,[473]3.3632,[474]3.3623,[475]3.3636,[476]3.3620,[477]3.3610,[478]3.3618,[479]3.3635,[480]3.3662,[481]3.3622,[482]3.3656,[483]3.3648,[484]3.3683,[485]3.3748,[486]3.3777,[487]3.3814,[488]3.3867,[489]3.3892,[490]3.3938,[491]3.3999,[492]3.4044,[493]3.4042,[494]3.4053,[495]3.4079,[496]3.4097,[497]3.4126,[498]3.4129,[499]3.4124,[500]3.4164,[501]3.4211,[502]3.4202,[503]3.4186,[504]3.4207,[505]3.4241,[506]3.4326,[507]3.4354,[508]3.4389,[509]3.4316,[510]3.4258,[511]3.4193,[512]3.4147,[513]3.4084,[514]3.4069,[515]3.4089,[516]3.4039,[517]3.4036,[518]3.4028,[519]3.4034,[520]3.4078,[521]3.4066,[522]3.4053,[523]3.4111,[524]3.4098,[525]3.4083,[526]3.4034,[527]3.3984,[528]3.3949,[529]3.3919,[530]3.3888,[531]3.3858,[532]3.3803,[533]3.3741,[534]3.3698,[535]3.3706,[536]3.3733,[537]3.3765,[538]3.3789,[539]3.3816,[540]3.3868,[541]3.3901,[542]3.3924,[543]3.3869,[544]3.3826,[545]3.3822,[546]3.3756,[547]3.3691,[548]3.3627,[549]3.3560,[550]3.3499,[551]3.3437,[552]3.3379,[553]3.3320,[554]3.3299,[555]3.3285,[556]3.3313,[557]3.3353,[558]3.3412,[559]3.3457,[560]3.3509,[561]3.3491,
+llama_print_timings: load time = 4673.16 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 2385511.51 ms / 287232 tokens ( 8.31 ms per token, 120.41 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 2597613.67 ms / 287233 tokens
+
+Final estimate: PPL = 3.3491 +/- 0.01849
+```
+
+
---
-👤 **ikawrakow** commented the **2025-03-27** at **04:49:07**:
+👤 **ikawrakow** commented on **2025-03-27** at **04:49:07**
Thank you for verifying that it works!
---
-👤 **saood06** commented the **2025-03-27** at **08:14:07**:
+👤 **saood06** commented on **2025-03-27** at **08:14:07**
-> Closes #285 and #196
-
-This only closed #285, for multiple commands need to use a comma and repeat each command ([source](https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue)).
+> Closes [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285) and [#196](https://github.com/ikawrakow/ik_llama.cpp/issues/196)
-Closes #196
+This only closed [#285](https://github.com/ikawrakow/ik_llama.cpp/issues/285), for multiple commands need to use a comma and repeat each command ([source](https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue)).
---
-👤 **saood06** commented the **2025-03-27** at **08:23:08**:
+👤 **saood06** commented on **2025-03-27** at **08:23:08**
>So, it seems one can not use fp16 arithmetic in DeepSeek-V3/R1.
@@ -62,7 +294,7 @@ Is this why https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussion
---
-👤 **ikawrakow** commented the **2025-03-27** at **08:27:17**:
+👤 **ikawrakow** commented on **2025-03-27** at **08:27:17**
> Is this why https://github.com/ikawrakow/ik_llama.cpp/discussions/242#discussioncomment-12429240 the imatrix in that comment was failing?
diff --git a/github-data/pull_requests/294 - Make sure tensor row size is multiple of block size also when quantizin.md b/github-data/pull_requests/294 - Make sure tensor row size is multiple of block size also when quantizing with --.md
similarity index 59%
rename from github-data/pull_requests/294 - Make sure tensor row size is multiple of block size also when quantizin.md
rename to github-data/pull_requests/294 - Make sure tensor row size is multiple of block size also when quantizing with --.md
index 8a0d1e694..5775e72fa 100644
--- a/github-data/pull_requests/294 - Make sure tensor row size is multiple of block size also when quantizin.md
+++ b/github-data/pull_requests/294 - Make sure tensor row size is multiple of block size also when quantizing with --.md
@@ -1,13 +1,16 @@
-### 🔀 [#294](https://github.com/ikawrakow/ik_llama.cpp/pull/294) - Make sure tensor row size is multiple of block size also when quantizing with --pure
+## 🔀 [Pull Request #294](https://github.com/ikawrakow/ik_llama.cpp/pull/294) - Make sure tensor row size is multiple of block size also when quantizing with --pure
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/change_q_pure` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-27 |
| **Updated** | 2025-03-27 |
+| **Merged** | 2025-03-27 |
---
-#### Description
+## 📄 Description
`ffn_down_exps` row sizes are not a multiple of 256 in DeepSeek-Lite. When using `--pure` with `llama-quantize` this leads to a crash. I got tired of having to do custom quantization overrides in that case, so this PR adds the check for divisibility by the quantization block size also for `--pure`, and uses the fallback quantization type if necessary.
\ No newline at end of file
diff --git a/github-data/pull_requests/295 - Quantization improvements.md b/github-data/pull_requests/295 - Quantization improvements.md
index 9e9e42612..f3f5d8496 100644
--- a/github-data/pull_requests/295 - Quantization improvements.md
+++ b/github-data/pull_requests/295 - Quantization improvements.md
@@ -1,14 +1,17 @@
-### 🔀 [#295](https://github.com/ikawrakow/ik_llama.cpp/pull/295) - Quantization improvements
+## 🔀 [Pull Request #295](https://github.com/ikawrakow/ik_llama.cpp/pull/295) - Quantization improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/make_qx_quants` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-28 |
| **Updated** | 2025-03-30 |
+| **Merged** | 2025-03-29 |
---
-#### Description
+## 📄 Description
It is now more than a year since I added the imatrix to `llama.cpp`. I think we can say that imatrix based quantization is now the standard. Hence, I believe it is no longer necessary to make quantization robust against failure modes that can be triggered when quantizing without an imatrix.
@@ -63,7 +66,7 @@ ___
3 This quantization type is not available in mainline `llama.cpp`.
-4 Some of the tensor row size are not divisible by the k- and i-quants super-block size of 256. In mainline `llama.cpp` the quantization fails in that case when using `--pure`. I have changed `ik_llama.cpp` to use the fallback quantization type in that case in PR #294.
+4 Some of the tensor row size are not divisible by the k- and i-quants super-block size of 256. In mainline `llama.cpp` the quantization fails in that case when using `--pure`. I have changed `ik_llama.cpp` to use the fallback quantization type in that case in PR [#294](https://github.com/ikawrakow/ik_llama.cpp/issues/294).
5 PR 12557 does not change `Q6_K` quantization.
@@ -89,9 +92,9 @@ Extending the above algorithm to the non-linear quants `IQ4_XS` and `IQ4_NL` is
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **compilade** commented the **2025-03-28** at **15:35:37**:
+👤 **compilade** commented on **2025-03-28** at **15:35:37**
Nice! It seems like your improved `make_qx_quants` is extremely similar to `make_qkxh_quants` when starting the search from `MIN(abs(nmin), abs(nmax)) - 1` instead of `MIN(abs(nmin), abs(nmax)) / 2` (when comparing the equirectangular projections). This would also make `make_qkxh_quants` faster (though I don't know by how much).
@@ -99,26 +102,37 @@ Here's your improved `make_qx_quants` with settings from `Q4_0`:

+```
+np.min(cos)=0.9962519378012932
+np.mean(cos)=0.9994353889532565
+np.max(cos)=1.0
+```
+
And your improved `quantize_row_iq4_nl_impl` looks like this:

+```
+np.min(cos)=0.9978821632073399
+np.mean(cos)=0.9994873576857634
+np.max(cos)=0.9999999996961985
+```
Very interesting approach with the gradient.
---
-👤 **ikawrakow** commented the **2025-03-28** at **19:44:43**:
+👤 **ikawrakow** commented on **2025-03-28** at **19:44:43**
To be honest I don't understand these plots. I know yellow is good and blue is bad, and there is a lot of blue, so they must be pretty bad?
---
-👤 **compilade** commented the **2025-03-28** at **19:59:47**:
+👤 **compilade** commented on **2025-03-28** at **19:59:47**
> To be honest I don't understand these plots. I know yellow is good and blue is bad, and there is a lot of blue, so they must be pretty bad?
-No, the plots of your algorithms are not bad. Blue is simply the color of the max error. I did also include the min mean and max cosine similarities of the plots.
+No, the plots of your algorithms are not bad. Blue is simply the color of the max error. I did also include the values for the min mean and max cosine similarities of the plots.
If an algorithm had a very big error in one spot, everything else would be yellow. This means the colors can't really be compared directly.
@@ -128,7 +142,7 @@ In this case, the modifications you propose here **do improve** how the plots lo
---
-👤 **ikawrakow** commented the **2025-03-28** at **20:03:32**:
+👤 **ikawrakow** commented on **2025-03-28** at **20:03:32**
And what are the two coordinates of the plot? I understand it is a projection, but what is it that is being projected?
@@ -138,7 +152,7 @@ That would be the standard way to approach an optimization problem, no?
---
-👤 **compilade** commented the **2025-03-28** at **20:55:13**:
+👤 **compilade** commented on **2025-03-28** at **20:55:13**
> And what are the two coordinates of the plot? I understand it is a projection, but what is it that is being projected?
@@ -164,13 +178,13 @@ I will compare the speed and perplexity of narrower cumulative search with this
---
-👤 **saood06** commented the **2025-03-28** at **23:16:13**:
+👤 **saood06** commented on **2025-03-28** at **23:16:13**
>Tested is "pure" quantization (i.e., using the `--pure` option of `llama-quantize`) with token embeddings and output tensor set to `Q8_0`.
Was this needed for some quants of DSL to function? As I ran into issues with a pure iq4_k_r4 quant for the new Deepseek V3 0324 (as my first mix of this finetune was noticeably slower than my first and fastest mix of R1).
-The pure ran at about the same speed as that R1 mix (I think it should have been a bit faster than it is and the speed loss may be from #259 since for this model I did not convert it myself and grabbed a conversion that was done with mainline), but it was not functional (I forgot to test perplexity before unloading it), either giving a few incomprehensible tokens or just straight to an EOS token from my brief usage.
+The pure ran at about the same speed as that R1 mix (I think it should have been a bit faster than it is and the speed loss may be from [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) since for this model I did not convert it myself and grabbed a conversion that was done with mainline), but it was not functional (I forgot to test perplexity before unloading it), either giving a few incomprehensible tokens or just straight to an EOS token from my brief usage.
Comparing the quant logs for both, the only different tensors of the functional R1 mix were the following 5:
@@ -229,15 +243,15 @@ llm_load_print_meta: model params = 671.026 B //this is lower because of MLA
Do you think that setting output.weight to iq6_k and leaving the rest completely pure would work?
-When I do make this next quant I might end up converting the model myself to see if #259 was costing me performance (even if I won't be comparing the exact same mix, I think it would still answer that question).
+When I do make this next quant I might end up converting the model myself to see if [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) was costing me performance (even if I won't be comparing the exact same mix, I think it would still answer that question).
---
-👤 **ikawrakow** commented the **2025-03-29** at **06:53:18**:
+👤 **ikawrakow** commented on **2025-03-29** at **06:53:18**
> When I do make this next quant I might end up converting the model myself to see if https://github.com/ikawrakow/ik_llama.cpp/pull/259 was costing me performance
-#259 creates `attn_k_b` and `attn_v_b` as `Q8_0`, so this can have impact on TG performance compared to a model where these tensors were created with lower bpw. Apart from this, your system seems to be extremely sensitive to how things are laid out in memory, and creating `attn_k_b` and `attn_v_b` on the fly will lead to a different memory layout.
+[#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) creates `attn_k_b` and `attn_v_b` as `Q8_0`, so this can have impact on TG performance compared to a model where these tensors were created with lower bpw. Apart from this, your system seems to be extremely sensitive to how things are laid out in memory, and creating `attn_k_b` and `attn_v_b` on the fly will lead to a different memory layout.
> but it was not functional (I forgot to test perplexity before unloading it), either giving a few incomprehensible tokens or just straight to an EOS token from my brief usage.
@@ -245,28 +259,28 @@ Not sure about this one.
---
-👤 **saood06** commented the **2025-03-29** at **07:36:32**:
+👤 **saood06** commented on **2025-03-29** at **07:36:32**
-> > When I do make this next quant I might end up converting the model myself to see if #259 was costing me performance
+> > When I do make this next quant I might end up converting the model myself to see if [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) was costing me performance
>
-> #259 creates `attn_k_b` and `attn_v_b` as `Q8_0`, so this can have impact on TG performance compared to a model where these tensors were created with lower bpw.
+> [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) creates `attn_k_b` and `attn_v_b` as `Q8_0`, so this can have impact on TG performance compared to a model where these tensors were created with lower bpw.
Yes I experimented with some quant mixes with those at Q8_0 before to see how much impact they had on PPL (but never isolated effects as the change in PPL was too minor and the TG impact too large for my preferences).
>Apart from this, your system seems to be extremely sensitive to how things are laid out in memory, and creating `attn_k_b` and `attn_v_b` on the fly will lead to a different memory layout.
-Yes it is unfortunately very sensitive to that, I even considered #259 before I downloaded this preconverted model but decided to try it anyway.
+Yes it is unfortunately very sensitive to that, I even considered [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) before I downloaded this preconverted model but decided to try it anyway.
> > but it was not functional (I forgot to test perplexity before unloading it), either giving a few incomprehensible tokens or just straight to an EOS token from my brief usage.
>
> Not sure about this one.
-I'll test attn_output.weight set to iq6_k and report back when I get a chance (will first have to download and convert the model so that I can also test #259 ).
+I'll test attn_output.weight set to iq6_k and report back when I get a chance (will first have to download and convert the model so that I can also test [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) ).
---
-👤 **saood06** commented the **2025-03-30** at **08:44:47**:
+👤 **saood06** commented on **2025-03-30** at **08:44:47**
-> I'll test attn_output.weight set to iq6_k and report back when I get a chance (will first have to download and convert the model so that I can also test #259 ).
+> I'll test attn_output.weight set to iq6_k and report back when I get a chance (will first have to download and convert the model so that I can also test [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) ).
-This was also outputting gibberish.
\ No newline at end of file
+This was also outputting gibberish. It seems both are important.
\ No newline at end of file
diff --git a/github-data/pull_requests/298 - Update gguf-py constants.md b/github-data/pull_requests/298 - Update gguf-py constants.md
index d4737b8dc..856c5e186 100644
--- a/github-data/pull_requests/298 - Update gguf-py constants.md
+++ b/github-data/pull_requests/298 - Update gguf-py constants.md
@@ -1,16 +1,19 @@
-### 🔀 [#298](https://github.com/ikawrakow/ik_llama.cpp/pull/298) - Update gguf-py constants
+## 🔀 [Pull Request #298](https://github.com/ikawrakow/ik_llama.cpp/pull/298) - Update gguf-py constants
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/fix_python` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-31 |
| **Updated** | 2025-04-24 |
+| **Merged** | 2025-04-24 |
---
-#### Description
+## 📄 Description
-As reported in #297 the constants.py file needs to be updated.
+As reported in [#297](https://github.com/ikawrakow/ik_llama.cpp/issues/297) the constants.py file needs to be updated.
Testing the command that errored it now gets further.
@@ -42,9 +45,9 @@ This is because GGML_QUANT_SIZES ([code](https://github.com/ikawrakow/ik_llama.c
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-03-31** at **08:03:08**:
+👤 **ikawrakow** commented on **2025-03-31** at **08:03:08**
> could you give me a hint at how to update this?
@@ -54,7 +57,7 @@ The python stuff is in desperate need of sync with mainline. But the difference
---
-👤 **saood06** commented the **2025-03-31** at **09:07:46**:
+👤 **saood06** commented on **2025-03-31** at **09:07:46**
> > could you give me a hint at how to update this?
>
@@ -69,7 +72,7 @@ I'm still testing the performance implications of that on my system, it seems li
---
-👤 **saood06** commented the **2025-03-31** at **09:10:53**:
+👤 **saood06** commented on **2025-03-31** at **09:10:53**
>The python stuff is in desperate need of sync with mainline.
@@ -79,7 +82,7 @@ This GGML_QUANT_SIZES is the only thing I know that is missing besides the Gemma
---
-👤 **ikawrakow** commented the **2025-03-31** at **09:15:43**:
+👤 **ikawrakow** commented on **2025-03-31** at **09:15:43**
> What went wrong with the Gemma changes
@@ -87,7 +90,7 @@ It wasn't working. I copy-pasted the Gemma3 portion, but it started throwing exc
---
-👤 **saood06** commented the **2025-04-24** at **04:23:34**:
+👤 **saood06** commented on **2025-04-24** at **04:23:34**
@ikawrakow
@@ -96,8 +99,38 @@ Thanks for the hint. I was able to update GGML_QUANT_SIZES and this should be re
Running `python gguf-py/scripts/gguf_dump.py --markdown /mnt/sda/DeepSeek-V3-0324-IQ4_K_R4.gguf` works now. Output of the command attached below.
-[gguf_dump1.md](https://github.com/user-attachments/files/19884332/gguf_dump1.md)
+[gguf_dump1.md](https://github.com/user-attachments/files/19884332/gguf_dump1.md)
+
+Edit: Something seems wrong with I2_S, trying to use dump the model it runs into this error.
+
+```
+ File "/home/saood06/ik_main/ik_llama.cpp/gguf-py/gguf/gguf_reader.py", line 130, in __init__
+ self._build_tensors(offs, tensors_fields)
+ ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
+ File "/home/saood06/ik_main/ik_llama.cpp/gguf-py/gguf/gguf_reader.py", line 325, in _build_tensors
+ data = self._get(data_offs, item_type, item_count).reshape(np_dims),
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
+ValueError: cannot reshape array of size 17425632 into shape (2560,6912)
+```
+
+I added a print statement to see which tensor it was hitting and all it's values:
+
+```
+Processing tensor: blk.9.ffn_down.weight
+ dims: [6912, 2560]
+ raw_dtype: [36]
+ ggml_type: I2_S
+ n_elems: 17694720
+ np_dims: (2560, 6912)
+ block_size: 1, type_size: 1
+ n_bytes: 17694720
+ data_offs: 1827046400
+ item_count: 17694720
+ item_type:
+```
+
+Interestingly enough the `iq2_bn_r4` and `iq2_bn` converted version does not error and I can gguf-dump them.
---
-👤 **ikawrakow** submitted a review the **2025-04-24** at **05:33:08**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-04-24** at **05:33:08**
\ No newline at end of file
diff --git a/github-data/pull_requests/299 - Additional guards for interleaved quants.md b/github-data/pull_requests/299 - Additional guards for interleaved quants.md
index 72f338566..08d8f523d 100644
--- a/github-data/pull_requests/299 - Additional guards for interleaved quants.md
+++ b/github-data/pull_requests/299 - Additional guards for interleaved quants.md
@@ -1,24 +1,27 @@
-### 🔀 [#299](https://github.com/ikawrakow/ik_llama.cpp/pull/299) - Additional guards for interleaved quants
+## 🔀 [Pull Request #299](https://github.com/ikawrakow/ik_llama.cpp/pull/299) - Additional guards for interleaved quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/interleaved_guards` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-31 |
| **Updated** | 2025-04-01 |
+| **Merged** | 2025-04-01 |
---
-#### Description
+## 📄 Description
-Apparently not all use cases are covered when using interleaved quants, see #296.
+Apparently not all use cases are covered when using interleaved quants, see [#296](https://github.com/ikawrakow/ik_llama.cpp/issues/296).
Hopefully this PR handles all scenarios where one may arrive at using an interleaved quantization type where this is not possible.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-03-31** at **12:05:48**:
+👤 **saood06** commented on **2025-03-31** at **12:05:48**
Decided to test this branch, using just pure with `./llama-quantize --imatrix /mnt/sda/imatrix_V30324_mrader.dat --pure /mnt/sda/DeepseekV3_0324/DeepseekV3_0324-256x21B-BF16.gguf /mnt/sda/DeepSeek-V3-0324-IQ4_K_R4_ATT5.gguf IQ4_K_R4 48` and token embedding was still using the interleaved type.
@@ -51,7 +54,7 @@ converting to iq4_k_r4 .. size = 1767.50 MiB -> 497.11 MiB
---
-👤 **ikawrakow** commented the **2025-03-31** at **12:46:26**:
+👤 **ikawrakow** commented on **2025-03-31** at **12:46:26**
None of the above happens to me. Here the log of
```
@@ -782,7 +785,7 @@ Same outcome with `--custom-q ".*=iq4_k_r4"`.
---
-👤 **saood06** commented the **2025-04-01** at **00:08:56**:
+👤 **saood06** commented on **2025-04-01** at **00:08:56**
> None of the above happens to me. Here the log of
diff --git a/github-data/pull_requests/3 - Merge mainline llama.cpp.md b/github-data/pull_requests/3 - Merge mainline llama.cpp.md
index 72da515cd..7116646b4 100644
--- a/github-data/pull_requests/3 - Merge mainline llama.cpp.md
+++ b/github-data/pull_requests/3 - Merge mainline llama.cpp.md
@@ -1,21 +1,24 @@
-### 🔀 [#3](https://github.com/ikawrakow/ik_llama.cpp/pull/3) - Merge mainline llama.cpp
+## 🔀 [Pull Request #3](https://github.com/ikawrakow/ik_llama.cpp/pull/3) - Merge mainline llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/merge_July_26_2024` |
+| **Target Branch** | `main` |
| **Created** | 2024-07-26 |
| **Updated** | 2024-07-27 |
+| **Merged** | 2024-07-27 |
---
-#### Description
+## 📄 Description
Only quick testing so far.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-07-27** at **05:54:36**:
+👤 **ikawrakow** commented on **2024-07-27** at **05:54:36**
Seems to be working -> merging
\ No newline at end of file
diff --git a/github-data/pull_requests/301 - Fix 300.md b/github-data/pull_requests/301 - Fix 300.md
new file mode 100644
index 000000000..9c6354a7f
--- /dev/null
+++ b/github-data/pull_requests/301 - Fix 300.md
@@ -0,0 +1,24 @@
+## 🔀 [Pull Request #301](https://github.com/ikawrakow/ik_llama.cpp/pull/301) - Fix [#300](https://github.com/ikawrakow/ik_llama.cpp/issues/300)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_300` |
+| **Target Branch** | `main` |
+| **Created** | 2025-03-31 |
+| **Updated** | 2025-04-01 |
+| **Merged** | 2025-04-01 |
+
+---
+
+## 📄 Description
+
+Closes [#300](https://github.com/ikawrakow/ik_llama.cpp/issues/300)
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** commented on **2025-04-01** at **00:24:11**
+
+Thanks, it now compiles again.
\ No newline at end of file
diff --git a/github-data/pull_requests/301 - Fix _300.md b/github-data/pull_requests/301 - Fix _300.md
deleted file mode 100644
index bf20e8bb1..000000000
--- a/github-data/pull_requests/301 - Fix _300.md
+++ /dev/null
@@ -1,21 +0,0 @@
-### 🐛 [#301](https://github.com/ikawrakow/ik_llama.cpp/pull/301) - Fix [#300](https://github.com/ikawrakow/ik_llama.cpp/issues/300)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-03-31 |
-| **Updated** | 2025-04-01 |
-
----
-
-#### Description
-
-Closes #300
-
----
-
-#### 💬 Conversation
-
-👤 **saood06** commented the **2025-04-01** at **00:24:11**:
-
-Thanks, it now compiles again.
\ No newline at end of file
diff --git a/github-data/pull_requests/302 - Quantization improvements _2_.md b/github-data/pull_requests/302 - Quantization improvements 2.md
similarity index 75%
rename from github-data/pull_requests/302 - Quantization improvements _2_.md
rename to github-data/pull_requests/302 - Quantization improvements 2.md
index fc0490862..7b1413584 100644
--- a/github-data/pull_requests/302 - Quantization improvements _2_.md
+++ b/github-data/pull_requests/302 - Quantization improvements 2.md
@@ -1,16 +1,19 @@
-### 🔀 [#302](https://github.com/ikawrakow/ik_llama.cpp/pull/302) - Quantization improvements (2)
+## 🔀 [Pull Request #302](https://github.com/ikawrakow/ik_llama.cpp/pull/302) - Quantization improvements (2)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iqk_q_improvements` |
+| **Target Branch** | `main` |
| **Created** | 2025-03-31 |
| **Updated** | 2025-04-02 |
+| **Merged** | 2025-04-01 |
---
-#### Description
+## 📄 Description
-This PR is a follow up of #295. It applies the same approach to type-1 quants (`Q2_K, Q4_K, Q5_K, Q4_1, Q5_1`) and to `IQ3_K`. Quantization speed for `IQ3_K` is improved by a significant margin (up to 40%). Quantization speed for type-1 quants is also slightly improved ($\le 15$%). The changes do not result in PPL improvement for all tested models, but do improve PPL for the models that are more difficult to quantize (e.g., the LLaMA-3 series of models), and avoid a near catastrophic failure of `IQ3_K` on DeepSeek-Lite.
+This PR is a follow up of [#295](https://github.com/ikawrakow/ik_llama.cpp/issues/295). It applies the same approach to type-1 quants (`Q2_K, Q4_K, Q5_K, Q4_1, Q5_1`) and to `IQ3_K`. Quantization speed for `IQ3_K` is improved by a significant margin (up to 40%). Quantization speed for type-1 quants is also slightly improved ($\le 15$%). The changes do not result in PPL improvement for all tested models, but do improve PPL for the models that are more difficult to quantize (e.g., the LLaMA-3 series of models), and avoid a near catastrophic failure of `IQ3_K` on DeepSeek-Lite.
The following table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B1(L1-7B in the table), LLaMA-v2-7B1 (L2-7B), Mistral-7B1 (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the `--pure` option of `llama-quantize`) with token embeddings and output tensor set to `Q8_0`. The quantization command line is
```
@@ -57,10 +60,10 @@ ___
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-02** at **10:55:25**:
+👤 **saood06** commented on **2025-04-02** at **10:55:25**
>and avoid a near catastrophic failure of IQ3_K on DeepSeek-Lite.
-Interestingly IQ3_K before this PR was actually worse than Q3_K before #295 for DSL.
\ No newline at end of file
+Interestingly IQ3_K before this PR was actually worse than Q3_K before [#295](https://github.com/ikawrakow/ik_llama.cpp/issues/295) for DSL.
\ No newline at end of file
diff --git a/github-data/pull_requests/303 - Fix ARM_NEON build failure due to q8_2.md b/github-data/pull_requests/303 - Fix ARM_NEON build failure due to q8_2.md
index 8aca64bf7..ef7ab59a6 100644
--- a/github-data/pull_requests/303 - Fix ARM_NEON build failure due to q8_2.md
+++ b/github-data/pull_requests/303 - Fix ARM_NEON build failure due to q8_2.md
@@ -1,13 +1,16 @@
-### 🐛 [#303](https://github.com/ikawrakow/ik_llama.cpp/pull/303) - Fix ARM_NEON build failure due to q8_2
+## 🔀 [Pull Request #303](https://github.com/ikawrakow/ik_llama.cpp/pull/303) - Fix ARM_NEON build failure due to q8_2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_neon_q82` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-01 |
| **Updated** | 2025-04-01 |
+| **Merged** | 2025-04-01 |
---
-#### Description
+## 📄 Description
-I meant to also do `ARM_NEON` before merging #292 and then I forgot. This PR fixes the build failure.
\ No newline at end of file
+I meant to also do `ARM_NEON` before merging [#292](https://github.com/ikawrakow/ik_llama.cpp/issues/292) and then I forgot. This PR fixes the build failure.
\ No newline at end of file
diff --git a/github-data/pull_requests/307 - Metal_ much faster MoE prompt processing.md b/github-data/pull_requests/307 - Metal much faster MoE prompt processing.md
similarity index 89%
rename from github-data/pull_requests/307 - Metal_ much faster MoE prompt processing.md
rename to github-data/pull_requests/307 - Metal much faster MoE prompt processing.md
index 724cc06d4..c0f713a21 100644
--- a/github-data/pull_requests/307 - Metal_ much faster MoE prompt processing.md
+++ b/github-data/pull_requests/307 - Metal much faster MoE prompt processing.md
@@ -1,14 +1,17 @@
-### 🔀 [#307](https://github.com/ikawrakow/ik_llama.cpp/pull/307) - Metal: much faster MoE prompt processing
+## 🔀 [Pull Request #307](https://github.com/ikawrakow/ik_llama.cpp/pull/307) - Metal: much faster MoE prompt processing
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/metal_moe` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-02 |
| **Updated** | 2025-04-03 |
+| **Merged** | 2025-04-03 |
---
-#### Description
+## 📄 Description
The prompt processing (PP) performance on Metal for MoE models with many experts (such as DeepSeek) is pathetic. Here, and also in mainline before the very recent [PR 12612](https://github.com/ggml-org/llama.cpp/pull/12612). This mainline PR brings PP performance to a more acceptable level by effectively using GEMV for matrix multiplications involving MoE tensors.
diff --git a/github-data/pull_requests/309 - Fix GCC compilation errors on ARM.md b/github-data/pull_requests/309 - Fix GCC compilation errors on ARM.md
index 3f35b5e0e..87eff5629 100644
--- a/github-data/pull_requests/309 - Fix GCC compilation errors on ARM.md
+++ b/github-data/pull_requests/309 - Fix GCC compilation errors on ARM.md
@@ -1,13 +1,16 @@
-### 🐛 [#309](https://github.com/ikawrakow/ik_llama.cpp/pull/309) - Fix GCC compilation errors on ARM
+## 🔀 [Pull Request #309](https://github.com/ikawrakow/ik_llama.cpp/pull/309) - Fix GCC compilation errors on ARM
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_gcc_arm` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-03 |
| **Updated** | 2025-04-03 |
+| **Merged** | 2025-04-03 |
---
-#### Description
+## 📄 Description
-Closes #308
\ No newline at end of file
+Closes [#308](https://github.com/ikawrakow/ik_llama.cpp/issues/308)
\ No newline at end of file
diff --git a/github-data/pull_requests/31 - Fix build when iqk_mul_mat is disabled.md b/github-data/pull_requests/31 - Fix build when iqk_mul_mat is disabled.md
index 3ade2ab7b..2dc37549f 100644
--- a/github-data/pull_requests/31 - Fix build when iqk_mul_mat is disabled.md
+++ b/github-data/pull_requests/31 - Fix build when iqk_mul_mat is disabled.md
@@ -1,13 +1,16 @@
-### 🐛 [#31](https://github.com/ikawrakow/ik_llama.cpp/pull/31) - Fix build when iqk_mul_mat is disabled
+## 🔀 [Pull Request #31](https://github.com/ikawrakow/ik_llama.cpp/pull/31) - Fix build when iqk_mul_mat is disabled
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_no_iqk_build` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-31 |
| **Updated** | 2024-08-31 |
+| **Merged** | 2024-08-31 |
---
-#### Description
+## 📄 Description
-Ref #29
\ No newline at end of file
+Ref [#29](https://github.com/ikawrakow/ik_llama.cpp/issues/29)
\ No newline at end of file
diff --git a/github-data/pull_requests/310 - Metal_ FA and FlashMLA.md b/github-data/pull_requests/310 - Metal FA and FlashMLA.md
similarity index 68%
rename from github-data/pull_requests/310 - Metal_ FA and FlashMLA.md
rename to github-data/pull_requests/310 - Metal FA and FlashMLA.md
index 22889e21f..7dbada061 100644
--- a/github-data/pull_requests/310 - Metal_ FA and FlashMLA.md
+++ b/github-data/pull_requests/310 - Metal FA and FlashMLA.md
@@ -1,14 +1,17 @@
-### 🔀 [#310](https://github.com/ikawrakow/ik_llama.cpp/pull/310) - Metal: FA and FlashMLA
+## 🔀 [Pull Request #310](https://github.com/ikawrakow/ik_llama.cpp/pull/310) - Metal: FA and FlashMLA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/metal_fattn_update` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-03 |
| **Updated** | 2025-04-03 |
+| **Merged** | 2025-04-03 |
---
-#### Description
+## 📄 Description
Performance is not great, but it works with standard attentions and all 3 MLA options.
diff --git a/github-data/pull_requests/311 - Add -flax-vector-conversions for GCC on ARM.md b/github-data/pull_requests/311 - Add -flax-vector-conversions for GCC on ARM.md
index 11a75dbb4..bef55da27 100644
--- a/github-data/pull_requests/311 - Add -flax-vector-conversions for GCC on ARM.md
+++ b/github-data/pull_requests/311 - Add -flax-vector-conversions for GCC on ARM.md
@@ -1,7 +1,16 @@
-### 🔀 [#311](https://github.com/ikawrakow/ik_llama.cpp/pull/311) - Add -flax-vector-conversions for GCC on ARM
+## 🔀 [Pull Request #311](https://github.com/ikawrakow/ik_llama.cpp/pull/311) - Add -flax-vector-conversions for GCC on ARM
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/flax-vector-conversions` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-04 |
-| **Updated** | 2025-04-04 |
\ No newline at end of file
+| **Updated** | 2025-04-04 |
+| **Merged** | 2025-04-04 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/312 - Improved IQ2_XS quantization.md b/github-data/pull_requests/312 - Improved IQ2_XS quantization.md
index 840a2fdf7..ba9929e0e 100644
--- a/github-data/pull_requests/312 - Improved IQ2_XS quantization.md
+++ b/github-data/pull_requests/312 - Improved IQ2_XS quantization.md
@@ -1,14 +1,17 @@
-### 🔀 [#312](https://github.com/ikawrakow/ik_llama.cpp/pull/312) - Improved IQ2_XS quantization
+## 🔀 [Pull Request #312](https://github.com/ikawrakow/ik_llama.cpp/pull/312) - Improved IQ2_XS quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/improve_iq2_xs` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-05 |
| **Updated** | 2025-04-07 |
+| **Merged** | 2025-04-07 |
---
-#### Description
+## 📄 Description
The table shows PPL comparisons between the main branch and this PR for LLaMA-v1-7B1(L1-7B in the table), LLaMA-v2-7B1 (L2-7B), Mistral-7B1 (M-7B), LLaMA-3.1-8B-Instruct (L3-8B), and DeepSeek-V2-Lite (DSL). Context is always 512 tokens. Also given are the quantization times (Q-time for short in the table) in seconds on a Ryzen-7950X CPU. Tested is "pure" quantization (i.e., using the `--pure` option of `llama-quantize`) with token embeddings and output tensor set to `Q8_0`. The quantization command line is
```
diff --git a/github-data/pull_requests/313 - We need to synchronize before using device to host async memcpy.md b/github-data/pull_requests/313 - We need to synchronize before using device to host async memcpy.md
index 50656d5e2..bd5bed251 100644
--- a/github-data/pull_requests/313 - We need to synchronize before using device to host async memcpy.md
+++ b/github-data/pull_requests/313 - We need to synchronize before using device to host async memcpy.md
@@ -1,13 +1,16 @@
-### 🔀 [#313](https://github.com/ikawrakow/ik_llama.cpp/pull/313) - We need to synchronize before using device to host async memcpy
+## 🔀 [Pull Request #313](https://github.com/ikawrakow/ik_llama.cpp/pull/313) - We need to synchronize before using device to host async memcpy
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_cuda_memcpy_async` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-05 |
| **Updated** | 2025-04-05 |
+| **Merged** | 2025-04-05 |
---
-#### Description
+## 📄 Description
Thanks to @JohannesGaessler for noticing.
\ No newline at end of file
diff --git a/github-data/pull_requests/315 - Try not repacking q8_0 for FA computations.md b/github-data/pull_requests/315 - Try not repacking q8_0 for FA computations.md
index 90fec56e9..605aedc64 100644
--- a/github-data/pull_requests/315 - Try not repacking q8_0 for FA computations.md
+++ b/github-data/pull_requests/315 - Try not repacking q8_0 for FA computations.md
@@ -1,14 +1,16 @@
-### 🔀 [#315](https://github.com/ikawrakow/ik_llama.cpp/pull/315) - Try not repacking q8_0 for FA computations
+## 🔀 [Pull Request #315](https://github.com/ikawrakow/ik_llama.cpp/pull/315) - Try not repacking q8_0 for FA computations
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/try_fa_no_q80_repack` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-06 |
| **Updated** | 2025-05-04 |
---
-#### Description
+## 📄 Description
On the master branch if the K-cache is `Q8_0` it is repacked to `Q8_0_R8` before performing the Flash Attention computation. This is only done for PP (number of tokens in the batch $\ge$ 8), and tends to improve PP performance when the K-cache size is not too large. But for large K-cache, performance may suffer due to the additional allocation of a fairly significant amount of memory.
@@ -25,9 +27,9 @@ Another interesting observation is that there is no difference between offline a
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-04-06** at **15:43:02**:
+👤 **ubergarm** commented on **2025-04-06** at **15:43:02**
Picking up the conversation from [296](https://github.com/ikawrakow/ik_llama.cpp/issues/296#issuecomment-2781293572), I've run a comparison with the only variable being this PR (no repacking q8_0 for kcache).
@@ -353,7 +355,7 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ikawrakow** commented the **2025-04-06** at **16:30:25**:
+👤 **ikawrakow** commented on **2025-04-06** at **16:30:25**
Thank you for this.
@@ -363,7 +365,7 @@ It is hard to make progress without me being able to experiment on the actual bi
---
-👤 **ubergarm** commented the **2025-04-06** at **17:53:17**:
+👤 **ubergarm** commented on **2025-04-06** at **17:53:17**
> Thank you for this.
>
@@ -376,6 +378,8 @@ Yeah, I ran a couple more tests against `main@ec84855c` (not this PR) reducing t
#### tg

+
+*EDIT*: I also tried 64 threads for tg which seems about the same before 8k, then after 8k it is very slightly faster on average than the others, albeit with no peaks present. Then it is the within noise of the others at exactly 32k.
> If you are renting, where did you rent it?
@@ -396,6 +400,6 @@ Thanks!
---
-👤 **ikawrakow** commented the **2025-05-04** at **06:18:51**:
+👤 **ikawrakow** commented on **2025-05-04** at **06:18:51**
Doesn't look like it is useful, closing.
\ No newline at end of file
diff --git a/github-data/pull_requests/317 - Add copyright notices.md b/github-data/pull_requests/317 - Add copyright notices.md
index 712347916..33269abaa 100644
--- a/github-data/pull_requests/317 - Add copyright notices.md
+++ b/github-data/pull_requests/317 - Add copyright notices.md
@@ -1,14 +1,17 @@
-### 🔀 [#317](https://github.com/ikawrakow/ik_llama.cpp/pull/317) - Add copyright notices
+## 🔀 [Pull Request #317](https://github.com/ikawrakow/ik_llama.cpp/pull/317) - Add copyright notices
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/copyright` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-07 |
| **Updated** | 2025-04-07 |
+| **Merged** | 2025-04-07 |
---
-#### Description
+## 📄 Description
Explicitly added only to files where I have done non-trivial changes since the last merge of mainline on August 12 2024.
diff --git a/github-data/pull_requests/318 - Use links for ggml_llama.cpp authors.md b/github-data/pull_requests/318 - Use links for ggmlllama.cpp authors.md
similarity index 51%
rename from github-data/pull_requests/318 - Use links for ggml_llama.cpp authors.md
rename to github-data/pull_requests/318 - Use links for ggmlllama.cpp authors.md
index 0efbac0a0..5b54a4bde 100644
--- a/github-data/pull_requests/318 - Use links for ggml_llama.cpp authors.md
+++ b/github-data/pull_requests/318 - Use links for ggmlllama.cpp authors.md
@@ -1,22 +1,25 @@
-### 🔀 [#318](https://github.com/ikawrakow/ik_llama.cpp/pull/318) - Use links for ggml/llama.cpp authors
+## 🔀 [Pull Request #318](https://github.com/ikawrakow/ik_llama.cpp/pull/318) - Use links for ggml/llama.cpp authors
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/update_license` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-07 |
| **Updated** | 2025-04-07 |
+| **Merged** | 2025-04-07 |
---
-#### Description
+## 📄 Description
and also remove the local AUTHORS copy as suggested by @saood06
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** submitted a review the **2025-04-07** at **14:01:18**: 💬 `COMMENTED`
+👤 **saood06** reviewed this pull request 💬 on **2025-04-07** at **14:01:18**
You mentioned this to me in the other PR.
@@ -26,12 +29,12 @@ Would adding a line to the license referencing ik_llama.cpp authors and having t
---
-👤 **ikawrakow** commented the **2025-04-07** at **15:00:34**:
+👤 **ikawrakow** commented on **2025-04-07** at **15:00:34**
Like this?
---
-👤 **saood06** submitted a review the **2025-04-07** at **15:10:35**: ✅ `APPROVED`
+👤 **saood06** approved this pull request ✅ on **2025-04-07** at **15:10:35**
LGTM
\ No newline at end of file
diff --git a/github-data/pull_requests/32 - Zen4 Flash Attention.md b/github-data/pull_requests/32 - Zen4 Flash Attention.md
index 133520b04..86fed0db4 100644
--- a/github-data/pull_requests/32 - Zen4 Flash Attention.md
+++ b/github-data/pull_requests/32 - Zen4 Flash Attention.md
@@ -1,18 +1,21 @@
-### 🔀 [#32](https://github.com/ikawrakow/ik_llama.cpp/pull/32) - Zen4 Flash Attention
+## 🔀 [Pull Request #32](https://github.com/ikawrakow/ik_llama.cpp/pull/32) - Zen4 Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/zen4_flash_attn` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-01 |
| **Updated** | 2024-09-01 |
+| **Merged** | 2024-09-01 |
---
-#### Description
+## 📄 Description
### TL;DR
-This PR adds a flash attention (FA) implementation optimized for the Zen4 architecture as part of the quest to improve CPU inference for long contexts (#25, #26).
+This PR adds a flash attention (FA) implementation optimized for the Zen4 architecture as part of the quest to improve CPU inference for long contexts ([#25](https://github.com/ikawrakow/ik_llama.cpp/issues/25), [#26](https://github.com/ikawrakow/ik_llama.cpp/issues/26)).
### Limitations
diff --git a/github-data/pull_requests/320 - Guard against attempts to use MLA for non-MLA models.md b/github-data/pull_requests/320 - Guard against attempts to use MLA for non-MLA models.md
index 21b860158..f45334d76 100644
--- a/github-data/pull_requests/320 - Guard against attempts to use MLA for non-MLA models.md
+++ b/github-data/pull_requests/320 - Guard against attempts to use MLA for non-MLA models.md
@@ -1,13 +1,16 @@
-### 🔀 [#320](https://github.com/ikawrakow/ik_llama.cpp/pull/320) - Guard against attempts to use MLA for non-MLA models
+## 🔀 [Pull Request #320](https://github.com/ikawrakow/ik_llama.cpp/pull/320) - Guard against attempts to use MLA for non-MLA models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mla_guard` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-08 |
| **Updated** | 2025-04-08 |
+| **Merged** | 2025-04-08 |
---
-#### Description
+## 📄 Description
So we don't crash when someone uses `-mla` with non-MLA models.
\ No newline at end of file
diff --git a/github-data/pull_requests/321 - LlaMA-4 support _text only_.md b/github-data/pull_requests/321 - LlaMA-4 support text only.md
similarity index 75%
rename from github-data/pull_requests/321 - LlaMA-4 support _text only_.md
rename to github-data/pull_requests/321 - LlaMA-4 support text only.md
index 0f7d7e249..1a58a68f9 100644
--- a/github-data/pull_requests/321 - LlaMA-4 support _text only_.md
+++ b/github-data/pull_requests/321 - LlaMA-4 support text only.md
@@ -1,14 +1,17 @@
-### 🔀 [#321](https://github.com/ikawrakow/ik_llama.cpp/pull/321) - LlaMA-4 support (text only)
+## 🔀 [Pull Request #321](https://github.com/ikawrakow/ik_llama.cpp/pull/321) - LlaMA-4 support (text only)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/llama4` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-09 |
| **Updated** | 2025-04-11 |
+| **Merged** | 2025-04-10 |
---
-#### Description
+## 📄 Description
It seems the initial reactions to LlaMA-4 are mostly negative. Nevertheless, quantized LlaMA-Scout is something I can run on one of my systems, so here it is.
@@ -28,13 +31,13 @@ As mentioned in [PR 12791](https://github.com/ggml-org/llama.cpp/pull/12791), th
There are 2 R's in the word "strawberry".
```
-Closes #314
+Closes [#314](https://github.com/ikawrakow/ik_llama.cpp/issues/314)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-09** at **15:02:02**:
+👤 **ikawrakow** commented on **2025-04-09** at **15:02:02**
So, using a single active expert as prescribed by the model parameters, I get
```
@@ -49,7 +52,27 @@ It is of course slower (133 t/s vs 211 t/s with the setup described above), but
---
-👤 **ikawrakow** commented the **2025-04-10** at **05:59:25**:
+👤 **saood06** commented on **2025-04-10** at **03:37:51**
+
+> So, using a single active expert as prescribed by the model parameters, I get
+>
+> ```
+> PPL(Q8_0, n_ctx = 512) = 9.0644
+> ```
+>
+> Activating 2 experts using `--override-kv "llama4.expert_used_count=int:2"` I get
+>
+> ```
+> PPL(Q8_0, n_ctx = 512) = 8.7030
+> ```
+>
+> It is of course slower (133 t/s vs 211 t/s with the setup described above), but it is kind of strange that 2 experts produce a lower PPL. This wasn't the case for Mixtral8x7B where 3 experts were worse than 2 (unless one was using a very low bpw quantization).
+
+Have you tried even higher numbers? Does it peak at 2 experts?
+
+---
+
+👤 **ikawrakow** commented on **2025-04-10** at **05:59:25**
Here some quantization experiments with LlaMA-4-Scout
@@ -66,7 +89,15 @@ ___
---
-👤 **saood06** commented the **2025-04-10** at **06:13:30**:
+👤 **ikawrakow** commented on **2025-04-10** at **06:05:45**
+
+> Have you tried even higher numbers? Does it peak at 2 experts?
+
+Not yet. I'm doing some quantization experiments, and things take some time on the hardware I have available. For 3 experts with `Q8_0` the PPL calculation will be more than an hour.
+
+---
+
+👤 **saood06** commented on **2025-04-10** at **06:13:30**
> Strangely enough, replacing `q4_K` with `iq4_K` in the attention tensors leads to higher PPL
@@ -74,7 +105,7 @@ Do you think this could affect other architectures?
---
-👤 **ikawrakow** commented the **2025-04-10** at **06:18:31**:
+👤 **ikawrakow** commented on **2025-04-10** at **06:18:31**
> Do you think this could affect other architectures?
@@ -82,13 +113,13 @@ I have noticed in the past that `iq4_k/iq5_k/iq6_k` for the attention tensors do
---
-👤 **ikawrakow** commented the **2025-04-10** at **06:20:02**:
+👤 **ikawrakow** commented on **2025-04-10** at **06:20:02**
Oh, for token embeddings I had a few cases where it was better to use the corresponding k-quant instead of the `iqk` quant.
---
-👤 **saood06** commented the **2025-04-10** at **06:46:32**:
+👤 **saood06** commented on **2025-04-10** at **06:46:32**
> I have noticed in the past that `iq4_k/iq5_k/iq6_k` for the attention tensors does not have a clear advantage compared to `q4_K/q5_K/q6_K`. They are much better for the FFN portion and that's where the quality gains come from. But this is the first time when it became worse. So, in your case, if you are looking to optimize performance (and have time/energy to experiment), you can try replacing `iq4_k` with `q4_K` in the attention tensors as this will improve inference speed.
@@ -98,21 +129,21 @@ Interesting to hear. I will take all this into account next time I make quants.
---
-👤 **ikawrakow** commented the **2025-04-10** at **06:57:24**:
+👤 **ikawrakow** commented on **2025-04-10** at **06:57:24**
> Have you tried even higher numbers? Does it peak at 2 experts?
-Just tried. Did not run `Wikitext2` to completion, but after 172 chunks PPL is 0.1 higher than 2 experts, so it is very unlikely it will be better at the end. Still better than a single expert, but 2 experts seems to be the sweet spot (at the expense of a hit of performance).
+Just tried. Did not run `Wikitext2` to completion, but after 172 chunks PPL with 3 experts is 0.1 higher than 2 experts, so it is very unlikely it will be better at the end. Still better than a single expert, but 2 experts seems to be the sweet spot (at the expense of a hit in performance).
---
-👤 **ikawrakow** commented the **2025-04-10** at **07:05:15**:
+👤 **ikawrakow** commented on **2025-04-10** at **07:05:15**
This seems solid enough, merging it.
---
-👤 **saood06** commented the **2025-04-10** at **08:20:34**:
+👤 **saood06** commented on **2025-04-10** at **08:20:34**
> Just tried. Did not run `Wikitext2` to completion, but after 172 chunks PPL with 3 experts is 0.1 higher than 2 experts, so it is very unlikely it will be better at the end. Still better than a single expert, but 2 experts seems to be the sweet spot (at the expense of a hit in performance).
@@ -120,7 +151,7 @@ If I ever try Maverick will see if it is replicable there.
---
-👤 **ikawrakow** commented the **2025-04-10** at **15:11:51**:
+👤 **ikawrakow** commented on **2025-04-10** at **15:11:51**
So, L4-Scout seems to quantize pretty well.
@@ -149,7 +180,7 @@ So, L4-Scout seems to quantize pretty well.
* Model size: 34.871 GiB vs theirs 35.904 GiB
* Recipe:
```
-./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=iq4_ks,ffn_down_exps=q3_K,ffn_up_exps=iq1_s,ffn_gate_exps=iq1_s" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq1_s
+./bin/llama-quantize --imatrix l4_scout_imat_512.out --custom-q "ffn_gate_shexp=iq4_ks,ffn_up_shexp=iq4_ks,ffn_down_shexp=iq5_k,attn=iq4_ks,token_embd.weight=q4_K,output.weight=q6_K,blk\.[0-5]\.ffn_down_exps=iq4_ks,ffn_down_exps=q3_K,ffn_up_exps=iq2_xxs,ffn_gate_exps=iq2_xxs" ../../iquants/models/l4_109B/Llama4-Scout-16x17B-BF16.gguf junk1.bin iq2_xxs
```
### Beating Unsloth's UD-IQ1_S
@@ -163,7 +194,7 @@ So, L4-Scout seems to quantize pretty well.
---
-👤 **ikawrakow** commented the **2025-04-11** at **16:01:10**:
+👤 **ikawrakow** commented on **2025-04-11** at **16:01:10**
Here another recipe for `iq3_xxs`:
```
diff --git a/github-data/pull_requests/324 - Correct L4 rms_norm.md b/github-data/pull_requests/324 - Correct L4 rms_norm.md
index 84ba4cdcd..945c2b278 100644
--- a/github-data/pull_requests/324 - Correct L4 rms_norm.md
+++ b/github-data/pull_requests/324 - Correct L4 rms_norm.md
@@ -1,13 +1,16 @@
-### 🔀 [#324](https://github.com/ikawrakow/ik_llama.cpp/pull/324) - Correct L4 rms_norm
+## 🔀 [Pull Request #324](https://github.com/ikawrakow/ik_llama.cpp/pull/324) - Correct L4 rms_norm
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/l4_rms_norm` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-11 |
| **Updated** | 2025-04-11 |
+| **Merged** | 2025-04-11 |
---
-#### Description
+## 📄 Description
I was wondering about the hard-coded `1e-6` when porting the mainline PR, but left it the way it is. Mainline has now [corrected it](https://github.com/ggml-org/llama.cpp/pull/12882), so let's do that here as well.
\ No newline at end of file
diff --git a/github-data/pull_requests/325 - Fix KLD precision.md b/github-data/pull_requests/325 - Fix KLD precision.md
index 1ae8b1858..e3e4816c7 100644
--- a/github-data/pull_requests/325 - Fix KLD precision.md
+++ b/github-data/pull_requests/325 - Fix KLD precision.md
@@ -1,14 +1,17 @@
-### 🐛 [#325](https://github.com/ikawrakow/ik_llama.cpp/pull/325) - Fix KLD precision
+## 🔀 [Pull Request #325](https://github.com/ikawrakow/ik_llama.cpp/pull/325) - Fix KLD precision
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_kld` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-12 |
| **Updated** | 2025-04-13 |
+| **Merged** | 2025-04-12 |
---
-#### Description
+## 📄 Description
Some people insist that perplexity tells us nothing, and that [Kullback-Leibler Divergence](https://en.wikipedia.org/wiki/Kullback–Leibler_divergence) (KLD), along with the other statistics computed by `llama-perplexity` with the `--kl-divergence` option, are the one and only one true measure of quantization accuracy. Computing KLD requires 1st running the `llama-perplexity` tool with `--kl-divergence-base` to compute the logits of the base model, which are then used to compute KLD and other token probability statistics in a subsequent run with a quantized (or otherwise approximate) model. The base model logits file is quite large as it stores the log-probabilities for each evaluated token for all tokens in the vocabulary. Hence, when I added KLD capabilities to `llama.cpp` with [this](https://github.com/ggml-org/llama.cpp/pull/5076) and [this](https://github.com/ggml-org/llama.cpp/pull/5081) PRs, I used 16-bit precision to store the logits of the base model, setting the minimum logit to `std::max(min_logit, max_logit - 16). That was adequate for the models available at the time.
@@ -18,9 +21,9 @@ A lot of talk for this one-liner PR, which fixes the problem.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-04-13** at **15:20:53**:
+👤 **ubergarm** commented on **2025-04-13** at **15:20:53**
> I was concerned that other statistics will be influenced as well, but it looks like it is only PPL that becomes wrong.
@@ -35,12 +38,12 @@ Thanks!
---
-👤 **ikawrakow** commented the **2025-04-13** at **15:35:00**:
+👤 **ikawrakow** commented on **2025-04-13** at **15:35:00**
The PR does not affect `imatrix`. It affects `llama-perplexity` when run with `--kl-divergence-base X --kl-divergence`. This computes KL-Divergence and various other token probability statistics between the current model and the token probabilities for the base model stored in `X` and computed in a previous run of `llama-perplexity`.
---
-👤 **ikawrakow** commented the **2025-04-13** at **15:38:16**:
+👤 **ikawrakow** commented on **2025-04-13** at **15:38:16**
Also, I don't know how it affects other models. But for LLaMA-4-Scout I observed a nearly 1% difference without this PR.
\ No newline at end of file
diff --git a/github-data/pull_requests/326 - WIP Compute per layer LIM Scores during imatrix.md b/github-data/pull_requests/326 - WIP Compute per layer LIM Scores during imatrix.md
index ff9b897c6..c366d0f99 100644
--- a/github-data/pull_requests/326 - WIP Compute per layer LIM Scores during imatrix.md
+++ b/github-data/pull_requests/326 - WIP Compute per layer LIM Scores during imatrix.md
@@ -1,14 +1,16 @@
-### 🔀 [#326](https://github.com/ikawrakow/ik_llama.cpp/pull/326) - WIP Compute per layer LIM Scores during imatrix
+## 🔀 [Pull Request #326](https://github.com/ikawrakow/ik_llama.cpp/pull/326) - WIP Compute per layer LIM Scores during imatrix
| **Author** | `ubergarm` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `ug/compute-layer-input-mod-score` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-13 |
| **Updated** | 2025-04-16 |
---
-#### Description
+## 📄 Description
*WARNING*: This is mostly vibe code. Hope I'm not wasting y'alls time.
@@ -1048,9 +1050,9 @@ Layer LIM Score
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-13** at **06:30:24**:
+👤 **ikawrakow** commented on **2025-04-13** at **06:30:24**
Do I understand the results in the quoted PR correctly? The `ffn_down` tensors are the least important? This would be really funny, because everybody knows that quantization errors in `ffn_down` have the highest impact on observed quantization quality.
@@ -1058,11 +1060,13 @@ I didn't go to read the blog post, but why would cosine similarity between the i
---
-👤 **ikawrakow** submitted a review the **2025-04-13** at **07:05:04**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `examples/imatrix/imatrix.cpp` on **2025-04-13** at **07:05:04**
+
+So, `activations` gets overwritten each time we get called with a new set of activations. It also gets overwritten as we go over the rows of the activation matrix. At the end of the run, the `compute_lim()` function gets called. Which means that we get the LIM computed with just the very last token processed in the `imatrix` run, not an actual statistical evaluation of cosine similarities between inputs to tensors of the same type in subsequent layers.
---
-👤 **ubergarm** commented the **2025-04-13** at **15:58:29**:
+👤 **ubergarm** commented on **2025-04-13** at **15:58:29**
> Do I understand the results in the quoted PR correctly? The `ffn_down` tensors are the least important? This would be really funny, because everybody knows that quantization errors in `ffn_down` have the highest impact on observed quantization quality.
@@ -1086,7 +1090,7 @@ Really appreciate your time, thanks!
---
-👤 **ikawrakow** commented the **2025-04-13** at **16:29:52**:
+👤 **ikawrakow** commented on **2025-04-13** at **16:29:52**
> The paper that suggests using cosine similarity says:
>
@@ -1096,7 +1100,7 @@ Sure. But the activations did not change due to that tensor only, they changed d
---
-👤 **compilade** commented the **2025-04-13** at **17:58:43**:
+👤 **compilade** commented on **2025-04-13** at **17:58:43**
I agree with @ikawrakow, comparing across layers for a particular tensor seems like it would have non-intuitive results which might not necessarily be linked to relative importance of the tensors.
@@ -1104,12 +1108,12 @@ I think what is calculated here is the cosine similarity across the *inputs* of
> llama-imatrix technically has access to both the input and output activations of a layer, but only uses its input.
-@ubergarm What I meant by this was to calculate LIM scores with the input and output ***within*** each linear operations (i.e. what `llama-imatrix` already considers). The output would be from `t->data` while the input would still be from `src1->data`.
+@ubergarm What I meant by this was to calculate LIM scores with the input and output ***within*** each linear operations (i.e. what `llama-imatrix` already considers). The output would be from `t->data` while the input would still be from `src1->data`. (note that the section with `if (ask)` for the callback needs to require the output data in this case, but I think this is already done by default?)
Each layer should be independent in this approach. I don't know what they used (in the paper) to combine the results across multiple tokens, though. Likely the average, but I'm not sure.
---
-👤 **ikawrakow** commented the **2025-04-14** at **07:26:42**:
+👤 **ikawrakow** commented on **2025-04-14** at **07:26:42**
@compilade
@@ -1119,7 +1123,7 @@ I have used this to derive corrections for a quantized model (have not published
---
-👤 **compilade** commented the **2025-04-15** at **22:13:03**:
+👤 **compilade** commented on **2025-04-15** at **22:13:03**
> Can you be more specific how you want to calculate the impact of a linear operation from the input activations and the result of the linear operation?
@@ -1129,23 +1133,23 @@ I was thinking of directly calculating a dot product between the input and outpu
---
-👤 **ubergarm** commented the **2025-04-16** at **15:06:47**:
+👤 **ubergarm** commented on **2025-04-16** at **15:06:47**
-Closing this in favor of implementation in PR#328.
+Closing this in favor of implementation in PR[#328](https://github.com/ikawrakow/ik_llama.cpp/issues/328).
## Experiment
-Still more experimentation to do, and sorry no visual graphs as I'm away from my desk, but did a quick A/B test comparing two `V3-0324` quants which have the same final size but vary only in which routed expert layers receive more or less quantization. For this discussion I'll refer to the baseline case of giving the first 17 routed expert layers more bpw as `FIRST-N` approach vs using the results of layer importance from PR#328 `COSSIM` to decide which 17 routed expert layers should receive more bpw.
+Still more experimentation to do, and sorry no visual graphs as I'm away from my desk, but did a quick A/B test comparing two `V3-0324` quants which have the same final size but vary only in which routed expert layers receive more or less quantization. For this discussion I'll refer to the baseline case of giving the first 17 routed expert layers more bpw as `FIRST-N` approach vs using the results of layer importance from PR[#328](https://github.com/ikawrakow/ik_llama.cpp/issues/328) `COSSIM` to decide which 17 routed expert layers should receive more bpw.
-Finally, I provide the `--show-statistics` of the computed imatrix used for these quantizations from [@EAddario's mainline llama.cpp PR#12718](https://github.com/ggml-org/llama.cpp/pull/12718) if anyone wants to compare the numbers themselves. (I haven't had a chance to compare myself yet).
+Finally, I provide the `--show-statistics` of the computed imatrix used for these quantizations from [@EAddario's mainline llama.cpp PR[#12718](https://github.com/ikawrakow/ik_llama.cpp/issues/12718)](https://github.com/ggml-org/llama.cpp/pull/12718) if anyone wants to compare the numbers themselves. (I haven't had a chance to compare myself yet).
## tl;dr;
-Using PR#328 `llama-imatrix --layer-similarity [-lsim]` to decide which layers to prioritize quantization showed slightly better perplexity score than naively using the first 17 layers in a single experiment on `V3-0324`.
+Using PR[#328](https://github.com/ikawrakow/ik_llama.cpp/issues/328) `llama-imatrix --layer-similarity [-lsim]` to decide which layers to prioritize quantization showed slightly better perplexity score than naively using the first 17 layers in a single experiment on `V3-0324`.
* `FIRST-N` Final estimate: PPL = 3.3193 +/- 0.01830
* `COSSIM` Final estimate: PPL = 3.3151 +/- 0.0182
-While it is within the noise, there may be room for further improvement applying the scores to attention layer quantization as well which I didn't do for this experiment.
+While it is within the noise, there may be room for further improvement applying the scores to attention tensors quantization as well which I didn't do for this experiment. In retrospect, I probably should have used the layer importance scores from `sorted ffn importances`given those were the layers I was targeting. Move fast break things lol.
## Procedure
diff --git a/github-data/pull_requests/327 - Improved IQ1_M quantization.md b/github-data/pull_requests/327 - Improved IQ1_M quantization.md
index 542d30a84..c1d5a52ab 100644
--- a/github-data/pull_requests/327 - Improved IQ1_M quantization.md
+++ b/github-data/pull_requests/327 - Improved IQ1_M quantization.md
@@ -1,14 +1,17 @@
-### 🔀 [#327](https://github.com/ikawrakow/ik_llama.cpp/pull/327) - Improved IQ1_M quantization
+## 🔀 [Pull Request #327](https://github.com/ikawrakow/ik_llama.cpp/pull/327) - Improved IQ1_M quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/improve_iq1m` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-13 |
| **Updated** | 2025-04-13 |
+| **Merged** | 2025-04-13 |
---
-#### Description
+## 📄 Description
I was experimenting with LlaMA-4-Scout quantization and was bothered by the extremely long quantization time of `IQ1_M`, so looked into speeding things up.
diff --git a/github-data/pull_requests/328 - imatrix_ collect layer influence statistics.md b/github-data/pull_requests/328 - imatrix collect layer influence statistics.md
similarity index 82%
rename from github-data/pull_requests/328 - imatrix_ collect layer influence statistics.md
rename to github-data/pull_requests/328 - imatrix collect layer influence statistics.md
index 2582a9ed3..d7d8e6c80 100644
--- a/github-data/pull_requests/328 - imatrix_ collect layer influence statistics.md
+++ b/github-data/pull_requests/328 - imatrix collect layer influence statistics.md
@@ -1,14 +1,17 @@
-### 🔀 [#328](https://github.com/ikawrakow/ik_llama.cpp/pull/328) - imatrix: collect layer influence statistics
+## 🔀 [Pull Request #328](https://github.com/ikawrakow/ik_llama.cpp/pull/328) - imatrix: collect layer influence statistics
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/imatrix_lsim` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-14 |
| **Updated** | 2025-04-14 |
+| **Merged** | 2025-04-14 |
---
-#### Description
+## 📄 Description
@ubergarm
@@ -16,27 +19,104 @@ Here is how one can collect statistics about the activations change caused by a
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-04-14** at **14:39:20**:
+👤 **ubergarm** commented on **2025-04-14** at **14:39:20**
Holy smokes, amazing! I'm out for a couple nights, but going to pull this and try quick before leaving the house haha... Thanks!
---
-👤 **ikawrakow** commented the **2025-04-14** at **16:02:02**:
+👤 **ubergarm** commented on **2025-04-14** at **15:55:34**
+
+Oooh yeah, just got it to work in quick test! Took me a sec to figure it out given I was using CUDA which messes up the logic for block names I believe here:
+
+```
+ std::optional layer_index(const std::string& name) const {
+ printf("name=%s, m_params.output_tensor_name=%s\n", name.c_str(), m_params.output_tensor_name.c_str());
+ if (name == m_params.output_tensor_name && m_last_layer < 199) {
+ return m_last_layer + 1;
+ }
+```
+
+## Running on single CUDA GPU
+```
+compute_imatrix: tokenizing the input ..
+compute_imatrix: tokenization took 1.736 ms
+compute_imatrix: computing over 5 chunks with batch_size 512
+name=CUDA0#blk.0.attn_q.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.0.attn_k.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.0.attn_v.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.0.attn_output.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_gate.weight, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.0.ffn_gate.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_up.weight, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.0.ffn_up.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_down.weight, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.0.ffn_down.weight#0, m_params.output_tensor_name=ffn_down.weight
+name=CUDA0#blk.1.attn_q.weight#0, m_params.output_tensor_name=ffn_down.weight
+```
+
+## Running on CPU only compiled
+```
+compute_imatrix: tokenizing the input ..
+compute_imatrix: tokenization took 1.843 ms
+compute_imatrix: computing over 5 chunks with batch_size 512
+name=blk.0.attn_q.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.attn_k.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.attn_v.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.attn_output.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_gate.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_gate.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_up.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_up.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_down.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.0.ffn_down.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.attn_q.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.attn_k.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.attn_v.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.attn_output.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.ffn_gate.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.ffn_gate.weight, m_params.output_tensor_name=ffn_down.weight
+name=blk.1.ffn_up.weight, m_params.output_tensor_name=ffn_down.weight
+```
+
+Adding a simple function to strip the names between the `#` seems to fix it for CUDA.
+
+```cpp
+std::string extractBetweenHashes(const std::string& name) {
+ size_t first_hash = name.find('#');
+ if (first_hash == std::string::npos) {
+ return name; // No first '#', return original
+ }
+
+ size_t second_hash = name.find('#', first_hash + 1);
+ if (second_hash == std::string::npos) {
+ return name; // No second '#', return original
+ }
+
+ // Extract between the two '#' characters
+ return name.substr(first_hash + 1, second_hash - first_hash - 1);
+}
+```
+
+Gonna let it run on llama-2-13b then print up a quick graph, i'm running late but wanna see this lol... thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-04-14** at **16:02:02**
Does the last commit fix it? I had forgotten about having to strip the tensor name (and for whatever reason I didn't have the issue even though running on CUDA).
---
-👤 **ubergarm** commented the **2025-04-14** at **16:10:14**:
+👤 **ubergarm** commented on **2025-04-14** at **16:10:14**
Yep, that did the trick! Thanks! I have a chart I just graphed, will put it here with logs before heading out.
---
-👤 **ikawrakow** commented the **2025-04-14** at **16:13:51**:
+👤 **ikawrakow** commented on **2025-04-14** at **16:13:51**
Using this on LLaMA-4-Scout, I get this as the layers sorted by importance (most important first):
```
@@ -100,7 +180,7 @@ It arrived at a `PPL = 9.7545`, so nearly on par with Unsloth's `UD-Q2_K_XL`, de
---
-👤 **ubergarm** commented the **2025-04-14** at **16:28:24**:
+👤 **ubergarm** commented on **2025-04-14** at **16:28:24**
> (but using layer 47 instead of layer 4, which according to the metric would be the right thing to do, results in a worse outcome)
@@ -108,6 +188,8 @@ Very interesting. Yeah, I'm curious how much the input text for imatrix effects
I did a quick run with `llama-2-13b-chat.Q8_0.gguf` and plotted the results to compare against that [Layer-wise Quantization](https://arxiv.org/pdf/2406.17415) paper which suggests for this model the three most important layers would be 1, 2, and 40 while the least important would be 32, 33, and 34. Though I'm not sure how they got that final layer 40 cosine similarity.
+Your results seem to correlate with theirs in this limited test.
+
Results Graph and Log of modified llama-imatrix -lsim
diff --git a/github-data/pull_requests/329 - Add ability to hide imatrix details in llama-quantize.md b/github-data/pull_requests/329 - Add ability to hide imatrix details in llama-quantize.md
index 405db6cba..6386118ba 100644
--- a/github-data/pull_requests/329 - Add ability to hide imatrix details in llama-quantize.md
+++ b/github-data/pull_requests/329 - Add ability to hide imatrix details in llama-quantize.md
@@ -1,14 +1,17 @@
-### 🔀 [#329](https://github.com/ikawrakow/ik_llama.cpp/pull/329) - Add ability to hide imatrix details in llama-quantize
+## 🔀 [Pull Request #329](https://github.com/ikawrakow/ik_llama.cpp/pull/329) - Add ability to hide imatrix details in llama-quantize
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/hide_imatrix` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-14 |
| **Updated** | 2025-04-14 |
+| **Merged** | 2025-04-14 |
---
-#### Description
+## 📄 Description
Simply add `--hide-imatrix` to the command line when quantizing. This will store "top_secret" in the imatrix data file name and calibration dataset fields, and zeros in the batch size and number of chunks used to compute the imatrix. Example:
```
diff --git a/github-data/pull_requests/33 - Do not process prompts containing binary data for escapes.md b/github-data/pull_requests/33 - Do not process prompts containing binary data for escapes.md
index ce1cfb591..8485d0045 100644
--- a/github-data/pull_requests/33 - Do not process prompts containing binary data for escapes.md
+++ b/github-data/pull_requests/33 - Do not process prompts containing binary data for escapes.md
@@ -1,14 +1,17 @@
-### 🔀 [#33](https://github.com/ikawrakow/ik_llama.cpp/pull/33) - Do not process prompts containing binary data for escapes
+## 🔀 [Pull Request #33](https://github.com/ikawrakow/ik_llama.cpp/pull/33) - Do not process prompts containing binary data for escapes
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_multiple_choice` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-02 |
| **Updated** | 2024-09-02 |
+| **Merged** | 2024-09-02 |
---
-#### Description
+## 📄 Description
The multiple choice evaluation has been broken in `llama.cpp` via commit `6ff13987a`, and this PR fixes it.
diff --git a/github-data/pull_requests/330 - Allow q8_0 KV cache for head size 256.md b/github-data/pull_requests/330 - Allow q8_0 KV cache for head size 256.md
index 3f4cb96ba..8e3f82122 100644
--- a/github-data/pull_requests/330 - Allow q8_0 KV cache for head size 256.md
+++ b/github-data/pull_requests/330 - Allow q8_0 KV cache for head size 256.md
@@ -1,13 +1,16 @@
-### 🔀 [#330](https://github.com/ikawrakow/ik_llama.cpp/pull/330) - Allow q8_0 KV cache for head size 256
+## 🔀 [Pull Request #330](https://github.com/ikawrakow/ik_llama.cpp/pull/330) - Allow q8_0 KV cache for head size 256
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemma_q80_kvcache` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-15 |
| **Updated** | 2025-04-15 |
+| **Merged** | 2025-04-15 |
---
-#### Description
+## 📄 Description
Gemma models have a head size of 256. For whatever reason, the inherited CUDA FA code only allows `fp16` KV cache for this head size. This PR adds the ability to also use `Q8_0` KV cache with FA.
\ No newline at end of file
diff --git a/github-data/pull_requests/331 - Better gemm_gemv on AVX2 fr q4_0_r8.md b/github-data/pull_requests/331 - Better gemmgemv on AVX2 fr q4_0_r8.md
similarity index 62%
rename from github-data/pull_requests/331 - Better gemm_gemv on AVX2 fr q4_0_r8.md
rename to github-data/pull_requests/331 - Better gemmgemv on AVX2 fr q4_0_r8.md
index 4c14f5b42..1a6507c28 100644
--- a/github-data/pull_requests/331 - Better gemm_gemv on AVX2 fr q4_0_r8.md
+++ b/github-data/pull_requests/331 - Better gemmgemv on AVX2 fr q4_0_r8.md
@@ -1,13 +1,16 @@
-### 🔀 [#331](https://github.com/ikawrakow/ik_llama.cpp/pull/331) - Better gemm/gemv on AVX2 fr q4_0_r8
+## 🔀 [Pull Request #331](https://github.com/ikawrakow/ik_llama.cpp/pull/331) - Better gemm/gemv on AVX2 fr q4_0_r8
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/faster_avx2_q40` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-15 |
| **Updated** | 2025-04-15 |
+| **Merged** | 2025-04-15 |
---
-#### Description
+## 📄 Description
I constantly get confused how many `int16_t` dot products (`_mm256_maddubs_epi16()` results) I can sum up as `int16_t` before overflowing. In the case of `Q4_0` I was adding too few, and was having one unnecessary `_mm256_madd_epi16` because of that. This PR fixes this. The result is a ~10% gain in performance when tested with Geema-3-12B-Instruct.
\ No newline at end of file
diff --git a/github-data/pull_requests/332 - Better TG performance for GQA models _CPU_.md b/github-data/pull_requests/332 - Better TG performance for GQA models CPU.md
similarity index 95%
rename from github-data/pull_requests/332 - Better TG performance for GQA models _CPU_.md
rename to github-data/pull_requests/332 - Better TG performance for GQA models CPU.md
index e6307e81f..a9400bd37 100644
--- a/github-data/pull_requests/332 - Better TG performance for GQA models _CPU_.md
+++ b/github-data/pull_requests/332 - Better TG performance for GQA models CPU.md
@@ -1,14 +1,17 @@
-### 🔀 [#332](https://github.com/ikawrakow/ik_llama.cpp/pull/332) - Better TG performance for GQA models (CPU)
+## 🔀 [Pull Request #332](https://github.com/ikawrakow/ik_llama.cpp/pull/332) - Better TG performance for GQA models (CPU)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/tg_tweaks` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-16 |
| **Updated** | 2025-04-17 |
+| **Merged** | 2025-04-17 |
---
-#### Description
+## 📄 Description
This PR adds improved TG performance on the CPU for GQA models (LLaMA-2+, Gemma, etc.).
We see performance gains with and without FA. The gains without FA are fairly minor and come from a different way of distributing the work between the threads for the `K*Q` and `V*softmax(K*Q)` matrix multiplications. The performance gains with FA enabled are very significant, and FA now outperforms no-FA also for TG.
@@ -23,17 +26,17 @@ The x-axis is `N_KV`, the number of tokens in the KV cache.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-16** at **16:04:38**:
+👤 **ikawrakow** commented on **2025-04-16** at **16:04:38**
-Here another comparison to mainline, this time for Gemma3-12B-Instruct. Only runs with FA enabled, `Q8_0` KV-cache, `Q4_0` quantized model, Risen-5975WX CPU. I have rerun the mainline benchmark multiple times, dropping caches or not between runs, and the peculiar sudden drop in performance for the first 1024 tokens in the KV cache remained unchanged. Here mainline does significantly better relative to `ik_llama.cpp` compared to LLaMA-3.1-8B in the above graph. I suspect this is due to the fact that the benefit from the improvement this PR adds is less. Gemma3 has 16 attention heads in total and 8 KV heads. This results in the `K*Q` and `V*softmax(K*Q)` GEMM's for TG to be done with matrices with just 2 rows (compared to 4 rows for LLaMA-3), so the gain from using GEMM instead of GEMV is less. It is also possible that there is something in mainline that makes it perform better with the Gemma3 head size of 256 (vs 128 for LLaMA-3). The mainline CPU code has changed a lot since I left the project, so I cannot say I know very well what happens there.
+Here another comparison to mainline, this time for Gemma3-12B-Instruct. Only runs with FA enabled, `Q8_0` KV-cache, `Q4_0` quantized model, Ryzen-5975WX CPU. I have rerun the mainline benchmark multiple times, dropping caches or not between runs, and the peculiar sudden drop in performance for the first 1024 tokens in the KV cache remained unchanged. Here mainline does significantly better relative to `ik_llama.cpp` compared to LLaMA-3.1-8B in the above graph. I suspect this is due to the fact that the benefit from the improvement this PR adds is less. Gemma3 has 16 attention heads in total and 8 KV heads. This results in the `K*Q` and `V*softmax(K*Q)` GEMM's for TG to be done with matrices with just 2 rows (compared to 4 rows for LLaMA-3), so the gain from using GEMM instead of GEMV is less. It is also possible that there is something in mainline that makes it perform better with the Gemma3 head size of 256 (vs 128 for LLaMA-3). The mainline CPU code has changed a lot since I left the project, so I cannot say I know very well what happens there.

---
-👤 **saood06** commented the **2025-04-17** at **00:32:59**:
+👤 **saood06** commented on **2025-04-17** at **00:32:59**
>and FA now outperforms no-FA also for TG.
@@ -54,7 +57,7 @@ I wonder if they cross over at higher contexts the gap does seem to be closing h
---
-👤 **ikawrakow** commented the **2025-04-17** at **05:54:21**:
+👤 **ikawrakow** commented on **2025-04-17** at **05:54:21**
> Do you still have the raw markdown results? I know PP wasn't affected by this PR but I'm curious where it stands vs mainline.
@@ -220,7 +223,7 @@ Btw, my surprise at the 6X drop in PP performance for DeepSeek-V3/R1 that I expr
---
-👤 **saood06** commented the **2025-04-17** at **07:45:00**:
+👤 **saood06** commented on **2025-04-17** at **07:45:00**
> Mainline PP performance with FA is embarrassing.
@@ -238,7 +241,7 @@ Thanks for doing that.
Ya surprisingly the newer run with higher KV performed better looking at both.
-
+> ### Gemma3-12B-Instruct
> At 16k tokens mainline TG performance is indeed slightly better than `ik_llama.cpp`.
Here's the visual generated with the python script in the sweep-bench example folder, in order to see the crossover point.
diff --git a/github-data/pull_requests/333 - Support GLM-4-0414 models based on piDack_s mainline PR.md b/github-data/pull_requests/333 - Support GLM-4-0414 models based on piDacks mainline PR.md
similarity index 95%
rename from github-data/pull_requests/333 - Support GLM-4-0414 models based on piDack_s mainline PR.md
rename to github-data/pull_requests/333 - Support GLM-4-0414 models based on piDacks mainline PR.md
index 53d60cfb2..cbc7bcde4 100644
--- a/github-data/pull_requests/333 - Support GLM-4-0414 models based on piDack_s mainline PR.md
+++ b/github-data/pull_requests/333 - Support GLM-4-0414 models based on piDacks mainline PR.md
@@ -1,14 +1,16 @@
-### 🔀 [#333](https://github.com/ikawrakow/ik_llama.cpp/pull/333) - Support GLM-4-0414 models based on piDack's mainline PR
+## 🔀 [Pull Request #333](https://github.com/ikawrakow/ik_llama.cpp/pull/333) - Support GLM-4-0414 models based on piDack's mainline PR
| **Author** | `ubergarm` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ug/piDack/update-glm4z` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-17 |
-| **Updated** | 2025-04-21 |
+| **Updated** | 2025-07-26 |
---
-#### Description
+## 📄 Description
## tl;dr;
I got stuck on this PR and figured I'd push it anyway, no pressure to look at it.
@@ -17,7 +19,7 @@ I got stuck on this PR and figured I'd push it anyway, no pressure to look at it
This PR needs some more love. It is *not* working on CUDA backend, but *might* be working on CPU backend for `THUDM/GLM-Z1-Rumination-32B-0414` `bf16` GGUF converted using piDack's mainline branch.
## Purpose
-The goal of this PR is to incorporate changes made by [piDack on maline llama.cpp PR#12957](https://github.com/ggml-org/llama.cpp/pull/12957) in order to support the recently updated [THUDM/glm-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e) models.
+The goal of this PR is to incorporate changes made by [piDack on maline llama.cpp PR[#12957](https://github.com/ikawrakow/ik_llama.cpp/issues/12957)](https://github.com/ggml-org/llama.cpp/pull/12957) in order to support the recently updated [THUDM/glm-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e) models.
Specifically I was attempting to imatrix and quantize [THUDM/GLM-Z1-Rumination-32B-0414](https://huggingface.co/THUDM/GLM-Z1-Rumination-32B-0414/tree/main) hoping to use the new cosine similarity layer importance scoring to design a lower PPL quant.
@@ -941,15 +943,15 @@ I'll skip ahead and try to quantize it without imatrix for now and see if it act
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-04-17** at **22:30:45**:
+👤 **ubergarm** commented on **2025-04-17** at **22:30:45**
Okay, after some more testing it seems to be working with CPU backend, but not with CUDA.
-Quick Q4_0 quantization success
+Q4_0 quantization success
```bash
custom="
@@ -1013,9 +1015,19 @@ llama_model_quantize_internal: quant size = 17783.55 MB
-CUDA test fails
+CUDA inference test fails
```bash
+$ CUDA_VISIBLE_DEVICES="0," \
+./build/bin/llama-cli \
+ --alias ubergarm/GLM-Z1-Rumination-32B-0414-Q4_0 \
+ --model /mnt/raid/models/ubergarm/GLM-Z1-Rumination-32B-0414-GGUF/GLM-Z1-Rumination-32B-0414-Q4_0.gguf \
+ --ctx-size 8192 \
+ --parallel 1 \
+ --n-gpu-layers 62 \
+ --prompt "The meaning of life is" \
+ --threads 24
+
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
@@ -1072,8 +1084,9 @@ llama_print_timings: total time = 1630.87 ms / 55 tokens
-CPU test seems okay in quick test
+CPU inference seems okay with quick test
+*NOTE*: While it generates valid looking output, it behaves differently than running the same quant on mainline e.g. no `` token etc. Perhaps difference with default system prompt or not etc.
```bash
$ ./build/bin/llama-cli \
--alias ubergarm/GLM-Z1-Rumination-32B-0414-Q4_0 \
@@ -1142,28 +1155,33 @@ llama_print_timings: total time = 9967.31 ms / 10 tokens
Not exactly sure, but a few possible issues given I'm not too familiar with the code-base and mainline has diverged for some of this code:
-1. `batch` vs `ubatch`
-2. loading contexts
+1. Might be something in the cuda graph [build_chatglm()](https://github.com/ubergarm/ik_llama.cpp/blob/ug/piDack/update-glm4z/src/llama.cpp#L8266-L8307) e.g.
+ * [`batch` vs `ubatch`](https://github.com/ubergarm/ik_llama.cpp/blob/ug/piDack/update-glm4z/src/llama.cpp#L15206)
+ * [building qkv attention and rope stuff](https://github.com/ubergarm/ik_llama.cpp/blob/ug/piDack/update-glm4z/src/llama.cpp#L15229-L15274)
+2. I wasn't for sure on what ggml_context layer/split to use when [loading tensors](https://github.com/ubergarm/ik_llama.cpp/blob/ug/piDack/update-glm4z/src/llama.cpp#L8266-L8307)
+3. I possibly missed copying something important or made some random mistake. lol
+
+Gonna take a break for now and maybe fuss with it some more later.
---
-👤 **pwilkin** commented the **2025-04-17** at **22:46:57**:
+👤 **pwilkin** commented on **2025-04-17** at **22:46:57**
Took a quick look and I think you're missing the `convert_hf_to_gguf.py` changes from this commit: https://github.com/ggml-org/llama.cpp/pull/12957/commits/b928f8ca24b1f5f4e781b57f70e375bee07a9763, those were the ones that fixed the interleaved RoPE problems with the converted / quantified models.
---
-👤 **ubergarm** commented the **2025-04-17** at **23:13:50**:
+👤 **ubergarm** commented on **2025-04-17** at **23:13:50**
> Took a quick look and I think you're missing the `convert_hf_to_gguf.py` changes.
-Oh wow, thanks for taking a look! Right, I was being lazy and used your branch to do the `convert_hf_to_gguf.py` and only attempted to include changes to cpp code in this PR.
+Oh wow, thanks for taking a look! Right, I was being lazy and used the mainline branch to do the `convert_hf_to_gguf.py` and only attempted to include changes to cpp code in this PR.
-It made me think to try the `Q4_0` gguf I quantized with this `ik_llama.cpp` fork back over on your mainline PR and it works with CUDA and wow yeah does this thing ruminate with the default system prompt given it is not hooked up to actual tool use deep-research stuff.
+It made me think to try the `Q4_0` gguf I quantized with this `ik_llama.cpp` fork back over on the mainline PR and it works with CUDA and wow yeah this model *does indeed ruminate* with the default system prompt given it is *not* hooked up to actual tool use deep-research stuff.
-Testing this `Q4_0` on
+Testing `Q4_0` quantized from this fork back on mainline llama.cpp branch PR[#12957](https://github.com/ikawrakow/ik_llama.cpp/issues/12957)
```bash
$ git branch | grep '*'
@@ -1222,18 +1240,42 @@ To answer this question accurately, I must first define what life is, or at leas
.
.
.
+
+
+
+It's clear that the search engine isn't effectively filtering for scientific perspectives.
+
+.
+.
+.
+
+# seems to go on an on and on an on looping on
```
---
-👤 **ikawrakow** commented the **2025-04-20** at **06:15:30**:
+👤 **ikawrakow** commented on **2025-04-20** at **06:15:30**
Did you see https://github.com/ggml-org/llama.cpp/pull/13021 ?
---
-👤 **ubergarm** commented the **2025-04-21** at **15:36:34**:
+👤 **ubergarm** commented on **2025-04-21** at **15:36:34**
+
+I see, the PR that actually got merged was mainline `PR#12867`. I'll close this for now and hope to get a chance to try again using that PR to guide me instead. Low priority, just having fun trying to learn a little more. Thanks!
+
+---
-I see, the PR that actually got merged was mainline `PR#12867`. I'll close this for now and hope to get a chance to try again using that PR to guide me instead. Low priority, just having fun trying to learn a little more. Thanks!
\ No newline at end of file
+👤 **gopinath87607** commented on **2025-07-25** at **05:05:34**
+
+@ubergarm seems like glm is coming are we ready ? there is a some work going on in vllm repo i think
+
+---
+
+👤 **ubergarm** commented on **2025-07-26** at **17:45:50**
+
+@gopinath87607
+
+I believe ZzZzZzZzZzZz did a transformers PR already, but haven't seen one on mainline lcpp yet psure. Getting hard to keep up haha...
\ No newline at end of file
diff --git a/github-data/pull_requests/336 - Fix termux_android build.md b/github-data/pull_requests/336 - Fix termuxandroid build.md
similarity index 70%
rename from github-data/pull_requests/336 - Fix termux_android build.md
rename to github-data/pull_requests/336 - Fix termuxandroid build.md
index 7c1173ff7..baaf46407 100644
--- a/github-data/pull_requests/336 - Fix termux_android build.md
+++ b/github-data/pull_requests/336 - Fix termuxandroid build.md
@@ -1,14 +1,17 @@
-### 🐛 [#336](https://github.com/ikawrakow/ik_llama.cpp/pull/336) - Fix termux/android build
+## 🔀 [Pull Request #336](https://github.com/ikawrakow/ik_llama.cpp/pull/336) - Fix termux/android build
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/termux_fix` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-20 |
| **Updated** | 2025-04-30 |
+| **Merged** | 2025-04-21 |
---
-#### Description
+## 📄 Description
@ikawrakow
@@ -16,13 +19,13 @@ Sorry this is a mess, but this does get it to build now on my android device whe
I did catch the additional issue of the changed iqk_flash_attn_noalibi definition in the case where your building this repo and IQK_IMPLEMENT is not defined because my device doesn't support dotprod.
-Fixes #159
+Fixes [#159](https://github.com/ikawrakow/ik_llama.cpp/issues/159)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-20** at **08:59:26**:
+👤 **ikawrakow** commented on **2025-04-20** at **08:59:26**
Thank you for this.
@@ -32,11 +35,11 @@ I guess we need an `IQK_API` macro similar to `GGML_API`. Or one can just reuse
---
-👤 **saood06** commented the **2025-04-20** at **09:20:04**:
+👤 **saood06** commented on **2025-04-20** at **09:20:04**
> Thank you for this.
-It would be interesting to benchmark it, but I can't since my phone doesn't support IQK. My main motivation was thinking about doing a release (but I haven't done many non-native builds, and don't have access to a mac).
+It would be interesting to benchmark it, but I can't since my phone doesn't support IQK. My main motivation was thinking about doing a release (I can build on Windows, Linux, and now Android but I haven't done many non-native builds, and don't have access to a mac).
> So, the issue on Android was that no visibility was specified for the iqk functions, Android apparently uses hidden visibility by default, so the linker does not find the iqk functions.
@@ -52,13 +55,13 @@ That should work.
---
-👤 **saood06** commented the **2025-04-21** at **03:39:42**:
+👤 **saood06** commented on **2025-04-21** at **03:39:42**
Cleaned it up using an `IQK_API` macro.
---
-👤 **ikawrakow** commented during a code review the **2025-04-21** at **06:11:32** on `ggml/src/iqk/iqk_config.h`:
+👤 **ikawrakow** started a conversation on `ggml/src/iqk/iqk_config.h` on **2025-04-21** at **06:11:32**
To have this also work for a static built, it should be
```c++
@@ -77,41 +80,29 @@ To have this also work for a static built, it should be
#endif
```
----
-
-👤 **ikawrakow** commented during a code review the **2025-04-21** at **06:15:05** on `ggml/src/iqk/iqk_flash_attn.cpp`:
-
-Do we really need to repeat `extern "C" IQK_API` here?
-
----
-
-👤 **ikawrakow** submitted a review the **2025-04-21** at **06:27:52**: ✅ `APPROVED`
-
-I wonder if something else apart from the dot product is needed to have the iqk functions work on your phone. I see that I have consistently used `ggml_vdotq_s32`, whiere `ggml` provided an implementation when `__ARM_FEATURE_DOTPROD` is not available. The one known missing ingredient without `__ARM_FEATURE_DOTPROD ` is `vdotq_laneq_s32`. But is there something else missing? If `vdotq_laneq_s32` was the only missing thing, one could add an implementation, and then one would be able to use `iqk` stuff on generic `__aarch64__`. I don't have an Android phone myself, so was never compelled to try.
+> 👤 **saood06** replied on **2025-04-21** at **07:11:44**
+>
+> Changed.
---
-👤 **saood06** submitted a review the **2025-04-21** at **07:11:44**: 💬 `COMMENTED`
-
----
+👤 **ikawrakow** started a conversation on `ggml/src/iqk/iqk_flash_attn.cpp` on **2025-04-21** at **06:15:05**
-👤 **saood06** commented during a code review the **2025-04-21** at **07:11:44** on `ggml/src/iqk/iqk_config.h`:
-
-Changed.
-
----
+Do we really need to repeat `extern "C" IQK_API` here?
-👤 **saood06** submitted a review the **2025-04-21** at **07:12:00**: 💬 `COMMENTED`
+> 👤 **saood06** replied on **2025-04-21** at **07:12:00**
+>
+> Changed
---
-👤 **saood06** commented during a code review the **2025-04-21** at **07:12:00** on `ggml/src/iqk/iqk_flash_attn.cpp`:
+👤 **ikawrakow** approved this pull request ✅ on **2025-04-21** at **06:27:52**
-Changed
+I wonder if something else apart from the dot product is needed to have the iqk functions work on your phone. I see that I have consistently used `ggml_vdotq_s32`, whiere `ggml` provided an implementation when `__ARM_FEATURE_DOTPROD` is not available. The one known missing ingredient without `__ARM_FEATURE_DOTPROD ` is `vdotq_laneq_s32`. But is there something else missing? If `vdotq_laneq_s32` was the only missing thing, one could add an implementation, and then one would be able to use `iqk` stuff on generic `__aarch64__`. I don't have an Android phone myself, so was never compelled to try.
---
-👤 **saood06** commented the **2025-04-21** at **07:13:59**:
+👤 **saood06** commented on **2025-04-21** at **07:13:59**
>I don't have an Android phone myself, so was never compelled to try.
@@ -121,13 +112,13 @@ I made the two suggested changes, and it compiles.
---
-👤 **ikawrakow** commented the **2025-04-21** at **07:19:58**:
+👤 **ikawrakow** commented on **2025-04-21** at **07:19:58**
So now we need to find someone with a modern phone willing to test. I would be really curious to compare the performance to Vulkan. The GPUs on many of the phones are quite underpowered, and the `llama.cpp` Vulkan implementation is not particularly performant (although it seems to have been improving lately), so now that it builds on Android, running `ik_llama.cpp` on the CPU is possibly a viable alternative to Vulkan.
---
-👤 **saood06** commented the **2025-04-21** at **07:38:30**:
+👤 **saood06** commented on **2025-04-21** at **07:38:30**
> So now we need to find someone with a modern phone willing to test.
@@ -141,7 +132,7 @@ Do you have a model/quant in mind you would want ran across the 3 backends?
---
-👤 **ikawrakow** commented the **2025-04-21** at **08:45:24**:
+👤 **ikawrakow** commented on **2025-04-21** at **08:45:24**
> Do you have a model/quant in mind you would want ran across the 3 backends?
@@ -149,7 +140,42 @@ Including Android? Then something small like LLaMA-3B using `IQ4_XS` or `IQ4_KS`
---
-👤 **saood06** commented the **2025-04-30** at **07:37:58**:
+👤 **saood06** commented on **2025-04-24** at **13:47:00**
+
+I had a little bit of time with a Galaxy S22 (1×3.00 GHz Cortex-X2 & 3×2.40 GHz Cortex-A710 & 4×1.70 GHz Cortex-A510).
+
+`~/ik_llama.cpp/build1 $ bin/llama-sweep-bench -m ~/ggml-model-iq2_bn_r4.gguf -t 4`
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 6.081 | 84.20 | 3.537 | 36.18 |
+| 512 | 128 | 512 | 8.509 | 60.17 | 4.594 | 27.86 |
+| 512 | 128 | 1024 | 12.571 | 40.73 | 5.461 | 23.44 |
+| 512 | 128 | 1536 | 16.879 | 30.33 | 6.582 | 19.45 |
+| 512 | 128 | 2048 | 20.344 | 25.17 | 7.640 | 16.75 |
+| 512 | 128 | 2560 | 29.417 | 17.40 | 10.138 | 12.63 |
+| 512 | 128 | 3072 | 34.477 | 14.85 | 11.348 | 11.28 |
+| 512 | 128 | 3584 | 38.911 | 13.16 | 12.595 | 10.16 |
+
+Flash attention did worse:
+`~/ik_llama.cpp/build1 $ bin/llama-sweep-bench -m ~/ggml-model-iq2_bn_r4.gguf -fa -t 4`
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 9.496 | 53.92 | 3.954 | 32.38 |
+| 512 | 128 | 512 | 19.082 | 26.83 | 7.029 | 18.21 |
+| 512 | 128 | 1024 | 27.123 | 18.88 | 10.393 | 12.32 |
+| 512 | 128 | 1536 | 32.178 | 15.91 | 14.209 | 9.01 |
+| 512 | 128 | 2048 | 40.818 | 12.54 | 16.617 | 7.70 |
+| 512 | 128 | 2560 | 48.743 | 10.50 | 20.061 | 6.38 |
+| 512 | 128 | 3072 | 55.976 | 9.15 | 25.354 | 5.05 |
+| 512 | 128 | 3584 | 76.750 | 6.67 | 27.247 | 4.70 |
+
+I'll be able to test more with it again later.
+
+---
+
+👤 **saood06** commented on **2025-04-30** at **07:37:58**
I was able to test a bit more and turns out the results I got above are meaningless as the model returns gibberish. I have to build with arch flags manually set (and armv9 caused illegal instructions even though this device supports it, but `armv8.2-a+dotprod+fp16` worked). The new build was tested working with the test prompt in cli returning coherent results (and the much longer compile time showed it was actually compiling iqk_mul_mat.cpp), but performance numbers were wildly inconsistent between runs (even using taskset to try and force it to only be on the performant cores helped a bit but still was very inconsistent).
@@ -170,13 +196,13 @@ Best result I was able to get was with 4 threads and FA off but I haven't manage
---
-👤 **ikawrakow** commented the **2025-04-30** at **08:28:12**:
+👤 **ikawrakow** commented on **2025-04-30** at **08:28:12**
Do you know how `BitNet.cpp` does on this device?
---
-👤 **saood06** commented the **2025-04-30** at **08:47:23**:
+👤 **saood06** commented on **2025-04-30** at **08:47:23**
> Do you know how `BitNet.cpp` does on this device?
@@ -186,9 +212,9 @@ I wanted to provide the flash attention numbers as well, but I'm not sure if I j
---
-👤 **ikawrakow** commented the **2025-04-30** at **09:06:45**:
+👤 **ikawrakow** commented on **2025-04-30** at **09:06:45**
-So, my Arm optimizations are totally based on the M2 chip. Your results and what was reported in #345 may indicate that they may not really be optimal for lower end Arm processors. For instance, I often use more vector registers than available. On the M2-Max this register spillage is better (faster) than not using all vector registers. But the lower end chips may not handle this very well (common wisdom is that one should avoid register spillage). Or perhaps the compiler is not producing optimum code. Have you tried `clang` (which is what I use for the M2)?
+So, my Arm optimizations are totally based on the M2 chip. Your results and what was reported in [#345](https://github.com/ikawrakow/ik_llama.cpp/issues/345) may indicate that they may not really be optimal for lower end Arm processors. For instance, I often use more vector registers than available. On the M2-Max this register spillage is better (faster) than not using all vector registers. But the lower end chips may not handle this very well (common wisdom is that one should avoid register spillage). Or perhaps the compiler is not producing optimum code. Have you tried `clang` (which is what I use for the M2)?
I guess, if I want to become serious with supporting mobile devices, I should get myself a Raspberry Pi to play with. Or perhaps the Rock 5b board.
@@ -196,7 +222,7 @@ I haven't done any experiments on that sort of CPU for a long time. But I think
---
-👤 **saood06** commented the **2025-04-30** at **09:31:06**:
+👤 **saood06** commented on **2025-04-30** at **09:31:06**
>For instance, I often use more vector registers than available. On the M2-Max this register spillage is better (faster) than not using all vector registers. But the lower end chips may not handle this very well (common wisdom is that one should avoid register spillage).
diff --git a/github-data/pull_requests/337 - Add support for bitnet2b_2501 model.md b/github-data/pull_requests/337 - Add support for bitnet2b_2501 model.md
index a81dd4e60..a1100923b 100644
--- a/github-data/pull_requests/337 - Add support for bitnet2b_2501 model.md
+++ b/github-data/pull_requests/337 - Add support for bitnet2b_2501 model.md
@@ -1,14 +1,17 @@
-### 🔀 [#337](https://github.com/ikawrakow/ik_llama.cpp/pull/337) - Add support for bitnet2b_2501 model
+## 🔀 [Pull Request #337](https://github.com/ikawrakow/ik_llama.cpp/pull/337) - Add support for bitnet2b_2501 model
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/bitnet2b_2501` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-21 |
| **Updated** | 2025-04-22 |
+| **Merged** | 2025-04-22 |
---
-#### Description
+## 📄 Description
Very direct port of https://github.com/microsoft/BitNet/pull/167 more specifically this commit, https://github.com/Eddie-Wang1120/llama.cpp/commit/a8ac7072ae02ffd68b4b661db0ebd2689fb82b7f
@@ -18,9 +21,9 @@ I have not ran the model yet.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-21** at **16:08:46**:
+👤 **ikawrakow** commented on **2025-04-21** at **16:08:46**
I fetched the model from https://huggingface.co/microsoft/bitnet-b1.58-2B-4T
@@ -32,7 +35,7 @@ ERROR:hf-to-gguf:Model BitNetForCausalLM is not supported
---
-👤 **ikawrakow** commented the **2025-04-21** at **16:18:33**:
+👤 **ikawrakow** commented on **2025-04-21** at **16:18:33**
And after noticing that it is now "BitNetForCausalLM" instead of "BitnetForCausalLM" and fixing it, I get
```
@@ -62,13 +65,9 @@ ValueError: Can not map tensor 'model.layers.0.mlp.down_proj.weight_scale'
---
-👤 **saood06** commented the **2025-04-22** at **02:33:41**:
+👤 **saood06** commented on **2025-04-22** at **02:33:41**
-I can reproduce the issue with the safetensors conversion,
-
-
-
-but using the method outlined in #169 I was able to get it running.
+I can reproduce the issue with the safetensors conversion, but using the method outlined in [#169](https://github.com/ikawrakow/ik_llama.cpp/issues/169) I was able to get it running.
```
./bin/llama-quantize --allow-requantize /mnt/sda/bitnet/gguf/ggml-model-i2_s.gguf /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf iq2_bn
@@ -76,6 +75,8 @@ but using the method outlined in #169 I was able to get it running.
Full log inside
+
+
```
main: build = 3641 (35691804)
main: built with gcc (Clear Linux OS for Intel Architecture) 14.2.1 20241210 releases/gcc-14.2.0-551-g21a09f0507 for x86_64-generic-linux
@@ -454,7 +455,7 @@ main: quantize time = 7087.18 ms
main: total time = 7087.18 ms
```
-I even ran the same prompt ran on the other bitnet's.
+I even ran it with the same prompt that you ran on the other bitnet's.
```
./bin/llama-cli -m /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf -s 12345 -p "Write an essay about ecosystem" -t 8 --numa distribute -n 900
@@ -625,6 +626,10 @@ In conclusion, ecosystem services play a crucial role in sustaining human life a
##Answers:
1. Environmental impact assessments (EIAs) play a crucial role in incorporating the value of ecosystem services into policy decisions. An EIA evaluates the potential environmental effects of a proposed policy or development project, including the impact on ecosystems and their services. By considering the value of ecosystem services, policymakers can
+```
+
+
+```
llama_print_timings: load time = 295.32 ms
llama_print_timings: sample time = 82.35 ms / 900 runs ( 0.09 ms per token, 10929.49 tokens per second)
llama_print_timings: prompt eval time = 185.71 ms / 6 tokens ( 30.95 ms per token, 32.31 tokens per second)
@@ -659,11 +664,66 @@ Traceback (most recent call last):
KeyError: 'U8'
```
-For now maybe we can just have GGUF support only, relying on elsewhere to do conversion from safetensors just like Gemma3?
+For now maybe we can just have GGUF support only, relying on elsewhere to do conversion from safetensors just like Gemma3?
+
+Edit: Peak speed for me is at 24 threads, would be curious to see it on your machines since you have a lot of comparitive numbers.
+
+```
+llama_print_timings: load time = 301.94 ms
+llama_print_timings: sample time = 11.75 ms / 128 runs ( 0.09 ms per token, 10895.47 tokens per second)
+llama_print_timings: prompt eval time = 121.43 ms / 6 tokens ( 20.24 ms per token, 49.41 tokens per second)
+llama_print_timings: eval time = 3495.94 ms / 127 runs ( 27.53 ms per token, 36.33 tokens per second)
+llama_print_timings: total time = 3683.50 ms / 133 tokens
+```
+
+Edit 2: Pushed the python fix for the new name even if that file still doesn't work. I don't see a point of pushing the standalone file since I still can't get that to work either. If they are going to have a standalone file, we may as well tell people to grab a GGUF (I could even upload one for this model it's small enough).
+
+Edit 3: Even higher speeds with the R4 variant.
+
+```
+llama_print_timings: load time = 299.55 ms
+llama_print_timings: sample time = 11.89 ms / 128 runs ( 0.09 ms per token, 10760.82 tokens per second)
+llama_print_timings: prompt eval time = 98.51 ms / 6 tokens ( 16.42 ms per token, 60.91 tokens per second)
+llama_print_timings: eval time = 3330.97 ms / 127 runs ( 26.23 ms per token, 38.13 tokens per second)
+llama_print_timings: total time = 3498.51 ms / 133 tokens
+```
+
+Edit 4: Using bench to see some numbers of both, where now 48 seems better again, showing best result for both R4 and normal variants.
+
+`./bin/llama-bench -m /mnt/sda/bitnet/gguf/ggml-model-iq2_bn_r4.gguf -r 3 -t 48 --numa distribute`
+
+| model | size | params | backend | threads | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| bitnet-25 2B IQ2_BN_R4 - 2.00 bpw Bitnet | 934.16 MiB | 2.74 B | CPU | 48 | pp512 | 305.60 ± 17.19 |
+| bitnet-25 2B IQ2_BN_R4 - 2.00 bpw Bitnet | 934.16 MiB | 2.74 B | CPU | 48 | tg128 | 37.04 ± 0.58 |
+
+`./bin/llama-bench -m /mnt/sda/bitnet/gguf/ggml-model-iq2_bn.gguf -r 3 -t 48 --numa distribute`
+
+| model | size | params | backend | threads | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| bitnet-25 2B IQ2_BN - 2.00 bpw Bitnet | 934.16 MiB | 2.74 B | CPU | 48 | pp512 | 290.60 ± 11.27 |
+| bitnet-25 2B IQ2_BN - 2.00 bpw Bitnet | 934.16 MiB | 2.74 B | CPU | 48 | tg128 | 36.79 ± 0.52 |
+
+Very informal testing, no dropping of cache, or other precautions taken.
+
+Edit 5: It is available on huggingface [here](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF).
+
+Edit 6: Another informal benchmark, this time sweep bench.
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.263 | 405.31 | 3.747 | 34.16 |
+| 512 | 128 | 512 | 1.373 | 373.02 | 3.764 | 34.01 |
+| 512 | 128 | 1024 | 1.503 | 340.58 | 3.890 | 32.91 |
+| 512 | 128 | 1536 | 1.647 | 310.83 | 4.042 | 31.67 |
+| 512 | 128 | 2048 | 1.774 | 288.67 | 4.170 | 30.69 |
+| 512 | 128 | 2560 | 2.027 | 252.53 | 4.369 | 29.30 |
+| 512 | 128 | 3072 | 2.149 | 238.29 | 4.557 | 28.09 |
+| 512 | 128 | 3584 | 2.474 | 206.93 | 4.805 | 26.64 |
---
-👤 **ikawrakow** commented the **2025-04-22** at **05:48:56**:
+👤 **ikawrakow** commented on **2025-04-22** at **05:48:56**
Yes, I got it running by converting the `i2_s` model as well. But what about the missing pre-tokenizer?
```
@@ -711,7 +771,7 @@ Is `llama3` OK, or are we crippling the model by using the `llama3` pre-tokenize
---
-👤 **ikawrakow** commented the **2025-04-22** at **06:07:30**:
+👤 **ikawrakow** commented on **2025-04-22** at **06:07:30**
Here `sweep-bench` performance on my Ryzen-7950X using `-ctk q8_0 -fa -rtr -t 16`
@@ -728,7 +788,7 @@ Here `sweep-bench` performance on my Ryzen-7950X using `-ctk q8_0 -fa -rtr -t 16
---
-👤 **saood06** commented the **2025-04-22** at **06:15:43**:
+👤 **saood06** commented on **2025-04-22** at **06:15:43**
> Yes, I got it running by converting the `i2_s` model as well. But what about the missing pre-tokenizer?
>
@@ -738,7 +798,7 @@ It does seem to have an issue using EOS tokens and stopping generation, so there
---
-👤 **ikawrakow** commented the **2025-04-22** at **06:30:00**:
+👤 **ikawrakow** commented on **2025-04-22** at **06:30:00**
Here the results of the official Microsoft BitNet implementation (build a8ac7072, just pulled)
@@ -751,13 +811,23 @@ BitNet is a `llama.cpp` fork that does nothing else but adding BitNet support, w
---
-👤 **ikawrakow** submitted a review the **2025-04-22** at **06:31:48**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-04-22** at **06:31:48**
I think we can merge like this. It is fine to just use `I2_S` GGUFs. We can sort out the pre-tokenizer issue later.
---
-👤 **saood06** commented the **2025-04-22** at **07:08:26**:
+👤 **saood06** commented on **2025-04-22** at **06:54:14**
+
+> I think we can merge like this. It is fine to just use `I2_S` GGUFs. We can sort out the pre-tokenizer issue later.
+
+Okay. I'll make an issue. I tested the model more, it is coherent, and can even do multi turn conversation, it just doesn't ever use an EOS token and so it never stops it's own generation it will just continue until I stopped it, and I still don't really understand it's chat template:
+
+`{% for message in messages %}{% if loop.first %}{{ bos_token }}{% endif %}{% if message['role'] == 'user' %}{{ 'Human: ' + message['content'] + '\n\nBITNETAssistant: ' + eos_token }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token }}{% endif %}{% endfor %}`
+
+---
+
+👤 **saood06** commented on **2025-04-22** at **07:08:26**
> Here `sweep-bench` performance on my Ryzen-7950X using `-ctk q8_0 -fa -rtr -t 16`
@@ -765,7 +835,7 @@ I couldn't get flash attention running, it would always just exit with `Floating
---
-👤 **ikawrakow** commented the **2025-04-22** at **07:16:33**:
+👤 **ikawrakow** commented on **2025-04-22** at **07:16:33**
> I couldn't get flash attention running, it would always just exit with Floating point exception (core dumped).
@@ -773,7 +843,7 @@ Something is missing in the logic for your number of threads. The model has a st
---
-👤 **saood06** commented the **2025-04-22** at **07:26:59**:
+👤 **saood06** commented on **2025-04-22** at **07:26:59**
> > I couldn't get flash attention running, it would always just exit with Floating point exception (core dumped).
>
diff --git a/github-data/pull_requests/338 - BitNet adjustments.md b/github-data/pull_requests/338 - BitNet adjustments.md
index c21995599..e4588642d 100644
--- a/github-data/pull_requests/338 - BitNet adjustments.md
+++ b/github-data/pull_requests/338 - BitNet adjustments.md
@@ -1,26 +1,29 @@
-### 🔀 [#338](https://github.com/ikawrakow/ik_llama.cpp/pull/338) - BitNet adjustments
+## 🔀 [Pull Request #338](https://github.com/ikawrakow/ik_llama.cpp/pull/338) - BitNet adjustments
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_adjustments` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-22 |
| **Updated** | 2025-04-22 |
+| **Merged** | 2025-04-22 |
---
-#### Description
+## 📄 Description
-Two small tweaks to #337:
+Two small tweaks to [#337](https://github.com/ikawrakow/ik_llama.cpp/issues/337):
* Use `create_tensor` instead of `ml.create_tensor`. This is necessary for tensor overrides to work (in case one would ever want to use tensor overrides with a BitNet model)
* Use `output.weight` instead of `token_embd.weight` for the final matrix multiplication. This improves CUDA performance quite a bit as `token_embd.weight` is on the host, so needs to be copied to the GPU each time it is needed (or the matrix multiplication is done on the CPU when running TG). I see that MicroSoft have decided to have `output.weight` stored in the model, even though it is identical to `token_embd.weight` (in the initial BitNet models one simply reused `token_embd.weight`). This makes the model quite a bit larger than it needs to be. Go figure.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-04-22** at **07:01:54**:
+👤 **saood06** commented on **2025-04-22** at **07:01:54**
-> * Use `create_tensor` instead of `ml.create_tensor`. This is necessary for tensor overrides to work (in case one would ever want to use tensor overrides with a BitNet model)
+>Use `create_tensor` instead of `ml.create_tensor`. This is necessary for tensor overrides to work (in case one would ever want to use tensor overrides with a BitNet model)
Yes I noticed that, I just didn't want to change until I tested if it worked first.
@@ -34,4 +37,16 @@ I also noticed when converting the two tensors ended up with different quants.
```
[ 1/ 333] output.weight - [ 2560, 128256, 1, 1], type = f16, converting to q6_K .. size = 626.25 MiB -> 256.86 MiB
[ 2/ 333] token_embd.weight - [ 2560, 128256, 1, 1], type = f16, converting to iq4_nl .. size = 626.25 MiB -> 176.13 MiB
-```
\ No newline at end of file
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-04-22** at **07:10:08**
+
+> I also noticed when converting the two tensors ended up with different quants.
+
+These are the built-in defaults. If one wants to have something else one needs to use `--token-embedding-type` and `--output-tensor-type` (or `--custom-q`).
+
+> Interesting. There is a discussion on the huggingface that the model is larger than it has to be. Can we have change this to have smaller model size or is the performance benefit worth it (if it can't be duplicated on runtime for CUDA)?
+
+The two tensors are stored in the model. If we wanted to avoid the duplication, we need to add logic that checks if `output.weight` and `token_embd.weight` are the same. But if one is running on CUDA, one wants to have `output.weight` offloaded to the GPU to avoid the copy on each evaluation. `token_embd.weight` needs to stay on the host because in `llama.cpp` the input (token embedding, attention mask, etc.) is always prepared on the host. So, the only situation where we would gain is for CPU-only inference, where we wouldn't have the same tensor stored twice in memory.
\ No newline at end of file
diff --git a/github-data/pull_requests/341 - Add support for Cohere2.md b/github-data/pull_requests/341 - Add support for Cohere2.md
index 7a184bd05..b05cf1a26 100644
--- a/github-data/pull_requests/341 - Add support for Cohere2.md
+++ b/github-data/pull_requests/341 - Add support for Cohere2.md
@@ -1,16 +1,19 @@
-### 🔀 [#341](https://github.com/ikawrakow/ik_llama.cpp/pull/341) - Add support for Cohere2
+## 🔀 [Pull Request #341](https://github.com/ikawrakow/ik_llama.cpp/pull/341) - Add support for Cohere2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cohere2` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-23 |
| **Updated** | 2025-04-26 |
+| **Merged** | 2025-04-26 |
---
-#### Description
+## 📄 Description
-Closes #340
+Closes [#340](https://github.com/ikawrakow/ik_llama.cpp/issues/340)
Rudimentary tests with [this model](https://huggingface.co/dranger003/c4ai-command-r7b-12-2024-GGUF/blob/main/ggml-c4ai-command-r7b-12-2024-q4_k.gguf), appears to work fine.
diff --git a/github-data/pull_requests/342 - Fix LLaMA-4 attention.md b/github-data/pull_requests/342 - Fix LLaMA-4 attention.md
index a14f14be3..289e42834 100644
--- a/github-data/pull_requests/342 - Fix LLaMA-4 attention.md
+++ b/github-data/pull_requests/342 - Fix LLaMA-4 attention.md
@@ -1,16 +1,19 @@
-### 🐛 [#342](https://github.com/ikawrakow/ik_llama.cpp/pull/342) - Fix LLaMA-4 attention
+## 🔀 [Pull Request #342](https://github.com/ikawrakow/ik_llama.cpp/pull/342) - Fix LLaMA-4 attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_llama4_attention` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-24 |
| **Updated** | 2025-04-25 |
+| **Merged** | 2025-04-25 |
---
-#### Description
+## 📄 Description
-Closes #335
+Closes [#335](https://github.com/ikawrakow/ik_llama.cpp/issues/335)
I had missed the SWA part. As SWA only has a real impact past 8k tokens, and as the impact of not using SWA is relatively small for the next 8k tokens, the model appeared coherent up to 16k tokens.
diff --git a/github-data/pull_requests/343 - cuda use switch in constexpr funcs.md b/github-data/pull_requests/343 - cuda use switch in constexpr funcs.md
new file mode 100644
index 000000000..4ff9f3d87
--- /dev/null
+++ b/github-data/pull_requests/343 - cuda use switch in constexpr funcs.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #343](https://github.com/ikawrakow/ik_llama.cpp/pull/343) - cuda: use switch in constexpr funcs
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/pickup_13095` |
+| **Target Branch** | `main` |
+| **Created** | 2025-04-24 |
+| **Updated** | 2025-04-24 |
+| **Merged** | 2025-04-24 |
+
+---
+
+## 📄 Description
+
+Based on [PR 13095](https://github.com/ggml-org/llama.cpp/pull/13095) in mainline. Did not measure, but had the impression that CUDA compile time is reduced.
\ No newline at end of file
diff --git a/github-data/pull_requests/343 - cuda_ use switch in constexpr funcs.md b/github-data/pull_requests/343 - cuda_ use switch in constexpr funcs.md
deleted file mode 100644
index c13817328..000000000
--- a/github-data/pull_requests/343 - cuda_ use switch in constexpr funcs.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#343](https://github.com/ikawrakow/ik_llama.cpp/pull/343) - cuda: use switch in constexpr funcs
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-04-24 |
-| **Updated** | 2025-04-24 |
-
----
-
-#### Description
-
-Based on [PR 13095](https://github.com/ggml-org/llama.cpp/pull/13095) in mainline. Did not measure, but had the impression that CUDA compile time is reduced.
\ No newline at end of file
diff --git a/github-data/pull_requests/344 - Add GLM-4-0414 Model Support.md b/github-data/pull_requests/344 - Add GLM-4-0414 Model Support.md
index 5d393aa8b..d17406967 100644
--- a/github-data/pull_requests/344 - Add GLM-4-0414 Model Support.md
+++ b/github-data/pull_requests/344 - Add GLM-4-0414 Model Support.md
@@ -1,16 +1,19 @@
-### 🔀 [#344](https://github.com/ikawrakow/ik_llama.cpp/pull/344) - Add GLM-4-0414 Model Support
+## 🔀 [Pull Request #344](https://github.com/ikawrakow/ik_llama.cpp/pull/344) - Add GLM-4-0414 Model Support
| **Author** | `ubergarm` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ug/add-GLM-4-0414` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-24 |
| **Updated** | 2025-05-08 |
+| **Merged** | 2025-04-26 |
---
-#### Description
+## 📄 Description
-This is my second attempt which still has some issues. Original attempt was #333. This one is based on https://github.com/ggml-org/llama.cpp/pull/12867 . However, this PR does not bring over any of the python stuff.
+This is my second attempt which still has some issues. Original attempt was [#333](https://github.com/ikawrakow/ik_llama.cpp/issues/333). This one is based on https://github.com/ggml-org/llama.cpp/pull/12867 . However, this PR does not bring over any of the python stuff.
In limited testing with of [bartowski/THUDM_GLM-Z1-32B-0414-GGUF](https://huggingface.co/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/blob/main/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf) on CPU only and CUDA backends it seems to work as long as:
@@ -44,9 +47,9 @@ So I'll mark this as draft for now and see how things are looking soon.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-25** at **07:29:50**:
+👤 **ikawrakow** commented on **2025-04-25** at **07:29:50**
> If I increase --n-gpu-layers 60 or higher, it outputs GGGGGGGGGGGGGGG.
@@ -54,7 +57,7 @@ Does it also happen when you use `-ctk q8_0 -ctv q8_0`? There is [this PR](https
---
-👤 **ubergarm** commented the **2025-04-25** at **14:37:32**:
+👤 **ubergarm** commented on **2025-04-25** at **14:37:32**
Hrrm, unfortunately no using `-ctk q8_0 -ctv q8_0` with `-ngl 60` (or higher) still throws `GGGGGGGG`...
@@ -91,31 +94,31 @@ Could be that I made a mistake in the `build_glm4()` the attention cgraph? Inter
--port 8080
```
-Last observations are that mainline seems to work fine with or without `-fa` and also mainline is *much slower* even fully offloaded e.g. 20 tok/sec PP and 5 tok/s TG. Compared to `ik_llama.cp` getting 163 tok/sec PP and 17 tok/sec TG with `-ot attn=CPU -nkvo` and even faster at 271 tok/sec PP and 25 tok/sec TG with `-ngl 59`...
+*EDIT*: mainline was compiled for CPU only so this is to be expected: Last observations are that mainline seems to work fine with or without `-fa` and also mainline is *much slower* even ~fully offloaded~ e.g. 20 tok/sec PP and 5 tok/s TG. Compared to `ik_llama.cpp` getting 163 tok/sec PP and 17 tok/sec TG with `-ot attn=CPU -nkvo` and even faster at 271 tok/sec PP and 25 tok/sec TG with `-ngl 59`...
-Not sure what to try next other than dig in deeper to how `build_inp_KQ_mask()` and `llm_build_kv` have changed with mainline refactors or something...
+Not sure what to try next other than dig in deeper to how `build_inp_KQ_mask()` and `llm_build_kv()` have changed with mainline refactors or something...
---
-👤 **ikawrakow** commented the **2025-04-25** at **14:43:20**:
+👤 **ikawrakow** commented on **2025-04-25** at **14:43:20**
> Could be that I made a mistake in the build_glm4() the attention cgraph? Interestingly this invocation seems to works fine too:
-If you make a mistake with building the graph, this invocation wouldn't be working. If it works for all layers offloaded to the GPU except attention tensors and KV cache, it means there is a precision issue in the attention calculation on CUDA (on the CPU everything is computed with `fp32` precision).
+If you made a mistake with building the graph, this invocation wouldn't be working. If it works for all layers offloaded to the GPU except attention tensors and KV cache, it means there is a precision issue in the attention calculation on CUDA (on the CPU everything is computed with `fp32` precision).
---
-👤 **ubergarm** commented the **2025-04-25** at **14:48:42**:
+👤 **ubergarm** commented on **2025-04-25** at **14:48:42**
-I just noticed one more odd thing trying `-ot attn=CPU -ot .*=CUDA0` on `ik_llama.cpp` it prints this out on startup then crashes. There are two `__missing__` types per layer it seems...
+I just noticed one more odd thing trying `-ot attn=CPU -ot \.*=CUDA0` on `ik_llama.cpp` it prints this out on startup then crashes. There are two `__missing__` types per layer it seems...
```
-Tensor token_embd.weight buffer type overriden to CPU
-Tensor output_norm.weight buffer type overriden to CPU
-Tensor output.weight buffer type overriden to CPU
+Tensor token_embd.weight buffer type overriden to CUDA0
+Tensor output_norm.weight buffer type overriden to CUDA0
+Tensor output.weight buffer type overriden to CUDA0
Tensor blk.0.attn_norm.weight buffer type overriden to CPU
-Tensor __missing__ buffer type overriden to CPU
-Tensor __missing__ buffer type overriden to CPU
+Tensor __missing__ buffer type overriden to CUDA0
+Tensor __missing__ buffer type overriden to CUDA0
Tensor blk.0.attn_q.weight buffer type overriden to CPU
Tensor blk.0.attn_k.weight buffer type overriden to CPU
Tensor blk.0.attn_v.weight buffer type overriden to CPU
@@ -123,16 +126,34 @@ Tensor blk.0.attn_q.bias buffer type overriden to CPU
Tensor blk.0.attn_k.bias buffer type overriden to CPU
Tensor blk.0.attn_v.bias buffer type overriden to CPU
Tensor blk.0.attn_output.weight buffer type overriden to CPU
-Tensor blk.0.post_attention_norm.weight buffer type overriden to CPU
-Tensor blk.0.ffn_norm.weight buffer type overriden to CPU
-Tensor blk.0.ffn_down.weight buffer type overriden to CPU
-Tensor blk.0.ffn_up.weight buffer type overriden to CPU
-Tensor blk.0.post_ffw_norm.weight buffer type overriden to CPU
+Tensor blk.0.post_attention_norm.weight buffer type overriden to CUDA0
+Tensor blk.0.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.0.ffn_down.weight buffer type overriden to CUDA0
+Tensor blk.0.ffn_up.weight buffer type overriden to CUDA0
+Tensor blk.0.post_ffw_norm.weight buffer type overriden to CUDA0
+```
+
+Running mainline with `-ot \.*=CPU` shows this:
+
+```
+tensor token_embd.weight buffer type overriden to CPU
+tensor output_norm.weight buffer type overriden to CPU
+tensor output.weight buffer type overriden to CPU
+tensor blk.0.attn_norm.weight buffer type overriden to CPU
+tensor blk.0.attn_q.weight buffer type overriden to CPU
+tensor blk.0.attn_k.weight buffer type overriden to CPU
+tensor blk.0.attn_v.weight buffer type overriden to CPU
+tensor blk.0.attn_output.weight buffer type overriden to CPU
+tensor blk.0.post_attention_norm.weight buffer type overriden to CPU
+tensor blk.0.ffn_norm.weight buffer type overriden to CPU
+tensor blk.0.ffn_down.weight buffer type overriden to CPU
+tensor blk.0.ffn_up.weight buffer type overriden to CPU
+tensor blk.0.post_ffw_norm.weight buffer type overriden to CPU
```
---
-👤 **ikawrakow** commented the **2025-04-25** at **14:50:36**:
+👤 **ikawrakow** commented on **2025-04-25** at **14:50:36**
Try this: in the function `llm_build_kqv()`, on all lines that have
```
@@ -144,9 +165,9 @@ This will set the precision of the `K*Q` calculation to `fp32`, and hopefully fi
---
-👤 **ikawrakow** commented the **2025-04-25** at **14:57:27**:
+👤 **ikawrakow** commented on **2025-04-25** at **14:57:27**
-I see in mainline `llama.cpp` they have become tired of setting the `K*Q` calculation for `fp32` precision for specific models, and now have this
+I see in mainline `llama.cpp` they have become tired of setting the `K*Q` calculation to `fp32` precision for specific models, and now have this
```c++
ggml_tensor * kq = ggml_mul_mat(ctx0, k, q);
@@ -159,7 +180,7 @@ This is why mainline may be working for this model. I still refuse to set that g
---
-👤 **ubergarm** commented the **2025-04-25** at **15:01:49**:
+👤 **ubergarm** commented on **2025-04-25** at **15:01:49**
> add || model.arch == LLM_ARCH_GLM4
@@ -182,27 +203,43 @@ I'll push this up.
Remaining questions:
* Is it okay that it does *not* work without `-fa` ?
-* I didn't test on other hardware nor include the latest mainline patch `ggml_mul_mat_set_prec(cur, GGML_PREC_F32);`.
+* I didn't test on other hardware nor include the latest mainline patch to set `ggml_mul_mat_set_prec(cur, GGML_PREC_F32);` for `llm_build_ffn()` `down` as well which might be important on some GPUs.
---
-👤 **ubergarm** commented the **2025-04-25** at **15:35:45**:
+👤 **ikawrakow** commented on **2025-04-25** at **15:10:02**
-Okay, so now without `-fa` it no longer produces `GGGGGGG` but it is back to this kinda stuff:
+You have only enabled `fp32` with FA. There are 2 more checks further down where you need to add the same, then it should also work without FA.
+You don't need the latest PR in mainline that sets `fp32` generically for the entire model. That has huge performance impact if you are not running a quantized model.
+
+---
+
+👤 **ubergarm** commented on **2025-04-25** at **15:35:45**
+
+Okay, so now without `-fa` it no longer produces `GGGGGGG` but it is back to this kinda stuff:
+
```
arsTab�.^rellsúng pacirc Pepper九龙每:室hlt一层avit学isi� cé个义项PMC\":列为� friAZalyrátolpanies�formanceInvoke9不足 Cornel Naz/Rkoz�koz�INFedomaidaporaidariantchartôaid
```
-I'll look for a reference, I thought I've seen others mentioning this kinda output before.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-04-25** at **16:58:38**: 💬 `COMMENTED`
+I'll look for a reference, I thought I've seen others mentioning this kinda output before.
+
+Here is a reference where they [suggest using different batch size](https://huggingface.co/bartowski/THUDM_GLM-4-32B-0414-GGUF/discussions/5#68096886119a3e5577391a52) e.g. `-b 16 -ub 16` which I'll try now. Did *not* fix it.
+
+Another [reference here](https://github.com/ggml-org/llama.cpp/issues/12946#issuecomment-2804066978) which seems to suggest a recent python conversion update [here](https://github.com/ggml-org/llama.cpp/pull/13021).
+
+So maybe I'll double check my existing GGUF or try to convert my own GGUF using the most recent patch that updates some special tokens and sets
+```
+self.gguf_writer.add_rope_dimension_count(int(rope_dim * self.hparams.get("partial_rotary_factor", 0.5)))
+```
+
+Seems like bartowski used a version of mainline to convert that did include this PR hrmm..
+
---
-👤 **ikawrakow** commented during a code review the **2025-04-25** at **16:58:38** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-04-25** at **16:58:38**
Add
```c++
@@ -214,11 +251,7 @@ after line 9515
---
-👤 **ikawrakow** submitted a review the **2025-04-25** at **17:01:07**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-04-25** at **17:01:07** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-04-25** at **17:01:07**
Add
```c++
@@ -230,7 +263,7 @@ after line 9475
---
-👤 **ikawrakow** commented the **2025-04-25** at **17:07:32**:
+👤 **ikawrakow** commented on **2025-04-25** at **17:07:32**
I don't think any of the suggestions you are finding around the Internet are going to help. Just think about it:
* It works on the CPU (calculation done with `fp32`)
@@ -241,7 +274,7 @@ The only logical conclusion from these 3 observations is that you also need to s
---
-👤 **ubergarm** commented the **2025-04-25** at **18:31:20**:
+👤 **ubergarm** commented on **2025-04-25** at **18:31:20**
Thanks, I appreciate you helping me learn on this.
@@ -252,6 +285,7 @@ I tried setting precision to fp32 as you describe, but still get the same gibber
The patch you suggested above.
I went ahead and tried this and it seems to be taking the `kqv` path and not the `kqv_i` but still giving same gibberish.
+
```
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -276,13 +310,14 @@ I went ahead and tried this and it seems to be taking the `kqv` path and not the
kqv = kqv_i;
} else {
```
+
I'll dig into the differences between mainline non flash attention and this forks non flash attention path more to see if anything else sticks out to me.
---
-👤 **ikawrakow** commented the **2025-04-26** at **06:02:40**:
+👤 **ikawrakow** commented on **2025-04-26** at **06:02:40**
> Just to be clear I'm getting the gibberish output without -fa on both CPU only as well as CUDA backend.
@@ -297,7 +332,7 @@ In mainline they have reorganized how attention is built. Reshaping `V` to 3D at
---
-👤 **ikawrakow** commented the **2025-04-26** at **07:19:17**:
+👤 **ikawrakow** commented on **2025-04-26** at **07:19:17**
Here a quick CPU only `sweep-bench` performance comparison to mainline for the [bartowski/THUDM_GLM-Z1-32B-0414-GGUF](https://huggingface.co/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/blob/main/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf) model you are using
@@ -331,7 +366,7 @@ Here a quick CPU only `sweep-bench` performance comparison to mainline for the [
```
./bin/llama-sweep-bench -m THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf -c 8192 -t 32 -fa -ctk q8_0 -ctv q8_0 -rtr
```
-(but I needed the changes in PR #349 to make FA work on the CPU).
+(but I needed the changes in PR [#349](https://github.com/ikawrakow/ik_llama.cpp/issues/349) to make FA work on the CPU).
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -354,7 +389,7 @@ Here a quick CPU only `sweep-bench` performance comparison to mainline for the [
---
-👤 **ubergarm** commented the **2025-04-26** at **14:53:00**:
+👤 **ubergarm** commented on **2025-04-26** at **14:53:00**
Sweeet that fixes up the non-flash-attention case! This model is quite efficient, I just ran it with 128k context and only using `21194MiB` VRAM ?? Looking forward to some testing and benchmarking soon.
@@ -364,23 +399,33 @@ Thanks again really appreciate your time looking at this! Cheers!
---
-👤 **ubergarm** commented the **2025-04-26** at **15:23:46**:
+👤 **ikawrakow** commented on **2025-04-26** at **15:00:17**
+
+> This model is quite efficient, I just ran it with 128k context and only using 21194MiB VRAM ??
+
+Yes, it has a very high GQA factor of 24, so the KV entries per token are very small. This makes the attention portion very efficient, so the decline of TG speed with context in the KV cache is very slow (less than 10% when going from 0 to 8k tokens as per above table). So, it is a model worth having.
+
+Please make it ready and let's merge it.
+
+---
+
+👤 **ubergarm** commented on **2025-04-26** at **15:23:46**
Okay got it rebased, gonna force push it up after quick final test!!!
---
-👤 **ikawrakow** submitted a review the **2025-04-26** at **15:33:46**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-04-26** at **15:33:46**
---
-👤 **ubergarm** commented the **2025-04-26** at **15:41:12**:
+👤 **ubergarm** commented on **2025-04-26** at **15:41:12**
Yaay!! Feels good to finally get that model working haha... Thanks again for your patience and guidance! Have a g'night!
---
-👤 **ubergarm** commented the **2025-04-26** at **20:04:37**:
+👤 **ubergarm** commented on **2025-04-26** at **20:04:37**
> Here a quick CPU only sweep-bench performance comparison to mainline for the [bartowski/THUDM_GLM-Z1-32B-0414-GGUF](https://huggingface.co/bartowski/THUDM_GLM-Z1-32B-0414-GGUF/blob/main/THUDM_GLM-Z1-32B-0414-IQ4_XS.gguf) model you are using
@@ -396,7 +441,7 @@ I followed your lead and ran some `llama-sweep-bench` comparisons too. My CPU-on
-Logs
+👈 Logs
## `llama.cpp@558a76`
Plus github.com/ubergarm/llama.cpp `ug/port-sweep-bench` branch.
@@ -683,7 +728,7 @@ llama_new_context_with_model: graph splits = 1
main: n_kv_max = 5120, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
============ Repacked 367 tensors
-```
+
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 6.188 | 82.74 | 25.659 | 4.99 |
@@ -696,16 +741,19 @@ main: n_kv_max = 5120, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_lay
| 512 | 128 | 3584 | 6.815 | 75.12 | 26.110 | 4.90 |
| 512 | 128 | 4096 | 6.902 | 74.18 | 26.160 | 4.89 |
| 512 | 128 | 4608 | 7.007 | 73.07 | 26.232 | 4.88 |
+```
## my CUDA GPU test
+This is *with* flash attention enabled.
+

-Logs
+👈 Logs
## `llama.cpp@558a76`
Plus github.com/ubergarm/llama.cpp `ug/port-sweep-bench` branch.
@@ -1063,7 +1111,6 @@ llama_new_context_with_model: graph nodes = 1592
llama_new_context_with_model: graph splits = 2
main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
-```
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -1131,24 +1178,30 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
| 512 | 128 | 31232 | 0.951 | 538.41 | 6.633 | 19.30 |
| 512 | 128 | 31744 | 0.961 | 532.89 | 6.693 | 19.12 |
| 512 | 128 | 32256 | 0.970 | 527.77 | 6.744 | 18.98 |
-
+```
-I didn't yet try comparing running with non-flash-attention.
+## my CUDA GPU no fa
+
+This is *without* flash attention.
+
+
+
+Didn't grab the logs as it was a quick sanity check basically. If you want them let me know and I can do a longer run etc.
---
-👤 **saood06** commented the **2025-04-27** at **08:48:11**:
+👤 **saood06** commented on **2025-04-27** at **08:48:11**
> > This model is quite efficient, I just ran it with 128k context and only using 21194MiB VRAM ??
>
> Yes, it has a very high GQA factor of 24
-This caught my eye, and was glad they had a prior work dedicated to long context training of LLMs, that they referenced in the GQA part of their technical report, [LongAlign: A Recipe for Long Context Alignment of Large Language Models](https://arxiv.org/abs/2401.18058)
+This caught my eye, and looked into it and found they had a prior work dedicated to long context training of LLMs that they say "(Cf [LongAlign: A Recipe for Long Context Alignment of Large Language Models](https://arxiv.org/abs/2401.18058) for technical details)" in the GQA part of their technical report
---
-👤 **saood06** commented the **2025-05-08** at **22:44:40**:
+👤 **saood06** commented on **2025-05-08** at **22:44:40**
I found [this](https://adamniederer.com/blog/llm-context-benchmarks.html) where someone uses NoLiMa to test the long context performance and they did notice lower performance (which I believe is because of the very high GQA factor).
\ No newline at end of file
diff --git a/github-data/pull_requests/346 - Fix FA on ARM CPUs.md b/github-data/pull_requests/346 - Fix FA on ARM CPUs.md
index f22bbe824..73530b0f1 100644
--- a/github-data/pull_requests/346 - Fix FA on ARM CPUs.md
+++ b/github-data/pull_requests/346 - Fix FA on ARM CPUs.md
@@ -1,13 +1,16 @@
-### 🐛 [#346](https://github.com/ikawrakow/ik_llama.cpp/pull/346) - Fix FA on ARM CPUs
+## 🔀 [Pull Request #346](https://github.com/ikawrakow/ik_llama.cpp/pull/346) - Fix FA on ARM CPUs
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_arm_fa` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-25 |
| **Updated** | 2025-04-25 |
+| **Merged** | 2025-04-25 |
---
-#### Description
+## 📄 Description
-I broke it with PR #332.
\ No newline at end of file
+I broke it with PR [#332](https://github.com/ikawrakow/ik_llama.cpp/issues/332).
\ No newline at end of file
diff --git a/github-data/pull_requests/347 - Add ability to manually set arch flags.md b/github-data/pull_requests/347 - Add ability to manually set arch flags.md
index ddbdec3fd..b0f193901 100644
--- a/github-data/pull_requests/347 - Add ability to manually set arch flags.md
+++ b/github-data/pull_requests/347 - Add ability to manually set arch flags.md
@@ -1,13 +1,16 @@
-### 🔀 [#347](https://github.com/ikawrakow/ik_llama.cpp/pull/347) - Add ability to manually set arch flags
+## 🔀 [Pull Request #347](https://github.com/ikawrakow/ik_llama.cpp/pull/347) - Add ability to manually set arch flags
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/arch_flags` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-25 |
| **Updated** | 2025-04-25 |
+| **Merged** | 2025-04-25 |
---
-#### Description
+## 📄 Description
Hopefully that way one can work around compilers not honoring `-DGGML_NATIVE`
\ No newline at end of file
diff --git a/github-data/pull_requests/348 - Fix q4_1 and q5_1 on Arm.md b/github-data/pull_requests/348 - Fix q4_1 and q5_1 on Arm.md
index 5985aa395..84f9656e1 100644
--- a/github-data/pull_requests/348 - Fix q4_1 and q5_1 on Arm.md
+++ b/github-data/pull_requests/348 - Fix q4_1 and q5_1 on Arm.md
@@ -1,14 +1,17 @@
-### 🐛 [#348](https://github.com/ikawrakow/ik_llama.cpp/pull/348) - Fix q4_1 and q5_1 on Arm
+## 🔀 [Pull Request #348](https://github.com/ikawrakow/ik_llama.cpp/pull/348) - Fix q4_1 and q5_1 on Arm
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_q41_q51_arm` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-25 |
| **Updated** | 2025-04-25 |
+| **Merged** | 2025-04-25 |
---
-#### Description
+## 📄 Description
When I changed the `vet_dot_type` for `q8_1_x4` to `q8_2_x4` for the quants using `q8_1_x4` I forgot to also make the change for the `ARM_NEON` implementation. As a result `q4_1` and `q5_1` are currently broken. But because `q4_0/q5_0` will use `q4_1/q5_1` for a few `ffn_down` layers, `q4_0` and `q5_0` are broken as well.
diff --git a/github-data/pull_requests/349 - Fix division by zero bug.md b/github-data/pull_requests/349 - Fix division by zero bug.md
index 9971cce15..ca62c7faa 100644
--- a/github-data/pull_requests/349 - Fix division by zero bug.md
+++ b/github-data/pull_requests/349 - Fix division by zero bug.md
@@ -1,14 +1,17 @@
-### 🐛 [#349](https://github.com/ikawrakow/ik_llama.cpp/pull/349) - Fix division by zero bug
+## 🔀 [Pull Request #349](https://github.com/ikawrakow/ik_llama.cpp/pull/349) - Fix division by zero bug
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_div_zero` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-26 |
| **Updated** | 2025-04-26 |
+| **Merged** | 2025-04-26 |
---
-#### Description
+## 📄 Description
The bug was in the calculation of number of work items to use when computing FA on the CPU. In my case (maximum of 32 threads) it triggered with the GLM-4 model that has an unusually small number of KV heads (just 2). But I guess it can also trigger with a larger number of threads for more common numbers of KV heads.
diff --git a/github-data/pull_requests/35 - Fix Zen4 Flash Attention.md b/github-data/pull_requests/35 - Fix Zen4 Flash Attention.md
index e5883d5e1..92607928d 100644
--- a/github-data/pull_requests/35 - Fix Zen4 Flash Attention.md
+++ b/github-data/pull_requests/35 - Fix Zen4 Flash Attention.md
@@ -1,15 +1,18 @@
-### 🐛 [#35](https://github.com/ikawrakow/ik_llama.cpp/pull/35) - Fix Zen4 Flash Attention
+## 🔀 [Pull Request #35](https://github.com/ikawrakow/ik_llama.cpp/pull/35) - Fix Zen4 Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_flash_attn` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-02 |
| **Updated** | 2024-09-02 |
+| **Merged** | 2024-09-02 |
---
-#### Description
+## 📄 Description
-Closes #34
+Closes [#34](https://github.com/ikawrakow/ik_llama.cpp/issues/34)
Funny enough, the bug was not in the FA implementation but in the way I was calling `iqk_flash_attn_noalibi` from `ggml`.
\ No newline at end of file
diff --git a/github-data/pull_requests/351 - CPU FA improvements.md b/github-data/pull_requests/351 - CPU FA improvements.md
index d71f0b87b..4f0f4a219 100644
--- a/github-data/pull_requests/351 - CPU FA improvements.md
+++ b/github-data/pull_requests/351 - CPU FA improvements.md
@@ -1,14 +1,17 @@
-### 🔀 [#351](https://github.com/ikawrakow/ik_llama.cpp/pull/351) - CPU FA improvements
+## 🔀 [Pull Request #351](https://github.com/ikawrakow/ik_llama.cpp/pull/351) - CPU FA improvements
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_work_buffer` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-28 |
| **Updated** | 2025-04-29 |
+| **Merged** | 2025-04-29 |
---
-#### Description
+## 📄 Description
This PR further improves CPU FA performance for GQA models. It does not affect FlashMLA (relevant for DeepSeek models), but the same strategy could be applied also there. I have left this for a future PR.
diff --git a/github-data/pull_requests/352 - Update README.md.md b/github-data/pull_requests/352 - Update README.md.md
index 7432d2889..a12e488d2 100644
--- a/github-data/pull_requests/352 - Update README.md.md
+++ b/github-data/pull_requests/352 - Update README.md.md
@@ -1,15 +1,24 @@
-### 🔀 [#352](https://github.com/ikawrakow/ik_llama.cpp/pull/352) - Update README.md
+## 🔀 [Pull Request #352](https://github.com/ikawrakow/ik_llama.cpp/pull/352) - Update README.md
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ikawrakow-patch-1-1` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-28 |
| **Updated** | 2025-04-30 |
+| **Merged** | 2025-04-30 |
---
-#### 💬 Conversation
+## 📄 Description
-👤 **saood06** commented the **2025-04-29** at **01:04:42**:
+_No description provided._
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** commented on **2025-04-29** at **01:04:42**
LGTM, only thing that might be worth adding to the News section is the Android/termux fix since the efficiency of this repo is well suited for mobile devices.
\ No newline at end of file
diff --git a/github-data/pull_requests/355 - Apply Qwen3 PR from llama.cpp.md b/github-data/pull_requests/355 - Apply Qwen3 PR from llama.cpp.md
index c11500476..ab69cf5c6 100644
--- a/github-data/pull_requests/355 - Apply Qwen3 PR from llama.cpp.md
+++ b/github-data/pull_requests/355 - Apply Qwen3 PR from llama.cpp.md
@@ -1,14 +1,17 @@
-### 🔀 [#355](https://github.com/ikawrakow/ik_llama.cpp/pull/355) - Apply Qwen3 PR from llama.cpp
+## 🔀 [Pull Request #355](https://github.com/ikawrakow/ik_llama.cpp/pull/355) - Apply Qwen3 PR from llama.cpp
| **Author** | `bharrisau` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `qwen3` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-29 |
| **Updated** | 2025-04-29 |
+| **Merged** | 2025-04-29 |
---
-#### Description
+## 📄 Description
I've just ported over the Qwen3 PR. So it is missing the layers/model type, and does not have tests, etc.
@@ -21,9 +24,9 @@ I've just ported over the Qwen3 PR. So it is missing the layers/model type, and
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-04-29** at **06:55:54**:
+👤 **ikawrakow** commented on **2025-04-29** at **06:55:54**
Thanks! I was just in the process of doing the same.
@@ -31,23 +34,29 @@ Does `convert_hf_gguf.py` work with this model?
---
-👤 **ikawrakow** submitted a review the **2025-04-29** at **07:06:58**: ✅ `APPROVED`
+👤 **ikawrakow** started a conversation on `gguf-py/gguf/constants.py` on **2025-04-29** at **07:05:07**
+
+You are missing the `QWEN3` and `QWEN3MOE` enum entries further up in `class MODEL_ARCH(IntEnum)`
---
-👤 **ikawrakow** commented the **2025-04-29** at **08:02:04**:
+👤 **ikawrakow** approved this pull request ✅ on **2025-04-29** at **07:06:58**
+
+---
+
+👤 **ikawrakow** commented on **2025-04-29** at **08:02:04**
OK, I'll merge this and will add the missing enum entries separately.
---
-👤 **bharrisau** commented the **2025-04-29** at **08:28:30**:
+👤 **bharrisau** commented on **2025-04-29** at **08:28:30**
Ok - my other concern was that `LLM_ARCH_GRANITE = 46` line. Wasn't sure if I could remove that or not, but as I added more enum entries above it, having it hard coded didn't work.
---
-👤 **bharrisau** commented the **2025-04-29** at **08:29:34**:
+👤 **bharrisau** commented on **2025-04-29** at **08:29:34**
I've only tested that the MOE works.
@@ -80,6 +89,30 @@ The Prime Minister of Australia in 2008 was **Kevin Rudd**. He served as the 26t
---
-👤 **ikawrakow** commented the **2025-04-29** at **09:07:24**:
+👤 **bharrisau** commented on **2025-04-29** at **08:30:44**
+
+And a no-think example
+
+```
+# ./build/bin/llama-cli -m ~/models/Qwen3-30B-A3B-Q6_K.gguf --numa distribute -t 16 --prompt "<|im_start|>system\nWho was prime minister of Australia in 2008?<|im_end|>\n<|im_start|>assistant\n\n\n\n\n" -fa -fmoe -c 16384 -ctk q8_0 --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0
+system
+Who was prime minister of Australia in 2008?
+assistant
+
+
+
+
+In 2008, the Prime Minister of Australia was **Kevin Rudd**.
+
+He took office on **December 3, 2007**, after leading the Australian Labor Party to a victory in the federal election. He served as Prime Minister until **June 24, 2010**, when he was replaced by **Julia Gillard**. [end of text]
+llama_print_timings: load time = 2157.68 ms
+llama_print_timings: sample time = 10.92 ms / 78 runs ( 0.14 ms per token, 7140.90 tokens per second)
+llama_print_timings: prompt eval time = 558.01 ms / 25 tokens ( 22.32 ms per token, 44.80 tokens per second) llama_print_timings: eval time = 10152.31 ms / 77 runs ( 131.85 ms per token, 7.58 tokens per second)
+llama_print_timings: total time = 10866.73 ms / 102 tokens
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-04-29** at **09:07:24**
I also tested before merging and it seemed to be working correctly.
\ No newline at end of file
diff --git a/github-data/pull_requests/356 - Add missing enum values for qwen3 and qwen3moe.md b/github-data/pull_requests/356 - Add missing enum values for qwen3 and qwen3moe.md
index a978a0fec..2258ba5c7 100644
--- a/github-data/pull_requests/356 - Add missing enum values for qwen3 and qwen3moe.md
+++ b/github-data/pull_requests/356 - Add missing enum values for qwen3 and qwen3moe.md
@@ -1,7 +1,16 @@
-### 🔀 [#356](https://github.com/ikawrakow/ik_llama.cpp/pull/356) - Add missing enum values for qwen3 and qwen3moe
+## 🔀 [Pull Request #356](https://github.com/ikawrakow/ik_llama.cpp/pull/356) - Add missing enum values for qwen3 and qwen3moe
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_missing_enum_values_qwen3` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-29 |
-| **Updated** | 2025-04-29 |
\ No newline at end of file
+| **Updated** | 2025-04-29 |
+| **Merged** | 2025-04-29 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/36 - Zen4 Flash Attnetion 2.md b/github-data/pull_requests/36 - Zen4 Flash Attnetion 2.md
index b9f564b85..d781bae33 100644
--- a/github-data/pull_requests/36 - Zen4 Flash Attnetion 2.md
+++ b/github-data/pull_requests/36 - Zen4 Flash Attnetion 2.md
@@ -1,16 +1,19 @@
-### 🔀 [#36](https://github.com/ikawrakow/ik_llama.cpp/pull/36) - Zen4 Flash Attnetion 2
+## 🔀 [Pull Request #36](https://github.com/ikawrakow/ik_llama.cpp/pull/36) - Zen4 Flash Attnetion 2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/zen4_flash_attn_2` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-03 |
| **Updated** | 2024-09-04 |
+| **Merged** | 2024-09-04 |
---
-#### Description
+## 📄 Description
-This PR is a follow up on #32 and adds the ability to use quantized K- and V-cache in the flash attention (FA) kernel. `Q4_0`, `Q4_1` and `Q8_0` are supported as cache quantization types. It is trivial to add additional types, but the implementation is templated, so number of template instantiations grows quadraticly with the number of supported quantization types, so I decided to settle for these 3 types for now.
+This PR is a follow up on [#32](https://github.com/ikawrakow/ik_llama.cpp/issues/32) and adds the ability to use quantized K- and V-cache in the flash attention (FA) kernel. `Q4_0`, `Q4_1` and `Q8_0` are supported as cache quantization types. It is trivial to add additional types, but the implementation is templated, so number of template instantiations grows quadraticly with the number of supported quantization types, so I decided to settle for these 3 types for now.
Performance is slightly lower than `fp16` cache (see graph below), so main use case is KV-cache size reduction for very large context lengths. Still, unlike mainline `llama.cpp`, performance remains strictly above no-FA.
diff --git a/github-data/pull_requests/360 - Fix IQK_FA_ALL_QUANTS on AVX2.md b/github-data/pull_requests/360 - Fix IQK_FA_ALL_QUANTS on AVX2.md
index f01dd15ff..1815e3b5b 100644
--- a/github-data/pull_requests/360 - Fix IQK_FA_ALL_QUANTS on AVX2.md
+++ b/github-data/pull_requests/360 - Fix IQK_FA_ALL_QUANTS on AVX2.md
@@ -1,13 +1,16 @@
-### 🐛 [#360](https://github.com/ikawrakow/ik_llama.cpp/pull/360) - Fix IQK_FA_ALL_QUANTS on AVX2
+## 🔀 [Pull Request #360](https://github.com/ikawrakow/ik_llama.cpp/pull/360) - Fix IQK_FA_ALL_QUANTS on AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_358` |
+| **Target Branch** | `main` |
| **Created** | 2025-04-30 |
| **Updated** | 2025-04-30 |
+| **Merged** | 2025-04-30 |
---
-#### Description
+## 📄 Description
-Fixes #358
\ No newline at end of file
+Fixes [#358](https://github.com/ikawrakow/ik_llama.cpp/issues/358)
\ No newline at end of file
diff --git a/github-data/pull_requests/364 - Fix FA bug on AVX2.md b/github-data/pull_requests/364 - Fix FA bug on AVX2.md
index f69ad9638..79940891c 100644
--- a/github-data/pull_requests/364 - Fix FA bug on AVX2.md
+++ b/github-data/pull_requests/364 - Fix FA bug on AVX2.md
@@ -1,25 +1,28 @@
-### 🐛 [#364](https://github.com/ikawrakow/ik_llama.cpp/pull/364) - Fix FA bug on AVX2
+## 🔀 [Pull Request #364](https://github.com/ikawrakow/ik_llama.cpp/pull/364) - Fix FA bug on AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_fa_avx2_bug` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-01 |
| **Updated** | 2025-05-02 |
+| **Merged** | 2025-05-02 |
---
-#### Description
+## 📄 Description
The bug was quite subtle: we have `Q8_0` K-cache, so we need to quantize the `Q` tensor to the appropriate quantization type (`vec_dot_type` in `ggml` lingo) that differs from platform to platform. We pick correctly the type. But then we notice that it is a GQA case, so we repack the K tensor to `Q8_0_R8` for faster processing, but still use the `vec_dot_type` selected based on `K` being `Q8_0`. On `Zen4` and `ARM_NEON` the `vet_dot_type` is the same, so everything works fine. But on `AVX2` the `vec_dot_type` changes, and we get gibberish (or even an assert for a NaN value).
-The bug was introduced in my recent CPU FA optimization round (#351)
+The bug was introduced in my recent CPU FA optimization round ([#351](https://github.com/ikawrakow/ik_llama.cpp/issues/351))
-Closes #363
+Closes [#363](https://github.com/ikawrakow/ik_llama.cpp/issues/363)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-02** at **05:09:05**:
+👤 **ikawrakow** commented on **2025-05-02** at **05:09:05**
-It looks like this does not fully fix #363, but I'll merge it to not have 2 real bugs stay on the main branch.
\ No newline at end of file
+It looks like this does not fully fix [#363](https://github.com/ikawrakow/ik_llama.cpp/issues/363), but I'll merge it to not have 2 real bugs stay on the main branch.
\ No newline at end of file
diff --git a/github-data/pull_requests/366 - Add support for new Bitnet model architecture name.md b/github-data/pull_requests/366 - Add support for new Bitnet model architecture name.md
index f92185aba..7ebccd35d 100644
--- a/github-data/pull_requests/366 - Add support for new Bitnet model architecture name.md
+++ b/github-data/pull_requests/366 - Add support for new Bitnet model architecture name.md
@@ -1,19 +1,22 @@
-### 🔀 [#366](https://github.com/ikawrakow/ik_llama.cpp/pull/366) - Add support for new Bitnet model architecture name
+## 🔀 [Pull Request #366](https://github.com/ikawrakow/ik_llama.cpp/pull/366) - Add support for new Bitnet model architecture name
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/bitnet_name_update` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-02 |
| **Updated** | 2025-05-02 |
+| **Merged** | 2025-05-02 |
---
-#### Description
+## 📄 Description
-Fixes #365
+Fixes [#365](https://github.com/ikawrakow/ik_llama.cpp/issues/365)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-02** at **05:07:17**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-02** at **05:07:17**
\ No newline at end of file
diff --git a/github-data/pull_requests/368 - Trying to fix iq1_s_r4_iq1_m_r4 quantization failure.md b/github-data/pull_requests/368 - Trying to fix iq1_s_r4_iq1_m_r4 quantization failure.md
deleted file mode 100644
index 7b25ee1ac..000000000
--- a/github-data/pull_requests/368 - Trying to fix iq1_s_r4_iq1_m_r4 quantization failure.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🐛 [#368](https://github.com/ikawrakow/ik_llama.cpp/pull/368) - Trying to fix iq1_s_r4/iq1_m_r4 quantization failure
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-03 |
-| **Updated** | 2025-05-03 |
-
----
-
-#### Description
-
-Closes #368
\ No newline at end of file
diff --git a/github-data/pull_requests/368 - Trying to fix iq1_s_r4iq1_m_r4 quantization failure.md b/github-data/pull_requests/368 - Trying to fix iq1_s_r4iq1_m_r4 quantization failure.md
new file mode 100644
index 000000000..cdbddbf96
--- /dev/null
+++ b/github-data/pull_requests/368 - Trying to fix iq1_s_r4iq1_m_r4 quantization failure.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #368](https://github.com/ikawrakow/ik_llama.cpp/pull/368) - Trying to fix iq1_s_r4/iq1_m_r4 quantization failure
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/try_fix_367` |
+| **Target Branch** | `main` |
+| **Created** | 2025-05-03 |
+| **Updated** | 2025-05-03 |
+| **Merged** | 2025-05-03 |
+
+---
+
+## 📄 Description
+
+Closes [#368](https://github.com/ikawrakow/ik_llama.cpp/issues/368)
\ No newline at end of file
diff --git a/github-data/pull_requests/369 - cmake_ force MSVC compiler charset to utf-8.md b/github-data/pull_requests/369 - cmake force MSVC compiler charset to utf-8.md
similarity index 66%
rename from github-data/pull_requests/369 - cmake_ force MSVC compiler charset to utf-8.md
rename to github-data/pull_requests/369 - cmake force MSVC compiler charset to utf-8.md
index 791eb81cf..8a5e62ba8 100644
--- a/github-data/pull_requests/369 - cmake_ force MSVC compiler charset to utf-8.md
+++ b/github-data/pull_requests/369 - cmake force MSVC compiler charset to utf-8.md
@@ -1,14 +1,17 @@
-### 🔀 [#369](https://github.com/ikawrakow/ik_llama.cpp/pull/369) - cmake: force MSVC compiler charset to utf-8
+## 🔀 [Pull Request #369](https://github.com/ikawrakow/ik_llama.cpp/pull/369) - cmake: force MSVC compiler charset to utf-8
| **Author** | `Gaolingx` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-03 |
| **Updated** | 2025-05-03 |
+| **Merged** | 2025-05-03 |
---
-#### Description
+## 📄 Description
This commit is to prevent `tests\test-grammar-integration.cpp(483,13): error C2001: newline in constant` showing up in non-UTF8 windows system while using MSVC.
@@ -22,21 +25,15 @@ This commit is to prevent `tests\test-grammar-integration.cpp(483,13): error C20
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-03** at **12:26:22**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-03** at **12:26:22**
LGTM, but I cannot test. It would be useful if at least one other person tested before we merge.
---
-👤 **ikawrakow** submitted a review the **2025-05-03** at **12:26:22**: ✅ `APPROVED`
-
-LGTM, but I cannot test.
-
----
-
-👤 **Gaolingx** commented the **2025-05-03** at **12:54:45**:
+👤 **Gaolingx** commented on **2025-05-03** at **12:54:45**
> LGTM, but I cannot test. It would be useful if at least one other person tested before we merge.
diff --git a/github-data/pull_requests/37 - Performance improvements for legacy quants on ARM_NEON.md b/github-data/pull_requests/37 - Performance improvements for legacy quants on ARM_NEON.md
index efc7e9f39..01615d66c 100644
--- a/github-data/pull_requests/37 - Performance improvements for legacy quants on ARM_NEON.md
+++ b/github-data/pull_requests/37 - Performance improvements for legacy quants on ARM_NEON.md
@@ -1,14 +1,17 @@
-### 🔀 [#37](https://github.com/ikawrakow/ik_llama.cpp/pull/37) - Performance improvements for legacy quants on ARM_NEON
+## 🔀 [Pull Request #37](https://github.com/ikawrakow/ik_llama.cpp/pull/37) - Performance improvements for legacy quants on ARM_NEON
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/neon_improve_legacy_quants` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-03 |
| **Updated** | 2024-09-04 |
+| **Merged** | 2024-09-04 |
---
-#### Description
+## 📄 Description
If we process 2 rows in the left matrix at a time we get in the range of 20% performance boost for PP-512 (except for `Q8_0`, where performance was already higher than the other quants). The table summarizes the results or LLaMA-3.1-8B on an M2-Max CPU. As I like keeping track of how we perform relative to mainline `llama.cpp`, the table includes results for the current `llama.cpp` build (`69a480a (3660)`). tinyBLAS is enabled in `llama.cpp`, so the 33% (`Q4_0`) or 16.6% (`Q8_0`) improvement is compared to tinyBLAS, which does not provide implementation for `Q4_1`, `Q5_0` and `Q5_1` (and correspondingly the performance gap is much larger).
diff --git a/github-data/pull_requests/370 - CUDA_ faster FA TG for GQA models.md b/github-data/pull_requests/370 - CUDA faster FA TG for GQA models.md
similarity index 83%
rename from github-data/pull_requests/370 - CUDA_ faster FA TG for GQA models.md
rename to github-data/pull_requests/370 - CUDA faster FA TG for GQA models.md
index 7cc3bae63..82c860e24 100644
--- a/github-data/pull_requests/370 - CUDA_ faster FA TG for GQA models.md
+++ b/github-data/pull_requests/370 - CUDA faster FA TG for GQA models.md
@@ -1,14 +1,17 @@
-### 🔀 [#370](https://github.com/ikawrakow/ik_llama.cpp/pull/370) - CUDA: faster FA TG for GQA models
+## 🔀 [Pull Request #370](https://github.com/ikawrakow/ik_llama.cpp/pull/370) - CUDA: faster FA TG for GQA models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_mma` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-03 |
| **Updated** | 2025-05-04 |
+| **Merged** | 2025-05-04 |
---
-#### Description
+## 📄 Description
This PR improves CUDA FA performance for token generation by a significant margin.
@@ -31,9 +34,9 @@ My GPU is `ADA_LOVELACE`, so the MMA kernel does not get invoked for TG. But bas
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-03** at **20:24:02**:
+👤 **ubergarm** commented on **2025-05-03** at **20:24:02**
Wow, I'll let the benchmarks speak for themselves.
@@ -692,7 +695,7 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ubergarm** commented the **2025-05-03** at **21:39:11**:
+👤 **ubergarm** commented on **2025-05-03** at **21:39:11**
I suppose I must let this benchmark speak for itself as well.
@@ -702,12 +705,680 @@ I suppose I must let this benchmark speak for itself as well.

+
+
+👈Logs
+
+## `llama.cpp/master@36667c8e` + `ug/port-sweep-bench@d541533a`
+```
+cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF
+cmake --build build --config Release -j $(nproc)
+
+CUDA_VISIBLE_DEVICE=0 \
+./build/bin/llama-sweep-bench \
+ --model /mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf \
+ -fa \
+ -ctk f16 -ctv f16 \
+ -c 32768 \
+ -ngl 99 \
+ --threads 1
+
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
+build: 5274 (d541533a) with cc (GCC) 14.2.1 20250128 for x86_64-pc-linux-gnu
+llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090 Ti) - 23266 MiB free
+llama_model_loader: loaded meta data with 41 key-value pairs and 579 tensors from /mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B
+llama_model_loader: - kv 3: general.basename str = Qwen3
+llama_model_loader: - kv 4: general.size_label str = 30B-A3B
+llama_model_loader: - kv 5: general.license str = apache-2.0
+llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv 7: general.base_model.count u32 = 1
+llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 30B A3B Base
+llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
+llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv 11: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 12: qwen3moe.block_count u32 = 48
+llama_model_loader: - kv 13: qwen3moe.context_length u32 = 32768
+llama_model_loader: - kv 14: qwen3moe.embedding_length u32 = 2048
+llama_model_loader: - kv 15: qwen3moe.feed_forward_length u32 = 6144
+llama_model_loader: - kv 16: qwen3moe.attention.head_count u32 = 32
+llama_model_loader: - kv 17: qwen3moe.attention.head_count_kv u32 = 4
+llama_model_loader: - kv 18: qwen3moe.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 19: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 21: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 22: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 23: qwen3moe.expert_count u32 = 128
+llama_model_loader: - kv 24: qwen3moe.expert_feed_forward_length u32 = 768
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 151643
+llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 34: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
+llama_model_loader: - kv 35: general.quantization_version u32 = 2
+llama_model_loader: - kv 36: general.file_type u32 = 15
+llama_model_loader: - kv 37: quantize.imatrix.file str = /models_out/Qwen3-30B-A3B-GGUF/Qwen_Q...
+llama_model_loader: - kv 38: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
+llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 384
+llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 209
+llama_model_loader: - type f32: 241 tensors
+llama_model_loader: - type q8_0: 48 tensors
+llama_model_loader: - type q4_K: 193 tensors
+llama_model_loader: - type q5_K: 48 tensors
+llama_model_loader: - type q6_K: 49 tensors
+print_info: file format = GGUF V3 (latest)
+print_info: file type = Q4_K - Medium
+print_info: file size = 17.35 GiB (4.88 BPW)
+load: special tokens cache size = 26
+load: token to piece cache size = 0.9311 MB
+print_info: arch = qwen3moe
+print_info: vocab_only = 0
+print_info: n_ctx_train = 32768
+print_info: n_embd = 2048
+print_info: n_layer = 48
+print_info: n_head = 32
+print_info: n_head_kv = 4
+print_info: n_rot = 128
+print_info: n_swa = 0
+print_info: n_swa_pattern = 1
+print_info: n_embd_head_k = 128
+print_info: n_embd_head_v = 128
+print_info: n_gqa = 8
+print_info: n_embd_k_gqa = 512
+print_info: n_embd_v_gqa = 512
+print_info: f_norm_eps = 0.0e+00
+print_info: f_norm_rms_eps = 1.0e-06
+print_info: f_clamp_kqv = 0.0e+00
+print_info: f_max_alibi_bias = 0.0e+00
+print_info: f_logit_scale = 0.0e+00
+print_info: f_attn_scale = 0.0e+00
+print_info: n_ff = 6144
+print_info: n_expert = 128
+print_info: n_expert_used = 8
+print_info: causal attn = 1
+print_info: pooling type = 0
+print_info: rope type = 2
+print_info: rope scaling = linear
+print_info: freq_base_train = 1000000.0
+print_info: freq_scale_train = 1
+print_info: n_ctx_orig_yarn = 32768
+print_info: rope_finetuned = unknown
+print_info: ssm_d_conv = 0
+print_info: ssm_d_inner = 0
+print_info: ssm_d_state = 0
+print_info: ssm_dt_rank = 0
+print_info: ssm_dt_b_c_rms = 0
+print_info: model type = 30B.A3B
+print_info: model params = 30.53 B
+print_info: general.name = Qwen3 30B A3B
+print_info: n_ff_exp = 768
+print_info: vocab type = BPE
+print_info: n_vocab = 151936
+print_info: n_merges = 151387
+print_info: BOS token = 151643 '<|endoftext|>'
+print_info: EOS token = 151645 '<|im_end|>'
+print_info: EOT token = 151645 '<|im_end|>'
+print_info: PAD token = 151643 '<|endoftext|>'
+print_info: LF token = 198 'Ċ'
+print_info: FIM PRE token = 151659 '<|fim_prefix|>'
+print_info: FIM SUF token = 151661 '<|fim_suffix|>'
+print_info: FIM MID token = 151660 '<|fim_middle|>'
+print_info: FIM PAD token = 151662 '<|fim_pad|>'
+print_info: FIM REP token = 151663 '<|repo_name|>'
+print_info: FIM SEP token = 151664 '<|file_sep|>'
+print_info: EOG token = 151643 '<|endoftext|>'
+print_info: EOG token = 151645 '<|im_end|>'
+print_info: EOG token = 151662 '<|fim_pad|>'
+print_info: EOG token = 151663 '<|repo_name|>'
+print_info: EOG token = 151664 '<|file_sep|>'
+print_info: max token length = 256
+load_tensors: loading model tensors, this can take a while... (mmap = true)
+load_tensors: offloading 48 repeating layers to GPU
+load_tensors: offloading output layer to GPU
+load_tensors: offloaded 49/49 layers to GPU
+load_tensors: CUDA0 model buffer size = 17596.43 MiB
+load_tensors: CPU_Mapped model buffer size = 166.92 MiB
+....................................................................................................
+llama_context: constructing llama_context
+llama_context: n_seq_max = 1
+llama_context: n_ctx = 32768
+llama_context: n_ctx_per_seq = 32768
+llama_context: n_batch = 2048
+llama_context: n_ubatch = 512
+llama_context: causal_attn = 1
+llama_context: flash_attn = 1
+llama_context: freq_base = 1000000.0
+llama_context: freq_scale = 1
+llama_context: CUDA_Host output buffer size = 0.58 MiB
+llama_kv_cache_unified: kv_size = 32768, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1, padding = 256
+llama_kv_cache_unified: CUDA0 KV buffer size = 3072.00 MiB
+llama_kv_cache_unified: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
+llama_context: CUDA0 compute buffer size = 300.75 MiB
+llama_context: CUDA_Host compute buffer size = 68.01 MiB
+llama_context: graph nodes = 2935
+llama_context: graph splits = 2
+
+main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1
+```
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.220 | 2330.76 | 0.912 | 140.28 |
+| 512 | 128 | 512 | 0.187 | 2740.62 | 0.937 | 136.60 |
+| 512 | 128 | 1024 | 0.191 | 2687.03 | 0.936 | 136.82 |
+| 512 | 128 | 1536 | 0.193 | 2659.55 | 0.943 | 135.80 |
+| 512 | 128 | 2048 | 0.198 | 2589.49 | 0.956 | 133.82 |
+| 512 | 128 | 2560 | 0.198 | 2579.72 | 0.959 | 133.42 |
+| 512 | 128 | 3072 | 0.203 | 2526.46 | 0.966 | 132.51 |
+| 512 | 128 | 3584 | 0.206 | 2485.20 | 0.977 | 130.96 |
+| 512 | 128 | 4096 | 0.211 | 2431.89 | 0.986 | 129.88 |
+| 512 | 128 | 4608 | 0.214 | 2397.05 | 0.997 | 128.44 |
+| 512 | 128 | 5120 | 0.215 | 2383.06 | 1.008 | 127.04 |
+| 512 | 128 | 5632 | 0.217 | 2356.61 | 1.020 | 125.50 |
+| 512 | 128 | 6144 | 0.221 | 2316.21 | 1.031 | 124.16 |
+| 512 | 128 | 6656 | 0.223 | 2294.66 | 1.044 | 122.59 |
+| 512 | 128 | 7168 | 0.227 | 2255.38 | 1.054 | 121.48 |
+| 512 | 128 | 7680 | 0.229 | 2232.53 | 1.065 | 120.15 |
+| 512 | 128 | 8192 | 0.233 | 2192.89 | 1.076 | 118.99 |
+| 512 | 128 | 8704 | 0.236 | 2169.88 | 1.088 | 117.61 |
+| 512 | 128 | 9216 | 0.239 | 2141.79 | 1.101 | 116.25 |
+| 512 | 128 | 9728 | 0.242 | 2112.17 | 1.110 | 115.33 |
+| 512 | 128 | 10240 | 0.246 | 2083.21 | 1.122 | 114.10 |
+| 512 | 128 | 10752 | 0.248 | 2060.86 | 1.165 | 109.85 |
+| 512 | 128 | 11264 | 0.251 | 2042.36 | 1.183 | 108.22 |
+| 512 | 128 | 11776 | 0.254 | 2011.97 | 1.173 | 109.12 |
+| 512 | 128 | 12288 | 0.257 | 1988.57 | 1.176 | 108.87 |
+| 512 | 128 | 12800 | 0.261 | 1960.90 | 1.191 | 107.46 |
+| 512 | 128 | 13312 | 0.265 | 1931.58 | 1.190 | 107.53 |
+| 512 | 128 | 13824 | 0.267 | 1914.62 | 1.197 | 106.97 |
+| 512 | 128 | 14336 | 0.270 | 1894.12 | 1.204 | 106.34 |
+| 512 | 128 | 14848 | 0.273 | 1876.37 | 1.211 | 105.67 |
+| 512 | 128 | 15360 | 0.276 | 1852.41 | 1.216 | 105.26 |
+| 512 | 128 | 15872 | 0.278 | 1838.45 | 1.223 | 104.68 |
+| 512 | 128 | 16384 | 0.282 | 1817.50 | 1.230 | 104.07 |
+| 512 | 128 | 16896 | 0.285 | 1793.62 | 1.236 | 103.54 |
+| 512 | 128 | 17408 | 0.290 | 1767.23 | 1.244 | 102.90 |
+| 512 | 128 | 17920 | 0.292 | 1753.08 | 1.250 | 102.43 |
+| 512 | 128 | 18432 | 0.296 | 1728.78 | 1.258 | 101.72 |
+| 512 | 128 | 18944 | 0.298 | 1716.57 | 1.265 | 101.22 |
+| 512 | 128 | 19456 | 0.302 | 1695.52 | 1.270 | 100.76 |
+| 512 | 128 | 19968 | 0.304 | 1682.70 | 1.280 | 99.99 |
+| 512 | 128 | 20480 | 0.306 | 1670.67 | 1.286 | 99.55 |
+| 512 | 128 | 20992 | 0.310 | 1654.28 | 1.293 | 99.03 |
+| 512 | 128 | 21504 | 0.314 | 1632.49 | 1.332 | 96.06 |
+| 512 | 128 | 22016 | 0.316 | 1621.03 | 1.348 | 94.95 |
+| 512 | 128 | 22528 | 0.321 | 1592.75 | 1.342 | 95.35 |
+| 512 | 128 | 23040 | 0.324 | 1578.64 | 1.354 | 94.57 |
+| 512 | 128 | 23552 | 0.327 | 1563.81 | 1.361 | 94.04 |
+| 512 | 128 | 24064 | 0.330 | 1550.20 | 1.365 | 93.77 |
+| 512 | 128 | 24576 | 0.334 | 1535.10 | 1.369 | 93.49 |
+| 512 | 128 | 25088 | 0.337 | 1520.24 | 1.376 | 93.05 |
+| 512 | 128 | 25600 | 0.339 | 1508.72 | 1.380 | 92.74 |
+| 512 | 128 | 26112 | 0.342 | 1497.33 | 1.388 | 92.23 |
+| 512 | 128 | 26624 | 0.345 | 1482.00 | 1.396 | 91.68 |
+| 512 | 128 | 27136 | 0.349 | 1468.23 | 1.403 | 91.26 |
+| 512 | 128 | 27648 | 0.353 | 1452.44 | 1.408 | 90.93 |
+| 512 | 128 | 28160 | 0.355 | 1441.72 | 1.413 | 90.62 |
+| 512 | 128 | 28672 | 0.359 | 1427.98 | 1.423 | 89.94 |
+| 512 | 128 | 29184 | 0.361 | 1418.24 | 1.431 | 89.47 |
+| 512 | 128 | 29696 | 0.365 | 1403.28 | 1.435 | 89.17 |
+| 512 | 128 | 30208 | 0.367 | 1396.56 | 1.443 | 88.68 |
+| 512 | 128 | 30720 | 0.370 | 1383.91 | 1.452 | 88.13 |
+| 512 | 128 | 31232 | 0.374 | 1369.59 | 1.458 | 87.81 |
+| 512 | 128 | 31744 | 0.376 | 1361.13 | 1.465 | 87.38 |
+| 512 | 128 | 32256 | 0.380 | 1348.53 | 1.508 | 84.87 |
+
+## `ik_llama.cpp/main@ab7f694b`
+```
+cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF
+cmake --build build --config Release -j $(nproc)
+
+CUDA_VISIBLE_DEVICE=0 \
+./build/bin/llama-sweep-bench \
+ --model /mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf \
+ -fmoe \
+ -fa \
+ -ctk f16 -ctv f16 \
+ -c 32768 \
+ -ngl 99 \
+ --threads 1
+
+llama_model_loader: loaded meta data with 41 key-value pairs and 579 tensors from /mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B
+llama_model_loader: - kv 3: general.basename str = Qwen3
+llama_model_loader: - kv 4: general.size_label str = 30B-A3B
+llama_model_loader: - kv 5: general.license str = apache-2.0
+llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv 7: general.base_model.count u32 = 1
+llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 30B A3B Base
+llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
+llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv 11: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 12: qwen3moe.block_count u32 = 48
+llama_model_loader: - kv 13: qwen3moe.context_length u32 = 32768
+llama_model_loader: - kv 14: qwen3moe.embedding_length u32 = 2048
+llama_model_loader: - kv 15: qwen3moe.feed_forward_length u32 = 6144
+llama_model_loader: - kv 16: qwen3moe.attention.head_count u32 = 32
+llama_model_loader: - kv 17: qwen3moe.attention.head_count_kv u32 = 4
+llama_model_loader: - kv 18: qwen3moe.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 19: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 21: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 22: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 23: qwen3moe.expert_count u32 = 128
+llama_model_loader: - kv 24: qwen3moe.expert_feed_forward_length u32 = 768
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 151643
+llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 34: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
+llama_model_loader: - kv 35: general.quantization_version u32 = 2
+llama_model_loader: - kv 36: general.file_type u32 = 15
+llama_model_loader: - kv 37: quantize.imatrix.file str = /models_out/Qwen3-30B-A3B-GGUF/Qwen_Q...
+llama_model_loader: - kv 38: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
+llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 384
+llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 209
+llama_model_loader: - type f32: 241 tensors
+llama_model_loader: - type q8_0: 48 tensors
+llama_model_loader: - type q4_K: 193 tensors
+llama_model_loader: - type q5_K: 48 tensors
+llama_model_loader: - type q6_K: 49 tensors
+llm_load_vocab: special tokens cache size = 26
+llm_load_vocab: token to piece cache size = 0.9311 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen3moe
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 151936
+llm_load_print_meta: n_merges = 151387
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 32768
+llm_load_print_meta: n_embd = 2048
+llm_load_print_meta: n_layer = 48
+llm_load_print_meta: n_head = 32
+llm_load_print_meta: n_head_kv = 4
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 8
+llm_load_print_meta: n_embd_k_gqa = 512
+llm_load_print_meta: n_embd_v_gqa = 512
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 6144
+llm_load_print_meta: n_expert = 128
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = linear
+llm_load_print_meta: freq_base_train = 1000000.0
+llm_load_print_meta: freq_scale_train = 1
+llm_load_print_meta: n_ctx_orig_yarn = 32768
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = Q4_K - Medium
+llm_load_print_meta: model params = 30.532 B
+llm_load_print_meta: model size = 17.347 GiB (4.880 BPW)
+llm_load_print_meta: repeating layers = 16.946 GiB (4.867 BPW, 29.910 B parameters)
+llm_load_print_meta: general.name = Qwen3 30B A3B
+llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
+llm_load_print_meta: EOS token = 151645 '<|im_end|>'
+llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
+llm_load_print_meta: LF token = 148848 'ÄĬ'
+llm_load_print_meta: EOT token = 151645 '<|im_end|>'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_ff_exp = 768
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
+llm_load_tensors: ggml ctx size = 0.51 MiB
+llm_load_tensors: offloading 48 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 49/49 layers to GPU
+llm_load_tensors: CPU buffer size = 166.92 MiB
+llm_load_tensors: CUDA0 buffer size = 17596.43 MiB
+...................................................................................................
+llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 1000000.0
+llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB
+llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 304.75 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 68.01 MiB
+llama_new_context_with_model: graph nodes = 1878
+llama_new_context_with_model: graph splits = 2
+
+main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1
+```
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.255 | 2009.10 | 0.918 | 139.50 |
+| 512 | 128 | 512 | 0.243 | 2106.91 | 0.936 | 136.69 |
+| 512 | 128 | 1024 | 0.250 | 2050.52 | 0.977 | 131.04 |
+| 512 | 128 | 1536 | 0.253 | 2022.75 | 1.003 | 127.60 |
+| 512 | 128 | 2048 | 0.265 | 1931.90 | 1.035 | 123.65 |
+| 512 | 128 | 2560 | 0.262 | 1954.45 | 1.058 | 121.03 |
+| 512 | 128 | 3072 | 0.269 | 1903.29 | 1.090 | 117.45 |
+| 512 | 128 | 3584 | 0.272 | 1881.21 | 1.117 | 114.59 |
+| 512 | 128 | 4096 | 0.278 | 1840.59 | 1.143 | 112.02 |
+| 512 | 128 | 4608 | 0.282 | 1816.74 | 1.173 | 109.16 |
+| 512 | 128 | 5120 | 0.289 | 1770.11 | 1.194 | 107.20 |
+| 512 | 128 | 5632 | 0.291 | 1762.16 | 1.235 | 103.62 |
+| 512 | 128 | 6144 | 0.297 | 1722.54 | 1.256 | 101.91 |
+| 512 | 128 | 6656 | 0.302 | 1697.57 | 1.287 | 99.48 |
+| 512 | 128 | 7168 | 0.310 | 1652.41 | 1.313 | 97.52 |
+| 512 | 128 | 7680 | 0.313 | 1635.64 | 1.345 | 95.17 |
+| 512 | 128 | 8192 | 0.320 | 1599.62 | 1.366 | 93.69 |
+| 512 | 128 | 8704 | 0.321 | 1594.68 | 1.400 | 91.44 |
+| 512 | 128 | 9216 | 0.328 | 1562.18 | 1.419 | 90.21 |
+| 512 | 128 | 9728 | 0.333 | 1536.75 | 1.453 | 88.12 |
+| 512 | 128 | 10240 | 0.335 | 1528.06 | 1.475 | 86.79 |
+| 512 | 128 | 10752 | 0.346 | 1481.90 | 1.507 | 84.93 |
+| 512 | 128 | 11264 | 0.347 | 1476.88 | 1.537 | 83.30 |
+| 512 | 128 | 11776 | 0.353 | 1452.17 | 1.562 | 81.96 |
+| 512 | 128 | 12288 | 0.356 | 1436.80 | 1.589 | 80.54 |
+| 512 | 128 | 12800 | 0.363 | 1412.23 | 1.616 | 79.22 |
+| 512 | 128 | 13312 | 0.367 | 1396.25 | 1.645 | 77.83 |
+| 512 | 128 | 13824 | 0.371 | 1379.64 | 1.667 | 76.79 |
+| 512 | 128 | 14336 | 0.375 | 1364.63 | 1.699 | 75.36 |
+| 512 | 128 | 14848 | 0.383 | 1337.10 | 1.720 | 74.40 |
+| 512 | 128 | 15360 | 0.386 | 1325.34 | 1.755 | 72.92 |
+| 512 | 128 | 15872 | 0.394 | 1298.46 | 1.783 | 71.78 |
+| 512 | 128 | 16384 | 0.400 | 1281.11 | 1.812 | 70.63 |
+| 512 | 128 | 16896 | 0.404 | 1267.98 | 1.841 | 69.53 |
+| 512 | 128 | 17408 | 0.410 | 1248.98 | 1.866 | 68.58 |
+| 512 | 128 | 17920 | 0.415 | 1234.86 | 1.894 | 67.57 |
+| 512 | 128 | 18432 | 0.423 | 1209.68 | 1.921 | 66.65 |
+| 512 | 128 | 18944 | 0.424 | 1208.65 | 1.954 | 65.52 |
+| 512 | 128 | 19456 | 0.433 | 1183.41 | 1.979 | 64.69 |
+| 512 | 128 | 19968 | 0.436 | 1175.61 | 2.006 | 63.80 |
+| 512 | 128 | 20480 | 0.437 | 1170.34 | 2.034 | 62.93 |
+| 512 | 128 | 20992 | 0.443 | 1156.61 | 2.064 | 62.02 |
+| 512 | 128 | 21504 | 0.455 | 1124.73 | 2.090 | 61.24 |
+| 512 | 128 | 22016 | 0.457 | 1119.31 | 2.127 | 60.18 |
+| 512 | 128 | 22528 | 0.466 | 1099.39 | 2.161 | 59.23 |
+| 512 | 128 | 23040 | 0.468 | 1092.87 | 2.185 | 58.59 |
+| 512 | 128 | 23552 | 0.472 | 1084.20 | 2.217 | 57.73 |
+| 512 | 128 | 24064 | 0.479 | 1068.27 | 2.247 | 56.96 |
+| 512 | 128 | 24576 | 0.485 | 1054.75 | 2.279 | 56.16 |
+| 512 | 128 | 25088 | 0.489 | 1047.65 | 2.303 | 55.58 |
+| 512 | 128 | 25600 | 0.492 | 1040.04 | 2.336 | 54.80 |
+| 512 | 128 | 26112 | 0.497 | 1030.94 | 2.364 | 54.15 |
+| 512 | 128 | 26624 | 0.504 | 1015.28 | 2.392 | 53.51 |
+| 512 | 128 | 27136 | 0.511 | 1001.21 | 2.423 | 52.83 |
+| 512 | 128 | 27648 | 0.515 | 993.63 | 2.450 | 52.25 |
+| 512 | 128 | 28160 | 0.520 | 984.55 | 2.478 | 51.65 |
+| 512 | 128 | 28672 | 0.528 | 968.98 | 2.509 | 51.01 |
+| 512 | 128 | 29184 | 0.532 | 962.30 | 2.539 | 50.40 |
+| 512 | 128 | 29696 | 0.536 | 955.10 | 2.566 | 49.89 |
+| 512 | 128 | 30208 | 0.538 | 952.18 | 2.596 | 49.31 |
+| 512 | 128 | 30720 | 0.546 | 938.43 | 2.628 | 48.70 |
+| 512 | 128 | 31232 | 0.549 | 932.50 | 2.655 | 48.22 |
+| 512 | 128 | 31744 | 0.556 | 920.87 | 2.687 | 47.64 |
+| 512 | 128 | 32256 | 0.562 | 911.52 | 2.712 | 47.19 |
+
+## `ik_llama.cpp/ik/fattn_mma@056f0818` PR370
+```
+cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF
+cmake --build build --config Release -j $(nproc)
+
+CUDA_VISIBLE_DEVICE=0 \
+./build/bin/llama-sweep-bench \
+ --model /mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf \
+ -fmoe \
+ -fa \
+ -ctk f16 -ctv f16 \
+ -c 32768 \
+ -ngl 99 \
+ --threads 1
+
+llama_model_loader: loaded meta data with 41 key-value pairs and 579 tensors from /mnt/astrodata/llm/models/bartowski/Qwen_Qwen3-30B-A3B-GGUF/Qwen_Qwen3-30B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 30B A3B
+llama_model_loader: - kv 3: general.basename str = Qwen3
+llama_model_loader: - kv 4: general.size_label str = 30B-A3B
+llama_model_loader: - kv 5: general.license str = apache-2.0
+llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv 7: general.base_model.count u32 = 1
+llama_model_loader: - kv 8: general.base_model.0.name str = Qwen3 30B A3B Base
+llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
+llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-30B...
+llama_model_loader: - kv 11: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 12: qwen3moe.block_count u32 = 48
+llama_model_loader: - kv 13: qwen3moe.context_length u32 = 32768
+llama_model_loader: - kv 14: qwen3moe.embedding_length u32 = 2048
+llama_model_loader: - kv 15: qwen3moe.feed_forward_length u32 = 6144
+llama_model_loader: - kv 16: qwen3moe.attention.head_count u32 = 32
+llama_model_loader: - kv 17: qwen3moe.attention.head_count_kv u32 = 4
+llama_model_loader: - kv 18: qwen3moe.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 19: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 20: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 21: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 22: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 23: qwen3moe.expert_count u32 = 128
+llama_model_loader: - kv 24: qwen3moe.expert_feed_forward_length u32 = 768
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 151643
+llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 34: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
+llama_model_loader: - kv 35: general.quantization_version u32 = 2
+llama_model_loader: - kv 36: general.file_type u32 = 15
+llama_model_loader: - kv 37: quantize.imatrix.file str = /models_out/Qwen3-30B-A3B-GGUF/Qwen_Q...
+llama_model_loader: - kv 38: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
+llama_model_loader: - kv 39: quantize.imatrix.entries_count i32 = 384
+llama_model_loader: - kv 40: quantize.imatrix.chunks_count i32 = 209
+llama_model_loader: - type f32: 241 tensors
+llama_model_loader: - type q8_0: 48 tensors
+llama_model_loader: - type q4_K: 193 tensors
+llama_model_loader: - type q5_K: 48 tensors
+llama_model_loader: - type q6_K: 49 tensors
+llm_load_vocab: special tokens cache size = 26
+llm_load_vocab: token to piece cache size = 0.9311 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen3moe
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 151936
+llm_load_print_meta: n_merges = 151387
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 32768
+llm_load_print_meta: n_embd = 2048
+llm_load_print_meta: n_layer = 48
+llm_load_print_meta: n_head = 32
+llm_load_print_meta: n_head_kv = 4
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 8
+llm_load_print_meta: n_embd_k_gqa = 512
+llm_load_print_meta: n_embd_v_gqa = 512
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 6144
+llm_load_print_meta: n_expert = 128
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = linear
+llm_load_print_meta: freq_base_train = 1000000.0
+llm_load_print_meta: freq_scale_train = 1
+llm_load_print_meta: n_ctx_orig_yarn = 32768
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = Q4_K - Medium
+llm_load_print_meta: model params = 30.532 B
+llm_load_print_meta: model size = 17.347 GiB (4.880 BPW)
+llm_load_print_meta: repeating layers = 16.946 GiB (4.867 BPW, 29.910 B parameters)
+llm_load_print_meta: general.name = Qwen3 30B A3B
+llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
+llm_load_print_meta: EOS token = 151645 '<|im_end|>'
+llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
+llm_load_print_meta: LF token = 148848 'ÄĬ'
+llm_load_print_meta: EOT token = 151645 '<|im_end|>'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_ff_exp = 768
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 1 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
+llm_load_tensors: ggml ctx size = 0.51 MiB
+llm_load_tensors: offloading 48 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 49/49 layers to GPU
+llm_load_tensors: CPU buffer size = 166.92 MiB
+llm_load_tensors: CUDA0 buffer size = 17596.43 MiB
+...................................................................................................
+llama_new_context_with_model: n_ctx = 32768
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 1000000.0
+llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB
+llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
+llama_new_context_with_model: CUDA0 compute buffer size = 304.75 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 68.01 MiB
+llama_new_context_with_model: graph nodes = 1878
+llama_new_context_with_model: graph splits = 2
+
+main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1
+```
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.256 | 2001.93 | 0.940 | 136.11 |
+| 512 | 128 | 512 | 0.240 | 2134.60 | 0.981 | 130.46 |
+| 512 | 128 | 1024 | 0.245 | 2091.45 | 0.981 | 130.42 |
+| 512 | 128 | 1536 | 0.245 | 2087.85 | 0.986 | 129.78 |
+| 512 | 128 | 2048 | 0.257 | 1993.14 | 0.996 | 128.54 |
+| 512 | 128 | 2560 | 0.250 | 2050.10 | 0.995 | 128.67 |
+| 512 | 128 | 3072 | 0.257 | 1995.91 | 1.008 | 127.03 |
+| 512 | 128 | 3584 | 0.258 | 1987.22 | 1.022 | 125.30 |
+| 512 | 128 | 4096 | 0.263 | 1946.15 | 1.036 | 123.54 |
+| 512 | 128 | 4608 | 0.264 | 1938.42 | 1.042 | 122.86 |
+| 512 | 128 | 5120 | 0.267 | 1918.49 | 1.052 | 121.73 |
+| 512 | 128 | 5632 | 0.268 | 1909.64 | 1.066 | 120.04 |
+| 512 | 128 | 6144 | 0.271 | 1890.12 | 1.080 | 118.56 |
+| 512 | 128 | 6656 | 0.274 | 1868.32 | 1.092 | 117.20 |
+| 512 | 128 | 7168 | 0.281 | 1821.31 | 1.102 | 116.14 |
+| 512 | 128 | 7680 | 0.281 | 1825.09 | 1.109 | 115.41 |
+| 512 | 128 | 8192 | 0.287 | 1784.82 | 1.125 | 113.77 |
+| 512 | 128 | 8704 | 0.286 | 1792.48 | 1.135 | 112.81 |
+| 512 | 128 | 9216 | 0.290 | 1766.45 | 1.152 | 111.08 |
+| 512 | 128 | 9728 | 0.293 | 1746.46 | 1.161 | 110.27 |
+| 512 | 128 | 10240 | 0.294 | 1743.47 | 1.169 | 109.52 |
+| 512 | 128 | 10752 | 0.302 | 1694.24 | 1.218 | 105.08 |
+| 512 | 128 | 11264 | 0.301 | 1701.12 | 1.233 | 103.78 |
+| 512 | 128 | 11776 | 0.305 | 1676.91 | 1.221 | 104.84 |
+| 512 | 128 | 12288 | 0.309 | 1659.19 | 1.226 | 104.39 |
+| 512 | 128 | 12800 | 0.313 | 1638.23 | 1.238 | 103.35 |
+| 512 | 128 | 13312 | 0.314 | 1628.03 | 1.234 | 103.75 |
+| 512 | 128 | 13824 | 0.317 | 1615.65 | 1.242 | 103.03 |
+| 512 | 128 | 14336 | 0.319 | 1603.49 | 1.247 | 102.68 |
+| 512 | 128 | 14848 | 0.325 | 1574.84 | 1.253 | 102.16 |
+| 512 | 128 | 15360 | 0.327 | 1564.81 | 1.259 | 101.68 |
+| 512 | 128 | 15872 | 0.331 | 1547.61 | 1.266 | 101.13 |
+| 512 | 128 | 16384 | 0.333 | 1538.40 | 1.273 | 100.58 |
+| 512 | 128 | 16896 | 0.336 | 1522.28 | 1.279 | 100.10 |
+| 512 | 128 | 17408 | 0.341 | 1503.25 | 1.287 | 99.49 |
+| 512 | 128 | 17920 | 0.344 | 1489.38 | 1.292 | 99.06 |
+| 512 | 128 | 18432 | 0.349 | 1466.62 | 1.301 | 98.38 |
+| 512 | 128 | 18944 | 0.349 | 1468.88 | 1.310 | 97.73 |
+| 512 | 128 | 19456 | 0.355 | 1442.42 | 1.316 | 97.24 |
+| 512 | 128 | 19968 | 0.356 | 1438.99 | 1.328 | 96.41 |
+| 512 | 128 | 20480 | 0.356 | 1439.73 | 1.332 | 96.08 |
+| 512 | 128 | 20992 | 0.360 | 1423.74 | 1.340 | 95.51 |
+| 512 | 128 | 21504 | 0.366 | 1397.32 | 1.380 | 92.78 |
+| 512 | 128 | 22016 | 0.367 | 1396.55 | 1.391 | 92.02 |
+| 512 | 128 | 22528 | 0.373 | 1373.81 | 1.386 | 92.37 |
+| 512 | 128 | 23040 | 0.374 | 1370.53 | 1.388 | 92.25 |
+| 512 | 128 | 23552 | 0.377 | 1359.11 | 1.396 | 91.66 |
+| 512 | 128 | 24064 | 0.379 | 1350.00 | 1.395 | 91.78 |
+| 512 | 128 | 24576 | 0.383 | 1336.05 | 1.398 | 91.57 |
+| 512 | 128 | 25088 | 0.385 | 1328.42 | 1.402 | 91.32 |
+| 512 | 128 | 25600 | 0.388 | 1318.46 | 1.405 | 91.13 |
+| 512 | 128 | 26112 | 0.391 | 1309.57 | 1.416 | 90.41 |
+| 512 | 128 | 26624 | 0.394 | 1299.16 | 1.421 | 90.07 |
+| 512 | 128 | 27136 | 0.399 | 1282.71 | 1.433 | 89.31 |
+| 512 | 128 | 27648 | 0.401 | 1277.67 | 1.438 | 88.99 |
+| 512 | 128 | 28160 | 0.407 | 1259.25 | 1.453 | 88.12 |
+| 512 | 128 | 28672 | 0.413 | 1240.48 | 1.456 | 87.92 |
+| 512 | 128 | 29184 | 0.415 | 1234.24 | 1.464 | 87.41 |
+| 512 | 128 | 29696 | 0.416 | 1231.64 | 1.469 | 87.11 |
+| 512 | 128 | 30208 | 0.417 | 1228.70 | 1.477 | 86.63 |
+| 512 | 128 | 30720 | 0.421 | 1215.75 | 1.487 | 86.08 |
+| 512 | 128 | 31232 | 0.424 | 1206.39 | 1.498 | 85.47 |
+| 512 | 128 | 31744 | 0.426 | 1201.22 | 1.501 | 85.26 |
+| 512 | 128 | 32256 | 0.431 | 1189.21 | 1.544 | 82.92 |
+
+
+
I had not yet run Qwen3-30B-A3B fully offloaded on my local 3090TI 24GB VRAM rig on mainline before, so this is data I have not seen. I have a couple more benchmarks to repeat including my `mix-IQ3_K` quants as well as the hybrid CPU+GPU setup too on the remote thread ripper RTX A6000 to confirm given this PR is largely about TG performance.
A couple observations about this test case:
- I used `-fmoe` with both ik cases as it seems to improve performance over removing it still.
-- I noticed the power draw on my GPU was higher for mainline than this PR.
+- I noticed the power draw on my GPU was higher for mainline than this PR. Power limit is uncapped at full 450 Watts.
#### Mainline btop

@@ -717,17 +1388,17 @@ A couple observations about this test case:
---
-👤 **ubergarm** commented the **2025-05-03** at **22:05:33**:
+👤 **ubergarm** commented on **2025-05-03** at **22:05:33**
## [ubergarm/Qwen3-30B-A3B-mix-IQ4_K](https://huggingface.co/ubergarm/Qwen3-30B-A3B-GGUF)

-This is comparing a mix of mostly IQ5_K/IQ4_K layers between ik@main baseline and this ik@PR370 showing improved performance of *both* PP and TG for full GPU offload case.
+This is comparing a mix of mostly IQ5_K/IQ4_K layers between ik@main baseline and this ik@PR370 showing improved performance of *both* PP and TG for full GPU offload case. Sorry colors are not consistent with previous graphs.
-
+👈 Logs
## `ik_llama.cpp/main@ab7f694b`
```
@@ -1165,10 +1836,12 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **AesSedai** commented the **2025-05-03** at **22:57:58**:
+👤 **AesSedai** commented on **2025-05-03** at **22:57:58**
I've run the tests for 235B-A22B Q6 as well to compare. I used the Unsloth Q6 quant for both ik_llama.cpp and llama.cpp, the only arg difference in the calls is for ik_llama.cpp's support of `-fmoe -rtr`. Same offload the the rest otherwise.
+(also stole the graphing python code from @ubergarm, thanks again!)
+
ik_llama.cpp ik/fattn_mma
@@ -2698,12 +3371,14 @@ main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ubergarm** commented the **2025-05-03** at **23:17:07**:
+👤 **ubergarm** commented on **2025-05-03** at **23:17:07**
-@AesSedai *very nice*! Cool to see you are getting some uplift in PP as well and more linear fall-off for TG. I'm running that quant's little brother on my local rig in hybrid CPU+GPU inference in this test for comparison, but no mainline comparison as its the `-mix-IQ3_K`.
+@AesSedai *very nice*! Cool to see you are getting some uplift in PP as well as slower fall-off for TG. I'm running that quant's little brother on my local rig in hybrid CPU+GPU inference in this test for comparison, but no mainline comparison as its the `-mix-IQ3_K`.
Hope to finally get three runs of the hybrid CPU+GPU of the full Q8_0 across both forks before the night it out! If i have any juice left in me I might revisit earlier runs to add in `-ctk q8_0 -ctv q8_0` to see if any uplift for fully offloaded quantized kv-cache.
+Sorry again the colors are not similar across all our graphs, but we're doing good! haha...
+
## [ubergarm/Qwen3-235B-A22B-mix-IQ3_K](https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF)

@@ -3202,7 +3877,7 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ubergarm** commented the **2025-05-04** at **01:36:41**:
+👤 **ubergarm** commented on **2025-05-04** at **01:36:41**
## ubergarm/Qwen3-235B-A22B-Q8_0
@@ -3889,7 +4564,7 @@ main: n_kv_max = 32768, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_la
---
-👤 **ubergarm** commented the **2025-05-04** at **04:41:15**:
+👤 **ubergarm** commented on **2025-05-04** at **04:41:15**
Finally, I also tested this PR to ensure the models were still actually working in addition to being faster. I used this PR + my [ubergarm/Qwen3-30B-A3B-mix-IQ4_K](https://huggingface.co/ubergarm/Qwen3-30B-A3B-GGUF) to vibe code up the imatrix-statistics visualization scripts to parse and and plot data the stats: https://gist.github.com/ubergarm/2aa9327f7b98a9b16fef62b4941c7e76
@@ -3897,7 +4572,7 @@ So anecdotally the model still seems to work fine fwiw. Cheers and g'night!
---
-👤 **ikawrakow** commented the **2025-05-04** at **06:17:33**:
+👤 **ikawrakow** commented on **2025-05-04** at **06:17:33**
Thank you for these results and for testing!
@@ -3907,7 +4582,7 @@ In any case, this PR looks like a winner, so merging.
---
-👤 **ubergarm** commented the **2025-05-04** at **17:08:14**:
+👤 **ubergarm** commented on **2025-05-04** at **17:08:14**
Amazing work y'all! I did a little post to let folks know its time to `git pull` and rebuild to take advantage of all the improvements!
diff --git a/github-data/pull_requests/371 - Another attempt to fix 367.md b/github-data/pull_requests/371 - Another attempt to fix 367.md
new file mode 100644
index 000000000..20c1e8be1
--- /dev/null
+++ b/github-data/pull_requests/371 - Another attempt to fix 367.md
@@ -0,0 +1,18 @@
+## 🔀 [Pull Request #371](https://github.com/ikawrakow/ik_llama.cpp/pull/371) - Another attempt to fix [#367](https://github.com/ikawrakow/ik_llama.cpp/issues/367)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/try_fix_367_v2` |
+| **Target Branch** | `main` |
+| **Created** | 2025-05-03 |
+| **Updated** | 2025-05-04 |
+| **Merged** | 2025-05-04 |
+
+---
+
+## 📄 Description
+
+Fix `IQ1_M_R4` quantization failure.
+
+Closes [#367](https://github.com/ikawrakow/ik_llama.cpp/issues/367)
\ No newline at end of file
diff --git a/github-data/pull_requests/371 - Another attempt to fix _367.md b/github-data/pull_requests/371 - Another attempt to fix _367.md
deleted file mode 100644
index d9094869e..000000000
--- a/github-data/pull_requests/371 - Another attempt to fix _367.md
+++ /dev/null
@@ -1,15 +0,0 @@
-### 🐛 [#371](https://github.com/ikawrakow/ik_llama.cpp/pull/371) - Another attempt to fix [#367](https://github.com/ikawrakow/ik_llama.cpp/issues/367)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-03 |
-| **Updated** | 2025-05-04 |
-
----
-
-#### Description
-
-Fix `IQ1_M_R4` quantization failure.
-
-Closes #367
\ No newline at end of file
diff --git a/github-data/pull_requests/374 - CUDA_ MMQ for IQ4_KS.md b/github-data/pull_requests/374 - CUDA MMQ for IQ4_KS.md
similarity index 92%
rename from github-data/pull_requests/374 - CUDA_ MMQ for IQ4_KS.md
rename to github-data/pull_requests/374 - CUDA MMQ for IQ4_KS.md
index 152be3326..feccd41bd 100644
--- a/github-data/pull_requests/374 - CUDA_ MMQ for IQ4_KS.md
+++ b/github-data/pull_requests/374 - CUDA MMQ for IQ4_KS.md
@@ -1,14 +1,17 @@
-### 🔀 [#374](https://github.com/ikawrakow/ik_llama.cpp/pull/374) - CUDA: MMQ for IQ4_KS
+## 🔀 [Pull Request #374](https://github.com/ikawrakow/ik_llama.cpp/pull/374) - CUDA: MMQ for IQ4_KS
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_mmq_iq4_ks` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-04 |
| **Updated** | 2025-05-07 |
+| **Merged** | 2025-05-04 |
---
-#### Description
+## 📄 Description
`IQX_K` quants offer better quantization quality for the same amount of bits spent compared to k- and i-quants. But on CUDA they are slower for prompt processing (PP) because matrix multiplications are done via dequantize->cuBLAS, so I thought it is time to fix this.
@@ -105,9 +108,9 @@ TG performance is not affected at all by the PR, so no graph for that.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-04** at **07:33:54**:
+👤 **saood06** commented on **2025-05-04** at **07:33:54**
> I checked that if I add another warn-up run with n_ubatch tokens, performance for N_KV = 0 becomes higher than N_KV = 512 as expected. I guess, I will submit a separate PR for that.
@@ -115,7 +118,7 @@ Interesting, I've always dealt with it by either comparing the second row (as it
---
-👤 **ikawrakow** commented the **2025-05-04** at **07:41:21**:
+👤 **ikawrakow** commented on **2025-05-04** at **07:41:21**
> Interesting, I've always dealt with it by either comparing the second row (as it is generally more stable between runs anyways) or just running a very low context sweep-bench as a warmup
@@ -125,7 +128,7 @@ I'll make the PP warm-up pass optional via a command line argument as for very l
---
-👤 **saood06** commented the **2025-05-04** at **07:52:57**:
+👤 **saood06** commented on **2025-05-04** at **07:52:57**
>It does not affect CPU performance.
@@ -133,11 +136,11 @@ I just looked back at my notes/logs, it is the first TG for CPU that does vary,
>I'll make the PP warm-up pass optional via a command line argument as for very large models on the CPU it does take some time to process a batch of 512 tokens.
-I was going to suggest that, as that is very true for some of my testing.
+Thanks, I was going to suggest that, as that is very true for some of my testing.
---
-👤 **ubergarm** commented the **2025-05-07** at **22:02:48**:
+👤 **ubergarm** commented on **2025-05-07** at **22:02:48**
I'm working on some benchmarks for various Qwen3-30B-A3B quants and ran some llama-sweep-benches and this PR is looking good for your `IQ4_KS`. Used the `--warmup-batch` PR as well.
diff --git a/github-data/pull_requests/375 - Add batch warmup to sweep-bench.md b/github-data/pull_requests/375 - Add batch warmup to sweep-bench.md
index 34d2e1d88..d98690be6 100644
--- a/github-data/pull_requests/375 - Add batch warmup to sweep-bench.md
+++ b/github-data/pull_requests/375 - Add batch warmup to sweep-bench.md
@@ -1,16 +1,19 @@
-### 🔀 [#375](https://github.com/ikawrakow/ik_llama.cpp/pull/375) - Add batch warmup to sweep-bench
+## 🔀 [Pull Request #375](https://github.com/ikawrakow/ik_llama.cpp/pull/375) - Add batch warmup to sweep-bench
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/sweep_bench_warmup` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-04 |
| **Updated** | 2025-05-12 |
+| **Merged** | 2025-05-12 |
---
-#### Description
+## 📄 Description
-When using `sweep-bench` on CUDA, often the PP performance for `N_KV = 0` (i.e., first PP run) is lower than the measured PP performance for `N_KV > 0`. My guess is that this is due to having to find and load from the cache of pre-compiled kernels the required once, which may take time that is not negligible compared to the time it takes the compute the batch. For an example, see the graph in PR #374.
+When using `sweep-bench` on CUDA, often the PP performance for `N_KV = 0` (i.e., first PP run) is lower than the measured PP performance for `N_KV > 0`. My guess is that this is due to having to find and load from the cache of pre-compiled kernels the required once, which may take time that is not negligible compared to the time it takes the compute the batch. For an example, see the graph in PR [#374](https://github.com/ikawrakow/ik_llama.cpp/issues/374).
To prevent this misleading result, this PR adds the ability to also use a warm-up run with `n_ubatch` tokens. The option is off by default as computing a batch on the CPU for a large model can take a significant amount of time (but the measured performance is not affected by having done a batch warmup run). To turn it on, use
```
@@ -19,15 +22,15 @@ To prevent this misleading result, this PR adds the ability to also use a warm-u
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-04** at **08:51:18**:
+👤 **saood06** commented on **2025-05-04** at **08:51:18**
-Wouldn't it make sense to make this a global warmup option across bench and common (see this commit for when I affected all off them https://github.com/ikawrakow/ik_llama.cpp/commit/370274317b41b426893ff9a8f06030715d1c8a5f )? The only other thing is if you want the warmup MoE optimization of loading in all experts, then we would need to make the way that happens more robust as it is hacky and looks at it being exactly one token and that being the bos.
+Wouldn't it make sense to make this a global warmup option across bench and common (see this commit for when I affected all off them https://github.com/ikawrakow/ik_llama.cpp/commit/370274317b41b426893ff9a8f06030715d1c8a5f )? The only other thing is if you want the warmup MoE optimization of loading in all experts, then we would need to make the way that happens more robust as it is hacky and looks at it being exactly one token and that being the bos (as that would never happen normally), but a full batch is a normal occurence.
---
-👤 **ikawrakow** commented the **2025-05-04** at **09:24:18**:
+👤 **ikawrakow** commented on **2025-05-04** at **09:24:18**
> Wouldn't it make sense to make this a global warmup option across bench and common
@@ -35,7 +38,7 @@ It would. The command line option is added to `common`, so the parameter is theo
---
-👤 **saood06** commented the **2025-05-04** at **09:39:56**:
+👤 **saood06** commented on **2025-05-04** at **09:39:56**
> > Wouldn't it make sense to make this a global warmup option across bench and common
>
@@ -53,7 +56,7 @@ Yes I agree.
---
-👤 **ikawrakow** commented the **2025-05-04** at **12:22:35**:
+👤 **ikawrakow** commented on **2025-05-04** at **12:22:35**
> Yes but the implementation is done in sweep-bench.cpp not to common.cpp, you just added the command line option there, not the implementation (see the warmup implementation in common.cpp here:
@@ -65,7 +68,7 @@ Yes, because I'm not sure what this unified warmup is going to be. If it ends up
---
-👤 **saood06** commented the **2025-05-04** at **12:39:59**:
+👤 **saood06** commented on **2025-05-04** at **12:39:59**
> Yes, because I'm not sure what this unified warmup is going to be. If it ends up being the same or similar enough, one can reuse it in `sweep-bench`. But for now it is best if we don't touch the `common` warmup, thus affecting all examples.
@@ -79,7 +82,7 @@ Yes, I often output the json because you can see all the results (and I am famil
---
-👤 **ubergarm** commented the **2025-05-07** at **21:44:58**:
+👤 **ubergarm** commented on **2025-05-07** at **21:44:58**
## tl;dr;
:+1:
diff --git a/github-data/pull_requests/377 - Support for Llama-3-Nemotron models.md b/github-data/pull_requests/377 - Support for Llama-3-Nemotron models.md
index 1429319e1..a569a1cac 100644
--- a/github-data/pull_requests/377 - Support for Llama-3-Nemotron models.md
+++ b/github-data/pull_requests/377 - Support for Llama-3-Nemotron models.md
@@ -1,14 +1,17 @@
-### 🔀 [#377](https://github.com/ikawrakow/ik_llama.cpp/pull/377) - Support for Llama-3-Nemotron models
+## 🔀 [Pull Request #377](https://github.com/ikawrakow/ik_llama.cpp/pull/377) - Support for Llama-3-Nemotron models
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/deci_support` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-04 |
| **Updated** | 2025-05-09 |
+| **Merged** | 2025-05-09 |
---
-#### Description
+## 📄 Description
Port of https://github.com/ggml-org/llama.cpp/pull/10669
@@ -16,9 +19,9 @@ It compiles, have not tested yet. Testers welcome, but will try to test myself l
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-04** at **12:31:11**:
+👤 **saood06** commented on **2025-05-04** at **12:31:11**
I downloaded the source model and was able to convert it with `convert_hf_to_gguf.py` but I hit an error when attempting to quantize it.
@@ -26,19 +29,21 @@ I downloaded the source model and was able to convert it with `convert_hf_to_ggu
---
-👤 **ikawrakow** commented the **2025-05-04** at **12:38:47**:
+👤 **ikawrakow** commented on **2025-05-04** at **12:38:47**
Well, you see what `n_attention_wv` is and add another rule for accepting it. This is because of the layers that don't have the usual attention mechanism, I guess.
---
-👤 **saood06** commented the **2025-05-04** at **13:02:38**:
+👤 **saood06** commented on **2025-05-04** at **13:02:38**
-It's quantizing now.
+It's quantizing now.
+
+Edit: I guessed on the value for the big one based on the difference in number of layers between them.
---
-👤 **ikawrakow** commented the **2025-05-04** at **13:10:46**:
+👤 **ikawrakow** commented on **2025-05-04** at **13:10:46**
Apart from the 253B version that is beyond my reach, this will add support for this model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct ?
@@ -46,7 +51,7 @@ What about https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1 which s
---
-👤 **saood06** commented the **2025-05-04** at **13:14:52**:
+👤 **saood06** commented on **2025-05-04** at **13:14:52**
> Apart from the 253B version that is beyond my reach
@@ -62,7 +67,7 @@ That one should work (maybe the convert python might not?) but you may need to a
---
-👤 **saood06** commented the **2025-05-04** at **13:18:16**:
+👤 **saood06** commented on **2025-05-04** at **13:18:16**
It is coherent in the cli.
@@ -70,36 +75,44 @@ Will sweep-bench it later.
---
-👤 **ikawrakow** submitted a review the **2025-05-04** at **14:02:57**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-04** at **13:23:29**
+
+I guess this is copy-pasted from `build_llama()`, which also builds the graph for the Granite models. But do we expect Nemotron to have something to do with Granite? If not, it is better to remove it.
+
+> 👤 **saood06** replied on **2025-05-04** at **14:19:07**
+>
+> Sorry I didn't notice these. They are in the original PR as well (which I cherry-picked as it was from when they hadn't diverged too much), I'll take them out. Right now I'm working on the larger model as that can't be cherry-picked
---
-👤 **ikawrakow** commented the **2025-05-04** at **14:05:15**:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-04** at **13:23:50**
-I get this error when I try to run the [49B model](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) (after adjusting the `n_attention_vw` check):
-```
-llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
-```
+Same comment as above
---
-👤 **ikawrakow** commented the **2025-05-04** at **14:16:43**:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-04** at **13:24:25**
-Works if I convert with mainline, so something is missing in the conversion script.
+Does it apply to Nemotron?
---
-👤 **saood06** submitted a review the **2025-05-04** at **14:19:07**: 💬 `COMMENTED`
+👤 **ikawrakow** commented on **2025-05-04** at **14:05:15**
+
+I get this error when I try to run the [49B model](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) (after adjusting the `n_attention_vw` check):
+```
+llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
+```
---
-👤 **saood06** commented during a code review the **2025-05-04** at **14:19:07** on `src/llama.cpp`:
+👤 **ikawrakow** commented on **2025-05-04** at **14:16:43**
-Sorry I didn't notice these. They are in the original PR as well (which I cherry-picked as it was from when they hadn't diverged too much), I'll take them out. Right now I'm working on the larger model as that can't be cherry-picked
+Works if I convert with mainline, so something is missing in the conversion script.
---
-👤 **saood06** commented the **2025-05-04** at **14:19:52**:
+👤 **saood06** commented on **2025-05-04** at **14:19:52**
> Works if I convert with mainline, so something is missing in the conversion script.
@@ -107,15 +120,15 @@ Thanks for testing that, I'll look into the script.
---
-👤 **saood06** commented the **2025-05-04** at **15:23:42**:
+👤 **saood06** commented on **2025-05-04** at **15:23:42**
@Lissanro
-Can you try Llama-3_1-Nemotron-Ultra-253B now, the n_attention_vw check may be broken but everything else I think should be fine.
+Can you try Llama-3_1-Nemotron-Ultra-253B now?
---
-👤 **ikawrakow** commented the **2025-05-04** at **15:29:05**:
+👤 **ikawrakow** commented on **2025-05-04** at **15:29:05**
> the n_attention_vw check may be broken but everything else I think should be fine.
@@ -123,7 +136,7 @@ Oh, I forgot to comment on that one. I solved it for the 49B model by simply acc
---
-👤 **saood06** commented the **2025-05-04** at **15:41:08**:
+👤 **saood06** commented on **2025-05-04** at **15:41:08**
@ikawrakow
@@ -131,13 +144,13 @@ Can you test the conversion again? This is good to review again, I'm done pushin
---
-👤 **ikawrakow** commented the **2025-05-04** at **15:48:16**:
+👤 **ikawrakow** commented on **2025-05-04** at **15:48:16**
I'm running something on the computer where I downloaded the model. I'll test in a bit when the run finishes.
---
-👤 **saood06** commented the **2025-05-04** at **15:52:22**:
+👤 **saood06** commented on **2025-05-04** at **15:52:22**
>I'll test in a bit when the run finishes.
@@ -145,7 +158,34 @@ Take your time, I'm heading off for now anyways.
---
-👤 **Panchovix** commented the **2025-05-04** at **19:39:28**:
+👤 **ikawrakow** commented on **2025-05-04** at **16:46:39**
+
+When I run the mainline conversion script, I see this:
+```
+INFO:hf-to-gguf:Set meta model
+INFO:hf-to-gguf:Set model parameters
+INFO:hf-to-gguf:Set model quantization version
+INFO:hf-to-gguf:Set model tokenizer
+INFO:gguf.vocab:Adding 280147 merge(s).
+INFO:gguf.vocab:Setting special token type bos to 128000
+INFO:gguf.vocab:Setting special token type eos to 128009
+```
+But when I run the conversion script in the PR, I see this:
+```
+INFO:hf-to-gguf:Set meta model
+INFO:hf-to-gguf:Set model parameters
+INFO:hf-to-gguf:Set model tokenizer
+WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
+INFO:gguf.vocab:Setting special token type bos to 128000
+INFO:gguf.vocab:Setting special token type eos to 128009
+```
+So, something is not quite right with the merges.
+
+But I'm actually OK with the conversion script not working. We already have other models that require mainline for conversion to GGUF.
+
+---
+
+👤 **Panchovix** commented on **2025-05-04** at **19:39:28**
Thanks for the work! I'm trying L3 Nemotron 253B Q3_K_XL from unsloth (https://huggingface.co/unsloth/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main/UD-Q3_K_XL), here how is the log looks
@@ -312,7 +352,7 @@ Not sure if there's a flag that could improve things for dense models. Also not
---
-👤 **ikawrakow** commented the **2025-05-05** at **06:58:29**:
+👤 **ikawrakow** commented on **2025-05-05** at **06:58:29**
With the commit that I just pushed `convert_hf_to_gguf.py` now converts the [Nemotron-Super-49B](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) model correctly.
@@ -320,20 +360,27 @@ But then I see a difference in PPL.
I didn't run the `bf16` model directly (comes dangerously close to the total RAM I have), but using `Q8_0` quantization. I arrive at a lower PPL using the HF->GGUF conversion script in this PR compared to using mainline conversion:
* `PPL = 7.0801` using mainline HF->GGUF
-* `PPL = `7.0347` using this PR HF->GGUF
+* `PPL = 7.0347` using this PR HF->GGUF
+
+Quantization is done in exactly the same way, I'm running with exact same parameters on the same hardware, so something else is different in the converted `bf16` models (and just simple `diff` tells me that the files differ).
-Quantization is done in exactly the same way, I'm running with exact same parameters on the same hardware, so something else is different in the converted `bf16` models (and just simple `diff` tells me that the files differ).
+OK, doing `diff` on the logs, I see this difference:
+```
+llama_model_loader: - type f32: 131 tensors (mainline)
+vs
+llama_model_loader: - type f32: 130 tensors (this PR)
+```
---
-👤 **ikawrakow** submitted a review the **2025-05-05** at **13:14:51**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-05** at **13:14:51**
From my perspective this is ready to merge.
Just waiting for @Lissanro to confirm that it is working for them.
---
-👤 **Lissanro** commented the **2025-05-05** at **15:18:43**:
+👤 **Lissanro** commented on **2025-05-05** at **15:18:43**
I tried at first using this command:
@@ -367,7 +414,7 @@ CUDA_VISIBLE_DEVICES="" ~/pkgs/ik_llama.cpp/build/bin/llama-server \
---
-👤 **ikawrakow** commented the **2025-05-05** at **15:23:43**:
+👤 **ikawrakow** commented on **2025-05-05** at **15:23:43**
Can you try
```
@@ -380,11 +427,11 @@ Thanks.
---
-👤 **saood06** commented the **2025-05-05** at **22:20:08**:
+👤 **saood06** commented on **2025-05-05** at **22:20:08**
> With the commit that I just pushed `convert_hf_to_gguf.py` now converts the [Nemotron-Super-49B](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) model correctly.
-Nice, I see you grabbed the only changes to the vocab.py file that we were behind: https://github.com/ggml-org/llama.cpp/commit/8ba38584b2bf744814e1131f6f6aec97df5a57e1 and https://github.com/ggml-org/llama.cpp/commit/a686171ea71ed8cb8a324850d146cb65a001e141. I think you might have been able to cherry-pick those commits directly.
+Nice, I see you grabbed the only changes to vocab.py that we were behind: https://github.com/ggml-org/llama.cpp/commit/8ba38584b2bf744814e1131f6f6aec97df5a57e1 and https://github.com/ggml-org/llama.cpp/commit/a686171ea71ed8cb8a324850d146cb65a001e141. I think you might have been able to cherry-pick those commits directly.
>
> But then I see a difference in PPL.
>
@@ -409,7 +456,7 @@ Interesting, do you mind checking with gguf-hash or some other tool if that one
---
-👤 **Lissanro** commented the **2025-05-05** at **22:48:49**:
+👤 **Lissanro** commented on **2025-05-05** at **22:48:49**
> Can you try
> ~/pkgs/ik_llama.cpp/build/bin/llama-server \
@@ -428,13 +475,31 @@ CUDA error: an illegal memory access was encountered
---
-👤 **Panchovix** commented the **2025-05-06** at **15:26:40**:
+👤 **ikawrakow** commented on **2025-05-06** at **15:14:00**
+
+@Lissanro
+
+I'm noticing that there is an issue for MoE models when the partial GPU offload is not done via tensor overrides (as in the command above). I'll try to figure out what is wrong, but in the meantime can you try this:
+```
+~/pkgs/ik_llama.cpp/build/bin/llama-server
+--model /mnt/secondary/neuro/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF-UD-Q4_K_XL-131072seq/Llama-3_1-Nemotron-Ultra-253B-v1-UD-Q4_K_XL-00001-of-00004.gguf
+--ctx-size 81920 --n-gpu-layers 100 --tensor-split 25,25,25,25
+-fa -ctk q8_0 -ctv q8_0 --threads 64 --host 0.0.0.0 --port 5000 -fmoe
+-ot "blk\.3[2-9]\.ffn=CPU,blk\.[4-9][0-9]\.ffn=CPU"
+```
+This is similar to what you tried, but it will load all attention on the GPUs along with the first 32 layers of the experts, the remaining experts will be on the CPU. Not sure about the context, you may need to reduce it somewhat.
+
+Thanks!
+
+---
+
+👤 **Panchovix** commented on **2025-05-06** at **15:26:40**
Correct me if I'm wrong but isn't nemotron 253B a dense model? So no experts and such
---
-👤 **ikawrakow** commented the **2025-05-06** at **15:35:26**:
+👤 **ikawrakow** commented on **2025-05-06** at **15:35:26**
> Correct me if I'm wrong but isn't nemotron 253B a dense model? So no experts and such
@@ -442,7 +507,19 @@ Oops, I'm getting confused. Doing too many things at a time. Not sure then why p
---
-👤 **saood06** commented the **2025-05-07** at **01:47:18**:
+👤 **ikawrakow** commented on **2025-05-06** at **15:50:42**
+
+> Interesting, do you mind checking with gguf-hash or some other tool if that one changed tensor is the only difference? I am curious to know why this PR does one tensor less of f32 than mainline.
+
+I used `gguf-dump.py`, and the missing tensor is `rope_freqs`.
+
+Hashes are identical.
+
+The other difference is that ours is `general.file_type = 24`, while theirs is `general.file_type = 32`. I don't know what that means.
+
+---
+
+👤 **saood06** commented on **2025-05-07** at **01:47:18**
> I used `gguf-dump.py`, and the missing tensor is `rope_freqs`.
@@ -456,19 +533,19 @@ This one I understand, they both map to MOSTLY_BF16 ([ik_llama.cpp source](https
---
-👤 **Lissanro** commented the **2025-05-07** at **20:37:15**:
+👤 **Lissanro** commented on **2025-05-07** at **20:37:15**
If there is still something I need to test, please let me know (my understanding the last command was given under assumption it was MoE, but since it is a dense model, I assume I need some other command to test or maybe I already provided all debug info that is possible from my side). In any case, thank you very much for looking into this.
---
-👤 **ikawrakow** commented the **2025-05-09** at **07:09:55**:
+👤 **ikawrakow** commented on **2025-05-09** at **07:09:55**
I think I'll merge this one despite the missing `rope_freqs` tensors. We can try to sort out later why is it missing if people find performance degradation with long context.
---
-👤 **saood06** commented the **2025-05-09** at **07:54:55**:
+👤 **saood06** commented on **2025-05-09** at **07:54:55**
> I think I'll merge this one despite the missing `rope_freqs` tensors. We can try to sort out later why is it missing if people find performance degradation with long context.
diff --git a/github-data/pull_requests/38 - Zen4 Flash Attention - bf16 support.md b/github-data/pull_requests/38 - Zen4 Flash Attention - bf16 support.md
index db80944b2..4d6f8f348 100644
--- a/github-data/pull_requests/38 - Zen4 Flash Attention - bf16 support.md
+++ b/github-data/pull_requests/38 - Zen4 Flash Attention - bf16 support.md
@@ -1,14 +1,17 @@
-### 🔀 [#38](https://github.com/ikawrakow/ik_llama.cpp/pull/38) - Zen4 Flash Attention - bf16 support
+## 🔀 [Pull Request #38](https://github.com/ikawrakow/ik_llama.cpp/pull/38) - Zen4 Flash Attention - bf16 support
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/zen4_flash_attn_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-04 |
| **Updated** | 2024-09-05 |
+| **Merged** | 2024-09-05 |
---
-#### Description
+## 📄 Description
This PR adds support for using `bf16` for the kv-cache.
diff --git a/github-data/pull_requests/382 - Fix DeepSeek FA.md b/github-data/pull_requests/382 - Fix DeepSeek FA.md
index 9b686517e..05c7d5996 100644
--- a/github-data/pull_requests/382 - Fix DeepSeek FA.md
+++ b/github-data/pull_requests/382 - Fix DeepSeek FA.md
@@ -1,13 +1,16 @@
-### 🐛 [#382](https://github.com/ikawrakow/ik_llama.cpp/pull/382) - Fix DeepSeek FA
+## 🔀 [Pull Request #382](https://github.com/ikawrakow/ik_llama.cpp/pull/382) - Fix DeepSeek FA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_deepseek_fattn` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-05 |
| **Updated** | 2025-05-05 |
+| **Merged** | 2025-05-05 |
---
-#### Description
+## 📄 Description
-PR #370 broke it. Too many things to test.
\ No newline at end of file
+PR [#370](https://github.com/ikawrakow/ik_llama.cpp/issues/370) broke it. Too many things to test.
\ No newline at end of file
diff --git a/github-data/pull_requests/386 - FlashMLA-3 for DeepSeek models on CUDA.md b/github-data/pull_requests/386 - FlashMLA-3 for DeepSeek models on CUDA.md
index 9a725f1ac..3ba0d83d1 100644
--- a/github-data/pull_requests/386 - FlashMLA-3 for DeepSeek models on CUDA.md
+++ b/github-data/pull_requests/386 - FlashMLA-3 for DeepSeek models on CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) - FlashMLA-3 for DeepSeek models on CUDA
+## 🔀 [Pull Request #386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) - FlashMLA-3 for DeepSeek models on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_flash_mla3` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-06 |
| **Updated** | 2025-05-10 |
+| **Merged** | 2025-05-07 |
---
-#### Description
+## 📄 Description
[This PR](https://github.com/ggml-org/llama.cpp/pull/13306) in mainline `llama.cpp` is a CUDA flash attention (FA) implementation that also works with K head size of 576 and V head size of 512 as required for DeepSeek models with MLA enabled. **Caveat: it only works on Ampere or newer Nvidia GPUs**.
@@ -75,15 +78,15 @@ Testing with DeepSeek-V3/R1 will be greatly appreciated. Very few can run these
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **infy-infy** commented the **2025-05-07** at **14:31:37**:
+👤 **infy-infy** commented on **2025-05-07** at **14:31:37**
Will `-mla 3 -fa` work in mixed cpu+multigpu setup with Amperes and Pascals? Or it would be better to continue use `-mla 2 -fa`? I mean, maybe `-mla 3 -fa` will use some fallback for old cards and it would still be better than `-mla 2 -fa`
---
-👤 **ikawrakow** commented the **2025-05-07** at **14:36:26**:
+👤 **ikawrakow** commented on **2025-05-07** at **14:36:26**
> Will -mla 3 -fa work in mixed cpu+multigpu setup with Amperes and Pascals?
@@ -92,7 +95,7 @@ There is no fallback, and I'm not sure if I have put enough checks to prevent ev
---
-👤 **Ph0rk0z** commented the **2025-05-07** at **18:38:22**:
+👤 **Ph0rk0z** commented on **2025-05-07** at **18:38:22**
MLA 3 has faster sweep bench speeds for me but unfortunately deepseek 2.5 goes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
@@ -100,7 +103,7 @@ MLA 2 works.
---
-👤 **ubergarm** commented the **2025-05-08** at **02:09:57**:
+👤 **ubergarm** commented on **2025-05-08** at **02:09:57**
I gave this a very quick try though the model doesn't fit in VRAM+RAM so pulls almost 6GB/s paging of a Gen5 PCIe NVME drive. This is a 3090TI FE 24GB VRAM GPU.
@@ -174,7 +177,7 @@ We can see that the string is 8, 8, 2, , 0, 0, 0, 0, 8, 7, 1, 1, 0,0, 0, ^C
---
-👤 **ikawrakow** commented the **2025-05-08** at **04:34:09**:
+👤 **ikawrakow** commented on **2025-05-08** at **04:34:09**
OK, thanks for testing. Here is what I get with DeepSeek-Lite for @ubergarm's quantions:
```
@@ -208,7 +211,7 @@ The difference is that Lite has 16 heads, while the big models have 128. So, I g
---
-👤 **ikawrakow** commented the **2025-05-08** at **07:29:24**:
+👤 **ikawrakow** commented on **2025-05-08** at **07:29:24**
To be honest, I don't understand the failure.
@@ -233,19 +236,31 @@ So, based on observations, when we use 192,128 CUDA kernel for PP and 576,512 CU
---
-👤 **Ph0rk0z** commented the **2025-05-08** at **12:05:17**:
+👤 **Ph0rk0z** commented on **2025-05-08** at **12:05:17**
How many heads does 2.5 have? Maybe there is some difference. It's easier to run and more like qwen in size. I will have to check the MLA 1 output, could be bug in FA. Also had some crash in MLA 2 after using it a while but haven't reproduced yet.
---
-👤 **Ph0rk0z** commented the **2025-05-08** at **14:22:22**:
+👤 **ikawrakow** commented on **2025-05-08** at **12:48:38**
+
+> How many heads does 2.5 have? Maybe there is some difference.
+
+Should be the same as V3 with 128 heads.
+
+> It's easier to run and more like qwen in size.
+
+I know. But the DeepSeek models are the only MLA models around. I have verified it works with DeepSeek-Lite (16B parameters, 16 heads). The next step up with MLA are the giant DeepSeek models that I cannot run myself.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-05-08** at **14:22:22**
Looks like my theory was correct. On my system MLA 1 also produces issues, probably as soon as FA kicks in. May start out coherent for the first bit of tokens and then descends intooooooooooooooooooosddkkkkkkkkasd
---
-👤 **Panchovix** commented the **2025-05-08** at **14:39:38**:
+👤 **Panchovix** commented on **2025-05-08** at **14:39:38**
I can test on ikllamacpp in some hours if I can replicate on deepseek v3 0324 (I'm not home right now)
@@ -253,15 +268,21 @@ On main llamacpp I tested up to 64K CTX and it was working fine with the PR. If
---
-👤 **ikawrakow** commented the **2025-05-08** at **14:50:33**:
+👤 **ikawrakow** commented on **2025-05-08** at **14:50:33**
> On main llamacpp I tested up to 64K CTX and it was working fine with the PR. If I understand correctly I have to use the latest quants and then use -mla 3 -fa? Main llamacpp uses -mla 2 -fa equivalent?
-The mainline `llama.cpp` MLA implementation corresponds to `-mla 1` here. With this it wasn't possible to use flash attention on CUDA in the past, and it became possible with this PR and PR 13306 in mainline. If you use the latest quants that enable MLA in mainline, you require the not yet merged PR #394 that enables support for these incompatible models. Otherwise, you need to use an older model that does not allow MLA in mainline `llama.cpp`.
+The mainline `llama.cpp` MLA implementation corresponds to `-mla 1` here. With this it wasn't possible to use flash attention on CUDA in the past, and it became possible with this PR and PR 13306 in mainline. If you use the latest quants that enable MLA in mainline, you require the not yet merged PR [#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394) that enables support for these incompatible models. Otherwise, you need to use an older model that does not allow MLA in mainline `llama.cpp`.
---
-👤 **ikawrakow** commented the **2025-05-08** at **14:54:39**:
+👤 **Ph0rk0z** commented on **2025-05-08** at **14:53:50**
+
+I have no v3/r1 quants yet so it may very well work there but not in 2.5.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-08** at **14:54:39**
> Looks like my theory was correct. On my system MLA 1 also produces issues, probably as soon as FA kicks in. May start out coherent for the first bit of tokens and then descends intooooooooooooooooooosddkkkkkkkkasd
@@ -271,13 +292,19 @@ Then the conclusion would be that I introduced a bug when porting the mainline P
---
-👤 **ikawrakow** commented the **2025-05-08** at **15:11:44**:
+👤 **Ph0rk0z** commented on **2025-05-08** at **15:03:00**
+
+I can probably download deepseek lite since it's small. This is what I'm running: https://huggingface.co/bartowski/DeepSeek-V2.5-1210-GGUF/tree/main/DeepSeek-V2.5-1210-IQ4_XS
+
+---
+
+👤 **ikawrakow** commented on **2025-05-08** at **15:11:44**
That would work as a test.
---
-👤 **Panchovix** commented the **2025-05-08** at **18:19:57**:
+👤 **Panchovix** commented on **2025-05-08** at **18:19:57**
I just tried to load DeepSeek V3 Q2_K_XL but I get an issue on latest commit. This happens with both -mla 2 -fa and -mla 3 -fa. Not sure if I'm setting a parameter wrongly.
@@ -443,19 +470,19 @@ Segmentation fault (core dumped)
---
-👤 **ikawrakow** commented the **2025-05-08** at **19:08:29**:
+👤 **ikawrakow** commented on **2025-05-08** at **19:08:29**
-@Panchovix You are using a GGUF made for mainline llama.cpp MLA. As I wrote above, you need PR #394, which is an attempt to fix the incompatibility.
+@Panchovix You are using a GGUF made for mainline llama.cpp MLA. As I wrote above, you need PR [#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394), which is an attempt to fix the incompatibility.
---
-👤 **Panchovix** commented the **2025-05-08** at **19:09:38**:
+👤 **Panchovix** commented on **2025-05-08** at **19:09:38**
@ikawrakow ah I'm dumb, thanks! Haha gonna try the PR.
---
-👤 **Ph0rk0z** commented the **2025-05-08** at **23:53:02**:
+👤 **Ph0rk0z** commented on **2025-05-08** at **23:53:02**
Ok.. baby deepseek v2.0-chat, the ~16b one, right? Sort of inconclusive results.
@@ -464,11 +491,13 @@ MLA 2/3 crash with 8bit cache https://pastebin.com/0mkrcZwE
MLA 2/3 + FP16 cache do not exhibit too many issues from a quick test.
-These quants are months and months old so I'm not sure if anything is wrong with them, I also used IQ4_XS
+These quants are months and months old so I'm not sure if anything is wrong with them, I also used IQ4_XS
+
+re-ran v2.5 without 8bit cache and mla3 works now.
---
-👤 **ikawrakow** commented the **2025-05-09** at **05:41:54**:
+👤 **ikawrakow** commented on **2025-05-09** at **05:41:54**
I just tested [this model](https://huggingface.co/bartowski/DeepSeek-V2.5-1210-GGUF/tree/main/DeepSeek-V2.5-1210-IQ3_XXS), which is near the maximum size I can go. Seems to work perfectly fine with `fp16` KV cache:
```
@@ -1154,15 +1183,15 @@ But yes, `q8_0` KV cache is broken. I'll investigate.
---
-👤 **ikawrakow** commented the **2025-05-09** at **07:05:34**:
+👤 **ikawrakow** commented on **2025-05-09** at **07:05:34**
-OK, PR #400 should fix quantized KV cache.
+OK, PR [#400](https://github.com/ikawrakow/ik_llama.cpp/issues/400) should fix quantized KV cache.
---
-👤 **ubergarm** commented the **2025-05-09** at **16:11:48**:
+👤 **ubergarm** commented on **2025-05-09** at **16:11:48**
-> OK, PR #400 should fix quantized KV cache.
+> OK, PR [#400](https://github.com/ikawrakow/ik_llama.cpp/issues/400) should fix quantized KV cache.
Yes this seems to work in my quick testing of big DeepSeek-R1-IQ2_K_R4 hybrid CPU+GPU on my local rig for both `-mla 2` and `-mla 3` e.g.
```
@@ -1180,7 +1209,7 @@ Thanks for working through all the combinations!
---
-👤 **ikawrakow** commented the **2025-05-09** at **16:19:29**:
+👤 **ikawrakow** commented on **2025-05-09** at **16:19:29**
Thanks for testing.
@@ -1188,7 +1217,7 @@ I'm not sure if the `DDDDDD` is an actual bug. It is a low bit quantization, and
---
-👤 **saood06** commented the **2025-05-09** at **19:28:45**:
+👤 **saood06** commented on **2025-05-09** at **19:28:45**
> However, I noticed for both `-mla 2` and `-mla 3` in combination with `-ser 6,1`, it seems to work okay for short prompts like `Count from 1 to 10 in French`, but for longer ~600 token prompts it will throw `DDDDDDDD` again. Not a priority, I only use `-ser` if I'm desperate and can't access a remote rig!
@@ -1196,24 +1225,24 @@ I've never gotten `-ser` to work for me when loading a long context session (but
---
-👤 **ikawrakow** commented the **2025-05-10** at **09:13:51**:
+👤 **ikawrakow** commented on **2025-05-10** at **09:13:51**
> > However, I noticed for both `-mla 2` and `-mla 3` in combination with `-ser 6,1`, it seems to work okay for short prompts like `Count from 1 to 10 in French`, but for longer ~600 token prompts it will throw `DDDDDDDD` again. Not a priority, I only use `-ser` if I'm desperate and can't access a remote rig!
>
> I've never gotten `-ser` to work for me when loading a long context session (but I haven't really tried it in any other situation). I've never opened an issue as I've never taken the time to produce a minimally reproducible example.
-SER should hopefully work correctly now, see PR #404
+SER should hopefully work correctly now, see PR [#404](https://github.com/ikawrakow/ik_llama.cpp/issues/404)
---
-👤 **ubergarm** commented the **2025-05-10** at **16:19:20**:
+👤 **ubergarm** commented on **2025-05-10** at **16:19:20**
> > > However, I noticed for both `-mla 2` and `-mla 3` in combination with `-ser 6,1`, it seems to work okay for short prompts like `Count from 1 to 10 in French`, but for longer ~600 token prompts it will throw `DDDDDDDD` again. Not a priority, I only use `-ser` if I'm desperate and can't access a remote rig!
> >
> >
> > I've never gotten `-ser` to work for me when loading a long context session (but I haven't really tried it in any other situation). I've never opened an issue as I've never taken the time to produce a minimally reproducible example.
>
-> SER should hopefully work correctly now, see PR #404
+> SER should hopefully work correctly now, see PR [#404](https://github.com/ikawrakow/ik_llama.cpp/issues/404)
I just tried out PR404 which is now `main@a2d24c97`, but still seeing it reply `DDDDD` for longer contexts when using `-ser 6,1`.
@@ -1296,21 +1325,24 @@ No stack.
The program is not being run.
```
-
+
+
+It could be this model is too small, all attention layers are Q8_0 for GPU, and for CPU ffn_down is IQ3_K_R4, ffn_(gate|up) are IQ2_K_R4.
-It could be this model is too small, all attention layers are Q8_0` for GPU, and for CPU ffn_down is IQ3_K_R4, ffn_(gate|up) are IQ2_K_R4.
+Still works okay without `-ser 6,1`. I also tried removing `-fa` when testing ser and also threw DDDD.
-Still works okay without `-ser 6,1`. I also tried removing `-fa` when testing ser and also threw DDDD.
+*EDIT*
+I did a couple more tests and `-ser 6,1` seems to work with about ~200 token prompt, but it breaks and replies with DDDDDDD at ~300 tokens context prompt.
---
-👤 **Ph0rk0z** commented the **2025-05-10** at **16:43:00**:
+👤 **Ph0rk0z** commented on **2025-05-10** at **16:43:00**
Deepseek 2.5 seems to work with q_8, tg/pp is slightly faster than F16. Unfortunately sometimes a GPU gets stuck at 100% in task manager and the bench or server halts then sits. GPU power draw not consistent with 100% usage of course. It could be due to my undervolting or something else? F16 sweep completes successfully and is definitely "heavier" on resources so I'm not sure anymore.
---
-👤 **ikawrakow** commented the **2025-05-10** at **18:26:05**:
+👤 **ikawrakow** commented on **2025-05-10** at **18:26:05**
> Still works okay without -ser 6,1. I also tried removing -fa when testing ser and also threw DDDD.
@@ -1318,7 +1350,7 @@ OK, thanks. The PR fixes things for me, but it seems there is still a bug lurkin
---
-👤 **ikawrakow** commented the **2025-05-10** at **18:31:16**:
+👤 **ikawrakow** commented on **2025-05-10** at **18:31:16**
> Unfortunately sometimes a GPU gets stuck at 100% in task manager and the bench or server halts then sits.
@@ -1326,7 +1358,7 @@ There have been reports about problems with FA also in mainline. As I took the D
---
-👤 **Ph0rk0z** commented the **2025-05-10** at **21:14:52**:
+👤 **Ph0rk0z** commented on **2025-05-10** at **21:14:52**
Now that you mention it, that's the kind of error I'd get on llama-server. It would eventually fail and segfault there with synchronization listed as the fault. I assumed it was due to me undervolting. Setting lower max gpu clock along with the clock offset (only way to do it on linux) caused it to happen less often.
diff --git a/github-data/pull_requests/39 - Add support for bf16 to iqk_mul_mat.md b/github-data/pull_requests/39 - Add support for bf16 to iqk_mul_mat.md
index fa0bea835..643da9fb3 100644
--- a/github-data/pull_requests/39 - Add support for bf16 to iqk_mul_mat.md
+++ b/github-data/pull_requests/39 - Add support for bf16 to iqk_mul_mat.md
@@ -1,14 +1,17 @@
-### 🔀 [#39](https://github.com/ikawrakow/ik_llama.cpp/pull/39) - Add support for bf16 to iqk_mul_mat
+## 🔀 [Pull Request #39](https://github.com/ikawrakow/ik_llama.cpp/pull/39) - Add support for bf16 to iqk_mul_mat
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mul_mat_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-04 |
| **Updated** | 2024-09-05 |
+| **Merged** | 2024-09-05 |
---
-#### Description
+## 📄 Description
Only when natively supported (e.g., Zen4), else left to `ggml` to handle.
diff --git a/github-data/pull_requests/390 - Fix build for Xeon Gold 6226R.md b/github-data/pull_requests/390 - Fix build for Xeon Gold 6226R.md
index ea81b1113..7608c787f 100644
--- a/github-data/pull_requests/390 - Fix build for Xeon Gold 6226R.md
+++ b/github-data/pull_requests/390 - Fix build for Xeon Gold 6226R.md
@@ -1,14 +1,17 @@
-### 🐛 [#390](https://github.com/ikawrakow/ik_llama.cpp/pull/390) - Fix build for Xeon Gold 6226R
+## 🔀 [Pull Request #390](https://github.com/ikawrakow/ik_llama.cpp/pull/390) - Fix build for Xeon Gold 6226R
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_xeon_6226R` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-08 |
+| **Merged** | 2025-05-07 |
---
-#### Description
+## 📄 Description
I got access to a Xeon Gold 6226R system. The PR fixes the compilation errors due to this CPU supporting all `AVX512` extensions necessary to define `HAVE_FANCY_SIMD`, but does not support SIMD `popcnt`.
@@ -33,15 +36,15 @@ I guess, the lower performance for the first entry in the table is due to the sy
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Ph0rk0z** commented the **2025-05-07** at **13:39:54**:
+👤 **Ph0rk0z** commented on **2025-05-07** at **13:39:54**
I have generation prior to this chip. If you set bios to have 1 numa per CPU, the best results are from --numa distribute. Messing with numactl and interleave gives worse results across the board regardless of what the warning when you run says.
---
-👤 **ikawrakow** commented the **2025-05-07** at **13:45:05**:
+👤 **ikawrakow** commented on **2025-05-07** at **13:45:05**
> I have generation prior to this chip. If you set bios to have 1 numa per CPU, the best results are from --numa distribute. Messing with numactl and interleave gives worse results across the board regardless of what the warning when you run says.
@@ -51,13 +54,13 @@ Unfortunately I don't have physical access to the box (it belongs to somebody el
---
-👤 **Ph0rk0z** commented the **2025-05-07** at **14:43:33**:
+👤 **Ph0rk0z** commented on **2025-05-07** at **14:43:33**
Run with numa distribute and see if your benchie goes up. I might buy 8260 es since they're cheap. Does the extra AVX512-VNNI really help much?
---
-👤 **ikawrakow** commented the **2025-05-07** at **15:15:42**:
+👤 **ikawrakow** commented on **2025-05-07** at **15:15:42**
> Does the extra AVX512-VNNI really help much?
@@ -67,13 +70,13 @@ But it does make a difference for prompt processing. I get about the same PP spe
---
-👤 **Ph0rk0z** commented the **2025-05-07** at **15:55:30**:
+👤 **Ph0rk0z** commented on **2025-05-07** at **15:55:30**
Thanks. I already have AVX-512 but I guess my prompt processing will see a slight boost and of course I can upgrade my memory. With 6 channel 2400mt/s I only get 180GB which is a 30% haircut per proc from theoretical.
---
-👤 **ikawrakow** commented the **2025-05-07** at **16:12:52**:
+👤 **ikawrakow** commented on **2025-05-07** at **16:12:52**
> Thanks. I already have AVX-512 but
@@ -81,7 +84,7 @@ I haven't done a fine-gained implementation depending on `AVX512` extensions ava
---
-👤 **gereoffy** commented the **2025-05-07** at **16:41:26**:
+👤 **gereoffy** commented on **2025-05-07** at **16:41:26**
> > I have generation prior to this chip. If you set bios to have 1 numa per CPU, the best results are from --numa distribute. Messing with numactl and interleave gives worse results across the board regardless of what the warning when you run says.
>
@@ -93,7 +96,30 @@ hi! that box is mine, i can give you DRAC access so it's almost the phisical acc
---
-👤 **gereoffy** commented the **2025-05-07** at **17:16:49**:
+👤 **ikawrakow** commented on **2025-05-07** at **17:00:26**
+
+> hi! that box is mine, i can give you DRAC access so it's almost the phisical access except that you cannot kick the box :) anyway thanks for fixing compile!
+
+Oh, hi, nice to meet you virtually! And thanks for letting me use your box, it has been very helpful. Hope I didn't annoy you too much by running a lot of benchmarks.
+
+DRAC will give me access to the BIOS? But I'm not sure I want to do with it as none of the nodes has enough RAM to fit the DeepSeek models, so I need to use both CPUs. But perhaps someone more experienced with these things can give me more detailed suggestions. Here is what I see with `numactl --hardware`:
+```
+available: 2 nodes (0-1)
+node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
+node 0 size: 385565 MB
+node 0 free: 93568 MB
+node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63
+node 1 size: 193521 MB
+node 1 free: 25592 MB
+node distances:
+node 0 1
+ 0: 10 21
+ 1: 21 10
+```
+
+---
+
+👤 **gereoffy** commented on **2025-05-07** at **17:16:49**
> Oh, hi, nice to meet you virtually! And thanks for letting me use your box, it has been very helpful. Hope I didn't annoy you too much by running a lot of benchmarks.
no problem at all! this is a test/dev system...
@@ -109,22 +135,8 @@ is it possible to run model somehow splitted, and running each part of the model
---
-👤 **Ph0rk0z** commented the **2025-05-07** at **18:37:04**:
+👤 **Ph0rk0z** commented on **2025-05-07** at **18:37:04**
Pass --numa distribute, it splits the memory between both CPU evenly. I think all numa stuff here and main is the same. You can also put it on one node only.. ie the one you launch from.
-When I did tests I didn't have llama-sweep-bench so maybe worth trying again? I simply used both gemma/llama 3 70b and checked generation speed.
-
----
-
-👤 **Gaolingx** commented the **2025-05-07** at **18:41:35**:
-
-thank you for fixing it. when I run llama-server with `-fa` and `-rtr` parameter, the speed is a little faster than only use `-fa`, the prefill and decode are increased, That is a good beginning!
-
-`-c 8192 -t 16 -fa`:
-INFO [ print_timings] prompt eval time = 6958.30 ms / 36 tokens ( 193.29 ms per token, 5.17 tokens per second) | tid="52596" timestamp=1746491529 id_slot=0 id_task=31856 t_prompt_processing=6958.3 n_prompt_tokens_processed=36 t_token=193.28611111111113 n_tokens_second=5.173677478694509
-INFO [ print_timings] generation eval time = 617799.88 ms / 1700 runs ( 363.41 ms per token, 2.75 tokens per second) | tid="52596" timestamp=1746491529 id_slot=0 id_task=31856 t_token_generation=617799.884 n_decoded=1700 t_token=363.4116964705882 n_tokens_second=2.7517000958193774
-
-`-c 8192 -t 16 -fa -rtr`:
-INFO [ print_timings] prompt eval time = 11499.35 ms / 148 tokens ( 77.70 ms per token, 12.87 tokens per second) | tid="66164" timestamp=1746643229 id_slot=0 id_task=859 t_prompt_processing=11499.349 n_prompt_tokens_processed=148 t_token=77.69830405405405 n_tokens_second=12.8702937879353
-INFO [ print_timings] generation eval time = 755894.69 ms / 2074 runs ( 364.46 ms per token, 2.74 tokens per second) | tid="66164" timestamp=1746643229 id_slot=0 id_task=859 t_token_generation=755894.69 n_decoded=2074 t_token=364.4622420443587 n_tokens_second=2.7437684474275117
\ No newline at end of file
+When I did tests I didn't have llama-sweep-bench so maybe worth trying again? I simply used both gemma/llama 3 70b and checked generation speed.
\ No newline at end of file
diff --git a/github-data/pull_requests/391 - Fix DeepSeek q8_0 cache.md b/github-data/pull_requests/391 - Fix DeepSeek q8_0 cache.md
index dd9be220b..c4f7ddd36 100644
--- a/github-data/pull_requests/391 - Fix DeepSeek q8_0 cache.md
+++ b/github-data/pull_requests/391 - Fix DeepSeek q8_0 cache.md
@@ -1,17 +1,20 @@
-### 🐛 [#391](https://github.com/ikawrakow/ik_llama.cpp/pull/391) - Fix DeepSeek q8_0 cache
+## 🔀 [Pull Request #391](https://github.com/ikawrakow/ik_llama.cpp/pull/391) - Fix DeepSeek q8_0 cache
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_deepseek_q80_cache` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-07 |
+| **Merged** | 2025-05-07 |
---
-#### Description
+## 📄 Description
-Nobody has used `ik_llama.cpp` with a DeepSeek model and `Q8_0` KV cache since PR #351?
+Nobody has used `ik_llama.cpp` with a DeepSeek model and `Q8_0` KV cache since PR [#351](https://github.com/ikawrakow/ik_llama.cpp/issues/351)?
This PR fixes the assert one gets when one tries to use a DeepSeek model on the CPU using `Q8_0` KV cache.
-Also, it seems the optimization I added in #351 to repack the `K` cache to `Q8_0_R8` seems to lower TG performance for DeepSeek models, so disabling it.
\ No newline at end of file
+Also, it seems the optimization I added in [#351](https://github.com/ikawrakow/ik_llama.cpp/issues/351) to repack the `K` cache to `Q8_0_R8` seems to lower TG performance for DeepSeek models, so disabling it.
\ No newline at end of file
diff --git a/github-data/pull_requests/392 - fix some MSVC build problem..md b/github-data/pull_requests/392 - fix some MSVC build problem.md
similarity index 56%
rename from github-data/pull_requests/392 - fix some MSVC build problem..md
rename to github-data/pull_requests/392 - fix some MSVC build problem.md
index 0fcdd8fb1..cf96941b2 100644
--- a/github-data/pull_requests/392 - fix some MSVC build problem..md
+++ b/github-data/pull_requests/392 - fix some MSVC build problem.md
@@ -1,14 +1,17 @@
-### 🐛 [#392](https://github.com/ikawrakow/ik_llama.cpp/pull/392) - fix some MSVC build problem.
+## 🔀 [Pull Request #392](https://github.com/ikawrakow/ik_llama.cpp/pull/392) - fix some MSVC build problem.
| **Author** | `Gaolingx` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-07 |
+| **Merged** | 2025-05-07 |
---
-#### Description
+## 📄 Description
fix some MSVC build problem.
From PR :
@@ -26,20 +29,16 @@ Build Result:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-07** at **12:31:01**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-05-07** at **12:31:01** on `CMakeLists.txt`:
+👤 **ikawrakow** started a conversation on `CMakeLists.txt` on **2025-05-07** at **12:31:01**
Why are you deleting these? As a `vim` user they are essential for my CUDA editing experience.
----
-
-👤 **Gaolingx** submitted a review the **2025-05-07** at **12:37:49**: 💬 `COMMENTED`
+> 👤 **Gaolingx** replied on **2025-05-07** at **12:37:49**
+>
+> sorry, I don't know what happened deleting these, I forget revert these after build.
---
-👤 **ikawrakow** submitted a review the **2025-05-07** at **12:47:42**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-07** at **12:47:42**
\ No newline at end of file
diff --git a/github-data/pull_requests/394 - Handle incompatible DeepSeek GGUFs.md b/github-data/pull_requests/394 - Handle incompatible DeepSeek GGUFs.md
index e3f871937..c76ba0fa8 100644
--- a/github-data/pull_requests/394 - Handle incompatible DeepSeek GGUFs.md
+++ b/github-data/pull_requests/394 - Handle incompatible DeepSeek GGUFs.md
@@ -1,16 +1,19 @@
-### 🔀 [#394](https://github.com/ikawrakow/ik_llama.cpp/pull/394) - Handle incompatible DeepSeek GGUFs
+## 🔀 [Pull Request #394](https://github.com/ikawrakow/ik_llama.cpp/pull/394) - Handle incompatible DeepSeek GGUFs
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/handle_incompatible_deepseek_ggufs` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-10 |
+| **Merged** | 2025-05-09 |
---
-#### Description
+## 📄 Description
-Mainline `llama.cpp` [PR 12801](https://github.com/ggml-org/llama.cpp/pull/12801), which added MLA support for DeepSeek models 2.5 months after MLA was available here, broke backwards compatibility. As a result, the new DeepSeek GGUFs that started appearing on HF are not compatible with `ik_llama.cpp`, resulting in issues #373 and #383.
+Mainline `llama.cpp` [PR 12801](https://github.com/ggml-org/llama.cpp/pull/12801), which added MLA support for DeepSeek models 2.5 months after MLA was available here, broke backwards compatibility. As a result, the new DeepSeek GGUFs that started appearing on HF are not compatible with `ik_llama.cpp`, resulting in issues [#373](https://github.com/ikawrakow/ik_llama.cpp/issues/373) and [#383](https://github.com/ikawrakow/ik_llama.cpp/issues/383).
My initial reaction was to not support the new DeepSeek GGUFs, as there was no real reason to introduce the backwards incompatibility (and have people re-download the giant DeepSeek-R1/V3 models). The two new tensors (per layer) required for MLA can be easily created on-the-fly when loading the model as it is done here.
@@ -24,9 +27,9 @@ I have tested with DeepSeek-Lite, which uses the exact same attention architectu
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **whatever1983** commented the **2025-05-09** at **05:36:06**:
+👤 **whatever1983** commented on **2025-05-09** at **05:36:06**
python convert_hf_to_gguf.py --outfile /mydata/Downloads/DeepSeek-V3-0324-Pruned-Coder-411B-q8_0-ik.gguf --outtype q8_0 /mydata/Downloads/DeepSeek-V3-0324-Pruned-Coder-411B/
@@ -43,15 +46,15 @@ I would rather have convert_hf_to_gguf.py from the ik_llama.cpp repo work.
---
-👤 **ikawrakow** commented the **2025-05-09** at **05:47:47**:
+👤 **ikawrakow** commented on **2025-05-09** at **05:47:47**
> WARNING:gguf.vocab:Adding merges requested but no merges found, output may be non-functional.
-Yes, the `convert_hf_to_gguf.py` script currently on master does not handle merges well. There is a fix in PR #377, but I haven't merged because for some reason it misses the `rope_scaling` tensor, and we have not understood why.
+Yes, the `convert_hf_to_gguf.py` script currently on master does not handle merges well. There is a fix in PR [#377](https://github.com/ikawrakow/ik_llama.cpp/issues/377), but I haven't merged because for some reason it misses the `rope_scaling` tensor, and we have not understood why.
---
-👤 **Panchovix** commented the **2025-05-09** at **17:24:31**:
+👤 **Panchovix** commented on **2025-05-09** at **17:24:31**
I'm testing now! With DeepSeekV3 0324 Q2_K_XL latest quant, on 128GB VRAM (5090+4090x2+A6000) and 192GB RAM (6000Mhz 7800X3D). But first I just noticed this
@@ -142,7 +145,7 @@ So I can confirm latest quants with MLA works on ik llamacpp.
---
-👤 **ikawrakow** commented the **2025-05-09** at **19:00:28**:
+👤 **ikawrakow** commented on **2025-05-09** at **19:00:28**
@Panchovix Thanks for testing!
@@ -152,7 +155,7 @@ If you post your `llama.cpp` command here, perhaps we can give you suggestions h
---
-👤 **Panchovix** commented the **2025-05-09** at **19:11:22**:
+👤 **Panchovix** commented on **2025-05-09** at **19:11:22**
> @Panchovix Thanks for testing!
>
@@ -160,7 +163,7 @@ If you post your `llama.cpp` command here, perhaps we can give you suggestions h
>
> If you post your `llama.cpp` command here, perhaps we can give you suggestions how you can improve it for `ik_llama.cpp`.
-Had to modify it as I use -fa on main llamacpp and I think this PR was done before fa + mla was possible on main. The compute buffers on FA were 3.7 GB and then 400mb each, while here it was 4.5GB each buffer (which is near 1 tensor per GPU)
+Had to modify it as I use -fa on main llamacpp and I think this PR was done before fa + mla was possible on main. The compute buffers on FA were 3.7 GB and then 400mb each on main llamacpp, while here on ik llamacpp it was 4.5GB each buffer (which is near 1 tensor per GPU) without fa.
My command on main is
@@ -173,9 +176,9 @@ Sometimes it tries to load on CPU first, but I cancel and start it again until i
---
-👤 **ikawrakow** commented the **2025-05-09** at **19:24:18**:
+👤 **ikawrakow** commented on **2025-05-09** at **19:24:18**
-I have merged this PR. If you take the current main main branch and try
+I have merged this PR. If you take the current `ik_llama.cpp` main branch and try
```
./llama-server -m '/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99
--override-tensor 'blk\.([0-7])\..*_exps\.=CUDA0'
@@ -192,7 +195,7 @@ As `llama.cpp` still stores a V cache, your should have some extra space to perh
---
-👤 **Panchovix** commented the **2025-05-09** at **19:47:23**:
+👤 **Panchovix** commented on **2025-05-09** at **19:47:23**
Thanks! I went ahead and test, this is the output
@@ -584,7 +587,7 @@ llm_load_tensors: CUDA3 buffer size = 42786.36 MiB
It starts loading from CPU buffer size instead of CUDA 0. Also this seems to make the CPU to stutter a bit while loading. I haven't tested with mmap yet.
-RX/TX looks like this on PP
+RX/TX looks like this on ik llamacpp

@@ -601,7 +604,7 @@ prompt eval time = 35950.29 ms / 3218 tokens ( 11.17 ms per token, 89.51
eval time = 44338.15 ms / 380 tokens ( 116.68 ms per token, 8.57 tokens per second)
```
-ikllamacpp with the command above + rtr (ub 1536)
+ikllamacpp with the command above (ub 512)
```
INFO [ print_timings] prompt eval time = 104442.50 ms / 3218 tokens ( 32.46 ms per token, 30.81 tokens per second) | tid="139803965288448" timestamp=1746819713 id_slot=0 id_task=0 t_prompt_processing=104442.501 n_prompt_tokens_processed=3218 t_token=32.45571814791796 n_tokens_second=30.811211615853587
@@ -779,6 +782,243 @@ print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
+load_tensors: layer 0 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 1 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 2 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 3 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 4 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 5 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 6 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 7 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 8 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 9 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 10 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 11 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 12 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 13 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 14 assigned to device CUDA0, is_swa = 0
+load_tensors: layer 15 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 16 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 17 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 18 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 19 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 20 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 21 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 22 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 23 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 24 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 25 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 26 assigned to device CUDA1, is_swa = 0
+load_tensors: layer 27 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 28 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 29 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 30 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 31 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 32 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 33 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 34 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 35 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 36 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 37 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 38 assigned to device CUDA2, is_swa = 0
+load_tensors: layer 39 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 40 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 41 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 42 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 43 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 44 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 45 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 46 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 47 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 48 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 49 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 50 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 51 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 52 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 53 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 54 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 55 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 56 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 57 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 58 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 59 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 60 assigned to device CUDA3, is_swa = 0
+load_tensors: layer 61 assigned to device CUDA3, is_swa = 0
+tensor blk.3.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.3.ffn_down_exps.weight (2016 MiB q4_K) buffer type overridden to CUDA0
+tensor blk.3.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.4.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.4.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA0
+tensor blk.4.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.5.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.5.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA0
+tensor blk.5.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.6.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.6.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA0
+tensor blk.6.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.7.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.7.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA0
+tensor blk.7.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA0
+tensor blk.8.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.8.ffn_down_exps.weight (2016 MiB q4_K) buffer type overridden to CUDA1
+tensor blk.8.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.9.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.9.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA1
+tensor blk.9.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.10.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.10.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA1
+tensor blk.10.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.11.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.11.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA1
+tensor blk.11.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA1
+tensor blk.12.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.12.ffn_down_exps.weight (2016 MiB q4_K) buffer type overridden to CUDA2
+tensor blk.12.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.13.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.13.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA2
+tensor blk.13.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.14.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.14.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA2
+tensor blk.14.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.15.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.15.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA2
+tensor blk.15.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.16.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.16.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA2
+tensor blk.16.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA2
+tensor blk.17.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.17.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.17.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.18.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.18.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.18.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.19.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.19.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.19.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.20.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.20.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.20.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.21.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.21.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.21.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.22.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.22.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.22.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.23.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.23.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.23.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.24.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.24.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.24.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.25.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.25.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.25.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.26.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.26.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CUDA3
+tensor blk.26.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CUDA3
+tensor blk.27.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.27.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.27.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.28.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.28.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.28.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.29.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.29.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.29.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.30.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.30.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.30.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.31.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.31.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.31.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.32.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.32.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.32.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.33.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.33.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.33.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.34.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.34.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.34.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.35.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.35.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.35.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.36.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.36.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.36.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.37.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.37.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.37.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.38.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.38.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.38.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.39.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.39.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.39.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.40.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.40.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.40.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.41.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.41.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.41.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.42.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.42.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.42.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.43.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.43.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.43.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.44.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.44.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.44.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.45.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.45.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.45.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.46.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.46.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.46.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.47.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.47.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.47.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.48.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.48.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.48.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.49.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.49.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.49.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.50.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.50.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.50.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.51.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.51.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.51.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.52.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.52.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.52.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.53.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.53.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.53.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.54.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.54.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.54.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.55.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.55.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.55.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.56.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.56.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.56.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.57.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.57.ffn_down_exps.weight (2016 MiB q4_K) buffer type overridden to CPU
+tensor blk.57.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.58.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.58.ffn_down_exps.weight (1540 MiB q3_K) buffer type overridden to CPU
+tensor blk.58.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.59.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.59.ffn_down_exps.weight (2016 MiB q4_K) buffer type overridden to CPU
+tensor blk.59.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.60.ffn_gate_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+tensor blk.60.ffn_down_exps.weight (2016 MiB q4_K) buffer type overridden to CPU
+tensor blk.60.ffn_up_exps.weight (1176 MiB q2_K) buffer type overridden to CPU
+load_tensors: tensor 'token_embd.weight' (q4_K) (and 159 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
@@ -819,7 +1059,7 @@ I can add more info if needed!
---
-👤 **ikawrakow** commented the **2025-05-10** at **10:13:53**:
+👤 **ikawrakow** commented on **2025-05-10** at **10:13:53**
@Panchovix
@@ -833,12 +1073,12 @@ There is a PR in mainline `llama.cpp` to allow disabling offload to the GPU, see
---
-👤 **Panchovix** commented the **2025-05-10** at **17:08:54**:
+👤 **Panchovix** commented on **2025-05-10** at **17:08:54**
@ikawrakow ohh I see! If it's possible to do add the reverse feature it would be great! As I think ik llamacpp with it's optimizations would be faster than llamacpp for PP t/s if we could do the matrix multiplication in the GPU.
---
-👤 **ikawrakow** commented the **2025-05-10** at **17:15:44**:
+👤 **ikawrakow** commented on **2025-05-10** at **17:15:44**
-There is PR #405 now. You can try it with as high u-batch size as you can go. Don't use '-rtr' as this will disable the GPU offload of the experts.
\ No newline at end of file
+There is PR [#405](https://github.com/ikawrakow/ik_llama.cpp/issues/405) now. You can try it with as high u-batch size as you can go. Don't use '-rtr' as this will disable the GPU offload of the experts.
\ No newline at end of file
diff --git a/github-data/pull_requests/4 - Simdify and multi-thread tanh.md b/github-data/pull_requests/4 - Simdify and multi-thread tanh.md
index 151a8028c..0aaa6ddbd 100644
--- a/github-data/pull_requests/4 - Simdify and multi-thread tanh.md
+++ b/github-data/pull_requests/4 - Simdify and multi-thread tanh.md
@@ -1,14 +1,17 @@
-### 🔀 [#4](https://github.com/ikawrakow/ik_llama.cpp/pull/4) - Simdify and multi-thread tanh
+## 🔀 [Pull Request #4](https://github.com/ikawrakow/ik_llama.cpp/pull/4) - Simdify and multi-thread tanh
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/tanh` |
+| **Target Branch** | `main` |
| **Created** | 2024-07-27 |
| **Updated** | 2024-07-27 |
+| **Merged** | 2024-07-27 |
---
-#### Description
+## 📄 Description
It seemed Gemma-2 performance is lower than expected for its size. Looking at the architecture, I noticed that `tanh` is used in each layer, and then at the end for soft-caping the final output. `ggml` had `tanh` set to be computed with a single thread. Combined with `tanh(x)` being a pretty expensive operation, this resulted in a significant fraction of the time being spent in the `tanh` operation.
diff --git a/github-data/pull_requests/40 - Adding bf16 support to CUDA.md b/github-data/pull_requests/40 - Adding bf16 support to CUDA.md
index a5b4c469d..ec319745e 100644
--- a/github-data/pull_requests/40 - Adding bf16 support to CUDA.md
+++ b/github-data/pull_requests/40 - Adding bf16 support to CUDA.md
@@ -1,14 +1,17 @@
-### 🔀 [#40](https://github.com/ikawrakow/ik_llama.cpp/pull/40) - Adding bf16 support to CUDA
+## 🔀 [Pull Request #40](https://github.com/ikawrakow/ik_llama.cpp/pull/40) - Adding bf16 support to CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-05 |
| **Updated** | 2024-09-14 |
+| **Merged** | 2024-09-14 |
---
-#### Description
+## 📄 Description
Haha, `llama.cpp` seems to not support `bf16` on CUDA?
diff --git a/github-data/pull_requests/400 - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache.md b/github-data/pull_requests/400 - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache.md
index 897cf91d9..a2dd5c5aa 100644
--- a/github-data/pull_requests/400 - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache.md
+++ b/github-data/pull_requests/400 - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache.md
@@ -1,14 +1,17 @@
-### 🐛 [#400](https://github.com/ikawrakow/ik_llama.cpp/pull/400) - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache
+## 🔀 [Pull Request #400](https://github.com/ikawrakow/ik_llama.cpp/pull/400) - Fix CUDA DeepSeek FlashMLA-3 with quantized KV cache
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_fix_quantized_flash_mla3` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-09 |
| **Updated** | 2025-05-09 |
+| **Merged** | 2025-05-09 |
---
-#### Description
+## 📄 Description
The implementation was assuming that the K and V cache are contiguous, and was using this assumption to dequantize to `fp16`. This is certainly wrong for the V cache, which is just a view of the K cache with rows of 512 instead of 576 elements.
@@ -800,15 +803,15 @@ I.e., only very slightly slower than `fp16` KV cache. The KV cache is quite smal
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **JohannesGaessler** commented the **2025-05-09** at **07:23:38**:
+👤 **JohannesGaessler** commented on **2025-05-09** at **07:23:38**
Thank you for notifying me. I am aware of the defect, on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory. I can't comment on the specific code in this PR since I won't look at it unless you explicitly tell me I'm allowed to do so even without the conflict between you and Georgi first being resolved. The way I would have gone about it would have been not to use the V tensor at all, to dequantize K, and to then calculate the pointer, dimension, and strides for a pseudo V tensor from the K tensor.
---
-👤 **ikawrakow** commented the **2025-05-09** at **07:25:52**:
+👤 **ikawrakow** commented on **2025-05-09** at **07:25:52**
Forgot to add `-rtr` in the above performance test. Here it is with `-rtr` and `q8_0` KV cache
@@ -825,7 +828,7 @@ Forgot to add `-rtr` in the above performance test. Here it is with `-rtr` and `
---
-👤 **ikawrakow** commented the **2025-05-09** at **07:31:04**:
+👤 **ikawrakow** commented on **2025-05-09** at **07:31:04**
> on the mainline PR it is currently not manifesting as a bug because the K and V cache are not yet deduplicated and are thus both contiguous in memory.
@@ -835,6 +838,6 @@ In any case, the PR in `ik_llama.cpp` is mostly a copy of your mainline PR, so y
---
-👤 **JohannesGaessler** commented the **2025-05-09** at **07:49:51**:
+👤 **JohannesGaessler** commented on **2025-05-09** at **07:49:51**
My concern specifically is whether you would consider any of my work on mainline after looking at your code to be including a "substantial portion" of your work and could thus only be included in conjunction with the copyright notices in ik_llama.cpp. Much like you I am not a lawyer but if you tell me that you will not consider me looking at your work to be a license violation (or that in some specific case you waive the requirement of copyright notices) then there is no need for lawyers in the first place.
\ No newline at end of file
diff --git a/github-data/pull_requests/402 - Fix missing rope_freqs with convert_hf_to_gguf.md b/github-data/pull_requests/402 - Fix missing rope_freqs with convert_hf_to_gguf.md
index 2fd8803a0..4f3bc2415 100644
--- a/github-data/pull_requests/402 - Fix missing rope_freqs with convert_hf_to_gguf.md
+++ b/github-data/pull_requests/402 - Fix missing rope_freqs with convert_hf_to_gguf.md
@@ -1,14 +1,17 @@
-### 🐛 [#402](https://github.com/ikawrakow/ik_llama.cpp/pull/402) - Fix missing rope_freqs with convert_hf_to_gguf
+## 🔀 [Pull Request #402](https://github.com/ikawrakow/ik_llama.cpp/pull/402) - Fix missing rope_freqs with convert_hf_to_gguf
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/rope_freq_fix` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-09 |
| **Updated** | 2025-05-09 |
+| **Merged** | 2025-05-09 |
---
-#### Description
+## 📄 Description
This ports https://github.com/ggml-org/llama.cpp/pull/9396 and https://github.com/ggml-org/llama.cpp/pull/9117 (I don't think I needed this as the changes in here are basically reverted in 9396).
@@ -16,10 +19,10 @@ The issue was that the convert script used generate_extra_tensors for those tens
I tested with [Llama-3_1-Nemotron-51B-Instruct](https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct) and it now generates the rope_freqs.weight which was missing previously.
-Look at #377 for more information.
+Look at [#377](https://github.com/ikawrakow/ik_llama.cpp/issues/377) for more information.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-09** at **14:16:12**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-09** at **14:16:12**
\ No newline at end of file
diff --git a/github-data/pull_requests/404 - TG improvements for MoE models.md b/github-data/pull_requests/404 - TG improvements for MoE models.md
index 6ffa5f87f..d6f7c7c03 100644
--- a/github-data/pull_requests/404 - TG improvements for MoE models.md
+++ b/github-data/pull_requests/404 - TG improvements for MoE models.md
@@ -1,18 +1,21 @@
-### 🔀 [#404](https://github.com/ikawrakow/ik_llama.cpp/pull/404) - TG improvements for MoE models
+## 🔀 [Pull Request #404](https://github.com/ikawrakow/ik_llama.cpp/pull/404) - TG improvements for MoE models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/remove_unnessessary_ids_copy` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-10 |
| **Updated** | 2025-05-10 |
+| **Merged** | 2025-05-10 |
---
-#### Description
+## 📄 Description
This PR does 3 things:
* Removes an unnecessary device to host copy of selected experts IDs on CUDA. This results in a few percent improvement of CUDA TG speed for MoE models
-* Fixes bugs related to Smart Experts Reduction (SER, see #239). The issue was that the `GGML_OP_GET_ROWS` op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.
+* Fixes bugs related to Smart Experts Reduction (SER, see [#239](https://github.com/ikawrakow/ik_llama.cpp/issues/239)). The issue was that the `GGML_OP_GET_ROWS` op implementation did not consider disabled experts for float tensors. As a result, when combining the results of the experts garbage weights were used for the disabled experts, which could lead to NaNs.
* Further improves CUDA TG performance with SER enabled. Here the `ggml_cuda_op_mul_mat_vec_q_id` function did not consider that an expert may be disabled, and needlessly calculated the matrix-vector multiplication for disabled experts.
Prompt processing is not eaffected by these changes.
diff --git a/github-data/pull_requests/405 - GPU offload policy.md b/github-data/pull_requests/405 - GPU offload policy.md
index 9ba5757f9..b494a8983 100644
--- a/github-data/pull_requests/405 - GPU offload policy.md
+++ b/github-data/pull_requests/405 - GPU offload policy.md
@@ -1,14 +1,17 @@
-### 🔀 [#405](https://github.com/ikawrakow/ik_llama.cpp/pull/405) - GPU offload policy
+## 🔀 [Pull Request #405](https://github.com/ikawrakow/ik_llama.cpp/pull/405) - GPU offload policy
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/offload_policy` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-10 |
| **Updated** | 2025-05-12 |
+| **Merged** | 2025-05-12 |
---
-#### Description
+## 📄 Description
When part of the tensors are stored in RAM but there are faster back-ends available (GPU), the scheduler needs to decide if to offload the data for a given op to a faster back-end or to compute the op on the CPU. This is currently done via a simple heuristics where only matrix multiplications (`GGML_MUL_MAT` and `GGML_MUL_MAT_ID`) are offloaded if the batch size is larger than some threshold (currently 32). When `fmoe` is enabled, the fused `(ffn_up*X)*unary(ffn_gate*X))` op is never uploaded. In contrast, in mainline `llama.cpp` matrix multiplications are always offloaded when the batch size is `>= 32`. The result of this is that when the batch size becomes large enough, `llama.cpp` will outperform `ik_llama.cpp` in prompt processing speed. As "large enough" depends on many factors (size of tensors that need to be uploaded, speed of the PCI-E bus to the GPU, relative speed of the GPU vs the CPU), it is hard to devise a better offload policy that automatically takes the best decision.
@@ -124,9 +127,9 @@ Examples:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Panchovix** commented the **2025-05-10** at **18:12:44**:
+👤 **Panchovix** commented on **2025-05-10** at **18:12:44**
Many thanks for the PR! Sorry as I think I didn't understand correctly, for the case we were talking on https://github.com/ikawrakow/ik_llama.cpp/pull/394#issuecomment-2868723515, if we want to do the matrix multiplications on MoE models, we should specify
@@ -134,120 +137,25 @@ Many thanks for the PR! Sorry as I think I didn't understand correctly, for the
---
-👤 **ikawrakow** commented the **2025-05-10** at **18:22:29**:
+👤 **ikawrakow** commented on **2025-05-10** at **18:22:29**
This PR sets `ik_llama.cpp` GPU offload behavior to be the same as `llama.cpp`, so you don't need to use the `-op` argument. You would want to use it if you were running for instance Maverick, and then you would use `-op 27,0,29,0`.
---
-👤 **Panchovix** commented the **2025-05-10** at **18:33:15**:
+👤 **Panchovix** commented on **2025-05-10** at **18:33:15**
-Amazing, thanks! Now I'm trying to build from source but I'm getting some compilation issues, not sure if it is the PR or an update (I was on https://github.com/ikawrakow/ik_llama.cpp/commit/43a154d8b8b0e9217114577442cecb224a488d45 before)
-
-```
-[ 59%] Building CXX object src/CMakeFiles/llama.dir/unicode-data.cpp.o
-/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-collect2: error: ld returned 1 exit status
-/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-gmake[2]: *** [examples/gguf/CMakeFiles/llama-gguf.dir/build.make:103: bin/llama-gguf] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:3260: examples/gguf/CMakeFiles/llama-gguf.dir/all] Error 2
-gmake[1]: *** Waiting for unfinished jobs....
-collect2: error: ld returned 1 exit status
-gmake[2]: *** [examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/build.make:109: bin/llama-gguf-hash] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:3097: examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/all] Error 2
-[ 59%] Linking CXX shared library libllama.so
-[ 59%] Built target llama
-gmake: *** [Makefile:146: all] Error 2
-```
-
-```
-make --build gpupol --config Release -j 7
-[ 0%] Built target build_info
-[ 0%] Built target sha1
-[ 0%] Built target sha256
-[ 1%] Built target xxhash
-[ 56%] Built target ggml
-[ 56%] Linking CXX executable ../../bin/llama-gguf
-[ 57%] Linking CXX executable ../../bin/llama-gguf-hash
-[ 59%] Built target llama
-/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-collect2: error: ld returned 1 exit status
-/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-gmake[2]: *** [examples/gguf/CMakeFiles/llama-gguf.dir/build.make:103: bin/llama-gguf] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:3260: examples/gguf/CMakeFiles/llama-gguf.dir/all] Error 2
-gmake[1]: *** Waiting for unfinished jobs....
-collect2: error: ld returned 1 exit status
-gmake[2]: *** [examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/build.make:109: bin/llama-gguf-hash] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:3097: examples/gguf-hash/CMakeFiles/llama-gguf-hash.dir/all] Error 2
-[ 59%] Building CXX object examples/llava/CMakeFiles/llava.dir/clip.cpp.o
-[ 59%] Building CXX object common/CMakeFiles/common.dir/common.cpp.o
-[ 60%] Building CXX object examples/benchmark/CMakeFiles/llama-bench-matmult.dir/benchmark-matmult.cpp.o
-[ 60%] Building C object tests/CMakeFiles/test-c.dir/test-c.c.o
-[ 60%] Building CXX object common/CMakeFiles/common.dir/sampling.cpp.o
-[ 61%] Building CXX object examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/quantize-stats.cpp.o
-[ 61%] Building CXX object examples/llava/CMakeFiles/llava.dir/llava.cpp.o
-[ 61%] Linking C executable ../bin/test-c
-/usr/bin/ld: ../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-collect2: error: ld returned 1 exit status
-gmake[2]: *** [tests/CMakeFiles/test-c.dir/build.make:104: bin/test-c] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:2713: tests/CMakeFiles/test-c.dir/all] Error 2
-[ 61%] Building CXX object common/CMakeFiles/common.dir/console.cpp.o
-[ 61%] Building CXX object common/CMakeFiles/common.dir/grammar-parser.cpp.o
-[ 62%] Linking CXX executable ../../bin/llama-bench-matmult
-/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-collect2: error: ld returned 1 exit status
-gmake[2]: *** [examples/benchmark/CMakeFiles/llama-bench-matmult.dir/build.make:106: bin/llama-bench-matmult] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:2887: examples/benchmark/CMakeFiles/llama-bench-matmult.dir/all] Error 2
-[ 62%] Building CXX object common/CMakeFiles/common.dir/json-schema-to-grammar.cpp.o
-[ 63%] Building CXX object common/CMakeFiles/common.dir/train.cpp.o
-[ 63%] Building CXX object common/CMakeFiles/common.dir/ngram-cache.cpp.o
-[ 63%] Linking CXX executable ../../bin/llama-quantize-stats
-/usr/bin/ld: ../../ggml/src/libggml.so: undefined reference to `x000fe200080f0eff'
-collect2: error: ld returned 1 exit status
-gmake[2]: *** [examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/build.make:106: bin/llama-quantize-stats] Error 1
-gmake[1]: *** [CMakeFiles/Makefile2:3920: examples/quantize-stats/CMakeFiles/llama-quantize-stats.dir/all] Error 2
-In file included from /run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/examples/llava/clip.cpp:24:
-/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/examples/llava/../../common/stb_image.h: In function ‘int stbi__parse_png_file(stbi__png*, int, int)’:
-/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/examples/llava/../../common/stb_image.h:5450:31: warning: writing 1 byte into a region of size 0 [-Wstringop-overflow=]
- 5450 | tc[k] = (stbi_uc)(stbi__get16be(s) & 255) *
- | ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- 5451 | stbi__depth_scale_table[z->depth]; // non 8-bit images will be larger
- | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-/run/media/pancho/4C4643C74643B10E/ChatIAs/ik_llama.cpp/examples/llava/../../common/stb_image.h:5326:28: note: at offset 3 into destination object ‘tc’ of size 3
- 5326 | stbi_uc has_trans = 0, tc[3] = {0};
- | ^~
-[ 63%] Built target llava
-[ 63%] Linking CXX static library libcommon.a
-[ 63%] Built target common
-gmake: *** [Makefile:146: all] Error 2
-```
-
-It seems CUDA parts worked fine.
-
-I'm building with
-
-```
- CC=gcc-14 CXX=g++-14 CUDAHOSTCXX=g++-14 cmake -B build \
- -DGGML_CUDA=ON \
- -DGGML_CUDA_FA_ALL_QUANTS=ON \
- -DGGML_BLAS=OFF \
- -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \
- -DGGML_IQK_FA_ALL_QUANTS=1 \
- -DGGML_SCHED_MAX_COPIES=1 \
- -DCMAKE_CUDA_FLAGS="-allow-unsupported-compiler -ccbin=g++-14"
-
- cmake --build build --config Release -j 7
-```
+Amazing, thanks! EDIT: Compilation solved by doing a new git clone.
---
-👤 **ikawrakow** commented the **2025-05-10** at **18:45:34**:
+👤 **ikawrakow** commented on **2025-05-10** at **18:45:34**
Not sure. `grep` on the source tree for `000fe200080f0eff` returns no results.
---
-👤 **Panchovix** commented the **2025-05-10** at **19:39:27**:
+👤 **Panchovix** commented on **2025-05-10** at **19:39:27**
Okay restarting didn't work either. But cloning the PR itself in a new folder worked, so I guess there is an issue with my main folder after pulling the PR separately.
@@ -269,13 +177,13 @@ This is about 10% faster than main llamacpp with the same ubatch size, and GPU 0
---
-👤 **Panchovix** commented the **2025-05-10** at **23:37:03**:
+👤 **Panchovix** commented on **2025-05-10** at **23:37:03**
Just an update, tested other deepseek models (v30324, chimera, r1) at q2_k_xl, iq3_xxs, q3_k_s and q3_k_xl, all working fine! So really nice work.
---
-👤 **ikawrakow** commented the **2025-05-11** at **04:42:09**:
+👤 **ikawrakow** commented on **2025-05-11** at **04:42:09**
Thanks for testing, I appreciate it!
@@ -283,13 +191,13 @@ Johannes has improved the performance `llama.cpp` for MoE models quite a bit in
---
-👤 **Panchovix** commented the **2025-05-11** at **04:52:17**:
+👤 **Panchovix** commented on **2025-05-11** at **04:52:17**
I see! I think I would have to remove some layers from some experts from GPU to use -b and -ub 4096, which I think it would increase PP but maybe decrease TG a bit? At least I have noticed that with -b 2560 and -ub 2048 with less layers on GPU but more ctx (128K)
---
-👤 **ikawrakow** commented the **2025-05-11** at **04:59:57**:
+👤 **ikawrakow** commented on **2025-05-11** at **04:59:57**
> I think I would have to remove some layers from some experts from GPU to use -b and -ub 4096, which I think it would increase PP but maybe decrease TG a bit?
@@ -301,7 +209,7 @@ What is the use case for `-b 2560 -ub 2048`? The computation will run one u-batc
---
-👤 **Panchovix** commented the **2025-05-11** at **05:12:45**:
+👤 **Panchovix** commented on **2025-05-11** at **05:12:45**
> > I think I would have to remove some layers from some experts from GPU to use -b and -ub 4096, which I think it would increase PP but maybe decrease TG a bit?
>
@@ -317,23 +225,16 @@ Also just 1/61 the speed, pretty worth probably. I get 7 t/s on Q3_K_XL TG but ~
---
-👤 **Panchovix** commented the **2025-05-11** at **22:34:17**:
+👤 **Panchovix** commented on **2025-05-11** at **22:34:17**
Okay testing Q2_K_XL with -b 4096 and -ub 4096, PP t/s are insane
```
INFO [ print_timings] prompt eval time = 13435.86 ms / 3003 tokens ( 4.47 ms per token, 223.51 tokens per second) | tid="140099605647360" timestamp=1747002757 id_slot=0 id_task=385 t_prompt_processing=13435.857 n_prompt_tokens_processed=3003 t_token=4.474144855144855 n_tokens_second=223.50639784272786
-```
-
----
-
-👤 **cosystudio** commented the **2025-05-12** at **21:52:32**:
-
-I want to say thank you as well as provide a datapoint. PP hit 301 tk/s vs about 230 tk/s vs commit ab7f694b. x2 3090 AMD Epyc 9654P + 12 channels of DDR5 4800 MT/s ram
-
-./llama-server --alias /Qwen3-235B-A22B-128K-UD-Q4_K_XL -m /home/dev/models/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-128K-UD-Q4_K_XL-00001-of-00003.gguf -c 92160 -t 96 -fa -amb 512 -mla 3 -rtr -fmoe -ctk q8_0 -ctv q8_0 --parallel 1 -ngl 99 -ot "blk\.(0|1|2|3|4|5|6|14|15|16)\.ffn.*=CUDA0" -ot "blk\.(7|8|9|10|11|12|13|17|18|19)\.ffn.*=CUDA1" -ot "blk\.2[0-9]\.ffn.*=CPU" -ot "blk\.[3-9][0-9]\.ffn.*=CPU" --host 0.0.0.0 --port 8080 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -np 8 -ub 1024 --metrics -dt 0.05 --threads-http 16 --prompt-cache-all --predict 38912 -b 4096 -ub 4096
+```
+EDIT: After some gens it just gets faster
-INFO [ print_timings] prompt eval time = 23946.86 ms / 7221 tokens ( 3.32 ms per token, 301.54 tokens per second) | tid="130418296737792" timestamp=1747086263 id_slot=0 id_task=17 t_prompt_processing=23946.864 n_prompt_tokens_processed=7221 t_token=3.316280847528043 n_tokens_second=301.54261535038574
-INFO [ print_timings] generation eval time = 3061.63 ms / 55 runs ( 55.67 ms per token, 17.96 tokens per second) | tid="130418296737792" timestamp=1747086263 id_slot=0 id_task=17 t_token_generation=3061.629 n_decoded=55 t_token=55.66598181818182 n_tokens_second=17.964292865007486
-INFO [ print_timings] total time = 27008.49 ms | tid="130418296737792" timestamp=1747086263 id_slot=0 id_task=17 t_prompt_processing=23946.864 t_token_generation=3061.629 t_total=27008.493000000002
\ No newline at end of file
+```
+INFO [ print_timings] prompt eval time = 14636.06 ms / 3595 tokens ( 4.07 ms per token, 245.63 tokens per second) | tid="140099605647360" timestamp=1747003592 id_slot=0 id_task=2032 t_prompt_processing=14636.063 n_prompt_tokens_processed=3595 t_token=4.071227538247566 n_tokens_second=245.62616326535354
+```
\ No newline at end of file
diff --git a/github-data/pull_requests/406 - Fix race in the CUDA DeepSeek FA kernel.md b/github-data/pull_requests/406 - Fix race in the CUDA DeepSeek FA kernel.md
index 584af11c0..91ef1abfa 100644
--- a/github-data/pull_requests/406 - Fix race in the CUDA DeepSeek FA kernel.md
+++ b/github-data/pull_requests/406 - Fix race in the CUDA DeepSeek FA kernel.md
@@ -1,23 +1,61 @@
-### 🐛 [#406](https://github.com/ikawrakow/ik_llama.cpp/pull/406) - Fix race in the CUDA DeepSeek FA kernel
+## 🔀 [Pull Request #406](https://github.com/ikawrakow/ik_llama.cpp/pull/406) - Fix race in the CUDA DeepSeek FA kernel
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_cuda_fa_race` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-11 |
| **Updated** | 2025-05-13 |
+| **Merged** | 2025-05-11 |
---
-#### Description
+## 📄 Description
Reference: https://github.com/ggml-org/llama.cpp/pull/13438
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-12** at **15:59:39**:
+👤 **ubergarm** commented on **2025-05-12** at **15:59:39**
-Just saw what looks like a small patch in mainline's [earlier ggml-org/llama.cpp#13438 just updated in #13469 (linked here)](https://github.com/ggml-org/llama.cpp/pull/13469)
+Just saw what looks like a small patch in mainline's [earlier ggml-org/llama.cpp[#13438](https://github.com/ikawrakow/ik_llama.cpp/issues/13438) just updated in [#13469](https://github.com/ikawrakow/ik_llama.cpp/issues/13469) (linked here)](https://github.com/ggml-org/llama.cpp/pull/13469)
-Could be related to my issue with `DDDD` showing up for longer contexts which I attributed to `-ser` [as we were discussing here](https://github.com/ikawrakow/ik_llama.cpp/pull/386#issuecomment-2869078136)?
\ No newline at end of file
+Could be related to my issue with `DDDD` showing up for longer contexts which I attributed to `-ser` [as we were discussing here](https://github.com/ikawrakow/ik_llama.cpp/pull/386#issuecomment-2869078136)?
+
+Though hrmm, yours has this in a similar area already, so may not be relevent.
+```
+ if (np > 1) {
+ __syncthreads();
+ }
+```
+
+fwiw I tested the following small change and still am seeing `DDDD` with longer context and `-ser` so might not be related.
+
+```
+--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
++++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+@@ -734,9 +734,10 @@ static __device__ __forceinline__ void flash_attn_ext_f16_process_tile(
+ float2 * dstk_fixup_meta = dstk_fixup + (gridDim.x + blockIdx.x)*ncols;
+ dstk_fixup_meta[(threadIdx.y/np)*cols_per_warp + threadIdx.x] = make_float2(KQ_cmn, KQ_crs);
+ }
+- }
+-
+- if (np > 1) {
++ } else if (np > 1) {
++ // Warps with threadIdx.y % np == 0 execute a __syncthreads() in the if branch.
++ // Therefore, all other warps also need to execute a __syncthreads().
++ // Otherwise the points at which warps synchronize with each other would become misaligned.
+ __syncthreads();
+ }
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-05-13** at **04:34:01**
+
+> Could be related to my issue with DDDD showing up for longer contexts which I attributed to -ser https://github.com/ikawrakow/ik_llama.cpp/pull/386#issuecomment-2869078136?
+
+Thanks for the alert. But isn't it easier to rerun without `-ser` to not have 2 potential causes at the same time? There has been [a new report](https://github.com/ikawrakow/ik_llama.cpp/discussions/385#discussioncomment-13125043) about SER not working, this time CPU only.
\ No newline at end of file
diff --git a/github-data/pull_requests/408 - Faster DeepSeek FA on CUDA.md b/github-data/pull_requests/408 - Faster DeepSeek FA on CUDA.md
index 6f6e03f81..5a21a30fa 100644
--- a/github-data/pull_requests/408 - Faster DeepSeek FA on CUDA.md
+++ b/github-data/pull_requests/408 - Faster DeepSeek FA on CUDA.md
@@ -1,18 +1,21 @@
-### 🔀 [#408](https://github.com/ikawrakow/ik_llama.cpp/pull/408) - Faster DeepSeek FA on CUDA
+## 🔀 [Pull Request #408](https://github.com/ikawrakow/ik_llama.cpp/pull/408) - Faster DeepSeek FA on CUDA
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_flash_mla3_v2` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-11 |
| **Updated** | 2025-05-12 |
+| **Merged** | 2025-05-12 |
---
-#### Description
+## 📄 Description
This is a port of [this PR](https://github.com/ggml-org/llama.cpp/pull/13435) in mainline `llama.cpp`.
-The main difference to PR #386 is that now the FA kernel takes advantage of the fact that the V tensor contains the same data as the K tensor (it is a view on the K cache with an offset given by the RoPE embedding size). Hence, one can reduce the number of loads by reusing K tiles when processing `V*softmax(K*Q)`.
+The main difference to PR [#386](https://github.com/ikawrakow/ik_llama.cpp/issues/386) is that now the FA kernel takes advantage of the fact that the V tensor contains the same data as the K tensor (it is a view on the K cache with an offset given by the RoPE embedding size). Hence, one can reduce the number of loads by reusing K tiles when processing `V*softmax(K*Q)`.
To take advantage of this new kernel I had to change the way the K cache is organized. In mainline `llama.cpp` the K cache stores `(RoPE, NoPE)` parts in that order, and the FA kernel assumes this arrangement. But in `ik_llama.cpp` prior to this PR the K cache was stored as `(NoPE, RoPE)`. As there are several places where the views into the K cache can go wrong when building the graph, the PR should be tested more thoroughly before merging. I have tested all possible combinations of `mla` and `fa` using DeepSeek-Lite and it appears to work correctly, but still.
@@ -31,15 +34,15 @@ Finally, being curious about the peculiar TG behavior as a function of `N_KV`, I
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **JohannesGaessler** commented the **2025-05-11** at **14:05:44**:
+👤 **JohannesGaessler** commented on **2025-05-11** at **14:05:44**
An RTX 4080 has 76 streaming multiprocessor, the CUDA code assigns KV slices to SMs in chunks of size 256. So every 76*256=19456 tokens the size of biggest workload across all of the SMs increases and there is a dip in performance. These so-called quantization effects are much more noticeable with compute than with I/O so they become more pronounced if the I/O of a kernel is optimized.
---
-👤 **Panchovix** commented the **2025-05-11** at **18:28:44**:
+👤 **Panchovix** commented on **2025-05-11** at **18:28:44**
Just tested on DeepSeek V3 0324 Q2_K_XL and it seems to have improved my t/s TG by about 1-2% (I guess with offloading there isn't much difference), but tested a smaller models (DeepSeek2 16B) on a single GPU (5090) and got about 8-12% speedup, so pretty nice!
diff --git a/github-data/pull_requests/409 - Enable faster prompt processing with mainline llama.cpp GGUFs.md b/github-data/pull_requests/409 - Enable faster prompt processing with mainline llama.cpp GGUFs.md
index 526cd7373..f633d1fdd 100644
--- a/github-data/pull_requests/409 - Enable faster prompt processing with mainline llama.cpp GGUFs.md
+++ b/github-data/pull_requests/409 - Enable faster prompt processing with mainline llama.cpp GGUFs.md
@@ -1,19 +1,22 @@
-### 🔀 [#409](https://github.com/ikawrakow/ik_llama.cpp/pull/409) - Enable faster prompt processing with mainline llama.cpp GGUFs
+## 🔀 [Pull Request #409](https://github.com/ikawrakow/ik_llama.cpp/pull/409) - Enable faster prompt processing with mainline llama.cpp GGUFs
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/enable_mla3_in_crippled_ggufs` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-11 |
| **Updated** | 2025-05-12 |
+| **Merged** | 2025-05-12 |
---
-#### Description
+## 📄 Description
Mainline llama.cpp [PR 12901](https://github.com/ggml-org/llama.cpp/pull/12801), which added MLA support for DeepSeek models 2.5 months after MLA was available here, broke backwards compatibility. As a result,
-the new DeepSeek GGUFs that started appearing on HF became compatible with `ik_llama.cpp`, so I added support for the incompatible GGUFs in #394. But using such crippled DeepSeek GGUF results in a much lower prompt processing performance. This is because the `attn_wkv_b` tensor is missing, so one cannot use `mla = 3`.
+the new DeepSeek GGUFs that started appearing on HF became compatible with `ik_llama.cpp`, so I added support for the incompatible GGUFs in [#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394). But using such crippled DeepSeek GGUF results in a much lower prompt processing performance. This is because the `attn_wkv_b` tensor is missing, so one cannot use `mla = 3`.
-This PR removes this limitation. When `-mla 0 or 2 or 3` is specified on the command line, missing `attn_wkv_b` tensors are created on-the-fly while loading the model. This is basically the reverse of #259, where the `attn_wk_b` and `attn_wv_b`tensors necessary for MLA were computed from the `attn_wkv_b` tensors in the original DeepSeek GGUFs.
+This PR removes this limitation. When `-mla 0 or 2 or 3` is specified on the command line, missing `attn_wkv_b` tensors are created on-the-fly while loading the model. This is basically the reverse of [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259), where the `attn_wk_b` and `attn_wv_b`tensors necessary for MLA were computed from the `attn_wkv_b` tensors in the original DeepSeek GGUFs.
To show why this is useful, the following graph compares PP performance between the main branch and this PR. The `sweep-bench` command is
```
@@ -25,9 +28,9 @@ The model is a mainline `llama.cpp` DeepSeek-Lite GGUF with the `attn_wkv_b` ten
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Panchovix** commented the **2025-05-11** at **19:03:47**:
+👤 **Panchovix** commented on **2025-05-11** at **19:03:47**
Testing this PR (on top of https://github.com/ikawrakow/ik_llama.cpp/pull/405 and https://github.com/ikawrakow/ik_llama.cpp/pull/408 PRs), here's a complete log when loading DeepSeek V3 0324 Q2_K_XL. Notably, I had to reduce 1 layer on CUDA 2 (compared to https://github.com/ikawrakow/ik_llama.cpp/pull/405#issuecomment-2869126831), as now CUDA 2 was getting OOM. I noticed the compute buffers are ~3.3GB each instead of 2GB and 400MB respectively for each despite using the -fa flag with -mla 3.
@@ -771,4 +774,14 @@ INFO [ print_timings] total time = 82713.10 ms | tid="1405
-Testing with -mla 2, compute buffers are 3.4GB as well vs -mla 1 with -fa. Here it got a small perf improvement (109 t/s PP vs 106 t/s PP).
\ No newline at end of file
+Testing with -mla 2, compute buffers are 3.4GB as well vs -mla 3 with -fa. Here it got a small perf improvement (109 t/s PP vs 106 t/s PP).
+
+EDIT: I noticed that with this PR we have to specify -mla 1 to make compute buffers smaller, as it doesn't automatically changes it from 0 to 1.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-12** at **04:41:08**
+
+The compute buffers become larger because one needs extra buffers for the transformed cache. If you are running out of VRAM, you can reduce the compute buffer size using e.g. `-amb 512`. This may result in a small performance degradation (but often doesn't).
+
+The extra ~1 GiB in model size is for the newly created `attn_wkv_b` tensors.
\ No newline at end of file
diff --git a/github-data/pull_requests/41 - iqk_mul_mat_ARM_NEON_ adding bf16 support.md b/github-data/pull_requests/41 - iqk_mul_matARM_NEON adding bf16 support.md
similarity index 58%
rename from github-data/pull_requests/41 - iqk_mul_mat_ARM_NEON_ adding bf16 support.md
rename to github-data/pull_requests/41 - iqk_mul_matARM_NEON adding bf16 support.md
index 8a7085976..94223cd39 100644
--- a/github-data/pull_requests/41 - iqk_mul_mat_ARM_NEON_ adding bf16 support.md
+++ b/github-data/pull_requests/41 - iqk_mul_matARM_NEON adding bf16 support.md
@@ -1,13 +1,16 @@
-### 🔀 [#41](https://github.com/ikawrakow/ik_llama.cpp/pull/41) - iqk_mul_mat(ARM_NEON): adding bf16 support
+## 🔀 [Pull Request #41](https://github.com/ikawrakow/ik_llama.cpp/pull/41) - iqk_mul_mat(ARM_NEON): adding bf16 support
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/neon_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-05 |
| **Updated** | 2024-09-16 |
+| **Merged** | 2024-09-16 |
---
-#### Description
+## 📄 Description
It looks like ArmV8 ISA has support for `bf16`, but my M2 Max does not have it, so resorting to `bf16 -> f32` conversion and computations in `f32`. This is 2X slower than `f16`, but 8X better compared to what I get if I try to run a `bf16` model on the M2 (`NEON` and `Metal`).
\ No newline at end of file
diff --git a/github-data/pull_requests/410 - Better CPU FA performance for DeepSeek-Lite.md b/github-data/pull_requests/410 - Better CPU FA performance for DeepSeek-Lite.md
index 9e5ce2e73..f2d23da49 100644
--- a/github-data/pull_requests/410 - Better CPU FA performance for DeepSeek-Lite.md
+++ b/github-data/pull_requests/410 - Better CPU FA performance for DeepSeek-Lite.md
@@ -1,14 +1,17 @@
-### 🔀 [#410](https://github.com/ikawrakow/ik_llama.cpp/pull/410) - Better CPU FA performance for DeepSeek-Lite
+## 🔀 [Pull Request #410](https://github.com/ikawrakow/ik_llama.cpp/pull/410) - Better CPU FA performance for DeepSeek-Lite
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cpu_deepseek_fa` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-12 |
| **Updated** | 2025-05-20 |
+| **Merged** | 2025-05-13 |
---
-#### Description
+## 📄 Description
This FA tweak improves DeepSeek-Lite CPU TG performance with `Q8_0` KV cache.
@@ -71,9 +74,9 @@ The graph shows a comparison between the main branch and this PR for a `Q4_0` qu
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-20** at **08:19:37**:
+👤 **saood06** commented on **2025-05-20** at **08:19:37**
I did end up doing a fresh build, drop cache and server launch and have used it up to 32K tokens (double where I normally test sweep-bench), and my informal results are that it is about the same, maybe a little better. I don't see the same large improvement that seems to scale with context size that you do.
@@ -81,7 +84,15 @@ I may run a full sweep-bench later to get a better comparison, I only ran it at
---
-👤 **saood06** commented the **2025-05-20** at **09:19:47**:
+👤 **ikawrakow** commented on **2025-05-20** at **09:00:18**
+
+> I don't see the same large improvement that seems to scale with context size that you do.
+
+There is something different with the big siblings of DeepSeek-Lite that I haven't understood what it is. For one, IIRC your TG performance drops 3X when you go to 16k tokens, while in my case it is still at ~60% even before the PR. The self-attention part per layer in DeepSeek-V3/R1 is 8X of DeepSeek-Lite (128 instead of 16 heads for otherwise identical tensor dimensions). The FFN part is about 7X (`7168 x 2048` vs `2048 x 1408` + 8 active experts instead of 6) per layer, so I don't really see a reason why it should behave differently. If anything, with `-mla 3 -fa` my expectation would be that the big model TG performance decreases less with context size as the K-cache is smaller relative to the amount of computations that need to get done. So, I guess, it is somehow related to NUMA, so it is bottle-necked on that when computing self-attention. If so, yes, you probably will not see (significant) performance improvement.
+
+---
+
+👤 **saood06** commented on **2025-05-20** at **09:19:47**
> > I don't see the same large improvement that seems to scale with context size that you do.
>
@@ -91,8 +102,18 @@ I'm not sure because it has good local hitrate on TG see this: https://github.co
---
-👤 **ikawrakow** commented the **2025-05-20** at **09:44:56**:
+👤 **ikawrakow** commented on **2025-05-20** at **09:44:56**
> I'm not sure because it has good local hitrate on TG see this: https://github.com/ikawrakow/ik_llama.cpp/discussions/201#discussioncomment-13203928
-The high local TG hit rate is measured at what context?
\ No newline at end of file
+The high local TG hit rate is measured at what context?
+
+---
+
+👤 **saood06** commented on **2025-05-20** at **09:56:22**
+
+> > I'm not sure because it has good local hitrate on TG see this: [[#201](https://github.com/ikawrakow/ik_llama.cpp/issues/201) (comment)](https://github.com/ikawrakow/ik_llama.cpp/discussions/201#discussioncomment-13203928)
+>
+> The high local TG hit rate is measured at what context?
+
+32k
\ No newline at end of file
diff --git a/github-data/pull_requests/411 - Fix imatrix calculation for MLA models.md b/github-data/pull_requests/411 - Fix imatrix calculation for MLA models.md
index 23de01686..321f08bbb 100644
--- a/github-data/pull_requests/411 - Fix imatrix calculation for MLA models.md
+++ b/github-data/pull_requests/411 - Fix imatrix calculation for MLA models.md
@@ -1,16 +1,19 @@
-### 🐛 [#411](https://github.com/ikawrakow/ik_llama.cpp/pull/411) - Fix imatrix calculation for MLA models
+## 🔀 [Pull Request #411](https://github.com/ikawrakow/ik_llama.cpp/pull/411) - Fix imatrix calculation for MLA models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_mla_imatrix` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-12 |
| **Updated** | 2025-05-30 |
+| **Merged** | 2025-05-13 |
---
-#### Description
+## 📄 Description
-Mainline `llama.cpp` implemented MLA for DeepSeek models in [this PR](https://github.com/ggml-org/llama.cpp/pull/12801) 2.5 months after MLA was available here. The PR broke backwards compatibility with existing DeepSeek GGUFs. The incompatibility was handled in PR #394, and the reduced prompt processing performance with `llama.cpp`-style MLA GGUFs was recovered in #409.
+Mainline `llama.cpp` implemented MLA for DeepSeek models in [this PR](https://github.com/ggml-org/llama.cpp/pull/12801) 2.5 months after MLA was available here. The PR broke backwards compatibility with existing DeepSeek GGUFs. The incompatibility was handled in PR [#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394), and the reduced prompt processing performance with `llama.cpp`-style MLA GGUFs was recovered in [#409](https://github.com/ikawrakow/ik_llama.cpp/issues/409).
This PR fixes imatrix calculation for `llama.cpp`-style MLA GGUFs. The mainline MLA implementation splits the original `attn_kv_b` 2D tensor into `attn_k_b` and `attn_v_b`, which are 3D and have the shape `128 x n_lora x n_head` (`attn_k_b`) and `n_lora x 128 x n_head` (`attn_v_b`). When the `imatrix` tool was written there were only 2D tensors in the models, so it does not really work for the new 3D MLA tensors. There are two issues:
* The first issue is that the activations are not contiguous, and this leads to a crash in the `imatrix` tool. The crash was fixed in mainline `llama.cpp` in [PR 13286](https://github.com/ggml-org/llama.cpp/pull/13286), and is fixed here with this PR
@@ -20,9 +23,9 @@ It is now almost a month since the `llama.cpp` [MLA PR](https://github.com/ggml-
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **bartowski1182** commented the **2025-05-12** at **21:49:14**:
+👤 **bartowski1182** commented on **2025-05-12** at **21:49:14**
I have been purposefully avoiding reuploading with MLA, not even with the awareness of this glaring issue :')
@@ -30,20 +33,34 @@ And of course even these changes you've made, despite me knowing your exact inte
---
-👤 **ThomasBaruzier** commented the **2025-05-13** at **19:30:11**:
+👤 **danielhanchen** commented on **2025-05-13** at **15:50:35**
+
+Super nice work @ikawrakow ! I had to temporarily disable quantizing the _v and _b matrices and left them in Q8_0 - you're new changes are super good - nice again!
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-05-13** at **19:30:11**
Thank you for this!
I would be very grateful if anyone have the time/compute to create an imatrix for DeepSeek V3 0324 from this PR and uploads it to HF. It would probably take a week or two on my hardware
---
-👤 **ikawrakow** commented the **2025-05-14** at **11:01:31**:
+👤 **ikawrakow** commented on **2025-05-14** at **11:01:31**
I don't have the hardware to play with DeepSeek-V3/R1, but I'm curious about potential performance gains one can get that way. Published quantized models tend to use high-bit quants for the attention tensors (and after the MLA changes in `llama.cpp` they are all `Q8_0`). This is fine in terms of model size. But for token generation attention tensors are in the range of 40% of the model weights that need to get fetched from RAM/VRAM, so a lower bpw quantization type is going to have a non-negligible positive impact on performance. With this PR a proper imatrix can be computed, so perhaps it is feasible to go to lower bpw quantization for attention tensors without significant decrease in quantized model quality. From quick experiments with DeepSeek-V2-16B, a high-quality 5-bit quantization such as `IQ5_K` for the attention tensors is on par with `Q8_0`.
---
-👤 **ThomasBaruzier** commented the **2025-05-14** at **11:33:47**:
+👤 **saood06** commented on **2025-05-14** at **11:26:48**
+
+>so perhaps it is feasible to go to lower bpw quantization for attention tensors without significant decrease in quantized model quality. From quick experiments with DeepSeek-V2-16B, a high-quality 5-bit quantization such as `IQ5_K` for the attention tensors is on par with `Q8_0`.
+
+This is why I've been running pure IQ4_K and my next mix is going be a mix of IQ4_KS and Q4_K and IQ4_K.
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-05-14** at **11:33:47**
> I don't have the hardware to play with DeepSeek-V3/R1
@@ -51,7 +68,7 @@ Do you accept donations? You could feature such a page on your README explaining
---
-👤 **ikawrakow** commented the **2025-05-14** at **12:00:50**:
+👤 **ikawrakow** commented on **2025-05-14** at **12:00:50**
> Do you accept donations?
@@ -61,7 +78,7 @@ I even own a Ryzen-5975WX system that I inherited from the company I was working
---
-👤 **ThomasBaruzier** commented the **2025-05-14** at **13:01:06**:
+👤 **ThomasBaruzier** commented on **2025-05-14** at **13:01:06**
Well, that's amazing news, even if your sponsor doesn't get back to you.
Quickly looking on eBay, you could get away with 512GB ECC RDIMM at 2666MHz for 450eur or 3200Mhz for 800eur
@@ -70,7 +87,7 @@ Do you think TP is achievable here?
---
-👤 **ikawrakow** commented the **2025-05-14** at **13:57:42**:
+👤 **ikawrakow** commented on **2025-05-14** at **13:57:42**
> Well, that's amazing news, even if your sponsor doesn't get back to you.
@@ -82,13 +99,23 @@ What is TP?
---
-👤 **ikawrakow** commented the **2025-05-14** at **14:42:28**:
+👤 **ThomasBaruzier** commented on **2025-05-14** at **14:31:33**
+
+Oh you mean procrastinate by instead submitting even more amazing PRs here lmao
+
+TP is tensor parallelism, aiming at using 100% of each GPU during inference. But I guess it would require a tremendous amount of work to get there from a codebase that is not meant for such a feature. I don't even know if there would be significant gains because of hybrid inference bottlenecks.
+
+https://github.com/turboderp-org/exllamav2/blob/master/exllamav2/exllamav2_ext/ext_tp.cpp
+
+---
+
+👤 **ikawrakow** commented on **2025-05-14** at **14:42:28**
Ah, OK, TP is one of the things I would look into if I had 2 or more GPUs. I wouldn't dare to do it in the CUDA code, but have some vague ideas how it could be done on the level of the compute graph. I have no idea if/how much performance one would gain. How much faster is exllamav2?
---
-👤 **ThomasBaruzier** commented the **2025-05-14** at **14:52:13**:
+👤 **ThomasBaruzier** commented on **2025-05-14** at **14:52:13**
Without speculative decoding, 2x3090@275w:
- Llama 3.3 70B 4.5bpw, from 18.1 to 22.9 tok/s
@@ -98,7 +125,7 @@ Exl3 is supposed to have even better TP performance, but it's not implemented ye
---
-👤 **ikawrakow** commented the **2025-05-14** at **15:02:49**:
+👤 **ikawrakow** commented on **2025-05-14** at **15:02:49**
> Without speculative decoding, 2x3090@275w:
>
@@ -113,18 +140,30 @@ So, barely faster than `llama.cpp`? I have a 4080 (717 GB/s), so less bandwidth
---
-👤 **ThomasBaruzier** commented the **2025-05-28** at **01:03:50**:
+👤 **ThomasBaruzier** commented on **2025-05-28** at **01:03:50**
Sorry for the long wait. I finally got the time to properly benchmark all the quants in this repo and multiple exl2 sizes of Llama-3.1-Nemotron-Nano-8B-v1 (maybe a bit too much, I tried to generate the exl quants based on the bpw of the equivalent gguf files, and as a result, small quants ended up a lot heavier than their gguf counterpart)
I was also curious to see how fast each quant is (for custom mixes), but I didn't convert with --pure for the sake of the benchmark.
-I used basic standard parameters for both programs, and generated 1k token * 10 and averaged the result. Using ExllamaV2 0.3.1 and latest ik_llama.cpp. I didn't benchmark tensor parralelism.
+I used basic standard parameters for both programs, and generated 1k token * 10 and averaged the result. Using ExllamaV2 0.3.1 and latest ik_llama.cpp. A single 350w RTX 3090 was used. I didn't benchmark tensor parralelism.
+
+Commands used:
+
+### TabbyAPI:
+```bash
+CUDA_VISIBLE_DEVICES=0 python main.py --model-dir /home/user/exl --model-name --max-seq-len 4096
+```
-A single 350w RTX 3090 was used to perform all these tests:
+### llama-server:
+```bash
+CUDA_VISIBLE_DEVICES=0 ./ik_llama.cpp/llama-server -m -ngl 99 -mg 0 -c 4096
+```

+ (yes I forgot that some types are aliases, and ended up benchmarking everything...)
+
Tables
@@ -217,32 +256,206 @@ A single 350w RTX 3090 was used to perform all these tests:
- (yes I forgot that some types are aliases, and ended up benchmarking everything...)
-
For completeness, another plot with PPL metrics could have been useful, but I don't know any program that can compute PPL from an API
---
-👤 **ThomasBaruzier** commented the **2025-05-28** at **11:47:28**:
+👤 **saood06** commented on **2025-05-28** at **01:17:16**
+
+@ThomasBaruzier
+
+Thanks for the data. Did you accidentally create a second comment instead of editing the first? (I do appreciate the tables for raw data though).
+
+Also this repo has three types of quants, the k-quants, i-quants which are also in mainline, and iqk-quants (see [this](https://github.com/ikawrakow/ik_llama.cpp/discussions/8)) which are not found on mainline. This is why some of the green-dots are especially close together or have sudden changes in performance as you are putting both i-quants and iqk-quants together, even though they are different types of quants.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-28** at **07:44:31**
+
+@ThomasBaruzier Thanks for the detailed comparison!
+
+You are not using flash attention in `ik_llama.cpp`. For 1000 generated tokens this makes a noticeable difference in performance.
+
+I did a quick comparison to your data for Llama-3.1-Nemotron-Nano-8B-v1 on my 4080 GPU.
+
+First lets look at legacy (`Q4_0`, etc.) and k-quants (`Q4_K`, etc.)
+
+
+
+
+For 4+ bpw the behavior is as expected: the 4080 has less memory bandwidth, so performance is lower than your 3090. The difference decreases with decreasing bpw, that's most likely because you did not use FA. But something goes wrong on your GPU sub-4 bpw. k-quants have a very simple unpacking algorithm, so it would be unexpected if the calculation became compute bound so that the faster 4080 pulls ahead because of that.
+
+Things go south for i- and iqk-quants:
+
+
+
+If I put all 4080 data on the same plot it looks like this:
+
+
+
+Not much of a difference as TG is memory bound (apart from `IQ2_KS`, which is likely not fully optimized).
+
+The only explanation for the massive performance difference below 4 bpw between the 4080 and the 3090 is that the 3090 somehow does not like lookup tables (all i- and iqk-quants use a non-linear mapping between quant index and dequantized model weight, and this requires lookup tables).
+
+Here the `ik_llama.cpp` 4080 data for the above graphs
+
+| model | size | test | t/s |
+| ----------------- | ---------: | ------------: | ---------------: |
+| llama 8B Q8_0 | 7.95 GiB | tg1024 | 74.45 ± 0.20 |
+| llama 8B Q6_0 | 6.08 GiB | tg1024 | 94.15 ± 0.03 |
+| llama 8B Q5_0 | 5.21 GiB | tg1024 | 107.33 ± 0.05 |
+| llama 8B Q4_0 | 4.33 GiB | tg1024 | 124.87 ± 0.68 |
+| llama 8B Q6_K | 6.14 GiB | tg1024 | 92.48 ± 0.31 |
+| llama 8B Q5_K | 5.16 GiB | tg1024 | 107.81 ± 0.06 |
+| llama 8B Q4_K | 4.38 GiB | tg1024 | 123.71 ± 0.12 |
+| llama 8B Q3_K | 3.42 GiB | tg1024 | 139.14 ± 0.27 |
+| llama 8B Q2_K | 2.79 GiB | tg1024 | 174.68 ± 0.10 |
+| llama 8B IQ4_NL | 4.37 GiB | tg1024 | 121.61 ± 1.09 |
+| llama 8B IQ4_XS | 4.15 GiB | tg1024 | 127.50 ± 0.11 |
+| llama 8B IQ3_S | 3.44 GiB | tg1024 | 147.51 ± 0.72 |
+| llama 8B IQ3_XXS | 3.06 GiB | tg1024 | 160.33 ± 2.03 |
+| llama 8B IQ2_M | 2.74 GiB | tg1024 | 177.72 ± 0.03 |
+| llama 8B IQ2_XS | 2.42 GiB | tg1024 | 190.64 ± 1.20 |
+| llama 8B IQ2_XXS | 2.23 GiB | tg1024 | 195.61 ± 0.28 |
+| llama 8B IQ1_M | 2.03 GiB | tg1024 | 208.15 ± 1.80 |
+| llama 8B IQ1_S | 1.89 GiB | tg1024 | 213.89 ± 0.22 |
+| llama 8B IQ6_K | 6.19 GiB | tg1024 | 91.90 ± 0.28 |
+| llama 8B IQ5_K | 5.16 GiB | tg1024 | 106.67 ± 0.08 |
+| llama 8B IQ5_KS | 4.95 GiB | tg1024 | 110.28 ± 0.37 |
+| llama 8B IQ4_K | 4.37 GiB | tg1024 | 122.49 ± 0.09 |
+| llama 8B IQ4_KS | 4.16 GiB | tg1024 | 127.42 ± 0.66 |
+| llama 8B IQ3_K | 3.37 GiB | tg1024 | 146.39 ± 0.89 |
+| llama 8B IQ2_K | 2.53 GiB | tg1024 | 178.06 ± 1.58 |
+| llama 8B IQ2_KS | 2.30 GiB | tg1024 | 177.14 ± 0.07 |
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-05-28** at **11:47:28**
Thanks for all the feedback!
FA helps with 4+bpw as you predicted, but for i- and iqk-quants, I'll investigate further another time, maybe a few param tweaks could help?
Here is a refined plot:
-
+
+
+
+Tables
+
+## EXL2 Models
+
+| Quant/Type | Size (MB) | Speed (tok/s) |
+|------------|-----------|---------------|
+| 2.38bpw | 3398 | 182.65 |
+| 2.59bpw | 3572 | 174.38 |
+| 2.93bpw | 3855 | 163.01 |
+| 3.18bpw | 4063 | 156.92 |
+| 3.59bpw | 4404 | 143.35 |
+| 4.02bpw | 4762 | 134.40 |
+| 4.42bpw | 5095 | 131.70 |
+| 4.65bpw | 5286 | 124.44 |
+| 4.90bpw | 5494 | 122.50 |
+| 5.57bpw | 6052 | 118.12 |
+| 6.0bpw | 6515 | 112.21 |
+| 6.56bpw | 6981 | 105.64 |
+| 8.0bpw | 8177 | 92.36 |
+
+## GGUF Models
+
+| Quant/Type | Size (MB) | Speed (tok/s) |
+|------------|-----------|---------------|
+| IQ1_S | 1946 | 159.98 |
+| IQ1_M | 2081 | 154.14 |
+| IQ2_XXS | 2288 | 143.73 |
+| IQ2_KS | 2361 | 133.46 |
+| IQ2_XS | 2485 | 141.81 |
+| IQ2_K | 2579 | 135.18 |
+| IQ2_S | 2630 | 140.29 |
+| IQ2_M | 2811 | 143.30 |
+| Q2_K_S | 2866 | 147.24 |
+| Q2_K | 3047 | 130.29 |
+| IQ3_XXS | 3139 | 136.93 |
+| IQ3_XS | 3355 | 132.24 |
+| IQ3_K | 3445 | 115.67 |
+| IQ3_S | 3511 | 131.65 |
+| Q3_K_S | 3511 | 104.77 |
+| IQ3_M | 3625 | 127.77 |
+| Q3_K_M | 3848 | 111.58 |
+| IQ3_KL | 3855 | 116.59 |
+| IQ4_KSS | 4027 | 117.97 |
+| Q3_K_L | 4138 | 107.06 |
+| IQ4_XS | 4241 | 119.22 |
+| IQ4_KS | 4247 | 120.30 |
+| Q4_0 | 4459 | 141.52 |
+| IQ4_NL | 4461 | 115.15 |
+| IQ4_K | 4461 | 107.58 |
+| Q4_K_S | 4491 | 138.01 |
+| Q4_K_M | 4700 | 132.63 |
+| Q4_1 | 4892 | 133.22 |
+| IQ5_KS | 5121 | 105.39 |
+| Q5_K_S | 5292 | 122.86 |
+| IQ5_K | 5339 | 99.89 |
+| Q5_0 | 5353 | 123.06 |
+| Q5_K_M | 5475 | 119.12 |
+| Q5_1 | 5787 | 117.32 |
+| Q6_0 | 6234 | 111.12 |
+| Q6_K | 6290 | 105.05 |
+| IQ6_K | 6350 | 98.25 |
+| Q8_0 | 8145 | 90.84 |
+
+
---
-👤 **saood06** commented the **2025-05-30** at **13:35:30**:
+👤 **Ph0rk0z** commented on **2025-05-30** at **12:27:07**
+
+Is it possible to repack mainline quants somehow to be ik_llama compatible? Rather than doing it on the fly to just save a "normal" version of the weights as a copy? That should regain memory lost from the work around?
+
+>So, barely faster than llama.cpp? I have a 4080 (717 GB/s), so less bandwidth than a 3090 (935 GB/s), and I get 125 t/s for Llama-8B at 4.5 bpw on the 4080. Napkin math: 125 * 8/70 * 935/717 = 18.6 t/s
+
+Nah. Regardless of whatever calculations, I can load 70b models in llama.cpp of all kinds. They are about as fast with pipeline parallel, but in tensor parallel it is a much larger difference as he showed. Plus that is 0 CTX speeds, as context builds, it output t/s falls much less. For multi GPU and dual CPU socket it is a worthy endeavor 100%. On larger models the responsiveness goes from bleh to wow.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-30** at **12:40:07**
+
+> Is it possible to repack mainline quants somehow to be ik_llama compatible?
+
+What do you mean? All mainline quants apart from `TQ1_0` and `TQ2_0` can be used with `ik_llama.cpp`. `TQ1_0` and `TQ2_0` are BitNet specific, and there is a much faster implementation for BitNet here. If your question is if you can repack mainline quants to `*_R4` (or `*_R8`), yes, you can. You do it with
+```
+./bin/llama-quantize --repack $model $new_model X`
+```
+where `X` is some arbitrary quantization type name (`iq4_k_r4`, etc.)
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **12:47:37**
+
+> > Is it possible to repack mainline quants somehow to be ik_llama compatible?
+>
+> What do you mean?
+
+I'm guessing he means the wk_b tensors ([#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) uses the term on the fly as well). And as an answer to his question, a python script using gguf-py should be able to do it, assuming you have "donor" tensors. (on my system this on the fly generation came at a minor but measurable cost, and if I still had any "legacy" quants, that I needed to use extensively I would take this approach)
+
+---
+
+👤 **ikawrakow** commented on **2025-05-30** at **13:18:53**
+
+> And as an answer to his question, a python script using gguf-py should be able to do it, assuming you have "donor" tensors. (on my system this on the fly generation came at a minor but measurable cost,
+
+I suspect because the new tensors get created as `Q8_0`, while your original quants were IIRC 4 or 5 bit. The tensors are created as 8 bit to avoid possible accuracy loss when doing `dequantize -> transpose -> quantize without imatrix`. If you are content with potentially losing some accuracy (as you would in a python script that adds the tensors to an already quantized model), then one can add a command line option to do that on-the-fly as well.
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **13:35:30**
> I suspect because the new tensors get created as `Q8_0`, while your original quants were IIRC 4 or 5 bit. The tensors are created as 8 bit to avoid possible accuracy loss when doing `dequantize -> transpose -> quantize without imatrix`. If you are content with potentially losing some accuracy (as you would in a python script that adds the tensors to an already quantized model), then one can add a command line option to do that on-the-fly as well.
-I think I tested that theory and even accounting for that it was still a difference. I definitely have made quants that use `Q8_0` for those tensors, and I knew the on-the-fly ones were `Q8_0` at the time, but I'm not 100% sure if I did, and my notes aren't very thorough.
+I think I tested that theory and even accounting for that it was still a difference. I definitely have made quants that use `Q8_0` for those tensors, and I knew the on-the-fly ones were `Q8_0` at the time, but I'm not 100% sure if I did, and my notes aren't very thorough. My server is very picky about memory layout and placement.
---
-👤 **ubergarm** commented the **2025-05-30** at **13:42:28**:
+👤 **ubergarm** commented on **2025-05-30** at **13:42:28**
If folks are looking for ik_llama.cpp quantized version of DeepSeek-R1-0528, I just got one cooked up and [released on huggingface here](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF).
@@ -255,7 +468,7 @@ Gonna definitely look into a smaller one now with attention tensors possibly `q6
---
-👤 **saood06** commented the **2025-05-30** at **13:50:19**:
+👤 **saood06** commented on **2025-05-30** at **13:50:19**
> If folks are looking for ik_llama.cpp quantized version of DeepSeek-R1-0528, I just got one cooked up and [released on huggingface here](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF).
@@ -264,7 +477,52 @@ Thank you for the imatrix. I was considering making a discussion thread for Deep
---
-👤 **ikawrakow** commented the **2025-05-30** at **14:14:49**:
+👤 **ikawrakow** commented on **2025-05-30** at **13:53:58**
+
+> Gonna definitely look into a smaller one now with attention tensors possibly q6_K/q5_K or maybe iq5_ks (which might be good now for both CUDA and CPU?). I'm guessing mainline quants probably still have to keep attention at Q8_0 since that imatrix code doesn't have this?
+
+I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts. It would be also interesting to see how much mainline suffers when quantizing attention with less than `Q8_0` without having the correct imatrix. I think answering these question would be enough for a paper, so if I was a researcher desperate to get another paper on my CV, I would definitely do it.
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **13:59:09**
+
+> > Gonna definitely look into a smaller one now with attention tensors possibly q6_K/q5_K or maybe iq5_ks (which might be good now for both CUDA and CPU?). I'm guessing mainline quants probably still have to keep attention at Q8_0 since that imatrix code doesn't have this?
+>
+> I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts. It would be also interesting to see how much mainline suffers when quantizing attention with less than `Q8_0` without having the correct imatrix. I think answering these question would be enough for a paper, so if I was a researcher desperate to get another paper on my CV, I would definitely do it.
+
+In theory if you had the compute and benchmarks, I think https://github.com/Just-Curieous/Curie would result in nice quants, but with a model this big the compute would might be very expensive.
+
+---
+
+👤 **ubergarm** commented on **2025-05-30** at **14:01:23**
+
+> I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts.
+
+Yes, I wanted to do this after V3-0324, but I think now is the time to try it out. I'll probably go for `iq5_ks` given the recent improvements.
+
+I see [unsloth's keeping k_b and v_b at Q8_0](https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF?show_file_info=UD-IQ1_M%2FDeepSeek-R1-0528-UD-IQ1_M-00001-of-00005.gguf) but don't see the actual imatrix data file hrmm..
+
+
+---
+
+👤 **Ph0rk0z** commented on **2025-05-30** at **14:05:04**
+
+I thought there is still a penalty to memory, prompt processing and speed from using MLA containing mainline quants vs the old ones. Even if they load/work.
+
+As much as IQ3/Q4 quants sound nice, anything over 250gb is going to go down into unusable speeds on my system. Only get about ~50t/s PP and 10t/s using IQ2XXS as it is. If it gets much slower... Usability comes from cramming as much into the GPUs as possible because the CPUs/memory speed isn't that good.
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **14:10:35**
+
+> I thought there is still a penalty to memory, prompt processing and speed from using MLA containing mainline quants vs the old ones. Even if they load/work.
+>
+[#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394) and [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) are different, but they both add support for methods that differ from what our convert script generates.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-30** at **14:14:49**
> In theory if you had the compute and benchmarks, I think https://github.com/Just-Curieous/Curie would result in nice quants, but with a model this big the compute would might be very expensive.
@@ -284,33 +542,108 @@ grep Final log.out
---
-👤 **ikawrakow** commented the **2025-05-30** at **14:20:54**:
+👤 **ikawrakow** commented on **2025-05-30** at **14:20:54**
> I thought there is still a penalty to memory, prompt processing and speed from using MLA containing mainline quants vs the old ones. Even if they load/work.
-There shouldn't be after #409. Just `-mla 3 -fa`, and it should be fine. If there is any difference in performance, it would be very minor. I don't see a real difference with the models I can run, but some systems are very finicky about where tensors end up in memory, and it that case there may be a small performance difference because the tensors created on the fly are not in the same contiguously allocated memory block.
+There shouldn't be after [#409](https://github.com/ikawrakow/ik_llama.cpp/issues/409). Just `-mla 3 -fa`, and it should be fine. If there is any difference in performance, it would be very minor. I don't see a real difference with the models I can run, but some systems are very finicky about where tensors end up in memory, and it that case there may be a small performance difference because the tensors created on the fly are not in the same contiguously allocated memory block as the other tensors.
---
-👤 **saood06** commented the **2025-05-30** at **14:35:02**:
+👤 **Ph0rk0z** commented on **2025-05-30** at **14:29:29**
+
+>https://github.com/ikawrakow/ik_llama.cpp/pull/394 and https://github.com/ikawrakow/ik_llama.cpp/pull/259 are different, but they both add support for methods that differ from what our convert script generates.
+
+If I had the b/w to download the full model and use the script, I'd be golden. But sadly I have to go with what people upload. Losing several GB of GPU memory is another couple of tensors I can throw on there. Just trying to get a gauge of if I should avoid any new mainline quants. Unsloth was going to make some kind of 140gb one for the new R1. Even if quality is a little lower, speed is going to be like Qwen.
+
+>there shouldn't be after https://github.com/ikawrakow/ik_llama.cpp/pull/409. Just -mla 3 -fa, and it should be fine.
+
+I use those settings, so it will be mostly the same memory footprint as a native quant? Single GPU for ctx, I see how it doesn't matter but for 4x24 it really does.
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **14:31:11**
+
+> > [#394](https://github.com/ikawrakow/ik_llama.cpp/issues/394) and [#259](https://github.com/ikawrakow/ik_llama.cpp/issues/259) are different, but they both add support for methods that differ from what our convert script generates.
+>
+> If I had the b/w to download the full model and use the script, I'd be golden.
+
+I could maybe do tensor surgery and upload just the donor parts to huggingface, if you want?
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **14:35:02**
> Do we need an "AI" agent for this?
-If you want to create a full almost continuous spectrum of quality to size trade-offs you kind of need to do a lot of experimenting. I know ubergarm and EAddario are working on trying to rank tensors/layers to achieve that goal as well.
+If you want to create a full almost continuous spectrum of quality to size trade-offs you kind of need to do a lot of experimenting. I know ubergarm and EAddario are working on trying to rank tensors/layers to achieve that goal as well, but I do not think a greedy algorithm is optimal, and doing anything more would require more than just using a ranking.
+
+---
+
+👤 **ubergarm** commented on **2025-05-30** at **18:06:07**
+
+> I would be curious to see how much degradation in quality there is from using 6- or 5-bit quants for the attention tensors and shared experts.
+
+While I don't have a Ph.D., I didn't have to vibe code this bash script to brute force check these 7 test cases varying attn and shexp but holding all else constant q4_0.
+
+Its gonna take a long while to finish and then test perplexity on though. Will report back by later this weekend hopefully.
+
+
+
+👈 Test Case Bash Script
+
+```bash
+#!/usr/bin/env bash
+
+model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-256x21B-0528-BF16-00001-of-00030.gguf
+imatrix=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/imatrix-DeepSeek-R1-0528.dat
+outdir=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF
+basename=DeepSeek-R1-0528
+base_q=q4_0
+
+# iterate over list of tuples as attn_k_b shape requires qN_0 types
+for q in q8_0,q8_0 q6_0,q6_K q6_0,iq6_k q5_0,q5_K q5_0,iq5_k q5_0,iq5_ks q4_0,q4_0
+do
+ # unpack tuples into $1,$2
+ IFS=","
+ set -- $q
+
+ # quantize using $1 for attn_k_b and $2 for rest of attn and base_q for all else
+ numactl --interleave=all \
+ ./build/bin/llama-quantize \
+ --imatrix $imatrix \
+ --custom-q attn_k_b=$1 \
+ --custom-q attn=$2 \
+ --custom-q shexp=$2 \
+ --custom-q exps=$base_q \
+ $model \
+ $outdir/$basename-$base_q-attn-shexp-$2.gguf \
+ $base_q \
+ 2>&1 | tee -a logs/quantize-$basename-$base_q-attn-shexp-$2.gguf
+done
+```
+
+
+
+> It would be also interesting to see how much mainline suffers when quantizing attention with less than Q8_0 without having the correct imatrix.
+
+I haven't tried making MLA imatrix on mainline, but possibly there are some issues still with the 3D tensor shapes right? I'll not fuss with this for now, maybe someone else can figure this one out.
+
+I'm gonna release a quant today with `q5_0/iq5_ks/iq4_ks` attn_k_b/attn/shexp before discovering thes results also just so there will be at least one quant available for folks to try without q8_0's for `k_b` and `v_b`. Thanks!
---
-👤 **Ph0rk0z** commented the **2025-05-30** at **19:24:55**:
+👤 **Ph0rk0z** commented on **2025-05-30** at **19:24:55**
>I could maybe do tensor surgery and upload just the donor parts to huggingface, if you want?
-So far I have smoothie qwen, 2 quants of regular qwen and the older V3 (3/24). Those all work. I wanted to get chimera but not sure there is a small enough one out there. The mini R1 from now I'm willing to gamble with the smallest quant if it ever makes an appearance.
+So far I have smoothie qwen, 2 quants of regular qwen and the older V3 (3/24). Those all work. I wanted to get chimera but not sure there is a small enough one out there. The mini R1 from this week, I'm willing to gamble with the smallest quant, if it ever makes an appearance.
For the future though, who knows. Might be worth it.
---
-👤 **ubergarm** commented the **2025-05-30** at **20:13:14**:
+👤 **ubergarm** commented on **2025-05-30** at **20:13:14**
> Thank you for the imatrix. I was considering making a discussion thread for DeepSeek-R1-0528. The one we had for V3 was quite nice.
diff --git a/github-data/pull_requests/413 - Fix new CUDA FA on Touring.md b/github-data/pull_requests/413 - Fix new CUDA FA on Touring.md
index 5eb6137ce..6015d873f 100644
--- a/github-data/pull_requests/413 - Fix new CUDA FA on Touring.md
+++ b/github-data/pull_requests/413 - Fix new CUDA FA on Touring.md
@@ -1,13 +1,16 @@
-### 🐛 [#413](https://github.com/ikawrakow/ik_llama.cpp/pull/413) - Fix new CUDA FA on Touring
+## 🔀 [Pull Request #413](https://github.com/ikawrakow/ik_llama.cpp/pull/413) - Fix new CUDA FA on Touring
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_412` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-12 |
| **Updated** | 2025-05-12 |
+| **Merged** | 2025-05-12 |
---
-#### Description
+## 📄 Description
-Closes #412
\ No newline at end of file
+Closes [#412](https://github.com/ikawrakow/ik_llama.cpp/issues/412)
\ No newline at end of file
diff --git a/github-data/pull_requests/415 - Fix SER _CPU_.md b/github-data/pull_requests/415 - Fix SER CPU.md
similarity index 77%
rename from github-data/pull_requests/415 - Fix SER _CPU_.md
rename to github-data/pull_requests/415 - Fix SER CPU.md
index 25de60795..e0430b6f3 100644
--- a/github-data/pull_requests/415 - Fix SER _CPU_.md
+++ b/github-data/pull_requests/415 - Fix SER CPU.md
@@ -1,14 +1,17 @@
-### 🐛 [#415](https://github.com/ikawrakow/ik_llama.cpp/pull/415) - Fix SER (CPU)
+## 🔀 [Pull Request #415](https://github.com/ikawrakow/ik_llama.cpp/pull/415) - Fix SER (CPU)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_ser` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-13 |
| **Updated** | 2025-05-13 |
+| **Merged** | 2025-05-13 |
---
-#### Description
+## 📄 Description
There have been reports that Smart Expert Reduction (SER) can produce garbage.
@@ -20,9 +23,15 @@ A similar fix is required for the CUDA implementation. This is left for a follow
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-13** at **15:04:17**:
+👤 **ikawrakow** commented on **2025-05-13** at **14:54:54**
+
+I couldn't break it, so I'll just merge it without waiting for feedback.
+
+---
+
+👤 **ubergarm** commented on **2025-05-13** at **15:04:17**
Hah, our sleep schedules are just off, I just tested this compiling CPU only and it indeed fixes the issue when using `-ser 6,1`.
diff --git a/github-data/pull_requests/416 - Fix SER _CUDA_.md b/github-data/pull_requests/416 - Fix SER CUDA.md
similarity index 71%
rename from github-data/pull_requests/416 - Fix SER _CUDA_.md
rename to github-data/pull_requests/416 - Fix SER CUDA.md
index 547206519..76ffce95b 100644
--- a/github-data/pull_requests/416 - Fix SER _CUDA_.md
+++ b/github-data/pull_requests/416 - Fix SER CUDA.md
@@ -1,24 +1,27 @@
-### 🐛 [#416](https://github.com/ikawrakow/ik_llama.cpp/pull/416) - Fix SER (CUDA)
+## 🔀 [Pull Request #416](https://github.com/ikawrakow/ik_llama.cpp/pull/416) - Fix SER (CUDA)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_ser_cuda` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-13 |
| **Updated** | 2025-05-14 |
+| **Merged** | 2025-05-14 |
---
-#### Description
+## 📄 Description
-Follow up of #415. This should fix SER issues on CUDA.
+Follow up of [#415](https://github.com/ikawrakow/ik_llama.cpp/issues/415). This should fix SER issues on CUDA.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-13** at **15:30:55**:
+👤 **ubergarm** commented on **2025-05-13** at **15:30:55**
-Interestingly I recompiled main with CUDA (after you merged #415 into main) and haven't been able to reproduce the error now.
+Interestingly I recompiled main with CUDA (after you merged [#415](https://github.com/ikawrakow/ik_llama.cpp/issues/415) into main) and haven't been able to reproduce the error now.
fwiw this command is working both with and without this PR:
@@ -45,7 +48,7 @@ I don't have enough VRAM to fully offload any R1/V3 models so not sure how to be
---
-👤 **ikawrakow** commented the **2025-05-13** at **15:43:01**:
+👤 **ikawrakow** commented on **2025-05-13** at **15:43:01**
On CUDA it is more difficult to trigger the bug. I used Qwen3-30B-A3B quantized with `IQ5_K`. I only have a 16 GB GPU, so I had to leave the last 19 layers of exerts on the CPU. I used `llama-cli` like this
```
@@ -66,6 +69,6 @@ I don't think partial offload is required, and it is likely the bug will trigger
---
-👤 **ikawrakow** commented the **2025-05-13** at **15:57:54**:
+👤 **ikawrakow** commented on **2025-05-13** at **15:57:54**
Oops, it is still failing with DeepSeek-Lite. Converting to draft.
\ No newline at end of file
diff --git a/github-data/pull_requests/417 - CUDA_ quantized GEMM for for IQ4_K_ IQ5_K_ IQ6_K.md b/github-data/pull_requests/417 - CUDA quantized GEMM for for IQ4_K IQ5_K IQ6_K.md
similarity index 78%
rename from github-data/pull_requests/417 - CUDA_ quantized GEMM for for IQ4_K_ IQ5_K_ IQ6_K.md
rename to github-data/pull_requests/417 - CUDA quantized GEMM for for IQ4_K IQ5_K IQ6_K.md
index 1a72842e8..2bf1c01f9 100644
--- a/github-data/pull_requests/417 - CUDA_ quantized GEMM for for IQ4_K_ IQ5_K_ IQ6_K.md
+++ b/github-data/pull_requests/417 - CUDA quantized GEMM for for IQ4_K IQ5_K IQ6_K.md
@@ -1,20 +1,23 @@
-### 🔀 [#417](https://github.com/ikawrakow/ik_llama.cpp/pull/417) - CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K
+## 🔀 [Pull Request #417](https://github.com/ikawrakow/ik_llama.cpp/pull/417) - CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_mmq_iq4_k` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-14 |
| **Updated** | 2025-05-14 |
+| **Merged** | 2025-05-14 |
---
-#### Description
+## 📄 Description
-This PR follows in the footsteps of #374, and is the next step towards complete implementation of quantized matrix multiplications (a.k.a. MMQ) for the `IQX_K` quants.
+This PR follows in the footsteps of [#374](https://github.com/ikawrakow/ik_llama.cpp/issues/374), and is the next step towards complete implementation of quantized matrix multiplications (a.k.a. MMQ) for the `IQX_K` quants.
We get in the range of 15% performance improvement compared to the existing implementation that dequantizes to `fp16` and then uses cuBLAS to perform the matrix multiplications.
-Another benefit is avoiding the numerical issues observed for DeepSeek models when using `fp16` arithmetic (see #261). It also potentially leads to CUDA compute buffer size reduction because the intermediate buffer for the dequantized tensor is not required.
+Another benefit is avoiding the numerical issues observed for DeepSeek models when using `fp16` arithmetic (see [#261](https://github.com/ikawrakow/ik_llama.cpp/issues/261)). It also potentially leads to CUDA compute buffer size reduction because the intermediate buffer for the dequantized tensor is not required.
I have reused the existing matrix multiplication kernels, providing only the unpacking of the quantized data into the tiles used in the kernels. As such, performance is largely determined by the kernel (blocks of 16 or blocks of 32), and the unpacking cost (converting the packed data into `int8_t` values ready for matrix multiplications). This is best illustrated with the following graph. Model is LLaMA-3.1-8B, GPU is RTX-4080. All quantizations are done using `--output-tensor-type q6_K --pure`.
@@ -28,9 +31,9 @@ Such efforts are left for a future PR.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-14** at **14:42:37**:
+👤 **ubergarm** commented on **2025-05-14** at **14:42:37**
This is great to see the CUDA performance of the new iqX_k quants relative to each other.
diff --git a/github-data/pull_requests/418 - CUDA quantized GEMM for for IQ2_KS IQ2_K IQ3_K.md b/github-data/pull_requests/418 - CUDA quantized GEMM for for IQ2_KS IQ2_K IQ3_K.md
new file mode 100644
index 000000000..471a3f47d
--- /dev/null
+++ b/github-data/pull_requests/418 - CUDA quantized GEMM for for IQ2_KS IQ2_K IQ3_K.md
@@ -0,0 +1,34 @@
+## 🔀 [Pull Request #418](https://github.com/ikawrakow/ik_llama.cpp/pull/418) - CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_mmq_iq2_k` |
+| **Target Branch** | `main` |
+| **Created** | 2025-05-14 |
+| **Updated** | 2025-05-15 |
+| **Merged** | 2025-05-15 |
+
+---
+
+## 📄 Description
+
+This PR is a follow up of [#417](https://github.com/ikawrakow/ik_llama.cpp/issues/417) and (almost) completes the quantized matrix multiplication (a.k.a. MMQ) implementation for `IQX_K` quants. The only one missing is `IQ4_KSS`, but I don't think I'll do that one as the packing is much too complicated.
+
+There are larger performance gains for `IQ2_KS` (~35%) than for `IQ2_K` and `IQ3_K` (~10%). This is due to `IQ2_KS` having blocks of 32 and thus being able to use the more efficient GEMM kernel (see discussion in [#417](https://github.com/ikawrakow/ik_llama.cpp/issues/417)).
+
+The graph illustrates the performance improvements for the same setup as in [#417](https://github.com/ikawrakow/ik_llama.cpp/issues/417).
+
+
+
+Looking at this graph and in the graph in [#417](https://github.com/ikawrakow/ik_llama.cpp/issues/417), I almost feel like adding `IQ3_KS` and `IQ5_KS` as 3- and 5-bit quants with blocks of 32.
+
+---
+
+## 💬 Conversation
+
+👤 **ubergarm** commented on **2025-05-14** at **19:24:21**
+
+Wow the IQ2_KS improved around 35%!? The 32 block `_KS` variants have a nice speedup.
+
+I'd probably try out the larger IQ3_KS and especially IQ5_KS for some mixes in the future if you decide to add them.
\ No newline at end of file
diff --git a/github-data/pull_requests/418 - CUDA_ quantized GEMM for for IQ2_KS_ IQ2_K_ IQ3_K.md b/github-data/pull_requests/418 - CUDA_ quantized GEMM for for IQ2_KS_ IQ2_K_ IQ3_K.md
deleted file mode 100644
index 0e27e9007..000000000
--- a/github-data/pull_requests/418 - CUDA_ quantized GEMM for for IQ2_KS_ IQ2_K_ IQ3_K.md
+++ /dev/null
@@ -1,31 +0,0 @@
-### 🔀 [#418](https://github.com/ikawrakow/ik_llama.cpp/pull/418) - CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-14 |
-| **Updated** | 2025-05-15 |
-
----
-
-#### Description
-
-This PR is a follow up of #417 and (almost) completes the quantized matrix multiplication (a.k.a. MMQ) implementation for `IQX_K` quants. The only one missing is `IQ4_KSS`, but I don't think I'll do that one as the packing is much too complicated.
-
-There are larger performance gains for `IQ2_KS` (~35%) than for `IQ2_K` and `IQ3_K` (~10%). This is due to `IQ2_KS` having blocks of 32 and thus being able to use the more efficient GEMM kernel (see discussion in #417).
-
-The graph illustrates the performance improvements for the same setup as in #417.
-
-
-
-Looking at this graph and in the graph in #417, I almost feel like adding `IQ3_KS` and `IQ5_KS` as 3- and 5-bit quants with blocks of 32.
-
----
-
-#### 💬 Conversation
-
-👤 **ubergarm** commented the **2025-05-14** at **19:24:21**:
-
-Wow the IQ2_KS improved around 35%!? The 32 block `_KS` variants have a nice speedup.
-
-I'd probably try out the larger IQ3_KS and especially IQ5_KS for some mixes in the future if you decide to add them.
\ No newline at end of file
diff --git a/github-data/pull_requests/42 - Adding fused rms_norm.md b/github-data/pull_requests/42 - Adding fused rms_norm.md
index 5f126af5e..dd9144afc 100644
--- a/github-data/pull_requests/42 - Adding fused rms_norm.md
+++ b/github-data/pull_requests/42 - Adding fused rms_norm.md
@@ -1,14 +1,17 @@
-### 🔀 [#42](https://github.com/ikawrakow/ik_llama.cpp/pull/42) - Adding fused rms_norm
+## 🔀 [Pull Request #42](https://github.com/ikawrakow/ik_llama.cpp/pull/42) - Adding fused rms_norm
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fused_rms_norm` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-08 |
| **Updated** | 2024-09-08 |
+| **Merged** | 2024-09-08 |
---
-#### Description
+## 📄 Description
Many models have one or more of `rms_norm` followed by multiplication with a normalization tensor that is (almost) always just a single row. Fusing these two operations into a single op reduces thread synchronization cost and thus has the potential to improve performance, especially for relatively small models.
diff --git a/github-data/pull_requests/421 - Fix standard attention on the CPU.md b/github-data/pull_requests/421 - Fix standard attention on the CPU.md
index f0ca9d530..cbb8504ac 100644
--- a/github-data/pull_requests/421 - Fix standard attention on the CPU.md
+++ b/github-data/pull_requests/421 - Fix standard attention on the CPU.md
@@ -1,13 +1,16 @@
-### 🐛 [#421](https://github.com/ikawrakow/ik_llama.cpp/pull/421) - Fix standard attention on the CPU
+## 🔀 [Pull Request #421](https://github.com/ikawrakow/ik_llama.cpp/pull/421) - Fix standard attention on the CPU
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_standard_attention_cpu` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-15 |
| **Updated** | 2025-05-15 |
+| **Merged** | 2025-05-15 |
---
-#### Description
+## 📄 Description
-I have focusing on FA, MLA, FlashMLA lately, and at some point I have broken the standard self attention CPU implementation. This PR fixes it and closes #420.
\ No newline at end of file
+I have focusing on FA, MLA, FlashMLA lately, and at some point I have broken the standard self attention CPU implementation. This PR fixes it and closes [#420](https://github.com/ikawrakow/ik_llama.cpp/issues/420).
\ No newline at end of file
diff --git a/github-data/pull_requests/422 - Adding IQ5_KS - 5.25 bpw quants.md b/github-data/pull_requests/422 - Adding IQ5_KS - 5.25 bpw quants.md
index bf3620a9c..e99834ca0 100644
--- a/github-data/pull_requests/422 - Adding IQ5_KS - 5.25 bpw quants.md
+++ b/github-data/pull_requests/422 - Adding IQ5_KS - 5.25 bpw quants.md
@@ -1,16 +1,19 @@
-### 🔀 [#422](https://github.com/ikawrakow/ik_llama.cpp/pull/422) - Adding IQ5_KS - 5.25 bpw quants
+## 🔀 [Pull Request #422](https://github.com/ikawrakow/ik_llama.cpp/pull/422) - Adding IQ5_KS - 5.25 bpw quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq5_ks` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-15 |
| **Updated** | 2025-05-18 |
+| **Merged** | 2025-05-15 |
---
-#### Description
+## 📄 Description
-For motivation, see the CUDA performance graphs in #417 and #418.
+For motivation, see the CUDA performance graphs in [#417](https://github.com/ikawrakow/ik_llama.cpp/issues/417) and [#418](https://github.com/ikawrakow/ik_llama.cpp/issues/418).
Implementation for `AVX2, Zen4, ARM_NEON, CUDA, Metal`.
@@ -20,10 +23,10 @@ I also want to add interleaved variant `IQ5_KS_R4` before giving more performanc
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-18** at **21:18:35**:
+👤 **ubergarm** commented on **2025-05-18** at **21:18:35**
-Just did some testing of a mixed `IQ5_KS` / `IQ4_KS` quant of Qwen3-14B dense showing some Perplexity and Speed comparisons for full CUDA offload in this [new quant cookers guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/434).
+Just did some testing of a mixed `IQ5_KS` / `IQ4_KS` quant of Qwen3-14B dense showing some Perplexity and Speed comparisons for full CUDA offload in this [new quant cookers guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) (just scroll to bottom, can't link anchors in gh discussions...)
Thanks for adding, the quality looks really good for the size!
\ No newline at end of file
diff --git a/github-data/pull_requests/424 - Adding forgotten template instance for iq5_ks.md b/github-data/pull_requests/424 - Adding forgotten template instance for iq5_ks.md
index 7c65c25e6..0343b1a39 100644
--- a/github-data/pull_requests/424 - Adding forgotten template instance for iq5_ks.md
+++ b/github-data/pull_requests/424 - Adding forgotten template instance for iq5_ks.md
@@ -1,15 +1,18 @@
-### 🔀 [#424](https://github.com/ikawrakow/ik_llama.cpp/pull/424) - Adding forgotten template instance for iq5_ks
+## 🔀 [Pull Request #424](https://github.com/ikawrakow/ik_llama.cpp/pull/424) - Adding forgotten template instance for iq5_ks
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_missing_mmq_iq5ks` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-15 |
| **Updated** | 2025-05-15 |
+| **Merged** | 2025-05-15 |
---
-#### Description
+## 📄 Description
Sorry about that.
-Closes #423
\ No newline at end of file
+Closes [#423](https://github.com/ikawrakow/ik_llama.cpp/issues/423)
\ No newline at end of file
diff --git a/github-data/pull_requests/426 - IQ5_KS_R4 row-interleaved IQ5_KS.md b/github-data/pull_requests/426 - IQ5_KS_R4 row-interleaved IQ5_KS.md
new file mode 100644
index 000000000..e8dbad94c
--- /dev/null
+++ b/github-data/pull_requests/426 - IQ5_KS_R4 row-interleaved IQ5_KS.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #426](https://github.com/ikawrakow/ik_llama.cpp/pull/426) - IQ5_KS_R4: row-interleaved IQ5_KS
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq5_ks_r4` |
+| **Target Branch** | `main` |
+| **Created** | 2025-05-16 |
+| **Updated** | 2025-05-17 |
+| **Merged** | 2025-05-17 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/426 - IQ5_KS_R4_ row-interleaved IQ5_KS.md b/github-data/pull_requests/426 - IQ5_KS_R4_ row-interleaved IQ5_KS.md
deleted file mode 100644
index 1d0843dd2..000000000
--- a/github-data/pull_requests/426 - IQ5_KS_R4_ row-interleaved IQ5_KS.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🔀 [#426](https://github.com/ikawrakow/ik_llama.cpp/pull/426) - IQ5_KS_R4: row-interleaved IQ5_KS
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-16 |
-| **Updated** | 2025-05-17 |
\ No newline at end of file
diff --git a/github-data/pull_requests/427 - Fix AVX2 implementation of IQ4_K_ IQ4_KS_ IQ5_K_ IQ6_K.md b/github-data/pull_requests/427 - Fix AVX2 implementation of IQ4_K IQ4_KS IQ5_K IQ6_K.md
similarity index 72%
rename from github-data/pull_requests/427 - Fix AVX2 implementation of IQ4_K_ IQ4_KS_ IQ5_K_ IQ6_K.md
rename to github-data/pull_requests/427 - Fix AVX2 implementation of IQ4_K IQ4_KS IQ5_K IQ6_K.md
index 820b90a95..2b1e6aadb 100644
--- a/github-data/pull_requests/427 - Fix AVX2 implementation of IQ4_K_ IQ4_KS_ IQ5_K_ IQ6_K.md
+++ b/github-data/pull_requests/427 - Fix AVX2 implementation of IQ4_K IQ4_KS IQ5_K IQ6_K.md
@@ -1,19 +1,22 @@
-### 🐛 [#427](https://github.com/ikawrakow/ik_llama.cpp/pull/427) - Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K
+## 🔀 [Pull Request #427](https://github.com/ikawrakow/ik_llama.cpp/pull/427) - Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_iq4k_avx2` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-16 |
| **Updated** | 2025-05-16 |
+| **Merged** | 2025-05-16 |
---
-#### Description
+## 📄 Description
I have made the exact same mistake a number of times.
On `AVX2` the instruction to perform dot products of `int8_t` vectors (as needed in quantized matrix multiplications) is `_mm256_maddubs_epi8(x, y)`, where `x` must be unsigned and `y` signed, and the result is a SIMD vector of signed `int16_t` values $z_i = x_{2i} y_{2i} + x_{2i+1} y_{2i+1}$. The quant values `x` and quantized activations `y` are signed, so one way to deal with the the strangeness of this instruction is to add a suitable constant value `c` to `x` so that it becomes unsigned, use `_mm256_maddubs_epi8(c+x, y)` to accumulate the dot product, and at the end subtract $c \cdot b$, where $b = \sum y_i$ has been pre-computed when quantizing the activations `y`. The issue arises when the `x` values span the full `int8_t` range as it is the case with the non-linear quants `IQ4_NL, IQ4_XS, IQ4_K, IQ4_KS, IQ5_K, IQ5_KS, IQ6_K`. In that case `c = 128`, the `c+x` values span the full `uint8_t` range, and hence it is possible to overflow the signed `int16_t` range.
-I had though that I had fixed this mistake, but while working on the `IQ5_KS` type added in PR #422 I noticed that the issue still exists `IQ4_K, IQ4_KS, IQ5_K, IQ6_K` and was only fixed for the corresponding repacked variants.
+I had though that I had fixed this mistake, but while working on the `IQ5_KS` type added in PR [#422](https://github.com/ikawrakow/ik_llama.cpp/issues/422) I noticed that the issue still exists `IQ4_K, IQ4_KS, IQ5_K, IQ6_K` and was only fixed for the corresponding repacked variants.
The PR corrects the problem. There will be a slight (a few percent) PP performance degradation on `AVX2` for these quantization types.
\ No newline at end of file
diff --git a/github-data/pull_requests/428 - Zen4_ Faster PP for IQ2_KS_ IQ4_KS_ IQ5_KS.md b/github-data/pull_requests/428 - Zen4 Faster PP for IQ2_KS IQ4_KS IQ5_KS.md
similarity index 68%
rename from github-data/pull_requests/428 - Zen4_ Faster PP for IQ2_KS_ IQ4_KS_ IQ5_KS.md
rename to github-data/pull_requests/428 - Zen4 Faster PP for IQ2_KS IQ4_KS IQ5_KS.md
index 2890e411f..23f4a7647 100644
--- a/github-data/pull_requests/428 - Zen4_ Faster PP for IQ2_KS_ IQ4_KS_ IQ5_KS.md
+++ b/github-data/pull_requests/428 - Zen4 Faster PP for IQ2_KS IQ4_KS IQ5_KS.md
@@ -1,14 +1,17 @@
-### 🔀 [#428](https://github.com/ikawrakow/ik_llama.cpp/pull/428) - Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS
+## 🔀 [Pull Request #428](https://github.com/ikawrakow/ik_llama.cpp/pull/428) - Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/zen4_faster_iq4ks_iq5ks` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-17 |
| **Updated** | 2025-05-17 |
+| **Merged** | 2025-05-17 |
---
-#### Description
+## 📄 Description
| model | size | threads | test | t/s (main) | t/s (PR) | Speedup |
| ---------------- | ---------: | ------: | ------------: | ---------------: | ------------: | -------: |
diff --git a/github-data/pull_requests/429 - Option to enable or disable the CPU FA kernels.md b/github-data/pull_requests/429 - Option to enable or disable the CPU FA kernels.md
index 789d05d4e..9122817fa 100644
--- a/github-data/pull_requests/429 - Option to enable or disable the CPU FA kernels.md
+++ b/github-data/pull_requests/429 - Option to enable or disable the CPU FA kernels.md
@@ -1,14 +1,17 @@
-### 🔀 [#429](https://github.com/ikawrakow/ik_llama.cpp/pull/429) - Option to enable or disable the CPU FA kernels
+## 🔀 [Pull Request #429](https://github.com/ikawrakow/ik_llama.cpp/pull/429) - Option to enable or disable the CPU FA kernels
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/option_cpu_fa` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-17 |
| **Updated** | 2025-05-17 |
+| **Merged** | 2025-05-17 |
---
-#### Description
+## 📄 Description
The compilation of `iqk_mul_mat.cpp` takes extremely long - currently 2m22s on my Ryzen-7950X CPU, with some users reporting times in the range of 30 minutes on an Antroid phone using Termux . This is to a large extent due to the Flash Attention (FA) kernels. Hence, this PR adds a `cmake` option to enable or disable the CPU FA kernels. It is set on by default, and can be changed using
```
diff --git a/github-data/pull_requests/43 - iq2_tn slightly faster PP on Zen4.md b/github-data/pull_requests/43 - iq2_tn slightly faster PP on Zen4.md
new file mode 100644
index 000000000..3f14c04f1
--- /dev/null
+++ b/github-data/pull_requests/43 - iq2_tn slightly faster PP on Zen4.md
@@ -0,0 +1,20 @@
+## 🔀 [Pull Request #43](https://github.com/ikawrakow/ik_llama.cpp/pull/43) - iq2_tn: slightly faster PP on Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_tn_faster_pp` |
+| **Target Branch** | `main` |
+| **Created** | 2024-09-08 |
+| **Updated** | 2024-09-08 |
+| **Merged** | 2024-09-08 |
+
+---
+
+## 📄 Description
+
+With this change we get `PP512 = 494 t/s` (using flash attention), up from `468 t/s` (~5% improvement) running on a Ryzen-7950X CPU.
+
+Compared to the initial `IQ2_TN` PR [#13](https://github.com/ikawrakow/ik_llama.cpp/issues/13) the cumulative improvement is 15%.
+
+Compared to `TQ2_0` in `llama.cpp`, which has now been merged, we are now 80% faster.
\ No newline at end of file
diff --git a/github-data/pull_requests/43 - iq2_tn_ slightly faster PP on Zen4.md b/github-data/pull_requests/43 - iq2_tn_ slightly faster PP on Zen4.md
deleted file mode 100644
index 76eb73e8d..000000000
--- a/github-data/pull_requests/43 - iq2_tn_ slightly faster PP on Zen4.md
+++ /dev/null
@@ -1,17 +0,0 @@
-### 🔀 [#43](https://github.com/ikawrakow/ik_llama.cpp/pull/43) - iq2_tn: slightly faster PP on Zen4
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-09-08 |
-| **Updated** | 2024-09-08 |
-
----
-
-#### Description
-
-With this change we get `PP512 = 494 t/s` (using flash attention), up from `468 t/s` (~5% improvement) running on a Ryzen-7950X CPU.
-
-Compared to the initial `IQ2_TN` PR #13 the cumulative improvement is 15%.
-
-Compared to `TQ2_0` in `llama.cpp`, which has now been merged, we are now 80% faster.
\ No newline at end of file
diff --git a/github-data/pull_requests/430 - Disable multi-add for now.md b/github-data/pull_requests/430 - Disable multi-add for now.md
index 6bbacd35a..4178c5aca 100644
--- a/github-data/pull_requests/430 - Disable multi-add for now.md
+++ b/github-data/pull_requests/430 - Disable multi-add for now.md
@@ -1,16 +1,18 @@
-### 🔀 [#430](https://github.com/ikawrakow/ik_llama.cpp/pull/430) - Disable multi-add for now
+## 🔀 [Pull Request #430](https://github.com/ikawrakow/ik_llama.cpp/pull/430) - Disable multi-add for now
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/disable_multi_add` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-18 |
| **Updated** | 2025-05-23 |
---
-#### Description
+## 📄 Description
-There have been several crash reports (#398, #425) for large MoE models when using hybrid GPU/CPU inference. As I don't have the hardware to run such large models I'm not able to debug. But with help from @nux, who ran `llama-server` in the debugger on his computer and gave me a backtrace along with a few variables values (see [this](https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2888464768) and [this](https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2888500696), my hypothesis is that the problem is with the multi-add operation that I added to `ik_llama.cpp`.
+There have been several crash reports ([#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398), [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425)) for large MoE models when using hybrid GPU/CPU inference. As I don't have the hardware to run such large models I'm not able to debug. But with help from @nux, who ran `llama-server` in the debugger on his computer and gave me a backtrace along with a few variables values (see [this](https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2888464768) and [this](https://github.com/ikawrakow/ik_llama.cpp/issues/425#issuecomment-2888500696), my hypothesis is that the problem is with the multi-add operation that I added to `ik_llama.cpp`.
I'm of course not sure if the hypothesis is correct as it is based on very scarce evidence. Hence I would appreciate if the people reporting a problem test this PR and let me know if it fixes the problem, so pinging @Panchovix, @ciprianveg, @pt13762104, @schynce, @p4s2wd
@@ -20,15 +22,15 @@ What is multi-add? In MoE models the contributions of the routed experts need to
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **schynce** commented the **2025-05-18** at **10:10:42**:
+👤 **schynce** commented on **2025-05-18** at **10:10:42**
Hi!
I tested the ik/disable_multi_add branch, but it unfortunately did not solve the issue.
-Running this command:
+Running this command (**IQ4_XS quant**):
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
@@ -106,7 +108,7 @@ The program is not being run.
Aborted (core dumped)
```
-I tested once again just to be sure, and I can confirm that this command does *not* crash:
+I tested once again just to be sure, and I can confirm that this command does *not* crash (**mix-IQ3_K**):
```
./llama-server --model /mnt/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf --alias Qwen3-235B-A22B-mix-IQ3_K \
@@ -117,7 +119,7 @@ I tested once again just to be sure, and I can confirm that this command does *n
-ot "blk\.(42|43|44|45|46|47|48|49|50|51|52|53|54|55|56|57)\.=CUDA2"
```
-Also, as suggested in #398 by @Ph0rk0z, running without -fa seems to not crash:
+Also, as suggested in [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398) by @Ph0rk0z, running without -fa seems to not crash, even with the otherwise crashing **IQ4_XS** quant:
```
./llama-server --model /mnt/Qwen3-235B-A22B-IQ4_XS-00001-of-00003.gguf --alias Qwen3-235B-A22B-IQ4_XS \
@@ -130,13 +132,13 @@ Also, as suggested in #398 by @Ph0rk0z, running without -fa seems to not crash:
---
-👤 **ikawrakow** commented the **2025-05-18** at **11:50:50**:
+👤 **ikawrakow** commented on **2025-05-18** at **11:50:50**
To be honest, I don't understand what could be wrong.
---
-👤 **ChicoPinto70** commented the **2025-05-18** at **21:33:03**:
+👤 **ChicoPinto70** commented on **2025-05-18** at **21:33:03**
If I may, I have the same problem running DeepSeekV3 0324. My workaround to avoid this bug is, change the rtr for no_map, use tensor split to the two gpus not connect to the monitor and, in the deepseek case, use MLA 3 instead 2.
@@ -148,19 +150,19 @@ I hope it helps.
---
-👤 **Ph0rk0z** commented the **2025-05-18** at **21:56:16**:
+👤 **Ph0rk0z** commented on **2025-05-18** at **21:56:16**
It happened to me much more when I undervolted hard and had nvidia HDMI audio devices compete for BAR space. Now that I fixed those issues, I am not seeing this a whole lot if at all.
---
-👤 **ciprianveg** commented the **2025-05-18** at **21:58:30**:
+👤 **ciprianveg** commented on **2025-05-18** at **21:58:30**
It isn't a hardware issue, llama.cpp is not experiencing this issue with same settings
---
-👤 **schynce** commented the **2025-05-18** at **22:45:19**:
+👤 **schynce** commented on **2025-05-18** at **22:45:19**
> It isn't a hardware issue, llama.cpp is not experiencing this issue with same settings
@@ -168,14 +170,14 @@ I can also confirm that llama.cpp runs fine with the same settings (just without
---
-👤 **Ph0rk0z** commented the **2025-05-19** at **00:24:03**:
+👤 **Ph0rk0z** commented on **2025-05-19** at **00:24:03**
llama.cpp doesn't have fmoe or rtr and has a different fa implementation. Exllama didn't crash on me either :D
If hardware instability makes it easier to reproduce it could be related. Check nothing funny is in journal or dmesg.
---
-👤 **ikawrakow** commented the **2025-05-19** at **06:35:14**:
+👤 **ikawrakow** commented on **2025-05-19** at **06:35:14**
`ik_llama.cpp` is faster then `llama.cpp`, else you wouldn't be here. If there is a hardware issue or a driver bug, or a bug that exists in `ik_llama.cpp` and in `llama.cpp`, the probability to trigger the problem is likely to be higher when the computation goes faster.
diff --git a/github-data/pull_requests/431 - Forgotten MMQ ref and typo.md b/github-data/pull_requests/431 - Forgotten MMQ ref and typo.md
index 1b7ea1e46..36fa27657 100644
--- a/github-data/pull_requests/431 - Forgotten MMQ ref and typo.md
+++ b/github-data/pull_requests/431 - Forgotten MMQ ref and typo.md
@@ -1,14 +1,17 @@
-### 🔀 [#431](https://github.com/ikawrakow/ik_llama.cpp/pull/431) - Forgotten MMQ ref and typo
+## 🔀 [Pull Request #431](https://github.com/ikawrakow/ik_llama.cpp/pull/431) - Forgotten MMQ ref and typo
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `fix_mmq` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-18 |
| **Updated** | 2025-05-22 |
+| **Merged** | 2025-05-18 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -18,18 +21,18 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-18** at **14:36:30**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-18** at **14:36:30**
Hey, you are back!
---
-👤 **Nexesenex** commented the **2025-05-18** at **14:48:44**:
+👤 **Nexesenex** commented on **2025-05-18** at **14:48:44**
Hey!
-Yeah, you sounded the horn with those MMQ Kernels for the IQ_K quants, I waited for them for a long time. I merge your IQ quants (included the KS ones with success last year, before the rev 14 of the GGUF format broke compatibility with them, possibly due to the template change introduced in https://github.com/ikawrakow/ik_llama.cpp/pull/45 )
+Yeah, you sounded the horn with those MMQ Kernels for the IQ_K quants, I waited for them for a long time. I merged in Croco your IQ quants (included the KS ones with success last year, before the rev 14 of the GGUF format broke compatibility with them, possibly due to the template change introduced in https://github.com/ikawrakow/ik_llama.cpp/pull/45 )
Meanwhile, I was amusing myself merging models, among other nerdy delights.
Congrats for all the amazing developments you made, even if it's hard for me to swing between mainline and IK_Llama to feed my Croco.
Also, Turboderp switched on QTIP based quants for Exllamav3.
diff --git a/github-data/pull_requests/435 - Refactor iqk_mul_mat.cpp.md b/github-data/pull_requests/435 - Refactor iqk_mul_mat.cpp.md
index 574a3fcce..8a3dd93ce 100644
--- a/github-data/pull_requests/435 - Refactor iqk_mul_mat.cpp.md
+++ b/github-data/pull_requests/435 - Refactor iqk_mul_mat.cpp.md
@@ -1,14 +1,17 @@
-### 🔀 [#435](https://github.com/ikawrakow/ik_llama.cpp/pull/435) - Refactor iqk_mul_mat.cpp
+## 🔀 [Pull Request #435](https://github.com/ikawrakow/ik_llama.cpp/pull/435) - Refactor iqk_mul_mat.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/refactor_iqk` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-20 |
| **Updated** | 2025-05-23 |
+| **Merged** | 2025-05-22 |
---
-#### Description
+## 📄 Description
I have been putting all matrix multiplication (GEMM) and flash attention (FA) kernels into `iqk_mul_mat.cpp`. With time it became a giant source file (~18 kLOC) containing heavily templated C++ code. The result: extremely long compilations times (over 2 minutes on a high end CPU, with some users reporting 30 minutes on an Android phone).
@@ -33,13 +36,13 @@ The GEMM files compile in 5-6 seconds each, so the FA instantiations dominate th
It is a massive change. Testing of all types (50+ when row-interleaved quants are included) on `AVX2, Zen4` and `ARM_NEON` took quite some time. I hope to have covered all possible combinations, but still would appreciate additional testing from people using `ik_llama.cpp` for CPU-only inference.
-Closes #183
+Closes [#183](https://github.com/ikawrakow/ik_llama.cpp/issues/183)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-20** at **07:20:58**:
+👤 **saood06** commented on **2025-05-20** at **07:20:58**
>I hope to have covered all possible combinations, but still would appreciate additional testing from people using ik_llama.cpp for CPU-only inference.
@@ -49,7 +52,40 @@ Tested with my standard `cmake .. -DGGML_RPC=ON -DGGML_IQK_FA_ALL_QUANTS=1; cmak
---
-👤 **saood06** commented the **2025-05-20** at **07:51:53**:
+👤 **ikawrakow** commented on **2025-05-20** at **07:30:51**
+
+> It used more threads but still nowhere near saturating my available ones for a large amount of the time.
+
+It cannot saturate your 48 cores. It needs to build `libggml.so` first, and this is what it takes to do that:
+```
+[ 2%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml.c.o
+[ 3%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-alloc.c.o
+[ 4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-quants.c.o
+[ 4%] Building C object ggml/src/CMakeFiles/ggml.dir/ggml-backend.c.o
+[ 5%] Building CXX object ggml/src/CMakeFiles/ggml.dir/llamafile/sgemm.cpp.o
+[ 5%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_mul_mat.cpp.o
+[ 6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_flash_attn.cpp.o
+[ 6%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_576_512.cpp.o
+[ 7%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_192_128.cpp.o
+[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_128_128.cpp.o
+[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_256_256.cpp.o
+[ 8%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_96_96.cpp.o
+[ 9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/fa/iqk_fa_64_64.cpp.o
+[ 9%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_floats.cpp.o
+[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_kquants.cpp.o
+[ 10%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iquants.cpp.o
+[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_iqk_quants.cpp.o
+[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_1bit.cpp.o
+[ 12%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_gemm_legacy_quants.cpp.o
+[ 13%] Building CXX object ggml/src/CMakeFiles/ggml.dir/iqk/iqk_quantize.cpp.o
+```
+With all quants enabled for FA the above takes 36 seconds on my `AVX2` box.
+
+Compiling `llama.cpp` is another piece that takes quite some time, so it should get refactored as well.
+
+---
+
+👤 **saood06** commented on **2025-05-20** at **07:51:53**
>It cannot saturate your 48 cores. It needs to build libggml.so first, and this is what it takes to do that:
@@ -63,7 +99,7 @@ Thank you for this, it is a very welcome speed improvement.
---
-👤 **cmoncure** commented the **2025-05-22** at **18:23:28**:
+👤 **cmoncure** commented on **2025-05-22** at **18:23:28**
This commit results in a significant performance regression for me, established by git bisect.
@@ -74,11 +110,11 @@ commit b94cd3b632a78dfb46b18d52b84be66bcf26166a (HEAD)
Author: Kawrakow
Date: Thu May 22 10:05:51 2025 +0300
- Refactor iqk_mul_mat.cpp (#435)
+ Refactor iqk_mul_mat.cpp ([#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435))
---
-👤 **ikawrakow** commented the **2025-05-23** at **05:09:34**:
+👤 **ikawrakow** commented on **2025-05-23** at **05:09:34**
> This commit results in a significant performance regression for me, established by git bisect.
diff --git a/github-data/pull_requests/438 - Another attempt to fix the illegal memory access bug.md b/github-data/pull_requests/438 - Another attempt to fix the illegal memory access bug.md
index 345856461..46870be6e 100644
--- a/github-data/pull_requests/438 - Another attempt to fix the illegal memory access bug.md
+++ b/github-data/pull_requests/438 - Another attempt to fix the illegal memory access bug.md
@@ -1,16 +1,18 @@
-### 🐛 [#438](https://github.com/ikawrakow/ik_llama.cpp/pull/438) - Another attempt to fix the illegal memory access bug
+## 🔀 [Pull Request #438](https://github.com/ikawrakow/ik_llama.cpp/pull/438) - Another attempt to fix the illegal memory access bug
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/desperate_bug_fix_attempt` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-20 |
| **Updated** | 2025-05-23 |
---
-#### Description
+## 📄 Description
-Attempt to fix #398, #425
+Attempt to fix [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398), [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425)
My hopes are not very high, but it is better to try.
* More extensive check that we can really also fuse the `ffn_down` operation. The change does nothing for me, but I also never have a crash, so let's try that.
diff --git a/github-data/pull_requests/439 - Bug fixes from mainline.md b/github-data/pull_requests/439 - Bug fixes from mainline.md
index 180b2ae79..9567dd66e 100644
--- a/github-data/pull_requests/439 - Bug fixes from mainline.md
+++ b/github-data/pull_requests/439 - Bug fixes from mainline.md
@@ -1,16 +1,19 @@
-### 🐛 [#439](https://github.com/ikawrakow/ik_llama.cpp/pull/439) - Bug fixes from mainline
+## 🔀 [Pull Request #439](https://github.com/ikawrakow/ik_llama.cpp/pull/439) - Bug fixes from mainline
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_mailine_fixes` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-20 |
| **Updated** | 2025-05-20 |
+| **Merged** | 2025-05-20 |
---
-#### Description
+## 📄 Description
Do these fix the mysterious illegal memory access crashes?
I doubt it, but who knows.
-Ref #389, #425
\ No newline at end of file
+Ref [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389), [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425)
\ No newline at end of file
diff --git a/github-data/pull_requests/44 - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models.md b/github-data/pull_requests/44 - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models.md
index b4dfe51b1..d0eb9ad6e 100644
--- a/github-data/pull_requests/44 - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models.md
+++ b/github-data/pull_requests/44 - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models.md
@@ -1,14 +1,17 @@
-### 🔀 [#44](https://github.com/ikawrakow/ik_llama.cpp/pull/44) - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models
+## 🔀 [Pull Request #44](https://github.com/ikawrakow/ik_llama.cpp/pull/44) - Adding IQ1_TN - 1.6875 bpw for TriLM ternary models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_tn` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-09 |
| **Updated** | 2024-09-09 |
+| **Merged** | 2024-09-09 |
---
-#### Description
+## 📄 Description
For the Bitnt-1.58b ternary models I had added `IQ1_BN` (1.625 bpw) and `IQ2_BN` (2.0 bpw) quants. But for TriLM I only added `IQ2_TN` (2.0625 bpw). This PR fills the gap adding the corresponding 1.6875 bpw quantization type `IQ1_TN`.
@@ -46,9 +49,9 @@ As `IQ2_BN` PP performance is better than `IQ1_BN`, these tables indicate that m
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-09-09** at **11:56:12**:
+👤 **ikawrakow** commented on **2024-09-09** at **11:56:12**
For the record, here is how this PR improves `IQ1/2_BN` performance for PP
diff --git a/github-data/pull_requests/441 - Trellis quants with CPU inference.md b/github-data/pull_requests/441 - Trellis quants with CPU inference.md
index c5d61fca3..e87e35911 100644
--- a/github-data/pull_requests/441 - Trellis quants with CPU inference.md
+++ b/github-data/pull_requests/441 - Trellis quants with CPU inference.md
@@ -1,14 +1,17 @@
-### 🔀 [#441](https://github.com/ikawrakow/ik_llama.cpp/pull/441) - Trellis quants with CPU inference
+## 🔀 [Pull Request #441](https://github.com/ikawrakow/ik_llama.cpp/pull/441) - Trellis quants with CPU inference
| **Author** | `andrewkchan` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `andrewkchan/try_trellis` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-20 |
| **Updated** | 2025-05-23 |
+| **Merged** | 2025-05-23 |
---
-#### Description
+## 📄 Description
As requested a while ago, takes (https://github.com/ikawrakow/ik_llama.cpp/pull/113) and adds CPU implementations of the quantized matmuls (via iqk_mul_mat) for inference. AVX2 and F16C support are required.
@@ -24,9 +27,9 @@ I am not sure of the PR practices - if you'd like me to merge into https://githu
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-21** at **07:13:48**:
+👤 **ikawrakow** commented on **2025-05-21** at **07:13:48**
> For Llama-3.1-8B-Instruct, I get 0.3t/s with IQ2_KT compared to >1.0t/s with F16 on AMD EPYC 7R32 (32 cores)
@@ -34,7 +37,7 @@ Is this in debug mode? I'm getting 10.4 t/s for `IQ2_KT` on my 16-core Ryzen-795
---
-👤 **andrewkchan** commented the **2025-05-21** at **07:17:47**:
+👤 **andrewkchan** commented on **2025-05-21** at **07:17:47**
I'm compiling with `cmake --build ./build --config Release -j $(nproc)`. I might need to tweak the number of threads; I've found this greatly impacts performance on my test machine in the past for llama.cpp.
@@ -50,7 +53,7 @@ Should I be using llama-bench or some other tool?
---
-👤 **ikawrakow** commented the **2025-05-21** at **07:24:07**:
+👤 **ikawrakow** commented on **2025-05-21** at **07:24:07**
I also tried `llama-cli` to make sure the output is coherent, and also get in the range of 10 t/s. To measure performance I now tend to use `llama-sweep-bench`. For instance, the table below was generated using
```
@@ -69,13 +72,13 @@ We get PP and TG performance as a function of the number of tokens in the KV cac
---
-👤 **andrewkchan** commented the **2025-05-21** at **07:30:16**:
+👤 **andrewkchan** commented on **2025-05-21** at **07:30:16**
Ok, well it's great to know the CPU inference performance is not totally unusable and that it's probably just my setup! I will try to figure this out on my own. Might email you some more questions to not pollute this PR discussion. Thanks also for the pointer on benchmarking.
---
-👤 **andrewkchan** commented the **2025-05-21** at **08:11:09**:
+👤 **andrewkchan** commented on **2025-05-21** at **08:11:09**
I purged my build directory + recompiled and performance is a lot better, and I no longer see the weird `ggml_backend_sched_alloc_splits: failed to allocate graph` messages from (https://github.com/ggml-org/llama.cpp/discussions/8088). Possibly the build cache was using some artifacts from a previous debug build.
@@ -83,7 +86,7 @@ Now F16 gets almost 4x faster at 4.59 generation t/s, and IQ2_KT now beats F16 a
---
-👤 **ikawrakow** commented the **2025-05-21** at **14:35:39**:
+👤 **ikawrakow** commented on **2025-05-21** at **14:35:39**
I did speed up `IQ2_KT` slightly, see [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/andrew_trellis). Here is what I get now on the Ryzen-7950X
@@ -95,16 +98,16 @@ I did speed up `IQ2_KT` slightly, see [this branch](https://github.com/ikawrakow
| 512 | 128 | 1536 | 8.453 | 60.57 | 10.704 | 11.96 |
| 512 | 128 | 2048 | 8.488 | 60.32 | 10.798 | 11.85 |
-Overall it looks good to me, so we can think about merging. But there is also PR #435, where I have completely refactored `iqk_mul_mat.cpp`. Do you want to look into adding the changes on that branch?
+Overall it looks good to me, so we can think about merging. But there is also PR [#435](https://github.com/ikawrakow/ik_llama.cpp/issues/435), where I have completely refactored `iqk_mul_mat.cpp`. Do you want to look into adding the changes on that branch?
---
-👤 **andrewkchan** commented the **2025-05-22** at **04:32:39**:
+👤 **andrewkchan** commented on **2025-05-22** at **04:32:39**
Terrific, this gets my test machine to 5.59t/s. I saw the LCG ops in next8 taking up lots of time but wasn't sure what to do about it, this is a cool trick - I assume having the constants as locals keeps them in registers or otherwise ensures they remain hot in cache?
-Re: https://github.com/ikawrakow/ik_llama.cpp/pull/435 - it looks not too difficult to me to reconcile my new kernels with the refactor. If you're done with your refactor already, you could merge your PR and then I can fix the conflicts accordingly - maybe that's the cleanest way to do this?
+Re: https://github.com/ikawrakow/ik_llama.cpp/pull/435 - it looks not too difficult to me to reconcile my new kernels with the refactor. If you're done with your refactor already, you could merge your PR and then I can fix the resulting conflicts on this PR - maybe that's the cleanest way to do this? Since this branch is already conflicting with a file on main anyway. Otherwise happy to merge this first, then work on your branch.
---
-👤 **ikawrakow** submitted a review the **2025-05-23** at **06:17:15**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-23** at **06:17:15**
\ No newline at end of file
diff --git a/github-data/pull_requests/442 - CUDA call tracer.md b/github-data/pull_requests/442 - CUDA call tracer.md
index 314fefd31..cb22ec1a4 100644
--- a/github-data/pull_requests/442 - CUDA call tracer.md
+++ b/github-data/pull_requests/442 - CUDA call tracer.md
@@ -1,13 +1,23 @@
-### 🔀 [#442](https://github.com/ikawrakow/ik_llama.cpp/pull/442) - CUDA call tracer
+## 🔀 [Pull Request #442](https://github.com/ikawrakow/ik_llama.cpp/pull/442) - CUDA call tracer
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/cuda_tracer` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-21 |
| **Updated** | 2025-05-23 |
---
-#### Description
+## 📄 Description
-This PR adds a CUDA call tracer. The main purpose of the tracer is to hopefully help debug the illegal memory access crashes reported in #398 and #425. If there is a crash, the last 32 invocations of `CUDA_CHECK` will be printed to `stderr` before aborting. In my testing the overhead added by the tracer has negligible impact on performance.
\ No newline at end of file
+This PR adds a CUDA call tracer. The main purpose of the tracer is to hopefully help debug the illegal memory access crashes reported in [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398) and [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425). If there is a crash, the last 32 invocations of `CUDA_CHECK` will be printed to `stderr` before aborting. In my testing the overhead added by the tracer has negligible impact on performance.
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-05-23** at **15:26:16**
+
+I can close this one now.
\ No newline at end of file
diff --git a/github-data/pull_requests/443 - Streamline a bit the quant strategies.md b/github-data/pull_requests/443 - Streamline a bit the quant strategies.md
index 2a939ee13..0743a7c58 100644
--- a/github-data/pull_requests/443 - Streamline a bit the quant strategies.md
+++ b/github-data/pull_requests/443 - Streamline a bit the quant strategies.md
@@ -1,14 +1,17 @@
-### 🔀 [#443](https://github.com/ikawrakow/ik_llama.cpp/pull/443) - Streamline a bit the quant strategies
+## 🔀 [Pull Request #443](https://github.com/ikawrakow/ik_llama.cpp/pull/443) - Streamline a bit the quant strategies
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `QS_streamline` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-22 |
| **Updated** | 2025-05-22 |
+| **Merged** | 2025-05-22 |
---
-#### Description
+## 📄 Description
Unlike last time..
@@ -24,33 +27,54 @@ Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, beca
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented during a code review the **2025-05-22** at **06:46:59** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-22** at **06:46:59**
Why do we want to limit to `<= 8` experts?
+> 👤 **Nexesenex** replied on **2025-05-22** at **13:46:33**
+>
+> Oh, I just didn't want to step on bigger MOEs because I didn't test any.
+> I left that to your discretion.
+
---
-👤 **ikawrakow** commented during a code review the **2025-05-22** at **06:48:18** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-22** at **06:48:18**
Why limit to `<= 8` experts?
+> 👤 **Nexesenex** replied on **2025-05-22** at **13:48:21**
+>
+> I just did not want to step on bigger MOEs because I didn't test any.
+> I left that to your discretion. But ofc if it's fine with your we can remove that second condition.
+
---
-👤 **ikawrakow** commented during a code review the **2025-05-22** at **06:54:53** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-22** at **06:54:53**
So, I see you added the condition for `Q5_K_S` just above but I have forgotten why we want to have it. Can you remind me? I was wondering not too long ago why a model quantized with `Q5_K_S` ended up having less the 5.5 bpw (but didn't check). Why is the decision to reduce the number of bits dependent on the vocabulary size?
+> 👤 **Nexesenex** replied on **2025-05-22** at **13:54:19**
+>
+> I added this back then because attn_q endures very well a smaller quant on Llama 3 models, with no perplexity bump or even a drop around 0.005 on L3 (and Also Mistral 123b models).
+> I also observed this with IQ4_XS -> IQ3_S for attn_q.
+> I take benefit of this to bump attn_v instead on L3, which is very sensitive to it and will thus furtherly drop the perplexity for a smaller size quantized model still.
+> At the time, you agreed with the principle.
+
---
-👤 **ikawrakow** commented during a code review the **2025-05-22** at **06:55:55** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-05-22** at **06:55:55**
`<= 8`?
+> 👤 **Nexesenex** replied on **2025-05-22** at **13:54:45**
+>
+> Ok, I will remove this <= 8 experts condition!
+
---
-👤 **ikawrakow** submitted a review the **2025-05-22** at **06:58:25**: 💬 `COMMENTED`
+👤 **ikawrakow** reviewed this pull request 💬 on **2025-05-22** at **06:58:25**
Looks OK apart from the `<= 8` condition for MoE models. I don't think it is needed.
@@ -58,49 +82,4 @@ This may make it more convenient for some people, but I basically just use `--cu
---
-👤 **Nexesenex** commented during a code review the **2025-05-22** at **13:46:33** on `src/llama.cpp`:
-
-Oh, I just not wanted to step on bigger MOEs because I didn't test any.
-I left that to your discretion.
-
----
-
-👤 **Nexesenex** submitted a review the **2025-05-22** at **13:46:34**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** submitted a review the **2025-05-22** at **13:48:21**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** commented during a code review the **2025-05-22** at **13:48:21** on `src/llama.cpp`:
-
-I just did not want to step on bigger MOEs because I didn't test any.
-I left that to your discretion. But ofc if it's fine with your we can remove that second condition.
-
----
-
-👤 **Nexesenex** submitted a review the **2025-05-22** at **13:54:19**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** commented during a code review the **2025-05-22** at **13:54:19** on `src/llama.cpp`:
-
-I added this back then because attn_q endures very well a smaller quant on Llama 3 models, with no perplexity bump or even a drop around 0.005 on L3 (and Also Mistral 123b models).
-I also observed this with IQ4_XS -> IQ3_S for attn_q.
-I take benefit of this to bump attn_v instead on L3, which is very sensitive to it.
-At the time, you agreed with the principle.
-
----
-
-👤 **Nexesenex** submitted a review the **2025-05-22** at **13:54:45**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** commented during a code review the **2025-05-22** at **13:54:45** on `src/llama.cpp`:
-
-Ok, I will remove this <= part!
-
----
-
-👤 **ikawrakow** submitted a review the **2025-05-22** at **15:04:41**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-22** at **15:04:41**
\ No newline at end of file
diff --git a/github-data/pull_requests/444 - gguf-split _ update.md b/github-data/pull_requests/444 - gguf-split update.md
similarity index 67%
rename from github-data/pull_requests/444 - gguf-split _ update.md
rename to github-data/pull_requests/444 - gguf-split update.md
index 9b689366d..37f6dcf4b 100644
--- a/github-data/pull_requests/444 - gguf-split _ update.md
+++ b/github-data/pull_requests/444 - gguf-split update.md
@@ -1,14 +1,17 @@
-### 🔀 [#444](https://github.com/ikawrakow/ik_llama.cpp/pull/444) - gguf-split : update
+## 🔀 [Pull Request #444](https://github.com/ikawrakow/ik_llama.cpp/pull/444) - gguf-split : update
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `updated_split` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-22 |
| **Updated** | 2025-05-23 |
+| **Merged** | 2025-05-23 |
---
-#### Description
+## 📄 Description
Among the useful stuff on mainline, there's the updates of gguf-split.
@@ -20,7 +23,7 @@ So I just put it here!
-----
-gguf-split : improve --split and --merge logic (#9619)
+gguf-split : improve --split and --merge logic ([#9619](https://github.com/ikawrakow/ik_llama.cpp/issues/9619))
* make sure params --split and --merge are not specified at same time
@@ -33,7 +36,7 @@ Co-authored-by: slaren
---------
-gguf-split : add basic checks (#9499)
+gguf-split : add basic checks ([#9499](https://github.com/ikawrakow/ik_llama.cpp/issues/9499))
* gguf-split : do not overwrite existing files when merging
@@ -50,6 +53,6 @@ Authored-by: slaren
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-23** at **05:07:35**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-23** at **05:07:35**
\ No newline at end of file
diff --git a/github-data/pull_requests/445 - Fix typo in non-AVX2 code branch.md b/github-data/pull_requests/445 - Fix typo in non-AVX2 code branch.md
index fe2330b88..d62a547fb 100644
--- a/github-data/pull_requests/445 - Fix typo in non-AVX2 code branch.md
+++ b/github-data/pull_requests/445 - Fix typo in non-AVX2 code branch.md
@@ -1,7 +1,16 @@
-### 🐛 [#445](https://github.com/ikawrakow/ik_llama.cpp/pull/445) - Fix typo in non-AVX2 code branch
+## 🔀 [Pull Request #445](https://github.com/ikawrakow/ik_llama.cpp/pull/445) - Fix typo in non-AVX2 code branch
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_typo` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-23 |
-| **Updated** | 2025-05-23 |
\ No newline at end of file
+| **Updated** | 2025-05-23 |
+| **Merged** | 2025-05-23 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/446 - Fix bug in MMVQ kernel.md b/github-data/pull_requests/446 - Fix bug in MMVQ kernel.md
index 750e4ef6c..f50cf5c9e 100644
--- a/github-data/pull_requests/446 - Fix bug in MMVQ kernel.md
+++ b/github-data/pull_requests/446 - Fix bug in MMVQ kernel.md
@@ -1,41 +1,44 @@
-### 🐛 [#446](https://github.com/ikawrakow/ik_llama.cpp/pull/446) - Fix bug in MMVQ kernel
+## 🔀 [Pull Request #446](https://github.com/ikawrakow/ik_llama.cpp/pull/446) - Fix bug in MMVQ kernel
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_mmvq_bug` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-23 |
| **Updated** | 2025-05-24 |
+| **Merged** | 2025-05-23 |
---
-#### Description
+## 📄 Description
-After a very long bug hunt, this PR should hopefully fix #389, #398, #425.
+After a very long bug hunt, this PR should hopefully fix [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389), [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398), [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425).
Thanks to everybody who tested my previous bug fix attempts!
Huge kudos to @ciprianveg who was instrumental in finding the bug!
The bug was in the CUDA matrix-vector multiplication kernel (a.k.a., MMVQ). It only shows up when the kernel processes 2 or 3 tokens. Hence, it was not observed during TG, and only showed up during PP when an expert in a MoE model ended up with having to process just 2 or 3 tokens from the batch (which is rare).
-I believe all other changes I made in #442 are not necessary, but please test this PR to confirm.
+I believe all other changes I made in [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442) are not necessary, but please test this PR to confirm.
-Closes #389
-Closes #398
-Closes #425
+Closes [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389)
+Closes [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398)
+Closes [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ciprianveg** commented the **2025-05-23** at **11:29:36**:
+👤 **ciprianveg** commented on **2025-05-23** at **11:29:36**
Thank you for the fix!🍻
On Fri, 23 May 2025, 12:17 Kawrakow, ***@***.***> wrote:
-> After a very long bug hunt, this PR should hopefully fix #389
-> , #398
-> , #425
+> After a very long bug hunt, this PR should hopefully fix [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389)
+> , [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398)
+> , [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425)
> .
>
> Thanks to everybody who tested my previous bug fix attempts!
@@ -48,13 +51,13 @@ On Fri, 23 May 2025, 12:17 Kawrakow, ***@***.***> wrote:
> a MoE model ended up with having to process just 2 or 3 tokens from the
> batch (which is rare).
>
-> I believe all other changes I made in #442
+> I believe all other changes I made in [#442](https://github.com/ikawrakow/ik_llama.cpp/issues/442)
> are not necessary,
> but please test this PR to confirm.
>
-> Closes #389
-> Closes #398
-> Closes #425
+> Closes [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389)
+> Closes [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398)
+> Closes [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425)
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
@@ -89,24 +92,30 @@ On Fri, 23 May 2025, 12:17 Kawrakow, ***@***.***> wrote:
---
-👤 **ikawrakow** commented the **2025-05-23** at **15:25:05**:
+👤 **schynce** commented on **2025-05-23** at **11:40:44**
-I think I'll merge this now. It fixes a real bug, so it should be merged irrespective of it fixing #389, #398, #425.
+I can happily confirm that this PR seems to have fixed the issues on my end! Thank you!
---
-👤 **Panchovix** commented the **2025-05-23** at **16:00:18**:
+👤 **ikawrakow** commented on **2025-05-23** at **15:25:05**
+
+I think I'll merge this now. It fixes a real bug, so it should be merged irrespective of it fixing [#389](https://github.com/ikawrakow/ik_llama.cpp/issues/389), [#398](https://github.com/ikawrakow/ik_llama.cpp/issues/398), [#425](https://github.com/ikawrakow/ik_llama.cpp/issues/425).
+
+---
+
+👤 **Panchovix** commented on **2025-05-23** at **16:00:18**
Amazing, thanks for all your work!
---
-👤 **p4s2wd** commented the **2025-05-24** at **05:12:04**:
+👤 **p4s2wd** commented on **2025-05-24** at **05:12:04**
Thank you!
---
-👤 **pt13762104** commented the **2025-05-24** at **09:31:08**:
+👤 **pt13762104** commented on **2025-05-24** at **09:31:08**
It's working fine now, thank you for your patience
\ No newline at end of file
diff --git a/github-data/pull_requests/448 - Fix MSVC compilation.md b/github-data/pull_requests/448 - Fix MSVC compilation.md
index 052ecbe5d..202b1c293 100644
--- a/github-data/pull_requests/448 - Fix MSVC compilation.md
+++ b/github-data/pull_requests/448 - Fix MSVC compilation.md
@@ -1,15 +1,18 @@
-### 🐛 [#448](https://github.com/ikawrakow/ik_llama.cpp/pull/448) - Fix MSVC compilation
+## 🔀 [Pull Request #448](https://github.com/ikawrakow/ik_llama.cpp/pull/448) - Fix MSVC compilation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_447` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-23 |
| **Updated** | 2025-05-23 |
+| **Merged** | 2025-05-23 |
---
-#### Description
+## 📄 Description
MSVC does not like `^` with SIMD vectors.
-Closes #447
\ No newline at end of file
+Closes [#447](https://github.com/ikawrakow/ik_llama.cpp/issues/447)
\ No newline at end of file
diff --git a/github-data/pull_requests/449 - Legacy quants conversion schemes in convert_hf_to_gguf.py.md b/github-data/pull_requests/449 - Legacy quants conversion schemes in convert_hf_to_gguf.py.md
index 1c4a54d40..5bd36c297 100644
--- a/github-data/pull_requests/449 - Legacy quants conversion schemes in convert_hf_to_gguf.py.md
+++ b/github-data/pull_requests/449 - Legacy quants conversion schemes in convert_hf_to_gguf.py.md
@@ -1,14 +1,17 @@
-### 🔀 [#449](https://github.com/ikawrakow/ik_llama.cpp/pull/449) - Legacy quants conversion schemes in convert_hf_to_gguf.py
+## 🔀 [Pull Request #449](https://github.com/ikawrakow/ik_llama.cpp/pull/449) - Legacy quants conversion schemes in convert_hf_to_gguf.py
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `legacy_quant_conv` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-23 |
| **Updated** | 2025-05-24 |
+| **Merged** | 2025-05-24 |
---
-#### Description
+## 📄 Description
This, notably in order to make smaller conversions to generate an iMatrix file.
@@ -31,9 +34,15 @@ Also, 2 forgotten mentions of FTYPE IQ3_KL are added in llama.cpp file, and one
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2025-05-23** at **14:38:10**:
+👤 **ikawrakow** commented on **2025-05-23** at **13:50:48**
+
+Why do we need the change in `convert_hf_to_gguf.py` ?
+
+---
+
+👤 **Nexesenex** commented on **2025-05-23** at **14:38:10**
Well, when I test a new finetune or merge of a big model I can't run in 16 or even 8 bits, I like to make a simple q5_0 or even q4_0 conversion to test it in chat in full offload or quasi-full offload on my 64GB VRAM.
@@ -45,7 +54,7 @@ I think some other folks could use that too, especially the ability to convert a
---
-👤 **ikawrakow** commented the **2025-05-23** at **15:23:12**:
+👤 **ikawrakow** commented on **2025-05-23** at **15:23:12**
Did you test that the conversion is working? I'm in the middle of something and don't feel like downloading a few models from HF to test.
@@ -53,7 +62,7 @@ The described new model testing procedure saves 1 conversion to `bf16` (or `Q8_0
---
-👤 **Nexesenex** commented the **2025-05-23** at **16:42:03**:
+👤 **Nexesenex** commented on **2025-05-23** at **16:42:03**
> Did you test that the conversion is working? I'm in the middle of something and don't feel like downloading a few models from HF to test.
@@ -74,4 +83,13 @@ When I'll come home tonight, I'll make some tests beyond the Llama 3 70b I've be
---
-👤 **ikawrakow** submitted a review the **2025-05-24** at **06:09:15**: ✅ `APPROVED`
\ No newline at end of file
+👤 **Nexesenex** commented on **2025-05-23** at **18:26:21**
+
+Just checked the 4 conversions types on Llama 3 1B, and they are all coherent, giving me an average recipe of French fries when asked.
+Qwen 1.5B works also.
+
+The feature seem to work with the IK Llama gguf conversion script as it is for the models it can convert normally, without the need to update it with the subsequent mainline PRs.
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-24** at **06:09:15**
\ No newline at end of file
diff --git a/github-data/pull_requests/45 - Add CUDA support for IQ1_TN.md b/github-data/pull_requests/45 - Add CUDA support for IQ1_TN.md
index 6e23a8121..28573c1ce 100644
--- a/github-data/pull_requests/45 - Add CUDA support for IQ1_TN.md
+++ b/github-data/pull_requests/45 - Add CUDA support for IQ1_TN.md
@@ -1,14 +1,17 @@
-### 🔀 [#45](https://github.com/ikawrakow/ik_llama.cpp/pull/45) - Add CUDA support for IQ1_TN
+## 🔀 [Pull Request #45](https://github.com/ikawrakow/ik_llama.cpp/pull/45) - Add CUDA support for IQ1_TN
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_tn_cuda` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-09 |
| **Updated** | 2024-09-09 |
+| **Merged** | 2024-09-09 |
---
-#### Description
+## 📄 Description
Just reuse the `IQ1_BN` implementation. The only twist is that we now have the row scale stored at the beginning of the row, so we need a small modification of the dot product template to have a pointer to the beginning of the row passed to the dot product implementation.
diff --git a/github-data/pull_requests/453 - Faster IQ3_KT and IQ4_KT.md b/github-data/pull_requests/453 - Faster IQ3_KT and IQ4_KT.md
index d3b3ec54a..97b52a055 100644
--- a/github-data/pull_requests/453 - Faster IQ3_KT and IQ4_KT.md
+++ b/github-data/pull_requests/453 - Faster IQ3_KT and IQ4_KT.md
@@ -1,16 +1,19 @@
-### 🔀 [#453](https://github.com/ikawrakow/ik_llama.cpp/pull/453) - Faster IQ3_KT and IQ4_KT
+## 🔀 [Pull Request #453](https://github.com/ikawrakow/ik_llama.cpp/pull/453) - Faster IQ3_KT and IQ4_KT
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/opt_kt_quants` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-24 |
| **Updated** | 2025-05-24 |
+| **Merged** | 2025-05-24 |
---
-#### Description
+## 📄 Description
-The PR improves `AVX2` performance for the trellis quants `IQ3_KT` and `IQ4_KT` recently added in PR #441.
+The PR improves `AVX2` performance for the trellis quants `IQ3_KT` and `IQ4_KT` recently added in PR [#441](https://github.com/ikawrakow/ik_llama.cpp/issues/441).
The results below are for LLaMA-3.1-8B on a Ryzen-5975WX CPU.
### IQ3_KT
diff --git a/github-data/pull_requests/454 - Add support for FP8 GGUF creation and re-quantization _WIP_.md b/github-data/pull_requests/454 - Add support for FP8 GGUF creation and re-quantization WIP.md
similarity index 55%
rename from github-data/pull_requests/454 - Add support for FP8 GGUF creation and re-quantization _WIP_.md
rename to github-data/pull_requests/454 - Add support for FP8 GGUF creation and re-quantization WIP.md
index b66e23f47..24da2924d 100644
--- a/github-data/pull_requests/454 - Add support for FP8 GGUF creation and re-quantization _WIP_.md
+++ b/github-data/pull_requests/454 - Add support for FP8 GGUF creation and re-quantization WIP.md
@@ -1,16 +1,18 @@
-### 🔀 [#454](https://github.com/ikawrakow/ik_llama.cpp/pull/454) - Add support for FP8 GGUF creation and re-quantization (WIP)
+## 🔀 [Pull Request #454](https://github.com/ikawrakow/ik_llama.cpp/pull/454) - Add support for FP8 GGUF creation and re-quantization (WIP)
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `s6/fp8_native` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-24 |
| **Updated** | 2025-06-15 |
---
-#### Description
+## 📄 Description
-The goal of this is to be able to directly handle FP8 (more specifically E4M3) native models by creating an FP8 GGUF, which can then be quantized into a GGUF that can be used for inferencing (inference on FP8 is beyond the scope of this PR similar to #169).
+The goal of this is to be able to directly handle FP8 (more specifically E4M3) native models by creating an FP8 GGUF, which can then be quantized into a GGUF that can be used for inferencing (inference on FP8 is beyond the scope of this PR similar to [#169](https://github.com/ikawrakow/ik_llama.cpp/issues/169)).
Currently only the FP8 GGUF creation is implemented (which involved including the weight_scale_inv and FP8_E4M3 quant methods). Tested with [this](https://huggingface.co/Qwen/Qwen3-0.6B-FP8/tree/main) tiny model for now which successfully created a GGUF and was able to dump it.
@@ -22,21 +24,21 @@ I will attempt to add the quantization support later (handling the scale) but wa
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-05-24** at **11:34:38**:
+👤 **ikawrakow** commented on **2025-05-24** at **11:34:38**
I was thinking that it would be useful to support `fp8`.
---
-👤 **ikawrakow** commented the **2025-05-25** at **04:37:33**:
+👤 **ikawrakow** commented on **2025-05-25** at **04:37:33**
Btw, thinking about the matrix multiplication implementation, it seems one will need to multiply the activations with the `fp8` scales before doing the multiplication with the `fp8` tensor. The alternative would be to just directly convert to `Q8_0`, folding the scales into the `Q8_0` quants.
---
-👤 **saood06** commented the **2025-05-25** at **04:44:32**:
+👤 **saood06** commented on **2025-05-25** at **04:44:32**
> Btw, thinking about the matrix multiplication implementation, it seems one will need to multiply the activations with the `fp8` scales before doing the multiplication with the `fp8` tensor.
@@ -48,11 +50,11 @@ Interesting. I hadn't considered that. I'm still going to attempt the first appr
---
-👤 **whatever1983** commented the **2025-06-15** at **10:24:27**:
+👤 **whatever1983** commented on **2025-06-15** at **10:24:27**
Yo, guys, I just bumped into a surprise today:
-https://hf-mirror.com/nvidia/DeepSeek-V3-0324-FP4
+https://huggingface.co/nvidia/DeepSeek-V3-0324-FP4
Benchmark DeepSeek V3-0324 DeepSeek V3-0324-FP4
MMMU Pro 82 82.9
@@ -68,12 +70,12 @@ This raises some serious implications:
1. We should probably quant from FP4 native instead of FP8 or BF16, if the FP4 finetune performs way better than the original.
2. If FP4 is more than perfect, runs better than FP8 and BF16, quantization is dead. Any thing 2bit/3bit would wreck coding performance too much, anything 5bit-8bit to represent FP4 is a waste. FP4 to other forms of 4bit data representation ie IQ4K, IQ4XS isn't lossless.
-3. So instead of quantization, using higher precision hardware units(BF16, FP8) to run FP4 models losslessly is probably the way forward. On the CPU side, before Zen6 gets fp4 or Intel Xeon gets AMX-FP4, we can use BF16 to emulate FP4. I don't care if we take a 30% hit in performance running FP4 to BF16. It is lossless. On the CUDA side, we can just use "Chitu" kernels which runs FP4 with Ampere level hardware: https://kkgithub.com/thu-pacman/chitu
+3. So instead of quantization, using higher precision hardware units(BF16, FP8) to run FP4 models losslessly is probably the way forward. On the CPU side, before Zen6 gets fp4 or Intel Xeon gets AMX-FP4, we can use BF16 to emulate FP4. I don't care if we take a 30% hit in performance running FP4 to BF16. It is lossless. On the CUDA side, we can just use "Chitu" kernels which runs FP4 with Ampere level hardware: https://github.com/thu-pacman/chitu
4. Before FP8 conversion is finished, it is already deprecated. Blackwell FP4 training or quantized FP4 are already here. So I think FP4 actually now takes priority vs FP8 now. Half the VRAM with potential Nvidia tuned higher performance is killer.
---
-👤 **ikawrakow** commented the **2025-06-15** at **11:05:51**:
+👤 **ikawrakow** commented on **2025-06-15** at **11:05:51**
> FP4 to other forms of 4bit data representation ie IQ4K, IQ4XS isn't lossless.
@@ -89,13 +91,39 @@ You will not be the first and I'm sure you will not be the last to declare somet
---
-👤 **saood06** commented the **2025-06-15** at **11:24:59**:
+👤 **saood06** commented on **2025-06-15** at **11:22:39**
+
+> Apparently, the Nvidia official FP4 quantization of DS V3-0324 has a LiveCodeBench boost from 41 to 52.
+
+Are you absolutely sure about that? The official benchmark from Deepseek for the (unquantized) model claims 49.2.
+
+Why they chose to use a third party number that makes their own number look better in comparison can't be said for certain, but the obvious reason might be why.
+
+Edit: Look at the AIME numbers the official and unquauntized reports 59.4 the FP4 quant reports 49.3, so there is definitely some loss (and even going based on the numbers they choose to compare to of 52 it is still a loss)
+
+>Before FP8 conversion is finished, it is already deprecated. Blackwell FP4 training or quantized FP4 are already here. So I think FP4 actually now takes priority vs FP8 now. Half the VRAM with potential Nvidia tuned higher performance is killer.
+
+I realized that the work I did here was basically wrong very shortly after I did it (and the approach I took was needlessly complicated unless I wanted to create a scaled FP8 quant type, which would be interesting but was overly complicated for my goal).
+
+Taking the triton approach but removing the triton dependency would have been a lot easier to do, but I'm glad I did this as I did learn a lot about the actual native quant type of Deepseek and I think it helped me understand why IQ4_KS and IQ5_KS quant types work so well with Deepseek.
+
+---
+
+👤 **saood06** commented on **2025-06-15** at **11:24:59**
Closing as even though the approach could work, my attempt was wrong.
---
-👤 **saood06** commented the **2025-06-15** at **12:04:33**:
+👤 **whatever1983** commented on **2025-06-15** at **11:54:14**
+
+why would you close this issue? Even if the approach is wrong, FP8/FP4 implementation is still needed. Llama.cpp main branch also refused to accept an already implemented FP8 code path for months, which is a mistake.
+
+Taking a hit for AIME 2024 and getting a boost for LiveCodeBench is a great tradeoff. Nvidia obviously has more coding finetuning data than math data. Coders have a $150K-$300K/year salary compared to mathematician's at what? $60-80K/year. So any boost in coding is worth more than AIME or graduate level reasoning.
+
+---
+
+👤 **saood06** commented on **2025-06-15** at **12:04:33**
> why would you close this issue?
@@ -109,15 +137,47 @@ That has nothing to do with this.
I think there is no boost. The benchmark numbers have margins of error and often times different testing approaches.
-There is absolutely zero evidence I could find or that you provided that suggests they did some form of QAT or just fine-tuning after quantization to recover accuracy.
+There is absolutely zero evidence I could find or that you provided that suggests they did some form of QAT or just fine-tuning after quantization to recover accuracy and so if there is a boost it would have to be from quantization alone.
-Getting an ~425 GB quant of deepseek to perform about on par with unquantized is not really that impressive.
+Getting an ~425 GB quant of deepseek to perform about on par with unquantized is not really that impressive (the model linked is still useful because it would perform well on specific hardware).
Look at this https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13339067, the graph only goes up to ~370 GB and yet approaches 0 loss.
---
-👤 **saood06** commented the **2025-06-15** at **14:24:47**:
+👤 **whatever1983** commented on **2025-06-15** at **12:30:50**
+
+https://github.com/ggml-org/llama.cpp/pull/10055
+
+Is there anyway for ik to accept this pull from the main branch? It is opened from October of last year, so it is been delayed by ggerganov for 8 months now. (reason being that q8_0 is same quality as fp8. But you can't get Petaflops from q8_0 that you can from fp8. It is such a political anti-nvidia move from a mac guy.) Can we merge this into ik_llama.cpp with minimal workarounds?
+
+I mean, ggerganov's biggest mistake in his career is not accepting ik_llama.cpp's better quants and modification to GGML. I thought this fork is more open to accepting newer data types and quants. So I had higher hope of getting FP8/FP6/FP4 implemented here than a repo controlled by a mac guy who is trying to turn GGML into a training framework on a soldered down LPDDR5x 800GB/s platform when the industry is moving to 20TB/s per HBM4 accelerator platforms.
+
+---
+
+👤 **whatever1983** commented on **2025-06-15** at **13:48:58**
+
+"Getting an ~425 GB quant of deepseek to perform about on par with unquantized is not really that impressive (the model linked is still useful because it would perform well on specific hardware).
+
+Look at this https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13339067, the graph only goes up to ~370 GB and yet approaches 0 loss."
+
+The quant size isn't that impressive. The thing is if you run the 370GB non-FP4 quant on an EPYC with 512GB ram, you get 10-20 tokens/s with a 24GB VRAM GPU. That's a 1000W platform you run at home. that's 50w-100w per token generated a sec.
+
+8x FP4 accelerated GPUs might cost $400K each at 10KW each generating 21K tokens/s on 8x GB200s. That's 2w per token generated per sec. a 25-50x reduction in power density. Assume a DDR4 Based EPYC with 24G VRAM GPU at $5K, or a DDR5 Based EPYC with 24G 4090 at $10K, nvidia is 40 times more expensive cap ex but generates 1000 times tokens(21K vs 21 tokens/s). So per token generated is 25 times less at the capex.
+
+I am sorry for the mathematics. This order of magnitude difference is turning us into a shared structure where the API endpoint steal all your code output. If you have to run LLMs at home or privately, you'd hope that future CPU/GPU both have FP4 transformers capabilities.
+
+Just the cost perspective, you are 25x times off.
+
+Then there is the quality. PPL means nothing. 0.005 difference in PPL could mean a difference between code that runs in VS code, or code that doesn't. There is a difference for code even at IQ6K, q8_0, BF16 levels even though PPL is 0.25% different. If FP4 is guaranteed to be perfect, why waste the joules running 8bits or 16bits? Those Trillion parameter models are not meant to run on SSDs like DeepSeek would like you to believe it can.
+
+I don't know about you, but running non-perfect quants non-FP4 accelerated on home EPYC servers is not fun. I am running it. Waiting for 8K thinking tokens before first useful code token pops out at 10 tokens/s, that's a 10 minute wait. How much is 10 minutes of your life worth? Programmers should consider their life's worth at $1k/day. Assume 10hr/workday. That's $100/hr. 10 minute wait is $16 dollars. You would definitely offload that to a HBM powered API endpoint at this point. (And if you are throwing money at API endpoints to buy your own life's worth, might as well pay for 2700 elo o4-mini instead of a 2000 elo deepseek) Computing is suppose to accelerate productivity, not waiting for a reasoning models for minutes.
+
+Hence the need to run FP4 perfectly at Petaflops scale, not custom quants non-perfectly at Teraflops scale.
+
+---
+
+👤 **saood06** commented on **2025-06-15** at **14:24:47**
> The quant size isn't that impressive. The thing is if you run the 370GB non-FP4 quant on an EPYC with 512GB ram, you get 10-20 tokens/s with a 24GB VRAM GPU. That's a 1000W platform you run at home. that's 50w-100w per token generated a sec.
>
@@ -138,7 +198,7 @@ How can I be 25x times off if I made no claim about cost (let alone a numeric on
> Then there is the quality. PPL means nothing. 0.005 difference in PPL could mean a difference between code that runs in VS code, or code that doesn't. There is a difference for code even at IQ6K, q8_0, BF16 levels even though PPL is 0.25% different.
-Yes I am aware of that, there was even an example [here](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2882600098) where performance collapse was observed even though PPL looked good, but the problem is there are infinite valid ways to measure quality, and benchmarking takes time (especially for large models). NVIDIA seemingly didn't even bother to run benchmarks on the unquantized version (and like I said chose to use third party that were far lower than the official numbers which makes their quant look far better than it should).
+Yes I am aware of that, there was even an example [here](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2882600098) where performance collapse for a specific test was observed even though PPL looked good, but the problem is there are infinite valid ways to measure quality, and benchmarking takes time (especially for large models). NVIDIA seemingly didn't even bother to run benchmarks on the unquantized version (and like I said chose to use third party that were far lower than the official numbers which makes their quant look far better than it should).
> I don't know about you, but running non-perfect quants non-FP4 accelerated on home EPYC servers is not fun. I am running it. Waiting for 8K thinking tokens before first useful code token pops out at 10 tokens/s, that's a 10 minute wait.
@@ -159,7 +219,7 @@ If I wanted that PR here, I would port it, test it, and then make a PR. Otherwis
---
-👤 **ikawrakow** commented the **2025-06-15** at **14:36:25**:
+👤 **ikawrakow** commented on **2025-06-15** at **14:36:25**
> Is there anyway for ik to accept this pull from the main branch?
@@ -169,10 +229,34 @@ Are you asking if I would accept a PR adding `fp8` support if you prepared one f
---
-👤 **whatever1983** commented the **2025-06-15** at **14:49:50**:
+👤 **whatever1983** commented on **2025-06-15** at **14:49:50**
@ikawrakow
Might need some minor mods. The code in llama.cpp main branch seems decent. Besides the GGML version difference, why don't you try a merge first? At least the conversion scripts all work. Running FP8 on 40 series and 50 series need additional CUDA code. Running on CPU needs BF16 casts
-All I am saying is that at least the repo maintainer needs to be willing to accept the importance of those data formats. Because current/future hardware can do Petaflops on those formats. B200/GB10 and recently announced MI350X and the 432GB MI450X in 2026 can run the FP4 in a single GPU FP4 accelerated. You need to be forward looking.
\ No newline at end of file
+All I am saying is that at least the repo maintainer needs to be willing to accept the importance of those data formats. Because current/future hardware can do Petaflops on those formats. B200/GB10 and recently announced MI350X and the 432GB MI450X in 2026 can run the FP4 in a single GPU FP4 accelerated. You need to be forward looking.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-15** at **14:59:23**
+
+> Because current/future hardware can do Petaflops on those formats
+
+All that the linked PR does is add a CPU implementation for CPU's that don't natively support `fp8`. It has nothing to do with Petaflop hardware. When we get our hands on one of those (are you getting one for yourself?), there will be no need for this PR because the hardware will natively support it, and all we need to do is change one parameter in our call to cuBLAS.
+
+As far as I can tell, PR 10055 is 2-3 times slower compared to what we have here. So, if I felt that `fp8` was high priority, I wouldn't just try to merge this PR, but would rather make one myself that has a similar or better performance as the other data types in `ik_llama.cpp`
+
+---
+
+👤 **whatever1983** commented on **2025-06-15** at **17:53:15**
+
+@ikawrakow:
+
+Your 4080 is FP8 capable right? You bought it without ever touching the FP8 transformer pipeline, which is the most valuable part. With this PR, the CPU without FP8 support is only used to convert to the FP8 format, then your 4080 can run a 8B-12B model FP8 on 16GB VRAM.
+
+Just like that, you should try to get your hands on a 5090 card, where FP4 is the most valuable hardware that an average gamer will never touch until games use FP4 to inference LLMs. Such a waste really.
+
+CPUs are one generation away from implementing FP4. Zen6 and xeon with amx-fp4 are in the pipelines. You get FP4 into a gguf file first, running on bf16, and when CPUs with FP4 come out, you get a 4x boost.
+
+The real important thing is you never need to change model weights anymore at fp4 to guarantee 100% fidelity of the model. Hardware will catch up very very soon.
\ No newline at end of file
diff --git a/github-data/pull_requests/457 - Remove GGML_IQK_MUL_MAT option.md b/github-data/pull_requests/457 - Remove GGML_IQK_MUL_MAT option.md
index 1bdc54f4d..a5f5a870e 100644
--- a/github-data/pull_requests/457 - Remove GGML_IQK_MUL_MAT option.md
+++ b/github-data/pull_requests/457 - Remove GGML_IQK_MUL_MAT option.md
@@ -1,33 +1,43 @@
-### 🔀 [#457](https://github.com/ikawrakow/ik_llama.cpp/pull/457) - Remove GGML_IQK_MUL_MAT option
+## 🔀 [Pull Request #457](https://github.com/ikawrakow/ik_llama.cpp/pull/457) - Remove GGML_IQK_MUL_MAT option
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `ik/remove_iqk_option` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-25 |
| **Updated** | 2025-05-25 |
---
-#### Description
+## 📄 Description
There is no point in using `ik_llama.cpp` without `GGML_IQK_MUL_MAT`.
-Closes #456
+Closes [#456](https://github.com/ikawrakow/ik_llama.cpp/issues/456)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2025-05-25** at **12:34:51**:
+👤 **Nexesenex** commented on **2025-05-25** at **12:34:51**
-There is actually a point to leave this as a legacy marking for the quants, because it helps a lot with merging your quants, including the potential future ones, which are still compatible with only a few formatting adaptation with the mainline ggml framework, even if the ops are not.
+There is actually a point to leave this as a legacy marking for your new quants, because it helps a lot with merging them on my fork, including the potential future ones, which are still compatible with only a few formatting adaptation with the mainline ggml framework, even if the ops are not.
I'm really good at shooting in my own foot! :D
---
-👤 **ikawrakow** commented the **2025-05-25** at **15:10:55**:
+👤 **ikawrakow** commented on **2025-05-25** at **15:10:55**
> as a legacy marking
-Legacy marking in what sense?
\ No newline at end of file
+Legacy marking in what sense?
+
+---
+
+👤 **Nexesenex** commented on **2025-05-25** at **18:31:43**
+
+In the sense that even if the option is not used anymore to compile, it's pretty handy for your average enthusiast such as myself to still have the distinction in the code between IKL multmat dependent code and the rest of your code to help the merging with a mainline fork, at least on anything with a compatible part.
+
+For now, everything is still quite clear, but as time will pass, the divergence between IKL and mainline will increase, and having at least the point of reference of what works (theoretically) and what doesn't with mainline of August 2024 is an invaluable help.
\ No newline at end of file
diff --git a/github-data/pull_requests/458 - Add missing gguf-py constants.md b/github-data/pull_requests/458 - Add missing gguf-py constants.md
index 439709ac7..21556da65 100644
--- a/github-data/pull_requests/458 - Add missing gguf-py constants.md
+++ b/github-data/pull_requests/458 - Add missing gguf-py constants.md
@@ -1,13 +1,16 @@
-### 🔀 [#458](https://github.com/ikawrakow/ik_llama.cpp/pull/458) - Add missing gguf-py constants
+## 🔀 [Pull Request #458](https://github.com/ikawrakow/ik_llama.cpp/pull/458) - Add missing gguf-py constants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_missing_gguf_constants` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-25 |
| **Updated** | 2025-05-25 |
+| **Merged** | 2025-05-25 |
---
-#### Description
+## 📄 Description
The recently added `IQ5_KS, IQ5_KS_R4, IQ2_KT, IQ3_KT, IQ4_KT` were missing.
\ No newline at end of file
diff --git a/github-data/pull_requests/46 - IQ1_TN Metal implementation.md b/github-data/pull_requests/46 - IQ1_TN Metal implementation.md
index 5d0a10bf0..9c00834e3 100644
--- a/github-data/pull_requests/46 - IQ1_TN Metal implementation.md
+++ b/github-data/pull_requests/46 - IQ1_TN Metal implementation.md
@@ -1,14 +1,17 @@
-### 🔀 [#46](https://github.com/ikawrakow/ik_llama.cpp/pull/46) - IQ1_TN Metal implementation
+## 🔀 [Pull Request #46](https://github.com/ikawrakow/ik_llama.cpp/pull/46) - IQ1_TN Metal implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_tn_metal` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-10 |
| **Updated** | 2024-09-10 |
+| **Merged** | 2024-09-10 |
---
-#### Description
+## 📄 Description
`IQ1_BN` stores a scale at the beginning of each row, followed by `IQ1_BN` packing of the ternary quants. The existing Metal implementation does not allow for that sort of thing, so some changes were necessary (apart from adding the necessary additions in `ggml-metal.m`):
* We modify the `kernel_mul_mm` and `kernel_mul_mm_id_impl` templates to have a dequantizer type as a template parameter (instead of a dequantization function)
diff --git a/github-data/pull_requests/460 - aarch64 kernels for KT quants.md b/github-data/pull_requests/460 - aarch64 kernels for KT quants.md
index 06e275baa..92d50cb95 100644
--- a/github-data/pull_requests/460 - aarch64 kernels for KT quants.md
+++ b/github-data/pull_requests/460 - aarch64 kernels for KT quants.md
@@ -1,14 +1,16 @@
-### 🔀 [#460](https://github.com/ikawrakow/ik_llama.cpp/pull/460) - aarch64 kernels for KT quants
+## 🔀 [Pull Request #460](https://github.com/ikawrakow/ik_llama.cpp/pull/460) - aarch64 kernels for KT quants
| **Author** | `andrewkchan` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `trellis_aarch64` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-26 |
| **Updated** | 2025-05-30 |
---
-#### Description
+## 📄 Description
This adds aarch64 kernels for the KT quants added in https://github.com/ikawrakow/ik_llama.cpp/pull/441.
@@ -52,4 +54,24 @@ For comparison, I get ~18.3 t/s on IQ2_K, so it is considerably slower, but mayb
- Self-reported review complexity:
- [X] Low
- [ ] Medium
- - [ ] High
\ No newline at end of file
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-05-26** at **14:18:01**
+
+This is great! I didn't know you had an M3 laptop
+
+I had started working on the NEON implementation, but did not push to GitHub because there was still a bug in the `IQ4_KT` implementation that I didn't find where it was. You can check [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/trellis_neon).
+
+On NEON one can use `fp16` arithmetic. I think this should make it go quite a bit faster. Can you compare on your M3?
+
+---
+
+👤 **andrewkchan** commented on **2025-05-26** at **19:31:52**
+
+Oh nice! Yes it is about 2x the speed on IQ2_KT, from 4.5t/s to 7.9t/s on basic text generation test. IQ3_KT goes from 3.4t/s to 5.7t/s and IQ4_KT goes from 2.1t/s to 2.5t/s (using the buggy kernel).
+
+Maybe we can close this PR then and you can continue with your branch. I can get started on metal kernels instead. Or I'm happy to try to finish your work too. What do you think?
\ No newline at end of file
diff --git a/github-data/pull_requests/461 - CUDA implementation for IQ2_K_R4_ IQ3_K_R4_ IQ4_K_R4_ IQ5_K_R4.md b/github-data/pull_requests/461 - CUDA implementation for IQ2_K_R4 IQ3_K_R4 IQ4_K_R4 IQ5_K_R4.md
similarity index 77%
rename from github-data/pull_requests/461 - CUDA implementation for IQ2_K_R4_ IQ3_K_R4_ IQ4_K_R4_ IQ5_K_R4.md
rename to github-data/pull_requests/461 - CUDA implementation for IQ2_K_R4 IQ3_K_R4 IQ4_K_R4 IQ5_K_R4.md
index b85eb0ede..dccc9d052 100644
--- a/github-data/pull_requests/461 - CUDA implementation for IQ2_K_R4_ IQ3_K_R4_ IQ4_K_R4_ IQ5_K_R4.md
+++ b/github-data/pull_requests/461 - CUDA implementation for IQ2_K_R4 IQ3_K_R4 IQ4_K_R4 IQ5_K_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#461](https://github.com/ikawrakow/ik_llama.cpp/pull/461) - CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4
+## 🔀 [Pull Request #461](https://github.com/ikawrakow/ik_llama.cpp/pull/461) - CUDA implementation for IQ2_K_R4, IQ3_K_R4, IQ4_K_R4, IQ5_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_iq4_k_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-26 |
| **Updated** | 2025-06-04 |
+| **Merged** | 2025-05-26 |
---
-#### Description
+## 📄 Description
The `IQX_K` quants and their row-interleaved siblings `IQX_K_R4` offer better quantization quality than corresponding i-, k-, or legacy quants at the same bpw. `IQX_K_R4` quants have better CPU performance but cannot be used on CUDA as there is no GEMM/GEMV implementation. Hence, "quant cookers" need to release `IQX_K` quantized model, so users can use them on their GPUs, but that requires users doing CPU-ony inference to repack the model to take advantage of the better CPU performance. In addition, @ubergarm has released various `IQK_X_R4` quantized models (see [here](https://huggingface.co/ubergarm), and those cannot be used for GPU inference.
@@ -20,9 +23,9 @@ For now GEMM is implemented via dequantize + cuBLAS. I may add quantized GEMM (a
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-05-30** at **15:22:27**:
+👤 **ubergarm** commented on **2025-05-30** at **15:22:27**
> I'll follow up with a separate PR for IQ2_KS_R4, IQ4_KS_R4 and IQ5_KS_R4.
@@ -50,7 +53,7 @@ For now I'll go with `IQ3_K_R4` and `IQ2_K_R4`. I might loop back in the future
---
-👤 **ikawrakow** commented the **2025-05-30** at **15:35:19**:
+👤 **ikawrakow** commented on **2025-05-30** at **15:35:19**
No I haven't done `IQ2_KS_R4` yet. I keep trying to improve it, so I got distracted with that. And, because there isn't much usage of it yet, I was considering making a breaking change to the packing. That was the actual reason for postponing the CUDA implementation.
@@ -60,15 +63,17 @@ Or, if you have the patience to wait for `iq2_kt`, you can try quantizing the `f
---
-👤 **ubergarm** commented the **2025-06-01** at **15:28:53**:
+👤 **ubergarm** commented on **2025-06-01** at **15:28:53**
> I did not enable GGML_CUDA_IQK_FORCE_BF16 by default as it reduces prompt processing performance while, as far as I can tell, bf16 is only required for DeepSeek.
-I got a report from the wild that FORCE_BF16=1 gave a speed boost and confirmed that it does seem to do so at least in this specific hardware configuration and this specific quant. I added a graph and data to the R1-0528 discussion: https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13335019
+I got a report from the wild that FORCE_BF16=1 gave a speed boost and confirmed that it does seem to do so at least in this specific hardware configuration and this specific quant for PP. I added a graph and data to the R1-0528 discussion: https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13335019
+
+This benchmark also confirms offloading additional `_r4` onto GPU is giving some speed boost for both PP and TG.
> Or, if you have the patience to wait for iq2_kt, you can try quantizing the ffn_up and ffn_gate tensors with that. It is slightly less bpw than iq2_ks (2.125 vs 2.1875), but you get lower PPL.
-OOOH! I just realized you've been doing the `iqN_kt` "trellis quants" which are the QTIP/exl3 quants for a while. I can be quite myopic. Reading through some old PRs I see you've done quite a bit already. I've been impressed by the low perplexity (especially with such low 2~3 bpw) using exllamav3 to make exl3 quants following @louiehelm 's quest for the best magic number e.g. `3INST mcg=0xB83EA16`
+OOOH! I just realized you've been doing the `iqN_kt` "trellis quants" which are the QTIP/exl3 quants for a while. I can be quite myopic. Reading through some old PRs I see you've done quite a bit already. I've been impressed by the low perplexity (especially with such low 2~3 bpw) using exllamav3 to make exl3 quants following @louiehelm 's [quest for the best magic number](https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-2916801280) e.g. `3INST mcg=0xB83EA16`

@@ -78,7 +83,7 @@ Regardless, I'll read up more on your implementation of iq2_kt and check the cod
---
-👤 **ikawrakow** commented the **2025-06-01** at **15:57:38**:
+👤 **ikawrakow** commented on **2025-06-01** at **15:57:38**
> OOOH! I just realized you've been doing the iqN_kt "trellis quants" which are the QTIP/exl3 quants for a while. I can be quite myopic. Reading through some old PRs I see you've done quite a bit already. I've been impressed by the low perplexity (especially with such low 2~3 bpw) using exllamav3 to make exl3 quants following @louiehelm 's https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-2916801280 e.g. 3INST mcg=0xB83EA16
@@ -92,7 +97,7 @@ The thing about apples-to-apples is that if you use `PPL(Q)/PPL(f16)` (or better
---
-👤 **louiehelm** commented the **2025-06-02** at **04:14:53**:
+👤 **louiehelm** commented on **2025-06-02** at **04:14:53**
I like KT quants too and tried subbing out 3INST parameters with superior ones (since LCG from QTIP paper x = 89226354 * x + 64248484 can't be optimal) but for some reason, all the better parameters with lower MSE both in synthetic trellis codes (without rotations) or in EXL3 (with rotations) don't show improvement when I slot them into ik_llama, recompile, quant, and test models.
@@ -127,7 +132,7 @@ would become something like this (with slightly different asm input params if yo
---
-👤 **ikawrakow** commented the **2025-06-02** at **05:26:12**:
+👤 **ikawrakow** commented on **2025-06-02** at **05:26:12**
> Could current KT code paths be implicitly tuned to expect certain behavior the default parameters provide? I haven't gone through the code super carefully but at first glance I can't immediately figure this out.
@@ -135,9 +140,9 @@ The quantization implementation does not attempt to find the provably optimum so
* I'm not a GPU person, so prefer to work on the CPU. Solving exactly on the CPU is simply prohibitive.
* All my past experience tells me that a lower RMSE does not necessarily translate into a better observable model quality
-Hence, a heuristics is used to determine "optimum" quants. The heuristics is tuned to the specific values being produced by the trellis. But I don't expect you to observe "unreasonable harm", just perhaps a somewhat lower quantization.
+Hence, a heuristics is used to determine "optimum" quants. The heuristics is tuned to the specific values being produced by the trellis. But I don't expect you to observe "unreasonable harm", just perhaps a somewhat lower quality quantization.
-I did play quite a bit with different generators when working on #113. For instance, I experimented with using the sum of the 8 bytes of 64-bit random variables. This has many advantages to the QTIP trellises:
+I did play quite a bit with different generators when working on [#113](https://github.com/ikawrakow/ik_llama.cpp/issues/113). For instance, I experimented with using the sum of the 8 bytes of 64-bit random variables. This has many advantages to the QTIP trellises:
* It produces a much better Gaussian distribution, so it is "theoretically better"
* It is much cheaper to generate. There are high quality pseudo random number generators that only require cheap xors and shifts instead of extremely expensive 32-bit integer multiplications. Summing up the elements is fast on CUDA and on the CPU.
* We end up with 16-bit integer random variables, so computing dot products is nearly 2X the speed of the QTIP trellises when there is no native `fp16` support as it is the case on many CPUs. We could go even a step further and squeeze them to 8-bit, which will make also CUDA run significantly faster.
@@ -148,4 +153,40 @@ But despite the "theoretical advantage", I observed lower quality quantization.
> Also, if you want KT quants to run even faster, the QTIP paper mentions how to combine the 2 masks in 3INST (AND + XOR) into a single LOP3 instruction. It needs to be added in asm because nvcc can't find this optimization but it improves speed by a measurable amount.
-I noticed it too in the QTIP paper, but I did not take it seriously because an integer multiplication is quite a bot slower than a xor. But if you say that you observe a measurable performance difference, I'll try it. Thanks!
\ No newline at end of file
+I noticed it too in the QTIP paper, but I did not take it seriously because an integer multiplication is quite a bot slower than a xor. But if you say that you observe a measurable performance difference, I'll try it. Thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-06-02** at **09:52:06**
+
+OK, using the inline assembly instruction results in a 0.6% speedup for TG-128 (178.7 t/s vs 177.5 t/s on my RTX-4080 for `IQ2_KT`-quantized LlaMA-3.1-8B).
+
+---
+
+👤 **ubergarm** commented on **2025-06-04** at **01:52:05**
+
+This closed PR probably isn't the place for this, but given the previous conversation around optimizing the KT quants I have my first KT quant perplexity/kld comparison now!
+
+#### DeepSeek-R1-0528-IQ2_KT
+`196.696 GiB (2.514 BPW)` "IQ2_KT" quant which is a mix of:
+```
+- type f32: 361 tensors
+- type q5_0: 61 tensors `attn_k_b`
+- type iq2_kt: 116 tensors `ffn_(gate|up)_exps`
+- type iq3_kt: 58 tensors `ffn_down_exps`
+- type iq4_kt: 551 tensors `everything else`
+```
+
+#### Perplexity
+Compared to my other ik_llama.cpp quants in this model collection made with same imatrix corpus with `wiki.test.raw`.
+
+
+
+## KLD
+Compared to my other ik_llama.cpp quant collection made with same imatrix corpus with very short unreleased "novel text" kld text corpus against q8_0 baseline.
+
+
+
+The only other piece of data I have is using IQ4_KT as attention tensor in otherwise Q4_0 quant which is in [the R1-0528 discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13344417).
+
+Looking forward playing with these some more and seeing how they perform across various models as more data becomes available. Thanks.
\ No newline at end of file
diff --git a/github-data/pull_requests/462 - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4.md b/github-data/pull_requests/462 - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4.md
index 78d9edfb4..7079f6f53 100644
--- a/github-data/pull_requests/462 - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4.md
+++ b/github-data/pull_requests/462 - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4.md
@@ -1,15 +1,18 @@
-### 🔀 [#462](https://github.com/ikawrakow/ik_llama.cpp/pull/462) - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4
+## 🔀 [Pull Request #462](https://github.com/ikawrakow/ik_llama.cpp/pull/462) - CUDA GEMM and GEMV for IQ4_KS_R4 and IQ5_KS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_iqk_ks_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-26 |
| **Updated** | 2025-05-27 |
+| **Merged** | 2025-05-27 |
---
-#### Description
+## 📄 Description
-This PR is a follow up to PR #461 and adds CUDA implementation for `IQ4_KS_R4` and `IQ5_KS_R4`
+This PR is a follow up to PR [#461](https://github.com/ikawrakow/ik_llama.cpp/issues/461) and adds CUDA implementation for `IQ4_KS_R4` and `IQ5_KS_R4`
Note: because GEMM is implemented via dequantize+cuBLAS, if you want to use a IQX_K_R4 DeepSeek-V3/R1 model on the GPU, you may need to build with -DGGML_CUDA_IQK_FORCE_BF16=1 to force bf16 arithmetic with cuBLAS as fp16 has been noted to lead to numerical instabilities and garbled output. I did not enable GGML_CUDA_IQK_FORCE_BF16 by default as it reduces prompt processing performance while, as far as I can tell, bf16 is only required for DeepSeek.
\ No newline at end of file
diff --git a/github-data/pull_requests/465 - Set cache_prompt default to true.md b/github-data/pull_requests/465 - Set cache_prompt default to true.md
index f415d77b9..d5ab813e6 100644
--- a/github-data/pull_requests/465 - Set cache_prompt default to true.md
+++ b/github-data/pull_requests/465 - Set cache_prompt default to true.md
@@ -1,21 +1,24 @@
-### 🔀 [#465](https://github.com/ikawrakow/ik_llama.cpp/pull/465) - Set cache_prompt default to true
+## 🔀 [Pull Request #465](https://github.com/ikawrakow/ik_llama.cpp/pull/465) - Set cache_prompt default to true
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/cache_default` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-28 |
| **Updated** | 2025-05-28 |
+| **Merged** | 2025-05-28 |
---
-#### Description
+## 📄 Description
There is very little reason to not enable cache_prompt, so it makes more sense for it to be enabled since it benefits those who either don't know about this or use tools that do not set this, and the option to turn it off is still allowed in the very niche situations where this behavior is not desired.
-Closes #455
+Closes [#455](https://github.com/ikawrakow/ik_llama.cpp/issues/455)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-28** at **05:18:19**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-28** at **05:18:19**
\ No newline at end of file
diff --git a/github-data/pull_requests/468 - Minor 2 iq2_ks TG performance improvement on CUDA.md b/github-data/pull_requests/468 - Minor 2 iq2_ks TG performance improvement on CUDA.md
new file mode 100644
index 000000000..a89bdd277
--- /dev/null
+++ b/github-data/pull_requests/468 - Minor 2 iq2_ks TG performance improvement on CUDA.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #468](https://github.com/ikawrakow/ik_llama.cpp/pull/468) - Minor (~2%) iq2_ks TG performance improvement on CUDA
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/minor_iq2ks_tweak` |
+| **Target Branch** | `main` |
+| **Created** | 2025-05-28 |
+| **Updated** | 2025-06-01 |
+| **Merged** | 2025-06-01 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/468 - Minor _2_ iq2_ks TG performance improvement on CUDA.md b/github-data/pull_requests/468 - Minor _2_ iq2_ks TG performance improvement on CUDA.md
deleted file mode 100644
index bed1ae7a8..000000000
--- a/github-data/pull_requests/468 - Minor _2_ iq2_ks TG performance improvement on CUDA.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🔀 [#468](https://github.com/ikawrakow/ik_llama.cpp/pull/468) - Minor (~2%) iq2_ks TG performance improvement on CUDA
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-28 |
-| **Updated** | 2025-06-01 |
\ No newline at end of file
diff --git a/github-data/pull_requests/469 - Replace MLA-specific KV cache with the standard KV cache.md b/github-data/pull_requests/469 - Replace MLA-specific KV cache with the standard KV cache.md
index 62b5c2653..c0d524106 100644
--- a/github-data/pull_requests/469 - Replace MLA-specific KV cache with the standard KV cache.md
+++ b/github-data/pull_requests/469 - Replace MLA-specific KV cache with the standard KV cache.md
@@ -1,22 +1,25 @@
-### 🔀 [#469](https://github.com/ikawrakow/ik_llama.cpp/pull/469) - Replace MLA-specific KV cache with the standard KV cache
+## 🔀 [Pull Request #469](https://github.com/ikawrakow/ik_llama.cpp/pull/469) - Replace MLA-specific KV cache with the standard KV cache
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/remove_kv_l` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-28 |
| **Updated** | 2025-05-30 |
+| **Merged** | 2025-05-30 |
---
-#### Description
+## 📄 Description
Also tried handling the case of a missing V cache (as it happens with most MLA options) when reading/writing/de-fragmenting the cache, but not sure of that works, so making the PR a draft.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-05-29** at **05:01:59**:
+👤 **saood06** commented on **2025-05-29** at **05:01:59**
I'll try to test this later tonight (my server is currently busy downloading and converting the new R1 checkpoint) with some loading and saving of the cache to a file but I don't see how de-fragmenting has changed looking at your commits.
@@ -24,7 +27,7 @@ De-fragmenting the cache is not a feature I'm very familiar with at all so I'm n
---
-👤 **ikawrakow** commented the **2025-05-29** at **05:05:56**:
+👤 **ikawrakow** commented on **2025-05-29** at **05:05:56**
> but I don't see how de-fragmenting has changed looking at your commits.
@@ -32,7 +35,7 @@ In the function `build_defrag()` there is a check for the presence of V-cache.
---
-👤 **saood06** commented the **2025-05-29** at **05:15:41**:
+👤 **saood06** commented on **2025-05-29** at **05:15:41**
> > but I don't see how de-fragmenting has changed looking at your commits.
>
@@ -42,7 +45,19 @@ I see it now. I also see that mainline has changed the default `defrag_thold` (n
---
-👤 **saood06** commented the **2025-05-29** at **13:31:05**:
+👤 **saood06** commented on **2025-05-29** at **13:00:20**
+
+This still results in a `Segmentation fault` when saving. Similar to before the file it was saving has the initial magic numbers and the number of tokens and what I'm assuming is the raw tokens of the prompt (size mostly lines up missing the last few it seems probably because it didn't flush the buffer before it crashed). So it seems to have crashed again when it attempted to write the actual cache.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-29** at **13:23:10**
+
+Can you debug?
+
+---
+
+👤 **saood06** commented on **2025-05-29** at **13:31:05**
> Can you debug?
@@ -50,20 +65,40 @@ I'll look into it more later. Going to head off now, was hoping to have more tim
---
-👤 **saood06** commented the **2025-05-30** at **07:57:07**:
+👤 **saood06** commented on **2025-05-30** at **06:42:21**
+
+@ikawrakow
+
+See [#473](https://github.com/ikawrakow/ik_llama.cpp/issues/473) I made that PR to merge onto this branch as I didn't know if you were fine with me pushing commits directly onto your branch.
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **07:57:07**
If you are waiting for me to test de-fragmenting the cache before marking this ready, I'm not sure if/when I will do that, as there doesn't seem to be any indication of when that happens in any example (server only tells you when fragmentation may be an issue). I'd either need to write an example or understand how it works well enough to create a situation in which I know it will happen (with the threshold I set, since as it stands it is disabled by default here).
---
-👤 **saood06** commented the **2025-05-30** at **08:03:29**:
+👤 **ikawrakow** commented on **2025-05-30** at **07:59:18**
+
+Closed in favor of [#473](https://github.com/ikawrakow/ik_llama.cpp/issues/473)
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **08:03:29**
@ikawrakow
-#473 merged onto `ik/remove_kv_l` and not main, sorry if that wasn't clear before.
+[#473](https://github.com/ikawrakow/ik_llama.cpp/issues/473) merged onto `ik/remove_kv_l` and not main, sorry if that wasn't clear before.
+
+---
+
+👤 **ikawrakow** commented on **2025-05-30** at **08:05:17**
+
+Oops.
---
-👤 **ikawrakow** commented the **2025-05-30** at **08:05:17**:
+👤 **ikawrakow** commented on **2025-05-30** at **08:08:05**
-Oops.
\ No newline at end of file
+For the de-fragmentation part of it, to me it looks like it should work. The fact that we didn't get any bug reports so far indicates that the de-fragmentation use case is not very important, so we can just wait and see.
\ No newline at end of file
diff --git a/github-data/pull_requests/47 - iq2_tn_ slightly better performance on AVX2.md b/github-data/pull_requests/47 - iq2_tn slightly better performance on AVX2.md
similarity index 73%
rename from github-data/pull_requests/47 - iq2_tn_ slightly better performance on AVX2.md
rename to github-data/pull_requests/47 - iq2_tn slightly better performance on AVX2.md
index e71aadad5..1316a97b5 100644
--- a/github-data/pull_requests/47 - iq2_tn_ slightly better performance on AVX2.md
+++ b/github-data/pull_requests/47 - iq2_tn slightly better performance on AVX2.md
@@ -1,14 +1,17 @@
-### 🔀 [#47](https://github.com/ikawrakow/ik_llama.cpp/pull/47) - iq2_tn: slightly better performance on AVX2
+## 🔀 [Pull Request #47](https://github.com/ikawrakow/ik_llama.cpp/pull/47) - iq2_tn: slightly better performance on AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_tn_avx2` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-10 |
| **Updated** | 2024-09-10 |
+| **Merged** | 2024-09-10 |
---
-#### Description
+## 📄 Description
We get `PP-512 = 545` t/s for the 4B TriLM model compared to `PP-512 = 498` t/s on the main branch (on a Ryzen-5975WX). TG is not affected.
diff --git a/github-data/pull_requests/470 - Send DONE for OAI compatibility.md b/github-data/pull_requests/470 - Send DONE for OAI compatibility.md
new file mode 100644
index 000000000..9da23edeb
--- /dev/null
+++ b/github-data/pull_requests/470 - Send DONE for OAI compatibility.md
@@ -0,0 +1,199 @@
+## 🔀 [Pull Request #470](https://github.com/ikawrakow/ik_llama.cpp/pull/470) - Send [DONE] for OAI compatibility
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/server_send_done` |
+| **Target Branch** | `main` |
+| **Created** | 2025-05-29 |
+| **Updated** | 2025-06-17 |
+| **Merged** | 2025-06-17 |
+
+---
+
+## 📄 Description
+
+See [#467](https://github.com/ikawrakow/ik_llama.cpp/issues/467)
+
+The PR adds a command line parameter `--send-done`, which makes the server send a `data: [DONE]\n\n` message when a stop token is encountered.
+
+---
+
+## 💬 Conversation
+
+👤 **cyril23** commented on **2025-06-04** at **06:37:52**
+
+Thanks a lot! `--send-done` works perfectly on my end!
+
+Below are my build and test steps in case they’re useful.
+
+## 1. Build ik_llama.cpp from your branch
+```
+# Inside WSL 2 (Ubuntu 24 LTS) start the base Docker container:
+sudo docker run --gpus all -it --rm \
+ --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --net \
+ host nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3
+
+# In the container, clone your branch and build ik_llama.cpp:
+cd /root && git clone -b ik/server_send_done https://github.com/ikawrakow/ik_llama.cpp
+cd ik_llama.cpp/
+apt-get update && apt-get install -y cmake build-essential
+cmake -B build -DGGML_CUDA=ON
+cmake --build build --config Release -j$(nproc)
+cd build/bin
+
+# Download a test model:
+wget "https://huggingface.co/unsloth/phi-4-GGUF/resolve/main/phi-4-Q4_K_M.gguf?download=true" -O "phi-4-Q4_K_M.gguf"
+```
+
+## 2. Start the ik_llama.cpp server without `--send-done`
+```
+./llama-server -m ./phi-4-Q4_K_M.gguf -c 2048 -ngl 99 -np 1 --cont-batching \
+ --host 0.0.0.0 --port 8000 -fa --alias "phi-4"
+```
+
+### 2.1 Test with cURL
+Send a chat completions request with streaming activated:
+```
+me@Computer:~$ curl --location 'http://localhost:8000/v1/chat/completions' \
+--header 'Content-Type: application/json' \
+--header 'Authorization: Bearer testxxx' \
+--data '{
+ "model": "phi-4",
+ "messages": [
+ {
+ "role": "system",
+ "content": "You are a helpful assistant."
+ },
+ {
+ "role": "user",
+ "content": "What is the capital of France? Make your answer as short as possible."
+ }
+ ],
+ "stream": true
+}'
+data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Paris"}}],"created":1749017398,"id":"chatcmpl-RZV2GpuTn0T4JOV2iTgDWfb21r6cxEOe","model":"phi-4","object":"chat.completion.chunk"}
+
+data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"."}}],"created":1749017398,"id":"chatcmpl-RZV2GpuTn0T4JOV2iTgDWfb21r6cxEOe","model":"phi-4","object":"chat.completion.chunk"}
+
+data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1749017398,"id":"chatcmpl-RZV2GpuTn0T4JOV2iTgDWfb21r6cxEOe","model":"phi-4","object":"chat.completion.chunk","usage":{"completion_tokens":3,"prompt_tokens":34,"total_tokens":37}}
+
+me@Computer:~$
+```
+
+As expected, no `data: [DONE]` line shows up.
+
+### 2.2 Test with inference-benchmarker (expected to fail)
+We can further test that with https://github.com/huggingface/inference-benchmarker/ which [expects the streamed data to end with `[done]`](https://github.com/huggingface/inference-benchmarker/blob/687e477930b387d3c9c787d4953a266f6469f047/src/requests.rs#L165):
+```
+# Build once:
+cd ~ && git clone https://github.com/huggingface/inference-benchmarker inference-benchmarker-current && \
+cd inference-benchmarker-current && \
+sudo docker build -t inference_benchmarker_latest .
+export HUGGING_FACE_HUB_TOKEN=my_token_here_xxx
+
+# Run:
+sudo docker run --network host -e HF_TOKEN=$HUGGING_FACE_HUB_TOKEN \
+ inference_benchmarker_latest inference-benchmarker --no-console \
+ --prompt-options "num_tokens=200,max_tokens=220,min_tokens=180,variance=10" \
+ --decode-options "num_tokens=200,max_tokens=220,min_tokens=180,variance=10" \
+ --url http://localhost:8000/v1 \
+ --rates 1.0 --max-vus 800 --duration 15s --warmup 15s --benchmark-kind rate \
+ --model-name "phi-4" --tokenizer-name "microsoft/phi-4"
+```
+Output of inference-benchmarker
+
+> Text Generation Inference Benchmark 1.1.0 (unknown)
+> [2025-06-04T06:11:15Z ERROR inference_benchmarker] Error running benchmark: "Backend did not return any valid response. It is either not responding or test duration is too short."
+
+This is exactly what we expect without `[DONE]`.
+
+## 3. Start the ik_llama.cpp server with `--send-done`
+```
+./llama-server -m ./phi-4-Q4_K_M.gguf -c 2048 -ngl 99 -np 1 --cont-batching \
+ --host 0.0.0.0 --port 8000 -fa --alias "phi-4" \
+ --send-done
+```
+
+### 3.1 Test with cURL
+```
+me@Computer:~$ curl --location 'http://localhost:8000/v1/chat/completions' \
+--header 'Content-Type: application/json' \
+--header 'Authorization: Bearer testxxx' \
+--data '{
+ "model": "phi-4",
+ "messages": [
+ {
+ "role": "system",
+ "content": "You are a helpful assistant."
+ },
+ {
+ "role": "user",
+ "content": "What is the capital of France? Make your answer as short as possible."
+ }
+ ],
+ "stream": true
+}'
+data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"Paris"}}],"created":1749017544,"id":"chatcmpl-lhQ9OQOyhQw3Vy5MCwWzIGEyg0zTs4kk","model":"phi-4","object":"chat.completion.chunk"}
+
+data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"."}}],"created":1749017544,"id":"chatcmpl-lhQ9OQOyhQw3Vy5MCwWzIGEyg0zTs4kk","model":"phi-4","object":"chat.completion.chunk"}
+
+data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1749017544,"id":"chatcmpl-lhQ9OQOyhQw3Vy5MCwWzIGEyg0zTs4kk","model":"phi-4","object":"chat.completion.chunk","usage":{"completion_tokens":3,"prompt_tokens":34,"total_tokens":37}}
+
+data: [DONE]
+
+me@Computer:~$
+```
+
+Now we can see `[DONE]\n\n` has been received as the final chunk! 👍
+
+### 3.2 Test with inference-benchmarker (now succeeds!)
+Run the Docker container again:
+```
+sudo docker run --network host -e HF_TOKEN=$HUGGING_FACE_HUB_TOKEN \
+ inference_benchmarker_latest inference-benchmarker --no-console \
+ --prompt-options "num_tokens=200,max_tokens=220,min_tokens=180,variance=10" \
+ --decode-options "num_tokens=200,max_tokens=220,min_tokens=180,variance=10" \
+ --url http://localhost:8000/v1 \
+ --rates 1.0 --max-vus 800 --duration 15s --warmup 15s --benchmark-kind rate \
+ --model-name "phi-4" --tokenizer-name "microsoft/phi-4"
+```
+Output of inference-benchmarker:
+```
+┌─────────────────┬────────────────────────────────────────────────────────────────┐
+│ Parameter │ Value │
+├─────────────────┼────────────────────────────────────────────────────────────────┤
+│ Max VUs │ 800 │
+│ Duration │ 15 │
+│ Warmup Duration │ 15 │
+│ Benchmark Kind │ Rate │
+│ Rates │ [1.0] │
+│ Num Rates │ 10 │
+│ Prompt Options │ num_tokens=Some(200),min_tokens=180,max_tokens=220,variance=10 │
+│ Decode Options │ num_tokens=Some(200),min_tokens=180,max_tokens=220,variance=10 │
+│ Tokenizer │ microsoft/phi-4 │
+│ Extra Metadata │ N/A │
+└─────────────────┴────────────────────────────────────────────────────────────────┘
+
+
+┌────────────────────┬────────────┬───────────────────┬────────────┬───────────┬───────────────────┬────────────┬─────────────────────┬─────────────────────────────┬──────────────────────────────┐
+│ Benchmark │ QPS │ E2E Latency (avg) │ TTFT (avg) │ ITL (avg) │ Throughput │ Error Rate │ Successful Requests │ Prompt tokens per req (avg) │ Decoded tokens per req (avg) │
+├────────────────────┼────────────┼───────────────────┼────────────┼───────────┼───────────────────┼────────────┼─────────────────────┼─────────────────────────────┼──────────────────────────────┤
+│ warmup │ 0.72 req/s │ 1.40 sec │ 65.69 ms │ 7.44 ms │ 127.13 tokens/sec │ 0.00% │ 10/10 │ 200.00 │ 177.40 │
+│ constant@1.00req/s │ 0.71 req/s │ 3.42 sec │ 2040.85 ms │ 7.52 ms │ 129.97 tokens/sec │ 0.00% │ 9/9 │ 200.00 │ 183.44 │
+└────────────────────┴────────────┴───────────────────┴────────────┴───────────┴───────────────────┴────────────┴─────────────────────┴─────────────────────────────┴──────────────────────────────┘
+```
+
+Everything finishes without errors. 👍
+
+---
+
+👤 **voipmonitor** commented on **2025-06-17** at **07:02:07**
+
+I have verified it and it works for me too with the --send-done. Would be nice to merge it.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-17** at **07:33:28**
+
+Closes [#467](https://github.com/ikawrakow/ik_llama.cpp/issues/467)
\ No newline at end of file
diff --git a/github-data/pull_requests/470 - Send _DONE_ for OAI compatibility.md b/github-data/pull_requests/470 - Send _DONE_ for OAI compatibility.md
deleted file mode 100644
index 667ee4b81..000000000
--- a/github-data/pull_requests/470 - Send _DONE_ for OAI compatibility.md
+++ /dev/null
@@ -1,23 +0,0 @@
-### 🔀 [#470](https://github.com/ikawrakow/ik_llama.cpp/pull/470) - Send [DONE] for OAI compatibility
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-05-29 |
-| **Updated** | 2025-06-17 |
-
----
-
-#### Description
-
-See #467
-
-The PR adds a command line parameter `--send-done`, which makes the server send a `data: [DONE]\n\n` message when a stop token is encountered.
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** commented the **2025-06-17** at **07:33:28**:
-
-Closes #467
\ No newline at end of file
diff --git a/github-data/pull_requests/471 - NEON implementation for trellis quants.md b/github-data/pull_requests/471 - NEON implementation for trellis quants.md
index 2de0a1670..044df26f2 100644
--- a/github-data/pull_requests/471 - NEON implementation for trellis quants.md
+++ b/github-data/pull_requests/471 - NEON implementation for trellis quants.md
@@ -1,16 +1,19 @@
-### 🔀 [#471](https://github.com/ikawrakow/ik_llama.cpp/pull/471) - NEON implementation for trellis quants
+## 🔀 [Pull Request #471](https://github.com/ikawrakow/ik_llama.cpp/pull/471) - NEON implementation for trellis quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/trellis_neon` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-29 |
| **Updated** | 2025-05-29 |
+| **Merged** | 2025-05-29 |
---
-#### Description
+## 📄 Description
-Alternative to #460
+Alternative to [#460](https://github.com/ikawrakow/ik_llama.cpp/issues/460)
One wouldn't really want to use this on a NEON CPU as it is much too slow. But for the sake of completeness, here it is.
@@ -46,6 +49,6 @@ Sweep bench results for LLaMA-3.1-8B-Instruct **with BLAS** on M2-Max CPU (PP pe
| 512 | 128 | 1536 | 5.069 | 101.01 | 22.843 | 5.60 |
| 512 | 128 | 2048 | 5.295 | 96.70 | 22.816 | 5.61 |
-This is nevertheless quite a bit faster than #460, so I'll go with this PR.
+This is nevertheless quite a bit faster than [#460](https://github.com/ikawrakow/ik_llama.cpp/issues/460), so I'll go with this PR.
**Of note:** I couldn't make `IQ4_KT` work with `fp16` arithmetic for some reason. Not sure if there really is `fp16` range overflow, or if I just have a bug in the `fp16` implementation that I simply cannot see.
\ No newline at end of file
diff --git a/github-data/pull_requests/473 - Replace MLA-specific KV cache with the standard KV cache V2.md b/github-data/pull_requests/473 - Replace MLA-specific KV cache with the standard KV cache V2.md
index 6bfc436ca..3d17fbbc5 100644
--- a/github-data/pull_requests/473 - Replace MLA-specific KV cache with the standard KV cache V2.md
+++ b/github-data/pull_requests/473 - Replace MLA-specific KV cache with the standard KV cache V2.md
@@ -1,14 +1,17 @@
-### 🔀 [#473](https://github.com/ikawrakow/ik_llama.cpp/pull/473) - Replace MLA-specific KV cache with the standard KV cache V2
+## 🔀 [Pull Request #473](https://github.com/ikawrakow/ik_llama.cpp/pull/473) - Replace MLA-specific KV cache with the standard KV cache V2
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/remove_kv_l` |
+| **Target Branch** | `ik/remove_kv_l` |
| **Created** | 2025-05-30 |
| **Updated** | 2025-05-30 |
+| **Merged** | 2025-05-30 |
---
-#### Description
+## 📄 Description
Tested and was able to successfully read and write the cache to a file. De-fragmenting the cache still has yet to be tested.
@@ -20,13 +23,19 @@ llama_new_context_with_model: KV self size = 5369.91 MiB, c^KV (f16): 5369.91 M
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-30** at **06:45:10**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-30** at **06:45:10**
---
-👤 **saood06** commented the **2025-05-30** at **06:51:24**:
+👤 **ikawrakow** commented on **2025-05-30** at **06:46:17**
+
+I have missed the double printing of the KV cache size. Do you want to fix it in this PR?
+
+---
+
+👤 **saood06** commented on **2025-05-30** at **06:51:24**
> I have missed the double printing of the KV cache size. Do you want to fix it in this PR?
@@ -34,116 +43,92 @@ Sure. I'll fix that and an indentation mistake in the commit I made.
---
-👤 **ikawrakow** submitted a review the **2025-05-30** at **07:28:18**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-30** at **07:28:18**
---
-👤 **saood06** commented the **2025-05-30** at **07:30:43**:
+👤 **saood06** commented on **2025-05-30** at **07:30:43**
-Can you just confirm that there is no V-cache for all modes of MLA when flash attention is enabled? I never used type 2 and an earlier PR (#246) says that even without flash attention it doesn't have a V-cache which seems wrong to me.
+Can you just confirm that there is no V-cache for all modes of MLA when flash attention is enabled? I never used type 2 and an earlier PR ([#246](https://github.com/ikawrakow/ik_llama.cpp/issues/246)) says that even without flash attention it doesn't have a V-cache which seems wrong to me.
---
-👤 **ikawrakow** commented the **2025-05-30** at **07:35:47**:
+👤 **ikawrakow** commented on **2025-05-30** at **07:35:47**
There is V cache with MLA=1, no FA. In that case the V portion of K gets transposed and stored in the V cache.
---
-👤 **ikawrakow** commented the **2025-05-30** at **08:01:39**:
-
-MLA=2 has no V cache with or without FA.
-
----
-
-👤 **saood06** commented the **2025-05-30** at **08:06:51**:
+👤 **saood06** commented on **2025-05-30** at **07:43:49**
-> MLA=2 has no V cache with or without FA.
+> There is V cache with MLA=1, no FA. In that case the V portion of K gets transposed and stored in the V cache.
-Do you mind fixing that then, since I wrongfully assumed MLA+FA meant no V-cache.
-
----
-
-👤 **saood06** submitted a review the **2025-05-30** at **15:24:23**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** submitted a review the **2025-05-30** at **15:56:29**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-05-30** at **15:56:29** on `src/llama.cpp`:
-
-Or we simply deprecate MLA=2. The only purpose of it was to have faster prompt processing on CUDA without needing a V cache. Now that there is a FA kernel for head sizes 576,512 also on CUDA, there is basically no point in having MLA=2. I also see many people still using it, which means they are getting lower TG performance.
-
----
-
-👤 **saood06** submitted a review the **2025-05-30** at **16:03:41**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-05-30** at **16:03:41** on `src/llama.cpp`:
-
->Or we simply deprecate MLA=2.
+I understand that, and the code I commited makes the assumption that flash attention plus MLA means no V-cache, MLA without flash attention has a V-cache but still gets printed differently as it is the latent representation of the cache (thus `c^KV`).
-Why is MLA=1 being kept? Is there any reason not to use MLA=3? So why not just make MLA a toggle again.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-05-30** at **16:20:40**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** submitted a review the **2025-05-30** at **16:25:20**: 💬 `COMMENTED`
+I was mostly asking about this:
+
+>mla = 2, fa = 0: FlashMLA . Works only on the CPU and on CUDA. Only small K cache required (the transposed V cache is computed on the fly)
+
+in the linked PR which seems like a typo.
---
-👤 **ikawrakow** commented during a code review the **2025-05-30** at **16:25:20** on `src/llama.cpp`:
-
-MLA=3 has the disadvantage that one needs an additional compute buffer that can become quite large for a long context and a large u-batch size. This can be mitigated with `-amb`, but if one is really operating on the limits of available RAM/VRAM, one may swallow the lower prompt processing performance and use MLA=1 (and for short contexts there isn't much of a difference between MLA=1 and MLA=3)
+👤 **ikawrakow** commented on **2025-05-30** at **08:01:39**
----
-
-👤 **saood06** submitted a review the **2025-05-30** at **16:25:54**: 💬 `COMMENTED`
+MLA=2 has no V cache with or without FA.
---
-👤 **saood06** commented during a code review the **2025-05-30** at **16:25:54** on `src/llama.cpp`:
+👤 **saood06** commented on **2025-05-30** at **08:06:51**
-> Mainly to be able to run in the same way as mainline, I guess.
+> MLA=2 has no V cache with or without FA.
-If that is now the main motivation, it might make sense to move it behind a compatibility flag since MLA=3 is such a sane default.
+Do you mind fixing that then, since I wrongfully assumed MLA+FA meant no V-cache.
---
-👤 **saood06** submitted a review the **2025-05-30** at **16:28:30**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `src/llama.cpp` on **2025-05-30** at **15:24:23**
----
+Given what you said about MLA=2, I don't think this holds. Instead of updating this, I do think passing both would be better even though it is technically a breaking change (unlike previous one which was backwards compatible).
-👤 **saood06** commented during a code review the **2025-05-30** at **16:28:30** on `src/llama.cpp`:
+> 👤 **ikawrakow** replied on **2025-05-30** at **15:56:29**
+>
+> Or we simply deprecate MLA=2. The only purpose of it was to have faster prompt processing on CUDA without needing a V cache. Now that there is a FA kernel for head sizes 576,512 also on CUDA, there is basically no point in having MLA=2. I also see many people still using it, which means they are getting lower TG performance.
-> MLA=3 has the disadvantage that one needs an additional compute buffer that can become quite large for a long context and a large u-batch size. This can be mitigated with `-amb`, but if one is really operating on the limits of available RAM/VRAM, one may swallow the lower prompt processing performance and use MLA=1 (and for short contexts there isn't much of a difference between MLA=1 and MLA=3)
-
-That makes sense then maybe a memory optimized flag not compatibility?
+> 👤 **saood06** replied on **2025-05-30** at **16:03:41**
+>
+> >Or we simply deprecate MLA=2.
+>
+> Why is MLA=1 being kept? Is there any reason not to use MLA=3? So why not just make MLA a toggle again.
----
+> 👤 **ikawrakow** replied on **2025-05-30** at **16:20:40**
+>
+> > Why is MLA=1 being kept?
+>
+> Good question. Mainly to be able to run in the same way as mainline, I guess.
-👤 **ikawrakow** submitted a review the **2025-05-30** at **16:34:16**: 💬 `COMMENTED`
+> 👤 **ikawrakow** replied on **2025-05-30** at **16:25:20**
+>
+> MLA=3 has the disadvantage that one needs an additional compute buffer that can become quite large for a long context and a large u-batch size. This can be mitigated with `-amb`, but if one is really operating on the limits of available RAM/VRAM, one may swallow the lower prompt processing performance and use MLA=1 (and for short contexts there isn't much of a difference between MLA=1 and MLA=3)
----
+> 👤 **saood06** replied on **2025-05-30** at **16:25:54**
+>
+> > Mainly to be able to run in the same way as mainline, I guess.
+>
+> If that is now the main motivation, it might make sense to move it behind a compatibility flag since MLA=3 is such a sane default.
-👤 **ikawrakow** commented during a code review the **2025-05-30** at **16:34:16** on `src/llama.cpp`:
+> 👤 **saood06** replied on **2025-05-30** at **16:28:30**
+>
+> > MLA=3 has the disadvantage that one needs an additional compute buffer that can become quite large for a long context and a large u-batch size. This can be mitigated with `-amb`, but if one is really operating on the limits of available RAM/VRAM, one may swallow the lower prompt processing performance and use MLA=1 (and for short contexts there isn't much of a difference between MLA=1 and MLA=3)
+>
+> That makes sense then maybe a memory optimized flag not compatibility?
-`-mla fast` and `-mla mem` ?
+> 👤 **ikawrakow** replied on **2025-05-30** at **16:34:16**
+>
+> `-mla fast` and `-mla mem` ?
----
-
-👤 **saood06** submitted a review the **2025-05-30** at **17:06:07**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-05-30** at **17:06:07** on `src/llama.cpp`:
-
-> `-mla fast` and `-mla mem` ?
-
-That sounds good.
\ No newline at end of file
+> 👤 **saood06** replied on **2025-05-30** at **17:06:07**
+>
+> > `-mla fast` and `-mla mem` ?
+>
+> That sounds good.
\ No newline at end of file
diff --git a/github-data/pull_requests/475 - Metal implementatio for the trellis quants..md b/github-data/pull_requests/475 - Metal implementatio for the trellis quants.md
similarity index 51%
rename from github-data/pull_requests/475 - Metal implementatio for the trellis quants..md
rename to github-data/pull_requests/475 - Metal implementatio for the trellis quants.md
index 96923182a..fde5762c0 100644
--- a/github-data/pull_requests/475 - Metal implementatio for the trellis quants..md
+++ b/github-data/pull_requests/475 - Metal implementatio for the trellis quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#475](https://github.com/ikawrakow/ik_llama.cpp/pull/475) - Metal implementatio for the trellis quants.
+## 🔀 [Pull Request #475](https://github.com/ikawrakow/ik_llama.cpp/pull/475) - Metal implementatio for the trellis quants.
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/trellis_metal` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-30 |
| **Updated** | 2025-06-01 |
+| **Merged** | 2025-06-01 |
---
-#### Description
+## 📄 Description
`IQ2_KT` and `IQ3_KT` work. `IQ2_KT` has a pretty decent performance.
diff --git a/github-data/pull_requests/478 - forgotten refs and typo.md b/github-data/pull_requests/478 - forgotten refs and typo.md
index 81e3a4517..cf7e0a1a9 100644
--- a/github-data/pull_requests/478 - forgotten refs and typo.md
+++ b/github-data/pull_requests/478 - forgotten refs and typo.md
@@ -1,14 +1,17 @@
-### 🔀 [#478](https://github.com/ikawrakow/ik_llama.cpp/pull/478) - forgotten refs and typo
+## 🔀 [Pull Request #478](https://github.com/ikawrakow/ik_llama.cpp/pull/478) - forgotten refs and typo
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `forgotten_refs_and_typo` |
+| **Target Branch** | `main` |
| **Created** | 2025-05-30 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-05-31 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -18,6 +21,6 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-05-31** at **04:36:44**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-05-31** at **04:36:44**
\ No newline at end of file
diff --git a/github-data/pull_requests/48 - AVX2 Flash Attention.md b/github-data/pull_requests/48 - AVX2 Flash Attention.md
index 959ffe45e..35194fd27 100644
--- a/github-data/pull_requests/48 - AVX2 Flash Attention.md
+++ b/github-data/pull_requests/48 - AVX2 Flash Attention.md
@@ -1,14 +1,17 @@
-### 🔀 [#48](https://github.com/ikawrakow/ik_llama.cpp/pull/48) - AVX2 Flash Attention
+## 🔀 [Pull Request #48](https://github.com/ikawrakow/ik_llama.cpp/pull/48) - AVX2 Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avx2_flash_attn` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-10 |
| **Updated** | 2024-09-10 |
+| **Merged** | 2024-09-10 |
---
-#### Description
+## 📄 Description
We don't gain as much as on a Zen4 system as there aren't as many vector registers, so we need to load/store data much more often. Still, we do get a small gain in performance.
diff --git a/github-data/pull_requests/480 - Rpc improvement.md b/github-data/pull_requests/480 - Rpc improvement.md
index a8e2942e2..ca9b0cb3a 100644
--- a/github-data/pull_requests/480 - Rpc improvement.md
+++ b/github-data/pull_requests/480 - Rpc improvement.md
@@ -1,14 +1,17 @@
-### 🔀 [#480](https://github.com/ikawrakow/ik_llama.cpp/pull/480) - Rpc improvement
+## 🔀 [Pull Request #480](https://github.com/ikawrakow/ik_llama.cpp/pull/480) - Rpc improvement
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `rpc_improvement` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-01 |
| **Updated** | 2025-06-25 |
+| **Merged** | 2025-06-08 |
---
-#### Description
+## 📄 Description
Include various improvement of rpc from mainline including:
1. adding rpc backend to override tensor option
@@ -25,21 +28,21 @@ Include various improvement of rpc from mainline including:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-06-01** at **02:58:58**:
+👤 **saood06** commented on **2025-06-01** at **02:58:58**
-Has this been tested? If so with what models and backends and what configurations. I attempted a similar PR a while ago, see #193 and when tested it did not work with Qwen2.5 72B since on mainline the PR that added "non-512 aligned tensors" was created to add support for that model. I also found that using KV cache quantization still did not work with RPC with or without #193.
+Has this been tested? If so with what models and backends and what configurations. I attempted a similar PR a while ago, see [#193](https://github.com/ikawrakow/ik_llama.cpp/issues/193) and when tested it did not work with Qwen2.5 72B since on mainline the PR that added "non-512 aligned tensors" was created to add support for that model. I also found that using KV cache quantization still did not work with RPC with or without [#193](https://github.com/ikawrakow/ik_llama.cpp/issues/193).
---
-👤 **ikawrakow** commented the **2025-06-01** at **05:43:47**:
+👤 **ikawrakow** commented on **2025-06-01** at **05:43:47**
I don't use RPC, so need other people to confirm that this works.
---
-👤 **saood06** commented the **2025-06-01** at **06:20:06**:
+👤 **saood06** commented on **2025-06-01** at **06:20:06**
> I don't use RPC, so need other people to confirm that this works.
@@ -47,31 +50,49 @@ I don't mind testing and reviewing this but before I do, I want to know what new
---
-👤 **firecoperana** commented the **2025-06-01** at **12:45:44**:
+👤 **firecoperana** commented on **2025-06-01** at **12:45:44**
I tested various quants of Deepseek v2.5, v3, v3 0324 models and it works. V3 0324 is the one with MLA support from mainline. Didn't test other models as I don't use them on this repo.
---
-👤 **firecoperana** commented the **2025-06-01** at **13:08:24**:
+👤 **saood06** commented on **2025-06-01** at **12:53:03**
+
+> I tested various quants of Deepseek v2.5, v3, v3 0324 models and it works. V3 0324 is the one with MLA support from mainline. Didn't test other models as I don't use them on this repo.
+
+Did you test with `-ot` or cache quantization? Do you mind sharing performance and what hardware you used?
+
+---
+
+👤 **firecoperana** commented on **2025-06-01** at **13:08:24**
My main machine is 3090 with 128GB ddr4. I did -ot to override individual expert tensors to my other machines with ddr4 3000mhz ram and 3060, and with --cache-type-k q8_0 and batch size of 512, in which case I can load the whole model into either vram and ram. I use cpu RPC backend to use ram from remote machines. For Deepseek V3 Q2_K_XL, I can get 10 it/s for pp and 3 it/s for inferencing. Deepseek V2.5 Q4 is about 6-7 it/s for inferencing.
---
-👤 **firecoperana** commented the **2025-06-01** at **13:24:34**:
+👤 **saood06** commented on **2025-06-01** at **13:16:15**
+
+> My main machine is 3090 with 128GB ddr4. I did -ot to override individual expert tensors to my other machines with ddr4 3000mhz ram and 3060, and with --cache-type-k q8_0 and batch size of 512, in which case I can load the whole model into either vram and ram. I use cpu RPC backend to use ram from remote machines. For Deepseek V3 Q2_K_XL, I can get 10 it/s for pp and 3 it/s for inferencing. Deepseek V2.5 Q4 is about 6-7 it/s for inferencing.
+
+Thank you for the details. For now I'll do some testing on Deepseek, with an RPC backend on my 3090 and `-ot`, with the rest of the model in RAM on the DDR4 server I usually use for inference.
+
+For reference with my pure IQ4_K_R4 I get similar speeds you get with RPC for both PP and TG so hopefully with RPC it can improve (and since those quants are now supported on CUDA, I won't need to make a new quant).
+
+---
+
+👤 **firecoperana** commented on **2025-06-01** at **13:24:34**
Be sure to set -t n -c in cpu backend, where n is the number of threads you want the tensors in ram to use. -c is to load tensors from local files next time. This is useful if you have slow LAN transfer speed.
---
-👤 **ikawrakow** commented the **2025-06-02** at **09:25:04**:
+👤 **ikawrakow** commented on **2025-06-02** at **09:25:04**
No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work.
---
-👤 **saood06** commented the **2025-06-02** at **10:15:59**:
+👤 **saood06** commented on **2025-06-02** at **10:15:59**
> No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work.
@@ -79,21 +100,45 @@ I haven't found the time to test this, but I do plan to, in the next few days. (
---
-👤 **firecoperana** commented the **2025-06-08** at **13:42:29**:
+👤 **ikawrakow** commented on **2025-06-08** at **11:52:02**
+
+I get build errors after merging this PR, so reverted. Please fix and resubmit.
+
+---
+
+👤 **firecoperana** commented on **2025-06-08** at **13:42:29**
> I get build errors after merging this PR, so reverted. Please fix and resubmit.
-What's the error? Does the error happen when you set DGGML_RPC=OFF?
+What's the error? Does the error happen only when you set DGGML_RPC=OFF?
---
-👤 **firecoperana** commented the **2025-06-08** at **14:23:46**:
+👤 **ikawrakow** commented on **2025-06-08** at **13:48:08**
+
+```
+/home/iwan/other/ik_llama.cpp/common/common.cpp: In function ‘bool gpt_params_find_arg(int, char**, const std::string&, gpt_params&, int&, bool&)’:
+/home/iwan/other/ik_llama.cpp/common/common.cpp:1013:13: error: ‘ggml_backend_rpc_buffer_type’ was not declared in this scope; did you mean ‘ggml_backend_cpu_buffer_type’?
+ 1013 | ggml_backend_rpc_buffer_type(server.c_str());
+ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ | ggml_backend_cpu_buffer_type
+/home/iwan/other/ik_llama.cpp/common/common.cpp:1016:9: error: ‘ggml_backend_rpc_buffer_type’ was not declared in this scope; did you mean ‘ggml_backend_cpu_buffer_type’?
+ 1016 | ggml_backend_rpc_buffer_type(servers.c_str());
+ | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ | ggml_backend_cpu_buffer_type
+```
+
+This is with `-DGGML_RPC=OFF`
+
+---
+
+👤 **firecoperana** commented on **2025-06-08** at **14:23:46**
Fixed
---
-👤 **saood06** commented the **2025-06-22** at **20:52:33**:
+👤 **saood06** commented on **2025-06-22** at **20:52:33**
Finally got around to testing this. It seems functional (sweep-bench testing only), but I couldn't get any performance advantage from offloading Deepseek-V3 based models via RPC to my 3090. I know when I tested that on mainline I also noticed a performance regression (that went up with the more I offloaded).
@@ -105,7 +150,7 @@ I may revisit when I eventually get an infiniband connection between the two com
---
-👤 **firecoperana** commented the **2025-06-23** at **00:37:44**:
+👤 **firecoperana** commented on **2025-06-23** at **00:37:44**
> Finally got around to testing this. It seems functional (sweep-bench testing only), but I couldn't get any performance advantage from offloading Deepseek-V3 based models via RPC to my 3090. I know when I tested that on mainline I also noticed a performance regression (that went up with the more I offloaded).
>
@@ -119,7 +164,7 @@ Can you add --tensor-split 0,99? This will make sure all non-expert layers are o
---
-👤 **saood06** commented the **2025-06-23** at **01:07:12**:
+👤 **saood06** commented on **2025-06-23** at **01:07:12**
> Can you add --tensor-split 0,99? This will make sure all non-expert layers are offloaded to RPC machine. You could try to offload expert layers to your 3090 with blk.(12|13).ffn_.*_exps=RPC[10.0.0.250:50052] to fully use 3090's VRAM.
@@ -313,7 +358,7 @@ There definitely was a lot of network traffic happening during inference. I don'
---
-👤 **HariboApfel** commented the **2025-06-25** at **08:06:14**:
+👤 **HariboApfel** commented on **2025-06-25** at **08:06:14**
i am encountering abysmal performance with ik_llama and rpc. I "assume" its rpc related.
@@ -367,12 +412,116 @@ any help would be apprichiated.
---
-👤 **ikawrakow** commented the **2025-06-25** at **10:45:43**:
+👤 **ikawrakow** commented on **2025-06-25** at **08:53:42**
+
+I never use RPC, so somebody else should comment.
+
+---
+
+👤 **HariboApfel** commented on **2025-06-25** at **10:17:36**
+
+i got rpc atleast working after redoing the arguments from the 1st post.
+
+using
+
+`./ik_llama.cpp/build/bin/llama-cli \
+ --model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
+ --threads 48 \
+ --n-gpu-layers 99 \
+ --temp 0.6 \
+ --top_p 0.95 \
+ --min_p 0.01 \
+ --ctx-size 16384 \
+ --flash-attn \
+ -mla 3 -fa \
+ -amb 512 \
+ -fmoe \
+ -ctk q8_0 \
+ -ot "blk\.(1|2|3|4|5|6)\.ffn_.*=CUDA0" \
+ -ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \
+ -ot "blk\.(11|12|13|14)\.ffn_.*=CUDA2" \
+ -ot "blk\.(15|16|17|18)\.ffn_.*=CUDA3" \
+ --override-tensor exps=CPU \
+ --prompt`
+
+to only run it on one host gets me closer to "expected performance"
+
+`llama_print_timings: load time = 98945.88 ms
+llama_print_timings: sample time = 45.19 ms / 384 runs ( 0.12 ms per token, 8497.27 tokens per second)
+llama_print_timings: prompt eval time = 5969.18 ms / 224 tokens ( 26.65 ms per token, 37.53 tokens per second)
+llama_print_timings: eval time = 57680.32 ms / 383 runs ( 150.60 ms per token, 6.64 tokens per second)
+llama_print_timings: total time = 63916.49 ms / 607 tokens`
+
+
+with RPC
+
+`./ik_llama.cpp/build/bin/llama-cli \
+ --rpc "$RPC_SERVERS" \
+ --model models/ubergarm/DeepSeek-R1-0528-GGUF/IQ2_K_R4/DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf \
+ --threads 48 \
+ --n-gpu-layers 99 \
+ --temp 0.6 \
+ --top_p 0.95 \
+ --min_p 0.01 \
+ --ctx-size 16384 \
+ --flash-attn \
+ -mla 3 -fa \
+ -amb 512 \
+ -fmoe \
+ -ctk q8_0 \
+ -ot "blk\.(1|2|3|4|5|6)\.ffn_.*=CUDA0" \
+ -ot "blk\.(7|8|9|10)\.ffn_.*=CUDA1" \
+ -ot "blk\.(11|12|13|14)\.ffn_.*=CUDA2" \
+ -ot "blk\.(15|16|17|18)\.ffn_.*=CUDA3" \
+ -ot "blk\.(19|20|21|22)\.ffn_.*=RPC[10.0.0.28:50052]" \
+ -ot "blk\.(23|24|25|26)\.ffn_.*=RPC[10.0.0.28:50053]" \
+ -ot "blk\.(27|28|29|30)\.ffn_.*=RPC[10.0.0.28:50054]" \
+ -ot "blk\.(31|32|33|34)\.ffn_.*=RPC[10.0.0.28:50055]" \
+ -ot "blk\.(35|36|37|38)\.ffn_.*=RPC[10.0.0.40:50052]" \
+ -ot "blk\.(39|40|41|42)\.ffn_.*=RPC[10.0.0.40:50053]" \
+ -ot "blk\.(43|44|45|46)\.ffn_.*=RPC[10.0.0.40:50054]" \
+ -ot "blk\.(47|48|49|50)\.ffn_.*=RPC[10.0.0.40:50055]" \
+ --override-tensor exps=CPU \
+ --prompt`
+
+i get around 5.5T/s
+`
+llama_print_timings: load time = 568857.08 ms
+llama_print_timings: sample time = 963.77 ms / 7798 runs ( 0.12 ms per token, 8091.13 tokens per second)
+llama_print_timings: prompt eval time = 8689.40 ms / 224 tokens ( 38.79 ms per token, 25.78 tokens per second)
+llama_print_timings: eval time = 1420492.95 ms / 7797 runs ( 182.18 ms per token, 5.49 tokens per second)
+llama_print_timings: total time = 1432903.60 ms / 8021 tokens`
+
+which is still less then llama.cpp with the same rpc setting with the unsloth unsloth/DeepSeek-R1-0528-GGUF/UD-Q2_K_XL quant ??
+
+the only real difference would be the -ot setting there-
+for llama.cpp i use
+
+` --cache-type-k q4_0 \
+ --threads 48 \
+ --n-gpu-layers 99 \
+ --prio 3 \
+ --temp 0.6 \
+ --top_p 0.95 \
+ --min_p 0.01 \
+ --flash-attn \
+ --ctx-size 16384 \
+ -ot "\.(3[5-9]|4[0-9]|5[0-9]|6[0-9]|7[0-9]|8[0-9]|9[0-9]|[0-9][0-9][0-9])\.ffn_up_exps.=CPU" \
+ -no-cnv \
+ --prompt`
+
+giving me 7.5T/s
+
+i would have assumend IK_llama is faster.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-25** at **10:45:43**
You can use Unsloth's UD-Q2_K_XL model (or any model that works with `llama.cpp`) with `ik_llama.cpp` just fine, and that would be more of an apples-to-apples comparison. It would also be useful to use the same cache type if you are after a performance comparison.
---
-👤 **firecoperana** commented the **2025-06-25** at **16:12:42**:
+👤 **firecoperana** commented on **2025-06-25** at **16:12:42**
Also for fair comparison, please compare if allocation of vram buffer and layers for each gpu and cpu are the same as mainline. I use tensor-split to control the exact number of layers for each gpu. And note that ik_llama has different order for tensor split than llama.cpp.
\ No newline at end of file
diff --git a/github-data/pull_requests/481 - Webui improvement.md b/github-data/pull_requests/481 - Webui improvement.md
index bddc5ad86..c44e334fa 100644
--- a/github-data/pull_requests/481 - Webui improvement.md
+++ b/github-data/pull_requests/481 - Webui improvement.md
@@ -1,14 +1,17 @@
-### 🔀 [#481](https://github.com/ikawrakow/ik_llama.cpp/pull/481) - Webui improvement
+## 🔀 [Pull Request #481](https://github.com/ikawrakow/ik_llama.cpp/pull/481) - Webui improvement
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `webui_improvement` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-01 |
| **Updated** | 2025-06-10 |
+| **Merged** | 2025-06-08 |
---
-#### Description
+## 📄 Description
Updating webui to a newer version, but not latest version
Some minor bug fix for webui
@@ -20,21 +23,41 @@ Some minor bug fix for webui
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-01** at **05:41:30**:
+👤 **ikawrakow** commented on **2025-06-01** at **05:41:30**
I need people to confirm that this works.
---
-👤 **saood06** commented the **2025-06-01** at **06:26:56**:
+👤 **saood06** commented on **2025-06-01** at **06:26:56**
I see options for DRY and XTC. Neither of which is currently supported here.
---
-👤 **ikawrakow** commented the **2025-06-01** at **07:32:23**:
+👤 **ikawrakow** commented on **2025-06-01** at **06:50:59**
+
+> I see options for DRY and XTC. Neither of which is currently supported here.
+
+Yes, I would have thought that one needs to pick up the changes in `common` before changing the server/WebUI/RPC. But picking up changes in `common` requires picking up changes in `llama`. But picking up changes in `llama` requires picking up changes in `ggml`. But picking up changes in `ggml` requires basically starting fresh and applying the hundreds of changes that I have done to `ggml`. But if I ever considered doing that, then it would be better to actually write my own...
+
+---
+
+👤 **saood06** commented on **2025-06-01** at **07:13:29**
+
+> > I see options for DRY and XTC. Neither of which is currently supported here.
+>
+> Yes, I would have thought that one needs to pick up the changes in `common` before changing the server/WebUI/RPC. But picking up changes in `common` requires picking up changes in `llama`. But picking up changes in `llama` requires picking up changes in `ggml`. But picking up changes in `ggml` requires basically starting fresh and applying the hundreds of changes that I have done to `ggml`. But if I ever considered doing that, then it would be better to actually write my own...
+
+Are you sure bringing over samplers is that difficult? There was a time when I wanted to bring over DRY ( I no longer care, min_p and temperature is all I use and n-sigma is the only one that if brought over I may end up using since it might be better at eliminating "bad" tokens at the tail than min_p is, but min_p works well enough that I doubt it would be that big of an improvement), and I looked into it, and the only major issue was that you would have to manually port it over because of the refactors that mainline has done, but it still seemed manageable, and much easier than starting from scratch.
+
+Edit: I want to clarify I saw DRY and XTC from the code. I haven't tested the new Webui.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-01** at **07:32:23**
Adding a sampler or two shouldn't be too hard. But
* This PR is a 12 kLOC change, so possibly being dependent on various other changes in `common` (or even `llama`?) to function correctly (I haven't checked, just guessing).
@@ -42,7 +65,7 @@ Adding a sampler or two shouldn't be too hard. But
---
-👤 **saood06** commented the **2025-06-01** at **08:05:34**:
+👤 **saood06** commented on **2025-06-01** at **08:05:34**
>Adding a sampler or two shouldn't be too hard. But
>[...]
@@ -56,7 +79,13 @@ I haven't checked either. I only looked through the code so far for this PR (and
---
-👤 **saood06** commented the **2025-06-01** at **12:38:48**:
+👤 **Ph0rk0z** commented on **2025-06-01** at **11:52:48**
+
+XTC is about the only way to remove top tokens which could be slop or refusals. Dry has it's issues, but is better than the other repeat penalties. min_p and temperature are fine for non creative stuff but otherwise they come up short. And no "just raise the temperature" isn't a solution.
+
+---
+
+👤 **saood06** commented on **2025-06-01** at **12:38:48**
> XTC is about the only way to remove top tokens which could be slop or refusals.
@@ -72,7 +101,7 @@ I disagree, min_p does fine at removing the "bad" tail end, and temperature work
---
-👤 **Ph0rk0z** commented the **2025-06-01** at **13:47:32**:
+👤 **Ph0rk0z** commented on **2025-06-01** at **13:47:32**
> since so much effort was made training the LLM to rank them in the order it did
@@ -91,11 +120,11 @@ Yes it does, as well as setting high top_K like 100. I use min_P of around .03 o
Absolutely kills the fun for me. We're coming at it from 2 different places. I want a realistic "personality" with no defined end goal. A chat videogame. You probably want a story that goes somewhere you have planned it to go.
-In either case, taking the sampling refactor from mainline probably does it all at once. It didn't look super easy from the PRs unfortunately. They did a lot of changes. Even trying to add tensor size printing, everything is all renamed or moved. IK not kidding about how they do that constantly.
+In either case, taking the sampling refactor from mainline probably does it all at once. It didn't look super easy from the PRs unfortunately. They did a lot of changes. Even when I was trying to add tensor size printing, everything is all renamed or moved. IK not kidding about how they do that constantly.
---
-👤 **saood06** commented the **2025-06-01** at **14:40:33**:
+👤 **saood06** commented on **2025-06-01** at **14:40:33**
> > since so much effort was made training the LLM to rank them in the order it did
>
@@ -130,19 +159,37 @@ Yeah, it doesn't look easy, I didn't look into it with the purpose of bringing i
---
-👤 **Ph0rk0z** commented the **2025-06-01** at **18:06:57**:
+👤 **Ph0rk0z** commented on **2025-06-01** at **18:06:57**
Have not tried top n sigma since it's only in mainline and generally I use EXL2 for normal sized models. I've been meaning to load up command-A or gemma and give it a whirl. All the "meme" sampling missing here is a bit of a drawback. I initially didn't even realize that it was forked pre dry/xtc and was confused why Deepseek 2.5 was looping so badly. Its like you have to choose between usable speed (close to fully offloaded dense model) or functionality.
---
-👤 **ikawrakow** commented the **2025-06-02** at **09:24:33**:
+👤 **ikawrakow** commented on **2025-06-02** at **06:07:35**
+
+> Its like you have to choose between usable speed (close to fully offloaded dense model) or functionality
+
+Interesting take. Isn't usable speed one of the most important functionalities of an LLM inference toolkit?
+
+---
+
+👤 **ikawrakow** commented on **2025-06-02** at **09:24:33**
No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work.
---
-👤 **Ph0rk0z** commented the **2025-06-02** at **11:21:04**:
+👤 **saood06** commented on **2025-06-02** at **10:25:34**
+
+> No user feedback here, so new strategy: I'll merge this tomorrow. If we don't get bug reports, all is good. If we do get bug reports, all is good too because we know that it needs further work.
+
+The DRY/XTC options in the UI this adds can't function. I don't think there is a need to test that, those samplers do not exist here, so the UI exposing them should be removed before this is added (or the samplers could be added I guess).
+
+The other thing I found when looking at the source code is the bug report button goes to mainline and not here.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-06-02** at **11:21:04**
> Isn't usable speed one of the most important functionalities of an LLM inference toolkit?
@@ -156,7 +203,7 @@ So it's making me badly want to port the QOL stuff. It mirrors LLMs where a mode
---
-👤 **ikawrakow** commented the **2025-06-02** at **12:53:48**:
+👤 **ikawrakow** commented on **2025-06-02** at **12:53:48**
> So it's making me badly want to port the QOL stuff. It mirrors LLMs where a model will be great and then has that one thing you want to change.
@@ -164,31 +211,51 @@ I would love that, and I'm sure many users will too.
---
-👤 **Ph0rk0z** commented the **2025-06-02** at **15:29:07**:
+👤 **Ph0rk0z** commented on **2025-06-02** at **15:29:07**
Ok.. well it seemed easy enough until I hit the portion where they refactored everything into args.h/args.cpp. So all those new things you added aren't in ctx params anymore. Some time around September. Looks fun, doesn't it? https://github.com/ggml-org/llama.cpp/commit/bfe76d4a17228bfd1565761f203123bc4914771b
---
-👤 **ikawrakow** commented the **2025-06-03** at **06:34:03**:
+👤 **ikawrakow** commented on **2025-06-03** at **06:34:03**
-@Ph0rk0z See #486 for the XTC sampler
+@Ph0rk0z See [#486](https://github.com/ikawrakow/ik_llama.cpp/issues/486) for the XTC sampler
---
-👤 **Ph0rk0z** commented the **2025-06-03** at **11:27:29**:
+👤 **Ph0rk0z** commented on **2025-06-03** at **11:27:29**
Ha! Last night I cherry picked and got the refactor working. Got as far as DRY and XTC. I didn't post it yet because I somehow bugged the seed to where it it might not be randomizing on re-rolls. I was gonna keep going after a night of sleep. Adding sigma was good because its way up there, past yet another refactor.
---
-👤 **pt13762104** commented the **2025-06-05** at **02:39:22**:
+👤 **Ph0rk0z** commented on **2025-06-03** at **11:53:08**
+
+https://github.com/Ph0rk0z/ik_llama.cpp/branches
+
+Btw, there is a branch where it's only refactored to separate out the sampling. Furthest ahead one is the DRY one. Still didn't delete the args.cpp nor fixed the Makefile changes mainline did but you get the gist. Is any of that worth doing?
+
+---
+
+👤 **ikawrakow** commented on **2025-06-03** at **12:27:30**
+
+Too much change for my taste. The DRY one is 8631+ LOC, 4089- LOC. The XTC one is 7687+, 4020-. This would require a lot of testing. My PR's are in the 70-90 LOC each. The DRY would be a bit bigger, but not sure if it is worth it.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-06-03** at **13:25:06**
+
+Yep, it is a ton of changes. They add a lot of code in a year. I'm surprised it worked at all. Much of it is related to all the examples too. Even here, 60 files changed for the webui.
+
+---
+
+👤 **pt13762104** commented on **2025-06-05** at **02:39:22**
Clicking the save button in settings doesn't exit it out like llama.cpp
---
-👤 **ikawrakow** commented the **2025-06-05** at **06:44:33**:
+👤 **ikawrakow** commented on **2025-06-05** at **06:44:33**
> Clicking the save button in settings doesn't exit it out like llama.cpp
@@ -196,7 +263,13 @@ Thanks for testing. Apart from this, does it work for you?
---
-👤 **firecoperana** commented the **2025-06-07** at **23:02:59**:
+👤 **pt13762104** commented on **2025-06-05** at **14:34:51**
+
+It works... At least I found it can respond properly and show TPS. Might need more testing.
+
+---
+
+👤 **firecoperana** commented on **2025-06-07** at **23:02:59**
> Clicking the save button in settings doesn't exit it out like llama.cpp
@@ -204,51 +277,51 @@ I think the issue is because you used the newest version of webui from mainline
---
-👤 **pt13762104** commented the **2025-06-08** at **02:07:43**:
+👤 **pt13762104** commented on **2025-06-08** at **02:07:43**
I'll try, thanks
---
-👤 **saood06** commented the **2025-06-08** at **05:02:29**:
+👤 **saood06** commented on **2025-06-08** at **05:02:29**
@firecoperana
-If you are interested I added a new endpoint to server that could be utilized by this front end (#502). I already added support to my preferred front end and it has been nice being able to see all my stored sessions and restore them with ease (saving and restoring support already existed but there was no good way to add it to a UI without being able to list what is saved which is what I added).
+If you are interested I added a new endpoint to server that could be utilized by this front end ([#502](https://github.com/ikawrakow/ik_llama.cpp/issues/502)). I already added support to my preferred front end and it has been nice being able to see all my stored sessions and restore them with ease (saving and restoring support already existed but there was no good way to add it to a UI without being able to list what is saved which is what I added).
---
-👤 **iehgit** commented the **2025-06-08** at **08:04:31**:
+👤 **iehgit** commented on **2025-06-08** at **08:04:31**
Works fine (multiple conversations, display of token rate). Huge improvement over the old UI, which made you choose between prompt formats that didn't fit to current models.
---
-👤 **firecoperana** commented the **2025-06-08** at **15:21:03**:
+👤 **firecoperana** commented on **2025-06-08** at **15:21:03**
> @firecoperana
>
-> If you are interested I added a new endpoint to server that could be utilized by this front end (#502). I already added support to my preferred front end and it has been nice being able to see all my stored sessions and restore them with ease (saving and restoring support already existed but there was no good way to add it to a UI without being able to list what is saved which is what I added).
+> If you are interested I added a new endpoint to server that could be utilized by this front end ([#502](https://github.com/ikawrakow/ik_llama.cpp/issues/502)). I already added support to my preferred front end and it has been nice being able to see all my stored sessions and restore them with ease (saving and restoring support already existed but there was no good way to add it to a UI without being able to list what is saved which is what I added).
I will try when I have time. That looks very helpful!
---
-👤 **saood06** commented the **2025-06-09** at **09:23:32**:
+👤 **saood06** commented on **2025-06-09** at **09:23:32**
@ikawrakow
-What is your opinion on having another alternative frontend besides the one implemented here. The one I use has what seems like an abandoned maintainer so I have no where to upstream my changes.
+What is your opinion on having another additional (alternative like legacy) frontend besides the one implemented here. The one I use has what seems like an abandoned maintainer so I have nowhere to upstream my changes.
---
-👤 **ikawrakow** commented the **2025-06-09** at **10:22:37**:
+👤 **ikawrakow** commented on **2025-06-09** at **10:22:37**
So you want to bring in to this repository your favorite frontend and maintain it here?
---
-👤 **saood06** commented the **2025-06-09** at **10:39:34**:
+👤 **saood06** commented on **2025-06-09** at **10:39:34**
> So you want to bring in to this repository your favorite frontend and maintain it here?
@@ -256,28 +329,71 @@ Yes.
---
-👤 **ikawrakow** commented the **2025-06-09** at **11:22:13**:
+👤 **ikawrakow** commented on **2025-06-09** at **10:45:37**
+
+Can I take a look?
+
+---
+
+👤 **saood06** commented on **2025-06-09** at **10:51:01**
+
+> Can I take a look?
+
+For now what is public is https://github.com/lmg-anon/mikupad/pull/113. But I have more that isn't public as it works but is not polished (like adding [#502](https://github.com/ikawrakow/ik_llama.cpp/issues/502) and [#504](https://github.com/ikawrakow/ik_llama.cpp/issues/504) ) and other things in the roadmap.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **11:07:34**
+
+It doesn't look like a very big project, so from that point of view, sure.
+
+But what about license and such?
+
+Why do you prefer do have it here instead of just a separate fork?
+
+---
+
+👤 **saood06** commented on **2025-06-09** at **11:19:25**
+
+> It doesn't look like a very big project, so from that point of view, sure
+> But what about license and such?
+
+It has a very permissible license, which allows for it to be here from how I read it. ( https://github.com/lmg-anon/mikupad/blob/main/LICENSE )
+
+> Why do you prefer do have it here instead of just a separate fork?
+
+I plan to maintain it following the feature support here, and there are changes that would make it integrate better here that I am planning.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **11:22:13**
I know CC0 is very permissive. What I don't know is how one mixes it with MIT. I.e., do we need to update the license file and such.
---
-👤 **saood06** commented the **2025-06-09** at **11:29:42**:
+👤 **saood06** commented on **2025-06-09** at **11:29:42**
> I know CC0 is very permissive. What I don't know is how one mixes it with MIT. I.e., do we need to update the license file and such.
-I think we can just add a CC0 section to the license file, that specifies the location of it. I will add and maintain an authors file.
+I think we can just add a CC0 section to the license file, that specifies the location of it. I could add and maintain an authors file.
---
-👤 **ikawrakow** commented the **2025-06-09** at **11:31:36**:
+👤 **ikawrakow** commented on **2025-06-09** at **11:31:36**
OK, go ahead.
---
-👤 **saood06** commented the **2025-06-09** at **11:38:39**:
+👤 **saood06** commented on **2025-06-09** at **11:38:39**
> OK, go ahead.
-Thanks, I will submit the PR when it is ready.
\ No newline at end of file
+Thanks, I will submit the PR when it is ready.
+
+---
+
+👤 **pt13762104** commented on **2025-06-09** at **13:34:49**
+
+Finally, some decent UI. Now I can ditch openwebui again. I can't just use the old UI, i don't even know where to start. This made my day
\ No newline at end of file
diff --git a/github-data/pull_requests/482 - Trellis quants_ faster CPU prompt processing.md b/github-data/pull_requests/482 - Trellis quants faster CPU prompt processing.md
similarity index 82%
rename from github-data/pull_requests/482 - Trellis quants_ faster CPU prompt processing.md
rename to github-data/pull_requests/482 - Trellis quants faster CPU prompt processing.md
index fb00161e7..09b7dd053 100644
--- a/github-data/pull_requests/482 - Trellis quants_ faster CPU prompt processing.md
+++ b/github-data/pull_requests/482 - Trellis quants faster CPU prompt processing.md
@@ -1,14 +1,17 @@
-### 🔀 [#482](https://github.com/ikawrakow/ik_llama.cpp/pull/482) - Trellis quants: faster CPU prompt processing
+## 🔀 [Pull Request #482](https://github.com/ikawrakow/ik_llama.cpp/pull/482) - Trellis quants: faster CPU prompt processing
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/dequant_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-01 |
| **Updated** | 2025-06-01 |
+| **Merged** | 2025-06-01 |
---
-#### Description
+## 📄 Description
The trellis quants `IQ2_KT, IQ3_KT, IQ4_KT` are very slow on the CPU. On the main branch using BLAS results in a better prompt processing performance. But BLAS is slower for basically all other data types, so that's not a good idea.
diff --git a/github-data/pull_requests/483 - convert_hf_to_gguf.py _ conversion from hf weights to Q6_0.md b/github-data/pull_requests/483 - convert_hf_to_gguf.py _ conversion from hf weights to Q6_0.md
deleted file mode 100644
index 68225f986..000000000
--- a/github-data/pull_requests/483 - convert_hf_to_gguf.py _ conversion from hf weights to Q6_0.md
+++ /dev/null
@@ -1,47 +0,0 @@
-### 🔀 [#483](https://github.com/ikawrakow/ik_llama.cpp/pull/483) - convert_hf_to_gguf.py : conversion from hf weights to Q6_0
-
-| **Author** | `Nexesenex` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-02 |
-| **Updated** | 2025-06-03 |
-
----
-
-#### Description
-
-This quantization script is obtained by making a sort of "cross multiplication" with the python code for q5_0, and the C code for q5_0 and q6_0 in order to get through trial and error the code for the q6_0 conversion script, this with the help of a 7xB parameters AI model.
-
-It was an interesting experiment!
-
-Tested on Llama 3.2 instruct 1B and Qwen 2.5 instruct 1.5B.
-Bitrate of this q6_0 conversion is 6.50BPW straight.
-PPL equivalent (+/-0.5%) to a regular q6_0 quant from a fp16 gguf.
-Inference is working as intended in my Croco.cpp.
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [ ] Low
- - [x] Medium
- - [ ] High
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** submitted a review the **2025-06-02** at **09:21:49**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** submitted a review the **2025-06-02** at **11:33:17**: 💬 `COMMENTED`
-
----
-
-👤 **Nexesenex** commented during a code review the **2025-06-02** at **11:33:17** on `convert_hf_to_gguf.py`:
-
-No, the q8_0 conversion ftype is not touched.
-This part of the code will just set the embeddings, output weight, attn_v, attn_k, or attn_qkv when it exists in q6_0 instead of q8_0 for the conversions in q5_0 and q5_1.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-06-03** at **06:30:23**: ✅ `APPROVED`
\ No newline at end of file
diff --git a/github-data/pull_requests/483 - convert_hf_to_gguf.py conversion from hf weights to Q6_0.md b/github-data/pull_requests/483 - convert_hf_to_gguf.py conversion from hf weights to Q6_0.md
new file mode 100644
index 000000000..753f5854d
--- /dev/null
+++ b/github-data/pull_requests/483 - convert_hf_to_gguf.py conversion from hf weights to Q6_0.md
@@ -0,0 +1,46 @@
+## 🔀 [Pull Request #483](https://github.com/ikawrakow/ik_llama.cpp/pull/483) - convert_hf_to_gguf.py : conversion from hf weights to Q6_0
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `conv_q6_0` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-02 |
+| **Updated** | 2025-06-03 |
+| **Merged** | 2025-06-03 |
+
+---
+
+## 📄 Description
+
+This quantization script is obtained by making a sort of "cross multiplication" with the python code for q5_0, and the C code for q5_0 and q6_0 in order to get through trial and error the code for the q6_0 conversion script, this with the help of a 7xB parameters AI model.
+
+It was an interesting experiment!
+
+Tested on Llama 3.2 instruct 1B and Qwen 2.5 instruct 1.5B.
+Bitrate of this q6_0 conversion is 6.50BPW straight.
+PPL equivalent (+/-0.5%) to a regular q6_0 quant from a fp16 gguf.
+Inference is working as intended in my Croco.cpp.
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [x] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** started a conversation on `convert_hf_to_gguf.py` on **2025-06-02** at **09:21:49**
+
+This will set it to `Q6_0` also for `Q8_0` conversion?
+
+> 👤 **Nexesenex** replied on **2025-06-02** at **11:33:17**
+>
+> No, the q8_0 conversion ftype is not touched.
+> This part of the code will just set the embeddings, output weight, attn_v, attn_k, or attn_qkv when it exists in ggml_type q6_0 instead of q8_0 for the conversions in ftypes q5_0 and q5_1.
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-03** at **06:30:23**
\ No newline at end of file
diff --git a/github-data/pull_requests/484 - BF16 Trellis implementation.md b/github-data/pull_requests/484 - BF16 Trellis implementation.md
index 5e8182da2..aa0fedf91 100644
--- a/github-data/pull_requests/484 - BF16 Trellis implementation.md
+++ b/github-data/pull_requests/484 - BF16 Trellis implementation.md
@@ -1,14 +1,16 @@
-### 🔀 [#484](https://github.com/ikawrakow/ik_llama.cpp/pull/484) - BF16 Trellis implementation
+## 🔀 [Pull Request #484](https://github.com/ikawrakow/ik_llama.cpp/pull/484) - BF16 Trellis implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/trellis_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-02 |
| **Updated** | 2025-06-19 |
---
-#### Description
+## 📄 Description
This PR adds a `bf16` CPU implementation for the trellis quants `IQ2_KT, IQ3_KT` and `IQ4_KT` for CPUs with native `bf16` support.
@@ -34,9 +36,315 @@ A similar optimization can be done for CPUs with native `fp16` support, but as I
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-03** at **04:22:51**:
+👤 **ubergarm** commented on **2025-06-02** at **21:16:37**
+
+Just did some a/b testing with llama-sweep-bench on my home rig using that new Qwen3-8B dense model distillation of R1-0528.
+
+1. The good news: The PR is definitely faster than main branch as my AMD 9950X has cpu flag `avx512_bf16`
+2. The bad news: Not sure what happened, but the first try with this branch it crashed with `ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed`. Worked fine the second time I ran it. Could *possibly* be my overclocked RAM but didn't see any other system instability and ran clean and finished on the second try without changing anything. Details below. *EDIT*: Was able to crash it again with a few more tries.
+3. *EDIT2*: It worked okay on the Intel Xeon 6980P doing perplexity and llama-sweep-bench on the Thread Ripper Pro mentioned below now.
+
+## Full GPU Offload
+
+
+## CPU Only
+
+
+
+
+👈 Details and Logs
+
+#### Test Quants
+```
+## DeepSeek-R1-0528-Qwen3-8B-IQ3_K
+llama_model_loader: - type f32: 145 tensors
+llama_model_loader: - type iq3_k: 72 tensors ffn_(gate|up)
+llama_model_loader: - type iq4_ks: 182 tensors everything else
+llm_load_print_meta: model size = 3.714 GiB (3.895 BPW)
+Final estimate: PPL = 11.7407 +/- 0.09382
+
+## DeepSeek-R1-0528-Qwen3-8B-IQ3_KT.gguf
+llama_model_loader: - type f32: 145 tensors
+llama_model_loader: - type iq3_kt: 72 tensors ffn_(gate|up)
+llama_model_loader: - type iq4_kt: 182 tensors everything else
+llm_load_print_meta: model size = 3.455 GiB (3.624 BPW)
+Final estimate: PPL = 12.2157 +/- 0.09915
+```
+
+#### llama-sweep-bench
+#### Full GPU Offload
+```bash
+$ git checkout main
+$ git rev-parse --short HEAD
+7a8abe29
+
+cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_SCHED_MAX_COPIES=1
+cmake --build build --config Release -j $(nproc)
+
+#model=/mnt/astrodata/llm/models/ubergarm/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-IQ3_K.gguf
+model=/mnt/astrodata/llm/models/ubergarm/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-IQ3_KT.gguf
+CUDA_VISIBLE_DEVICES="0" \
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -fa \
+ -c 32768 \
+ -ngl 99 \
+ --threads 1 \
+ --warmup-batch
+```
+
+#### CPU Only
+```bash
+# main test case
+$ git checkout main
+$ git rev-parse --short HEAD
+7a8abe29
+
+# PR484 ik/trellis_bf16 test case
+$ git checkout ik/trellis_bf16
+$ git rev-parse --short HEAD
+061d064b
+
+cmake -B build -DGGML_CUDA=OFF -DGGML_BLAS=OFF
+cmake --build build --config Release -j $(nproc)
+
+# with and without -rtr test cases
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -fa \
+ -c 8704 \
+ --threads 16 \
+ --warmup-batch
+```
+
+#### Full Crash Logs
+```
+model=/mnt/astrodata/llm/models/ubergarm/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-IQ3_KT.gguf
+
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -fa \
+ -c 8704 \
+ --threads 16 \
+ --warmup-batch
+
+llama_model_loader: loaded meta data with 36 key-value pairs and 399 tensors from /mnt/astrodata/llm/models/ubergarm/DeepSeek-R1-0528-Qwen3-8B-GGUF/DeepSeek-R1-0528-Qwen3-8B-IQ3_KT.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek R1 0528 Qwen3 8B
+llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-0528-Qwen3
+llama_model_loader: - kv 4: general.size_label str = 8B
+llama_model_loader: - kv 5: general.license str = mit
+llama_model_loader: - kv 6: qwen3.block_count u32 = 36
+llama_model_loader: - kv 7: qwen3.context_length u32 = 131072
+llama_model_loader: - kv 8: qwen3.embedding_length u32 = 4096
+llama_model_loader: - kv 9: qwen3.feed_forward_length u32 = 12288
+llama_model_loader: - kv 10: qwen3.attention.head_count u32 = 32
+llama_model_loader: - kv 11: qwen3.attention.head_count_kv u32 = 8
+llama_model_loader: - kv 12: qwen3.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 13: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 14: qwen3.attention.key_length u32 = 128
+llama_model_loader: - kv 15: qwen3.attention.value_length u32 = 128
+llama_model_loader: - kv 16: general.file_type u32 = 152
+llama_model_loader: - kv 17: qwen3.rope.scaling.type str = yarn
+llama_model_loader: - kv 18: qwen3.rope.scaling.factor f32 = 4.000000
+llama_model_loader: - kv 19: qwen3.rope.scaling.original_context_length u32 = 32768
+llama_model_loader: - kv 20: general.quantization_version u32 = 2
+llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 151643
+llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 151645
+llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 31: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 32: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-R1...
+llama_model_loader: - kv 33: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
+llama_model_loader: - kv 34: quantize.imatrix.entries_count i32 = 253
+llama_model_loader: - kv 35: quantize.imatrix.chunks_count i32 = 840
+llama_model_loader: - type f32: 145 tensors
+llama_model_loader: - type iq3_kt: 72 tensors
+llama_model_loader: - type iq4_kt: 182 tensors
+llm_load_vocab: special tokens cache size = 28
+llm_load_vocab: token to piece cache size = 0.9311 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen3
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 151936
+llm_load_print_meta: n_merges = 151387
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 131072
+llm_load_print_meta: n_embd = 4096
+llm_load_print_meta: n_layer = 36
+llm_load_print_meta: n_head = 32
+llm_load_print_meta: n_head_kv = 8
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 4
+llm_load_print_meta: n_embd_k_gqa = 1024
+llm_load_print_meta: n_embd_v_gqa = 1024
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 12288
+llm_load_print_meta: n_expert = 0
+llm_load_print_meta: n_expert_used = 0
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 1000000.0
+llm_load_print_meta: freq_scale_train = 0.25
+llm_load_print_meta: n_ctx_orig_yarn = 32768
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = IQ3_KT - 3.125 bpw
+llm_load_print_meta: model params = 8.191 B
+llm_load_print_meta: model size = 3.455 GiB (3.624 BPW)
+llm_load_print_meta: repeating layers = 2.874 GiB (3.554 BPW, 6.946 B parameters)
+llm_load_print_meta: general.name = DeepSeek R1 0528 Qwen3 8B
+llm_load_print_meta: BOS token = 151643 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 151645 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 151645 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 148848 'ÄĬ'
+llm_load_print_meta: max token length = 256
+llm_load_tensors: ggml ctx size = 0.18 MiB
+llm_load_tensors: CPU buffer size = 3538.31 MiB
+......................................................................................
+llama_new_context_with_model: n_ctx = 8704
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 1000000.0
+llama_new_context_with_model: freq_scale = 0.25
+llama_kv_cache_init: CPU KV buffer size = 1224.00 MiB
+llama_new_context_with_model: KV self size = 1224.00 MiB, K (f16): 612.00 MiB, V (f16): 612.00 MiB
+llama_new_context_with_model: CPU output buffer size = 0.58 MiB
+llama_new_context_with_model: CPU compute buffer size = 304.75 MiB
+llama_new_context_with_model: graph nodes = 978
+llama_new_context_with_model: graph splits = 1
+
+main: n_kv_max = 8704, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.610 | 318.01 | 8.114 | 15.77 |
+| 512 | 128 | 512 | 1.672 | 306.24 | 8.222 | 15.57 |
+| 512 | 128 | 1024 | 1.727 | 296.51 | 8.403 | 15.23 |
+| 512 | 128 | 1536 | 1.787 | 286.52 | 8.455 | 15.14 |
+| 512 | 128 | 2048 | 1.843 | 277.76 | 8.639 | 14.82 |
+| 512 | 128 | 2560 | 1.897 | 269.93 | 8.709 | 14.70 |
+| 512 | 128 | 3072 | 1.949 | 262.74 | 8.831 | 14.49 |
+| 512 | 128 | 3584 | 1.999 | 256.17 | 8.952 | 14.30 |
+| 512 | 128 | 4096 | 2.057 | 248.87 | 9.074 | 14.11 |
+| 512 | 128 | 4608 | 2.175 | 235.36 | 9.384 | 13.64 |
+| 512 | 128 | 5120 | 2.167 | 236.23 | 9.352 | 13.69 |
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+.
+.
+.
+```
+
+*EDIT* without rebooting it ran clean twice then the third time blew up again with:
+```
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.632 | 313.64 | 8.088 | 15.83 |
+| 512 | 128 | 512 | 1.683 | 304.24 | 8.214 | 15.58 |
+| 512 | 128 | 1024 | 1.741 | 294.14 | 8.619 | 14.85 |
+| 512 | 128 | 1536 | 1.798 | 284.73 | 8.462 | 15.13 |
+| 512 | 128 | 2048 | 1.851 | 276.66 | 8.621 | 14.85 |
+| 512 | 128 | 2560 | 1.909 | 268.16 | 8.725 | 14.67 |
+| 512 | 128 | 3072 | 1.966 | 260.48 | 8.851 | 14.46 |
+| 512 | 128 | 3584 | 2.022 | 253.27 | 8.981 | 14.25 |
+| 512 | 128 | 4096 | 2.072 | 247.09 | 9.151 | 13.99 |
+| 512 | 128 | 4608 | 2.157 | 237.39 | 9.218 | 13.89 |
+| 512 | 128 | 5120 | 2.179 | 234.97 | 9.344 | 13.70 |
+| 512 | 128 | 5632 | 2.248 | 227.72 | 9.499 | 13.48 |
+| 512 | 128 | 6144 | 2.286 | 223.97 | 9.649 | 13.27 |
+| 512 | 128 | 6656 | 2.339 | 218.94 | 10.081 | 12.70 |
+| 512 | 128 | 7168 | 2.396 | 213.67 | 9.989 | 12.81 |
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/mnt/astrodata/llm/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+ptrace: Operation not permitted.
+```
+
+
+
+## CPU Only 7965WX
+It took just under 8 hours to slow cook `DeepSeek-R1-0528-IQ2_KT` 196.696 GiB (2.514 BPW) on this rig. It doesn't run with CUDA offload as possibly missing some mmvq stuff `ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu:564: fatal error`, but seems to work fine when compiling for CPU only. Didn't get the assert like above either in very limited testing.
+
+
+
+```bash
+./build/bin/llama-sweep-bench \
+ --model ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ2_KT.gguf \
+ --ctx-size 4608 \
+ -mla 3 -fa \
+ -amb 512 \
+ -fmoe \
+ --threads 24 \
+ --warmup-batch \
+ --no-mmap
+
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q5_0: 61 tensors attn_k_b (crashes if u try to quantize to iq4_kt)
+llama_model_loader: - type iq2_kt: 116 tensors ffn_(up|gate)_exps
+llama_model_loader: - type iq3_kt: 58 tensors ffn_down_exps
+llama_model_loader: - type iq4_kt: 551 tensors attn/shexp/token_embd
+```
+
+Happy to try out anything to reproduce and hope it isn't a Heisenbug...
+
+Also, I was considering cooking a hybrid iq4_kt attn/shexp with iq3_k/iq2_k down/(up|gate) R1-0528, but with this speed-up to CPU inferencing I'll go all in with iq3_kt/iq2_kt down/(gate|up) just to see what happens. Gonna take a while to cook though! Thanks!
+
+---
+
+👤 **saood06** commented on **2025-06-03** at **00:38:49**
+
+>`iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed`
+
+I'm fairly certain that means there is a NaN somewhere in the calculations.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-03** at **04:22:51**
Thank for testing.
@@ -46,12 +354,32 @@ Looking at the low GPU TG performance, my guess is that you need to explicitly e
---
-👤 **ikawrakow** commented the **2025-06-03** at **07:10:14**:
+👤 **ubergarm** commented on **2025-06-03** at **05:22:12**
+
+I didn't run into that assert in limited testing a mixes of iqN_kt with DeepSeek-R1-0528 on two remote systems fwiw. This PR did speed up CPU only compiled inferencing but couldn't test CUDA offload as described. Accidently updated my above comment before realizing you'd already commented. Its past my bed time hah.
+
+> -DGGML_CUDA_F16=ON
+
+That did the trick for the `_kt` quant!
+
+
+
+Thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-06-03** at **07:10:14**
I hadn't tested this PR with a DeepSeek model. Testing now I see DeepSeek-Lite breaks with `bf16` precision. I don't get NaNs but I get extremely high perplexity values and gibberish in TG.
---
-👤 **ikawrakow** commented the **2025-06-19** at **07:26:25**:
+👤 **ikawrakow** commented on **2025-06-03** at **07:24:15**
+
+Something goes wrong on CUDA too with DeepSeek-Lite. So, it seems, trellis quants are not quite ready for prime time yet.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-19** at **07:26:25**
-Closing in favor of #529
\ No newline at end of file
+Closing in favor of [#529](https://github.com/ikawrakow/ik_llama.cpp/issues/529)
\ No newline at end of file
diff --git a/github-data/pull_requests/486 - Adding the XTC sampler.md b/github-data/pull_requests/486 - Adding the XTC sampler.md
index 2f139da9c..c609470b4 100644
--- a/github-data/pull_requests/486 - Adding the XTC sampler.md
+++ b/github-data/pull_requests/486 - Adding the XTC sampler.md
@@ -1,14 +1,17 @@
-### 🔀 [#486](https://github.com/ikawrakow/ik_llama.cpp/pull/486) - Adding the XTC sampler
+## 🔀 [Pull Request #486](https://github.com/ikawrakow/ik_llama.cpp/pull/486) - Adding the XTC sampler
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/sampling-xtc` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-03 |
| **Updated** | 2025-06-03 |
+| **Merged** | 2025-06-03 |
---
-#### Description
+## 📄 Description
Given popular demand, here is the XTC sampler.
@@ -20,18 +23,26 @@ Same usage as in mainline:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** submitted a review the **2025-06-03** at **09:34:48**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `common/common.cpp` on **2025-06-03** at **09:34:48**
----
-
-👤 **saood06** submitted a review the **2025-06-03** at **09:35:50**: 💬 `COMMENTED`
+1.0 is disabled for threshold, not 0.0
----
+> 👤 **ikawrakow** replied on **2025-06-03** at **09:39:08**
+>
+> Oh, I forgot to update those, thanks!
+>
+> As per mainline implementation, the disabling threshold is 0.5
-👤 **ikawrakow** submitted a review the **2025-06-03** at **09:39:08**: 💬 `COMMENTED`
+> 👤 **saood06** replied on **2025-06-03** at **09:44:33**
+>
+> >As per mainline implementation, the disabling threshold is 0.5
+>
+> Yeah, I forgot and only remembered after commenting (after reading the rest of the commit). I was referencing mainline which makes the mistake of saying 1.0 here (but >0.5 in other places). Sorry.
---
-👤 **saood06** submitted a review the **2025-06-03** at **09:44:33**: 💬 `COMMENTED`
\ No newline at end of file
+👤 **saood06** started a conversation on `common/sampling.h` on **2025-06-03** at **09:35:49**
+
+minor typo here "threashold" should be "threshold"
\ No newline at end of file
diff --git a/github-data/pull_requests/487 - Make sure MMVQ is supported before using it.md b/github-data/pull_requests/487 - Make sure MMVQ is supported before using it.md
index 7dfcc7969..06f804de4 100644
--- a/github-data/pull_requests/487 - Make sure MMVQ is supported before using it.md
+++ b/github-data/pull_requests/487 - Make sure MMVQ is supported before using it.md
@@ -1,15 +1,525 @@
-### 🔀 [#487](https://github.com/ikawrakow/ik_llama.cpp/pull/487) - Make sure MMVQ is supported before using it
+## 🔀 [Pull Request #487](https://github.com/ikawrakow/ik_llama.cpp/pull/487) - Make sure MMVQ is supported before using it
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `ik/mmvq_type_supported` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-03 |
| **Updated** | 2025-06-03 |
---
-#### Description
+## 📄 Description
The new trellis quants do not support quantized matrix-vector multiplications (a.k.a., MMVQ), but the fused ffn_up+ffn_gate implementation does not check for that, which leads to an assert when the MMVQ is called for a trellis quant.
-This PR attempts to fix it.
\ No newline at end of file
+This PR attempts to fix it.
+
+---
+
+## 💬 Conversation
+
+👤 **ubergarm** commented on **2025-06-03** at **19:43:41**
+
+Okay tested this PR which let's now run full DeepSeek-R1-0528 running a mix of all three new trellis quants with CUDA compiled.
+
+1. This PR does fix my previous error `ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu:564: fatal error`.
+2. I was able to offload onto two CUDA GPUs and do some very limited inferencing testing that looked okay.
+3. Began running the usual `llama-perplexity` test but started getting `nan` after chunk 25.
+4. If I compile with `-DGGML_CUDA_F16=ON` it seems to still inference okay, but perplexity throws `nan` immedeately on first chunk.
+5. Compiling with `-DGGML_CUDA_IQK_FORCE_BF16=1` still throws nan after chunk 25.
+
+Thanks, happy to try any other configurations or build flags etc. Otherwise might try CPU only to get this perplexity value for now haha...
+
+
+
+👈 Details and Logs
+
+## Testing PR487
+#### Quant
+DeepSeek-R1-0528-IQ2_KT 196.696 GiB (2.514 BPW)
+- type f32: 361 tensors
+- type q5_0: 61 tensors `attn_k_b`
+- type iq2_kt: 116 tensors `ffn_(gate|up)_exps`
+- type iq3_kt: 58 tensors `ffn_down_exps`
+- type iq4_kt: 551 tensors everything else
+
+#### Rig
+* CPU/RAM
+ - AMD 7965WX 24x Core 256GB DDR5@4800
+* GPUs
+ - Dual RTX A6000 48GB VRAM each total 96GB VRAM
+
+#### Methodology and Logs
+```bash
+git checkout ik/mmvq_type_supported
+git rev-parse --short HEAD
+626f49ab
+
+# also tested with -DGGML_CUDA_IQK_FORCE_BF16=1 with same results
+cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
+cmake --build ./build --config Release -j $(nproc)
+
+model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ2_KT.gguf
+./build/bin/llama-perplexity \
+ --model "$model" \
+ -f wiki.test.raw \
+ --seed 1337 \
+ --ctx-size 512 \
+ -mla 3 -fa \
+ -amb 512 \
+ -fmoe \
+ -ngl 99 \
+ -ot "blk\.(3|4|5|6|7|8|9|10|11|12|13|13)\.ffn_.*=CUDA0" \
+ -ot "blk\.(14|16|17|18|19|20|21|22|23|24|25)\.ffn_.*=CUDA1" \
+ -ot exps=CPU \
+ --threads 24
+
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 2 CUDA devices:
+ Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
+ Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
+main: build = 3724 (626f49ab)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: seed = 1337
+llama_model_loader: loaded meta data with 49 key-value pairs and 1147 tensors from /mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ2_KT.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek R1 0528
+llama_model_loader: - kv 3: general.version str = 0528
+llama_model_loader: - kv 4: general.basename str = DeepSeek-R1
+llama_model_loader: - kv 5: general.size_label str = 256x21B
+llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
+llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 15: general.file_type u32 = 151
+llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 23: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 34: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,129280] = ["
+llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,129280] = [3
+llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,127741] = ["
+llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 40: tokenizer.ggml.padding_token_id u32 = 1
+llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 42: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 43: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 44: general.quantization_version u32 = 2
+llama_model_loader: - kv 45: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-R1...
+llama_model_loader: - kv 46: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
+llama_model_loader: - kv 47: quantize.imatrix.entries_count i32 = 721
+llama_model_loader: - kv 48: quantize.imatrix.chunks_count i32 = 812
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q5_0: 61 tensors
+llama_model_loader: - type iq2_kt: 116 tensors
+llama_model_loader: - type iq3_kt: 58 tensors
+llama_model_loader: - type iq4_kt: 551 tensors
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = IQ2_KT - 2.125 bpw
+llm_load_print_meta: model params = 672.050 B
+llm_load_print_meta: model size = 196.696 GiB (2.514 BPW)
+llm_load_print_meta: repeating layers = 195.831 GiB (2.510 BPW, 670.196 B parameters)
+llm_load_print_meta: general.name = DeepSeek R1 0528
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 1.40 MiB
+Tensor blk.3.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.3.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.4.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.5.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.6.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.7.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.8.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.9.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.10.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.11.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.12.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_norm.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_gate_inp.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_down_exps.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_up_exps.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_gate_shexp.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_down_shexp.weight buffer type overriden to CUDA0
+Tensor blk.13.ffn_up_shexp.weight buffer type overriden to CUDA0
+Tensor blk.14.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.14.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.16.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.17.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.18.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.19.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.20.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.21.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.22.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.23.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.24.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_norm.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_gate_inp.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_gate_shexp.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_down_shexp.weight buffer type overriden to CUDA1
+Tensor blk.25.ffn_up_shexp.weight buffer type overriden to CUDA1
+Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 158670.43 MiB
+llm_load_tensors: CPU buffer size = 442.86 MiB
+llm_load_tensors: CUDA0 buffer size = 40719.56 MiB
+llm_load_tensors: CUDA1 buffer size = 40914.69 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 2048
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 3
+llama_new_context_with_model: attn_max_b = 512
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CUDA0 KV buffer size = 72.00 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 65.25 MiB
+llama_new_context_with_model: KV self size = 137.25 MiB, c^KV (f16): 137.25 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 1.97 MiB
+llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
+llama_new_context_with_model: CUDA0 compute buffer size = 2043.00 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 476.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 18.01 MiB
+llama_new_context_with_model: graph nodes = 3487
+llama_new_context_with_model: graph splits = 148
+
+system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+perplexity: tokenizing the input ..
+perplexity: tokenization took 604.513 ms
+perplexity: calculating perplexity over 561 chunks, n_ctx=512, batch_size=2048, n_seq=4
+perplexity: 49.29 seconds per pass - ETA 1 hours 55.22 minutes
+[1]2.6769,[2]3.4157,[3]2.4584,[4]2.0552,[5]1.8836,[6]1.7454,[7]1.6643,[8]1.6095,[9]1.5704,[10]1.5282,[11]1.5253,[12]1.6034,[13]1.6284,[14]1.7546,[15]1.8981,[16]1.9518,[17]2.1167,[18]2.2438,[19]2.2098,[20]2.2026,[21]2.3055,[22]2.2735,[23]2.2438,[24]2.2625,[25]nan,[26]nan,[27]nan,[28]nan,[29]nan,[30]nan,[31]nan,[32]nan,[33]nan,[34]nan,[35]nan,[36]nan,
+```
+
+
\ No newline at end of file
diff --git a/github-data/pull_requests/488 - Faster CPU prompt processing for Trellis quants and MoE models.md b/github-data/pull_requests/488 - Faster CPU prompt processing for Trellis quants and MoE models.md
index 8cc690ddb..fe8053d33 100644
--- a/github-data/pull_requests/488 - Faster CPU prompt processing for Trellis quants and MoE models.md
+++ b/github-data/pull_requests/488 - Faster CPU prompt processing for Trellis quants and MoE models.md
@@ -1,15 +1,18 @@
-### 🔀 [#488](https://github.com/ikawrakow/ik_llama.cpp/pull/488) - Faster CPU prompt processing for Trellis quants and MoE models
+## 🔀 [Pull Request #488](https://github.com/ikawrakow/ik_llama.cpp/pull/488) - Faster CPU prompt processing for Trellis quants and MoE models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/dequant_moe_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-03 |
| **Updated** | 2025-06-05 |
+| **Merged** | 2025-06-05 |
---
-#### Description
+## 📄 Description
-This PR is a follow up to #482, and applies the same dequantizing GEMM for MoE matrix multiplications.
+This PR is a follow up to [#482](https://github.com/ikawrakow/ik_llama.cpp/issues/482), and applies the same dequantizing GEMM for MoE matrix multiplications.
For a DeepSeek-Lite model where only the `ffn_up` and `ffn_gate` tensors are quantized with `IQ2_KT` I observe a ~35% improvement in PP performance compared to te main branch.
\ No newline at end of file
diff --git a/github-data/pull_requests/489 - Adding top-n-sigma sampler.md b/github-data/pull_requests/489 - Adding top-n-sigma sampler.md
index f6222f24d..76793aa19 100644
--- a/github-data/pull_requests/489 - Adding top-n-sigma sampler.md
+++ b/github-data/pull_requests/489 - Adding top-n-sigma sampler.md
@@ -1,14 +1,17 @@
-### 🔀 [#489](https://github.com/ikawrakow/ik_llama.cpp/pull/489) - Adding top-n-sigma sampler
+## 🔀 [Pull Request #489](https://github.com/ikawrakow/ik_llama.cpp/pull/489) - Adding top-n-sigma sampler
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/sampling-top-n-sigma` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-03 |
| **Updated** | 2025-06-03 |
+| **Merged** | 2025-06-03 |
---
-#### Description
+## 📄 Description
Given popular demand, adding top-n $\sigma$ sampler.
@@ -19,9 +22,15 @@ Set to off by default.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-03** at **10:04:08**:
+👤 **saood06** commented on **2025-06-03** at **09:48:28**
+
+Since this PR is still open could the documentation for this and XTC be added to [examples/server/README.md](https://github.com/ikawrakow/ik_llama.cpp/blob/ccb265c01676aad9ae5860ba50e74e61dfcd1cf8/examples/server/README.md) and [examples/main/README.md](https://github.com/ikawrakow/ik_llama.cpp/blob/ccb265c01676aad9ae5860ba50e74e61dfcd1cf8/examples/main/README.md).
+
+---
+
+👤 **ikawrakow** commented on **2025-06-03** at **10:04:08**
Sure, will do.
@@ -31,7 +40,7 @@ DRY?
---
-👤 **saood06** commented the **2025-06-03** at **10:23:49**:
+👤 **saood06** commented on **2025-06-03** at **10:23:49**
>What else do people want for sampling?
>
@@ -43,18 +52,47 @@ I do personally think DRY is the best repeat penalty (of the ones that are publi
---
-👤 **saood06** submitted a review the **2025-06-03** at **10:38:23**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `examples/main/README.md` on **2025-06-03** at **10:38:23**
+
+Maybe add something like the following:
+
+XTC probability sets how likely the XTC sampler is to engage.
+XTC threshold is the lower-bound for what probability is needed for a token to be considered a "Top choice" and when engaged only the lowest probability top choice is kept.
+
+And maybe change ### XTC Sampling to ### XTC Sampling (Exclude Top Choices) since the description above refers to the full name
+
+---
+
+👤 **saood06** started a conversation on `examples/main/README.md` on **2025-06-03** at **10:41:21**
+
+Maybe add something letting people know that increasing top-n-sigma results in more tokens being considered, while decreasing it makes less tokens be considered as not all users will be able to figure that out from the mathematical description you provided.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-06-03** at **11:33:23**
+
+Yep, DRY is good. XTC threshold is usually .1 and below to get anything meaningful out of it. Not sure how that compares here. Super interesting how this one is going to compare to the one I stole from mainline.
---
-👤 **saood06** submitted a review the **2025-06-03** at **10:41:21**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `examples/main/README.md` on **2025-06-03** at **12:22:28**
+
+"conrolled" -> controlled
+
+This isn't really accurate, as the lowest "top choice" is retained. As it is written it makes it seem like it removes all tokens with probability greater than the threshold.
+
+Also I think the conditions for it to be turned off should be consistent instead of having the probability one in the beginning and the threshold one at the bottom
---
-👤 **Ph0rk0z** commented the **2025-06-03** at **11:33:23**:
+👤 **ikawrakow** commented on **2025-06-03** at **13:38:26**
-Yep, DRY is good. XTC threshold is usually .1 and below to get anything meaningful out of it. Not sure how that compares here. Super interesting to how this one is going to compare to the one I stole from mainline.
+Why don't you make your changes on top of the PR? Or, we merge the way it is and you make a new PR with better description.
---
-👤 **saood06** submitted a review the **2025-06-03** at **12:22:28**: 💬 `COMMENTED`
\ No newline at end of file
+👤 **saood06** commented on **2025-06-03** at **14:04:51**
+
+> Or, we merge the way it is and you make a new PR with better description.
+
+Sure. I can do that.
\ No newline at end of file
diff --git a/github-data/pull_requests/49 - ARM_NEON Flash Attention.md b/github-data/pull_requests/49 - ARM_NEON Flash Attention.md
index ff98ec0f6..f00a26c37 100644
--- a/github-data/pull_requests/49 - ARM_NEON Flash Attention.md
+++ b/github-data/pull_requests/49 - ARM_NEON Flash Attention.md
@@ -1,14 +1,17 @@
-### 🔀 [#49](https://github.com/ikawrakow/ik_llama.cpp/pull/49) - ARM_NEON Flash Attention
+## 🔀 [Pull Request #49](https://github.com/ikawrakow/ik_llama.cpp/pull/49) - ARM_NEON Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/neon_flash_attention_2` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-11 |
| **Updated** | 2024-09-11 |
+| **Merged** | 2024-09-11 |
---
-#### Description
+## 📄 Description
This PR adds Flash Attention for `ARM_NEON`. The `Zen4/AVX2` implementation is reused with a few platform specific additions for `ARM_NEON`. As with `AVX2`, it is just for `fp16` kv-cache for now.
diff --git a/github-data/pull_requests/492 - CUDA implementation for IQ1_S_R4.md b/github-data/pull_requests/492 - CUDA implementation for IQ1_S_R4.md
index 3bb8d0d27..3d3a34fa7 100644
--- a/github-data/pull_requests/492 - CUDA implementation for IQ1_S_R4.md
+++ b/github-data/pull_requests/492 - CUDA implementation for IQ1_S_R4.md
@@ -1,18 +1,21 @@
-### 🔀 [#492](https://github.com/ikawrakow/ik_llama.cpp/pull/492) - CUDA implementation for IQ1_S_R4
+## 🔀 [Pull Request #492](https://github.com/ikawrakow/ik_llama.cpp/pull/492) - CUDA implementation for IQ1_S_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_iq1_s_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-04 |
| **Updated** | 2025-06-05 |
+| **Merged** | 2025-06-05 |
---
-#### Description
+## 📄 Description
Apparently there are people who would like to use `IQ1_S` or `IQ1_S_R4` quantized models. This PR adds CUDA implementation for `IQ1_S_R4`.
-It seems there has been some confusion about which of these quants is supported where (see discussions in #477)
+It seems there has been some confusion about which of these quants is supported where (see discussions in [#477](https://github.com/ikawrakow/ik_llama.cpp/issues/477))
To clarify:
* `IQ1_S` and `IQ1_S_R4` have both fast GEMM and GEMV on the CPU, but `IQ1_S_R4` is faster for prompt processing due to row interleaving
@@ -63,17 +66,42 @@ Here is the performance with dequantize+cuBLAS that I had originally:
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-04** at **22:53:11**:
+👤 **ubergarm** commented on **2025-06-04** at **21:06:38**
+Haha yes... Thanks a lot for this one! I realize I made a lot of confusion releasing that `IQ1_S_R4` so also uploaded the equivalent `IQ1_S`. I'll let folks know they can use the `_R4` with GPU offload now and likely increase their TG numbers!
+
+> these two quants are not 100% equivalent. IQ1_S uses float scales per super-blocks of 256 weights, while IQ1_S_R4 uses a single float scale for an entire tensor row (and is therefore slightly smaller with exactly 1.5 bpw, while IQ1_S is 1.5625 bpw).
+
+Appreciate the explanation, I was a little worried at first when the model size increased by a few GiB, but everything worked out and folks can still barely squeeze it onto 128GiB RAM + 24GB VRAM rigs by offloading layers. Makes sense now.
+
+I'll run a quick test of this PR and post some llama-sweep-bench results soon.
+
+---
+
+👤 **ubergarm** commented on **2025-06-04** at **22:53:11**
+
+*EDIT*;
+
+Oops, it just hit me after going for a long walk. My quant also uses `IQ1_M_R4`
+```
+- type iq1_s_r4: 116 tensors `ffn_(gate|up)_exps`
+- type iq1_m_r4: 58 tensors `ffn_down_exps`
+```
+
+So *ignore the rest of this* haha... I might try to roll an all exps at `iq1_s_r4` though and break the world record for the smallest R1-0528 quant. again. lol..,.
+
+*EDIT2*: Yeah it works, posted new comment below with llama-sweep-bench results.
+---
+
Well shucks, I tried this PR, but I'm not able to get the R1-0528-IQ1_S_R4 to run with GPU offload. I tried a few compilation options with and without `-DGGML_CUDA_IQK_FORCE_BF16=1` and the IQ1_S runs fine with the exact same llama-sweep-bench command.
This is on the 7965WX 256GB RAM + Dual RTX A6000 (96GB VRAM total) rig.
Watching `nvitop` the GPUs use low power even at 100% utilization as if it is just copying data perhaps and not actually running computations still like on main. I tried a single visible CUDA device as well but same behavior. I tried the earlier GEMV commit of `33ced81c` but same behavior.
-## PR496@fb6a0d01 IQ1_S
+## PR492@fb6a0d01 IQ1_S
`main: n_kv_max = 16384, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 24, n_threads_batch = 24`
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -82,7 +110,7 @@ Watching `nvitop` the GPUs use low power even at 100% utilization as if it is ju
| 4096 | 1024 | 8192 | 15.014 | 272.81 | 71.013 | 14.42 |
| 4096 | 1024 | 12288 | 17.540 | 233.52 | 73.294 | 13.97 |
-## PR496@fb6a0d01 IQ1_S_R4
+## PR492@fb6a0d01 IQ1_S_R4
`main: n_kv_max = 16384, n_batch = 512, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 24, n_threads_batch = 24`
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
@@ -655,13 +683,23 @@ main: n_kv_max = 16384, n_batch = 512, n_ubatch = 512, flash_attn = 1, n_gpu_lay
---
-👤 **ubergarm** commented the **2025-06-05** at **04:19:52**:
+👤 **ubergarm** commented on **2025-06-05** at **04:19:52**
Okay, it works after removing the iq1_m_r4 layers! I rolled a new `IQ1_S_R4-smol` which is `iq1_s_r4` for all `exps` but I bumped up attn/token_embd/shexp to `iq5_ks`.

-You can see how both GPUs are offloaded and with some utilization along with decent power usage:
+You can see how both GPUs are offloaded and with some utilization along with decent power usage. Without this PR on `main@f6d5fbdc` it gets less than 1 tok/sec generation with the same command.

-I'll go test perplexity on this little guy and see how it looks. Thanks!
\ No newline at end of file
+I'll go test perplexity on this little guy and see how it looks. Thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-06-05** at **04:24:18**
+
+> Oops, it just hit me after going for a long walk. My quant also uses IQ1_M_R4
+
+Yes, `IQ1_M_R4` has no CUDA support. I'll add it soon to support your quest for the world's smallest model.
+
+`ffn_down` also with `IQ1_S_R4` is likely to cripple the model.
\ No newline at end of file
diff --git a/github-data/pull_requests/493 - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4.md b/github-data/pull_requests/493 - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4.md
index 5a657fb11..e6b0c08bc 100644
--- a/github-data/pull_requests/493 - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4.md
+++ b/github-data/pull_requests/493 - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#493](https://github.com/ikawrakow/ik_llama.cpp/pull/493) - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4
+## 🔀 [Pull Request #493](https://github.com/ikawrakow/ik_llama.cpp/pull/493) - MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mmq_iq_ks_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-05 |
| **Updated** | 2025-06-05 |
+| **Merged** | 2025-06-05 |
---
-#### Description
+## 📄 Description
These two can use the more efficient block-of-32 MMQ GEMM kernels, so having MMQ implementation for them makes sense.
diff --git a/github-data/pull_requests/494 - IQ1_M_R4 CUDA implementation.md b/github-data/pull_requests/494 - IQ1_M_R4 CUDA implementation.md
index db0d0f869..5458c4950 100644
--- a/github-data/pull_requests/494 - IQ1_M_R4 CUDA implementation.md
+++ b/github-data/pull_requests/494 - IQ1_M_R4 CUDA implementation.md
@@ -1,14 +1,17 @@
-### 🔀 [#494](https://github.com/ikawrakow/ik_llama.cpp/pull/494) - IQ1_M_R4 CUDA implementation
+## 🔀 [Pull Request #494](https://github.com/ikawrakow/ik_llama.cpp/pull/494) - IQ1_M_R4 CUDA implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_iq1_m_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-05 |
| **Updated** | 2025-06-05 |
+| **Merged** | 2025-06-05 |
---
-#### Description
+## 📄 Description
To help the quest for the world's smallest DeepSeek model, this PR adds CUDA implementation for `IQ1_M_R4`.
@@ -46,9 +49,9 @@ Here sweep bench for LlaMA-3-8B on RTX-4080
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-05** at **15:26:27**:
+👤 **ubergarm** commented on **2025-06-05** at **15:26:27**
Amazing, you've done it! The pieces of the puzzle are in place. Congrats, ik, on the world's smallest working DeepSeek-R1-0528 quant! :tada:
@@ -139,4 +142,12 @@ llama_new_context_with_model: graph splits = 111
| 4096 | 1024 | 8192 | 14.947 | 274.04 | 76.418 | 13.40 |
| 4096 | 1024 | 12288 | 17.442 | 234.84 | 78.654 | 13.02 |
-
\ No newline at end of file
+
+
+---
+
+👤 **ubergarm** commented on **2025-06-05** at **16:12:10**
+
+`4.8805 +/- 0.02876` perplexity, not great, not terrible.
+
+Importantly, it runs clean with no nans!!! Ship it! :ship: :chipmunk: :rocket:
\ No newline at end of file
diff --git a/github-data/pull_requests/495 - Check if ffn_up and ffn_gate are of the same type before using fmoe.md b/github-data/pull_requests/495 - Check if ffn_up and ffn_gate are of the same type before using fmoe.md
index 63bbb38bf..d7bbb2da3 100644
--- a/github-data/pull_requests/495 - Check if ffn_up and ffn_gate are of the same type before using fmoe.md
+++ b/github-data/pull_requests/495 - Check if ffn_up and ffn_gate are of the same type before using fmoe.md
@@ -1,22 +1,117 @@
-### 🔀 [#495](https://github.com/ikawrakow/ik_llama.cpp/pull/495) - Check if ffn_up and ffn_gate are of the same type before using fmoe
+## 🔀 [Pull Request #495](https://github.com/ikawrakow/ik_llama.cpp/pull/495) - Check if ffn_up and ffn_gate are of the same type before using fmoe
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `ik/check_up_gate_fmoe` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-06 |
| **Updated** | 2025-07-12 |
---
-#### Description
+## 📄 Description
Apparently some quant cookers are going as far as using different quantization types for `ffn_up` and `ffn_gate`. As this possibility is not correctly handled in the fused `ffn_up+ffn_gate` op, this PR adds a check and disables `fmoe` in these layers.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-06** at **10:10:35**:
+👤 **Thireus** commented on **2025-06-06** at **08:43:25**
+
+Thank you for looking into this. I'll test your change and will report back when finished. Model loads when `-fmoe` is specified now.
+
+---
+
+👤 **Thireus** commented on **2025-06-06** at **09:30:53**
+
+It would appear that llama-sweep-bench and llama-cli don't like `-fmoe`.
+
+# llama-bench - Works when `-fmoe 1` specified for unsloth model
+```
+CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ~/ik_llama-main-b3758-23c3e73-bin-win-cuda-12.8-x64/llama-bench -m DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf -mla 3 -fa 1 -fmoe 1 -amb 1024 -ngl 99 -ctk f16 -ot ".ffn_(up|down)_exps.=CPU" -b 4096 -ub 4096 --mmap 0 --threads 36 --main-gpu 0 -n 0
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 3 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mla | amb | mmap | fmoe | test | t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --: | ----: | ---: | ---: | ------------: | ---------------: |
+==========================================================================
+Detected incompatible DeepSeek model.
+Will try to fix, but there are no guarantees
+
+*** Your prompt processing speed will be crippled ***
+
+Consider making your own ik_llama.cpp compatible model or
+ask the model provider to make one for you,
+==========================================================================
+Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+...
+| deepseek2 671B IQ1_S - 1.5625 bpw | 173.47 GiB | 672.05 B | CUDA | 99 | 36 | 4096 | 4096 | 1 | 3 | 1024 | 0 | 1 | pp512 | 23.32 ± 0.50 |
+
+build: 23c3e73 (1)
+```
+
+# llama-cli - Doesn't work
+```
+CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ~/ik_llama-main-b3758-23c3e73-bin-win-cuda-12.8-x64/llama-cli -m DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf -mla 3 -fa \
+ -amb 1024 \
+ -fmoe \
+ -ctk f16 \
+ -c 16384 \
+ -ngl 99 \
+ -ot "blk\.(3|4)\.ffn_.*=CUDA0" -ot "blk\.(5)\.ffn_.*=CUDA1" -ot "blk\.(6)\.ffn_.*=CUDA2" \
+ -ot exps=CPU \
+ -b 4096 -ub 4096 \
+ --warmup-batch \
+ --no-mmap \
+ --threads 36 \
+ --main-gpu 0 \
+ -p '<|begin▁of▁sentence|><|User|>What is the solution of x+5=-2?<|Assistant|>\n'
+---
+...
+<|begin?of?sentence|><|User|>What is the solution of x+5=-2?<|Assistant|>
+FirstD:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml.c:15189: fatal error
+D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml.c:15189: fatal error
+D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml.c:15189: fatal error
+...
+---
+```
+
+# llama-sweep-bench - Fatal error after the model loads when `-fmoe` specified for unsloth model
+```
+CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ~/ik_llama-main-b3758-23c3e73-bin-win-cuda-12.8-x64/llama-sweep-bench -m DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf -mla 3 -fa \
+ -amb 1024 \
+ -fmoe \
+ -ctk f16 \
+ -c 16384 \
+ -ngl 99 \
+ -ot "blk\.(3|4)\.ffn_.*=CUDA0" -ot "blk\.(5)\.ffn_.*=CUDA1" -ot "blk\.(6)\.ffn_.*=CUDA2" \
+ -ot exps=CPU \
+ -b 4096 -ub 4096 \
+ --warmup-batch \
+ --no-mmap \
+ --threads 36 \
+ --main-gpu 0
+---
+...
+main: n_kv_max = 16384, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 36, n_threads_batch = 36
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml.c:15189: D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml.c:15189: fatal error
+D:\a\ik_llama.cpp\ik_llama.cpp\ggml\src\ggml.c:15189: fatal error
+...
+---
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-06-06** at **10:10:35**
Oh, I see. The model contains `IQ1_M` quants. With partial offload TG will run on the CPU, and `IQ1_M` quants are not supported there with `-fmoe`. The fused `ffn_up+ffn_gate` op relies on the `IQK` GEMM/GEMV implementation, and there is no `IQK` implementation for `IQ1_M`.
@@ -26,19 +121,38 @@ Thanks for testing. So, for now, models containing `IQ1_M` quants cannot be used
---
-👤 **Thireus** commented the **2025-06-06** at **12:48:14**:
+👤 **Thireus** commented on **2025-06-06** at **12:48:14**
Ah! Thank you for the clarification. Where can I find the list of quantisation type currently implemented in ik_llama? I'm thinking of attempting to reproduce Unsloth dynamic GGUF quants that would only include supported ik_llama quants.
---
-👤 **Thireus** commented the **2025-06-06** at **13:03:28**:
+👤 **ikawrakow** commented on **2025-06-06** at **12:54:17**
+
+> Ah! Thank you for the clarification. Where can I find the list of quantisation type currently implemented in ik_llama?
+
+All types supported by `llama.cpp` are also supported in `ik_llama.cpp` (+ another 3X extra types). The specific issue you have with `-fmoe` (also an `ik_llama.cpp` extension) is that `IQ1_M` does not have fast CPU matrix multiplications implemented. But it will work just fine without `-fmoe` the same way it does in `llama.cpp` (i.e., very slow).
+
+---
+
+👤 **Thireus** commented on **2025-06-06** at **13:03:28**
Yes sorry this is what I meant, I'm looking for the file/folder where the fast CPU matrix multiplication for IQ1_M would need to be implemented please. I plan to use other UD quants so I will need to see what has been implemented so far for fast CPU matrix multiplication.
+Edit: I believe I found it - ggml/src/iqk/iqk_mul_mat.cpp
+
---
-👤 **Thireus** commented the **2025-06-06** at **14:04:05**:
+👤 **ikawrakow** commented on **2025-06-06** at **13:13:05**
+
+Everything is implemented apart from `IQ1_M`. But if you want to take a look yourself, this is in the `ggml/src/iqk` folder.
+The function `MulMat::prepare()` in `iqk_mul_mat.cpp` will tell you which types are implemented.
+
+I personally don't take `IQ1_S` and `IQ1_M` very seriously, so did not implement those. The only reason `IQ1_S` is implemented is that there was a user asking for `IQ1_S`, so I added it in PR [#212](https://github.com/ikawrakow/ik_llama.cpp/issues/212). It then turned out this specific user was only asking for it to copy the code into KTransformers.
+
+---
+
+👤 **Thireus** commented on **2025-06-06** at **14:04:05**
I see, not cool what happened here! ... 🫤
@@ -48,7 +162,7 @@ I with unsloth could make UD quants compatible with ik_llama. Their imatrix is q
2. Implement IQ1_M and potentially others for higher unsloth quants (dunno if they use XS and XSS in their UD, would need to check)
3. Use the provided non-UD IQ from unsloth... knowing I would not benefit from UD quality boost. However, they only provide IQ4 which I cannot run because too big for my rig, so would need to ask them to produce lower ones. 🙁
-I'm leaning towards 1. as I don't understand yet the benefits of using R4 quants. But may have to change my mind and go with option 1.
+I'm leaning towards 1. as I don't understand yet measured the benefits of using _R4 quants. But may have to change my mind and go with option 1.
---
Summary of “Missing quant‐types” per bit
@@ -62,7 +176,7 @@ Summary of “Missing quant‐types” per bit
---
-👤 **ikawrakow** commented the **2025-06-06** at **14:13:13**:
+👤 **ikawrakow** commented on **2025-06-06** at **14:13:13**
* IQ1_BN_R4 does not exist
* IQ3_XS, IQ3_XS_R4, IQ3_BN, IQ3_BN_R4 - they don't exist
@@ -73,7 +187,38 @@ To see what quantization types exist, take a look [here](https://github.com/ikaw
---
-👤 **Thireus** commented the **2025-06-07** at **13:39:51**:
+👤 **ubergarm** commented on **2025-06-06** at **19:58:52**
+
+Thanks for all the testing, this helps me understand why the temporary IQ1_S i had rolled seemed off in brief testing. The IQ1_S_R4 is definitely the way to go I see given the recent CUDA support.
+
+@Thireus
+
+> I would not benefit from UD quality boost
+
+## tl;dr;
+You seem pretty smart, don't limit your imagination to what others like unsloth and myself have done already. ik_llama.cpp gives you a powerful palette of quant types and optimizations to come up with your own mixes and methodologies for your use case and hardware.
+
+## Ramblings
+
+Sounds like you've done some measurements and possibly observed the quant recipes and imatrix methodologies grouped broadly under the label of "unsloth dynamic 2.0" are good for your use cases? I'm curious how you are doing benchmarks, as it can be pretty challenging with these large models. (maybe u already posted on HF, catching up on messages now).
+
+I'm genuinely curious as my understanding is that unsloth recently began to generate synthetic datasets including model specific tokens and using a larger context window e.g. 6-12k rather than the "normal" default 512 context. However it isn't always clear what methodology and imatrix corpus datasets were used on each quant and don't think they upload their imatrix dat files anymore either.
+
+My own experience at least with Qwen3-30B-A3B in writing up [The Great Quant Wars of 2025](https://gist.github.com/ubergarm/0f9663fd56fc181a00ec9f634635eb38) suggests that UD GGUFs are not necessarily "better" in a statistically measurable way at least using the usual methodologies which I try to share openly.
+
+In [another reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1ksw070/comment/mtq8ow7/) about why some folks seem to like unsloth quants, Daniel gave an interesting reply:
+
+> The biggest difference I would say isn't the quants, but rather our bug fixes for every model!
+
+I appreciate all the effort they are putting in, am very happy to have more quants to choose from, and honestly they helped get me into all this with the first very small DeepSeek-R1 quants only a few months ago now haha... Their hard work fixing bugs is great too, and I'm glad they are trying out more methodologies, but it isn't clear to me how these effect actual model performance in common situations or that it is always "better". It definitely isn't better in all situations, such as the [128k 4x yarn quant GGUFs](https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF/discussions/8) which change the Qwen recommended defaults - it might be better if all your prompts are actually 100-128k, but it gives [a measurable worse perplexity](https://github.com/vllm-project/llm-compressor/issues/1406#issuecomment-2937053069) in shorter context lengths as Qwen warns on the official model card. (link is some data I generated using exllamav3 in discussions with vllm-compressor AWQ quantizations).
+
+Anyway, happy Friday and enjoy your weekend! I appreciate your enthusiasm and am looking forward to seeing what you cook up! Always feel free to holler at me as I'm trying to keep up with folks pushing the limits of this stuff like Ed Addario, Bartowski, Unsloth, myself, and now you and others!
+
+Cheers!
+
+---
+
+👤 **Thireus** commented on **2025-06-07** at **13:39:51**
Hey @ubergarm, thank you for the kind words and most of all for sharing your knowledge here and there, it's been incredibly valuable. I am trying to ramp up my knowledge as fast as I can at the moment. I do not have well structured and scientific methodologies, but mainly rely on some quick tricks to build just enough evidence (to my own appreciation) about what my next steps should be to 1. get a GGUF tailored to my use cases, 2. make the most use of my current hardware in an attempt to avoid spending $20k+ on new hardware which may become obsolete in a couple of years and 3. gain sufficient knowledge to be comfortable with the (ik_)llama.cpp framework which appears to be the most flexible framework there is today for enthusiasts (I've explored exllama, vllm and a few others before).
@@ -90,7 +235,7 @@ To answer my original question, using `llama-quantize -h` is also a quick way to
---
-👤 **ubergarm** commented the **2025-06-07** at **16:45:29**:
+👤 **ubergarm** commented on **2025-06-07** at **16:45:29**
@Thireus
@@ -108,7 +253,13 @@ Cheers!
---
-👤 **Thireus** commented the **2025-06-11** at **06:04:25**:
+👤 **Thireus** commented on **2025-06-09** at **10:20:36**
+
+@ubergarm - Thanks, I went with option 1. (Get the imatrix from unsloth and produce my own quants for ik_llama). I've adapted the quants they use in their model to be ik-optimised. I'll be testing the quality of the model.
+
+---
+
+👤 **Thireus** commented on **2025-06-11** at **06:04:25**
Early observations using PPL: Using unsloth's imatrix into IQ1_S quants leads to slightly degraded results. `PPL = 4.9200 +/- 0.02917`
@@ -118,7 +269,251 @@ Unless I'm missing something, there are no mind-blowing results when evaluating
---
-👤 **ubergarm** commented the **2025-06-16** at **02:13:33**:
+👤 **ubergarm** commented on **2025-06-11** at **14:15:47**
+
+@Thireus
+
+Thanks for all the heavy lifting and number crunching to confirming some things. Your measured numbers for my IQ2_K_R4 and IQ3_K_R4 line up closely with my own so seems like you're methodology is sound. Nice job!
+
+> (110k+ context size) I should target IQ3_XXS
+
+You're in luck, because `IQ3_XXS` (and more, check the other closed PRs) just got a big boost in prompt processing today: https://github.com/ikawrakow/ik_llama.cpp/pull/516
+
+My advice would be to consider using `iq5_ks` for all attn/shexp/token_embd based on [this experiment](https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13336629) (assuming you have enough VRAM to fit your desired large 110k context still).
+
+Then maybe IQ3_XXS for ffn_down and IQ2_XXS for ffn_(gate|up) or something similar to hit the size for which you're aiming. `iq4_ks` for down and `iq3_xxs` (gate|up) may also be a good combo for a slightly larger quant now. I didn't run the final size numbers but you get the idea.
+
+Again feel free to use my [R1-0528 imatrix](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/blob/main/imatrix-DeepSeek-R1-0528.dat) which was made including the fixes to imatrix MLA computation in recent [PR411](https://github.com/ikawrakow/ik_llama.cpp/pull/411) so likely the best you can find without making your own.
+
+Have fun and keep us posted! I'd be interested in your `llama-sweep-bench` results as well comparing PP/TG between my IQ2_K_R4 with whatever IQ3_XXS mix you cook up. Cheers!
+
+---
+
+👤 **Thireus** commented on **2025-06-15** at **20:02:19**
+
+I need some help to understand quant performance - how can I know which quant performs better than others? Are there metrics somewhere that I've missed?
+
+For example, when using @ubergarm's quants:
+```
+# Token embedding and output tensors (GPU)
+# note token_embd cannot be repacked quant type
+token_embd\.weight=iq5_ks
+output\.weight=iq5_ks
+output_norm\.weight=iq5_ks
+
+# First 3 dense layers (0-3) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[0-2]\.attn_k_b.*=q5_0
+blk\.[0-2]\.attn_.*=iq5_ks
+blk\.[0-2]\..*=iq5_ks
+
+# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
+# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
+blk\.[3-9]\.attn_k_b.*=q5_0
+blk\.[1-5][0-9]\.attn_k_b.*=q5_0
+blk\.60\.attn_k_b.*=q5_0
+
+blk\.[3-9]\.attn_.*=iq5_ks
+blk\.[1-5][0-9]\.attn_.*=iq5_ks
+blk\.60\.attn_.*=iq5_ks
+
+blk\.[3-9]\.ffn_norm\.weight=iq5_ks
+blk\.[1-5][0-9]\.ffn_norm\.weight=iq5_ks
+blk\.60\.ffn_norm\.weight=iq5_ks
+
+blk\.[3-9]\.exp_probs_b\.bias=iq5_ks
+blk\.[1-5][0-9]\.exp_probs_b\.bias=iq5_ks
+blk\.60\.exp_probs_b\.bias=iq5_ks
+
+# Shared Experts (3-60) (GPU)
+blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
+blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
+blk\.60\.ffn_down_shexp\.weight=iq5_ks
+
+blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
+blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
+blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks
+
+# Routed Experts (3-60) (CPU)
+blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
+blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
+blk\.60\.ffn_down_exps\.weight=iq3_k_r4
+
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_k_r4
+blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_k_r4
+blk\.60\.ffn_(gate|up)_exps\.weight=iq2_k_r4
+```
+
+Perfs are great - pp and eval are through the roof, example: 5.50t/s for eval.
+
+But now, if I decide to change the quant of the routed experts from `iq3_k_r4` to `iq3_xxs_r4`, pp and eval get divided by 10, example: 0.62t/s for eval.
+
+Why is it that changing `iq3_k_r4` to `iq3_xxs_r4` results in such disproportionate and unexpected performance drop? I have in fact noticed this with other quants too, in fact any attempt that I've made at trying various quant mixes result in the same outcome: perf drops significantly making the model unusable. The only time I get great perfs if when using @ubergarm's secret recipe from https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF. At which point I am wondering if these are the only possible and usable recipes...
+
+How do I know which quant will result in perf degradation in advance and which ones result in great perfs? And why is it that `iq3_xxs_r4` performs 10x worse than `iq3_k_r4` which is quite counter intuitive to me?
+
+I'm really lost here.
+
+Edit: ChatGPT to the rescue on that one - https://chatgpt.com/share/684f3874-83f4-800f-a3b6-e9ac15ec48cd but unsure how accurate the answer is.
+
+---
+
+👤 **ubergarm** commented on **2025-06-15** at **21:21:11**
+
+@Thireus
+
+Great job you've come a long way in a short time! This is a great question. You're getting deep enough now that the answers are not simple! I'm no expert and will likely make some mistakes in this, but you'll get the gist.
+
+> I need some help to understand quant performance
+
+Given the context of your post my impression is you are interested in "performance" in terms of inference *speed* both token generation and prompt processing. (not in terms of quality like PPL/KLD similarity to the original model).
+
+> Perfs are great - pp and eval are through the roof, example: 5.50t/s for eval.
+
+Great to hear! I tried to choose my quants recipes based on a mix of quality and speed. Though given recent improvements in various quant inferencing implementations, there are probably other good combinations as well depending on your exact hardware.
+
+Keep in mind the "best" quant in terms of speed depends on a number of things:
+1. Overall size
+ * Prompt Processing (PP) tends to be CPU limited, while Token Generation (TG) tends to be RAM i/o bandwidth limited. using smaller quants means fewer bits which can speed up TG for example, but at likely a trade-off of typically "worse quality" for lower BPW quants.
+2. Hardware support
+ * LLM inferencing is a game of finding the most optimized kernels/algorithms to multiply matricies and vectors making use of *exact* hardware registers and flags available for CPU and GPUs. Type `lscpu | grep avx2` to see if your CPU supports AVX2 instruction set (Zen4 and newer). Newer CUDA architectures like 4090 and up support native fp8 registers etc.
+3. Software Implementation
+ * Even if you have the hardware, it doesn't matter unless the software is optimized to take advantage of it. ik's project tends to focus on CPU optimizations with some CUDA optimizations as well from what I've seen.
+ * Do a google search for MARLIN GEMM, and CUTLASS GEMM, and BLAS implementations, and you'll see there is an entire academic industrial complex built up around optimization of matrix math beginning around the early 80s with FORTRAN that continues today across multiple languages and target hardware.
+
+> How do I know which quant will result in perf degradation in advance and which ones result in great perfs?
+
+Right more specific to your exact problem at hand: "Which quants should I choose for my recipe to optimize speed?". I'm not sure there is a great way to know "in advance" honestly unless you look through the code to see which quants have like MMQ (quantized matrix multiplication psure) implementations for your target hardware. If it has to rely on fp16 and fp32 dtype registers, it will likely be slower especially on CPU etc.
+
+Personally, I pick a small model of similar architecture and make a bunch of quants. Then test them with llama-sweep-bench to empirically discover which ones are faster e.g. `iq5_ks` tends to be faster than `iq5_k` given the block size allowing less time spent unpacking and more time processing.
+
+Then I use what I learn in that experiment to inform how to quantize the larger models.
+
+> And why is it that iq3_xxs_r4 performs 10x worse than iq3_k_r4 which is quite counter intuitive to me?
+
+You saw recent updates to `iq3_xxs` in [PR516](https://github.com/ikawrakow/ik_llama.cpp/pull/516) and [PR524](https://github.com/ikawrakow/ik_llama.cpp/pull/524). Keep in mind that `iq3_xxs` is not the exact same implementation as `iq3_xxs_r4` and PR516 even says the new `iq3_xxs` is about 2x faster than `iq3_xxs_r4` given ik's specific llama-sweep-bench testing.
+
+Anyway, you get the idea. So in conclusion my basic approach is:
+
+1. Think about hardware break points for both GPU VRAM and CPU RAM.
+2. Decide if the model will likely be 100% on GPU or hybrid inferencing (or even CPU only).
+3. Test the biggest size quants that will allow me to hit my breakpoint targets for good quality.
+4. Test out some of them and compare against baseline `q4_0` and `q8_0` versions to see what is faster and lower perplexity KLD as well.
+5. Scale up with a larger model size and see if it works like I want.
+6. Iterate
+
+Also I'm very happy to fail. I've made many more quants that never saw the light of day than those that I upload to hugging face. Failure is half the fun. xD
+
+Cheers!
+
+---
+
+👤 **Thireus** commented on **2025-06-15** at **23:11:20**
+
+Thank you for the tips!
+
+> I pick a small model of similar architecture and make a bunch of quants. Then test them with llama-sweep-bench to empirically discover which ones are faster
+
+This! That was indeed going to be my next step. But I'm still very surprised to hear that there is not "general" universal quant benchmark, at least for CPU AVX-2 to give us an idea of what speed to expect for each quant. My assumption here is that it doesn't exist because would be vastly inaccurate and strongly dependent of config type... but I still find it surprising to be honest.
+
+Would you know a model that uses the same arch as DeepSeek R1-0528 that is relatively small?
+
+I just ran some benchmarks on: https://huggingface.co/Thireus/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-BF16-GGUF
+
+Here are the results:
+```
+GGUF Quant PP (t/s) with llama-sweep-bench -c 512
+q8_0 4936.13
+q4_1 4610.79
+iq4_nl 4604.94
+q4_0 4568.61
+q5_0 4473.73
+q5_1 4347.12
+q6_0 4334.24
+iq3_xxs 4084.95
+iq2_ks 3977.56
+iq3_s 3908.1
+iq4_xs 3890.67
+iq1_bn 3884.23
+iq6_k 3866.31
+iq2_bn 3866.19
+iq4_ks 3820.21
+iq2_k 3803.67
+iq3_k 3772.67
+iq4_ks_r4 3753.78
+iq1_m_r4 3749.02
+iq5_ks_r4 3702.12
+iq5_ks 3700.79
+iq4_k 3628.3
+iq5_k 3503.13
+iq1_m 3284.37
+iq2_k_r4 3202.56
+iq3_k_r4 3178.56
+iq4_kss 3093.66
+bf16 3051.4
+iq4_k_r4 3036.16
+iq5_k_r4 2988.56
+f32 2206.25
+q8_k_r8 2197.11
+q4_k_r4 2040.04
+f16 1950.84
+q2_k_r4 1886.5
+q5_k_r4 1880.66
+iq4_xs_r8 1764.99
+q6_k_r4 1753
+q3_k_r4 1725.95
+iq2_xs_r4 1584.74
+iq3_s_r4 1573.65
+iq2_xxs_r4 1468.21
+iq3_xxs_r4 1447.08
+iq2_bn_r4 1362.26
+q4_0_r8 1291.37
+q5_0_r4 1050.08
+q8_0_r8 1006.06
+q6_0_r4 996.71
+iq4_nl_r4 959.81
+iq2_xxs 54.81
+iq1_s 49.16
+iq1_s_r4 44.45
+iq2_xs 40.78
+iq2_s 38.96
+bf16_r16 DID NOT RUN
+iq2_kt DID NOT RUN
+iq3_kt DID NOT RUN
+iq4_kt DID NOT RUN
+iq2_m DID NOT QUANTIZE
+iq2_m_r4 DID NOT QUANTIZE
+iq3_kl DID NOT QUANTIZE
+iq3_m DID NOT QUANTIZE
+iq3_xs DID NOT QUANTIZE
+q2_k_s DID NOT QUANTIZE
+q3_k_l DID NOT QUANTIZE
+q3_k_m DID NOT QUANTIZE
+q3_k_s DID NOT QUANTIZE
+q4_0_4_4 DID NOT QUANTIZE
+q4_0_4_8 DID NOT QUANTIZE
+q4_0_8_8 DID NOT QUANTIZE
+q4_k_m DID NOT QUANTIZE
+q4_k_s DID NOT QUANTIZE
+q5_k_m DID NOT QUANTIZE
+q5_k_s DID NOT QUANTIZE
+q8_kv DID NOT QUANTIZE
+q8_kv_r8 DID NOT QUANTIZE
+```
+
+I've quantised these layers, and left all the others at q8_0:
+```
+blk\.([0-9]|1[0-9]|2[0-3])\.ffn_down\.weight=$_quant
+blk\.([0-9]|1[0-9]|2[0-3])\.ffn_gate\.weight=$_quant
+blk\.([0-9]|1[0-9]|2[0-3])\.ffn_norm\.weight=$_quant
+blk\.([0-9]|1[0-9]|2[0-3])\.ffn_up\.weight=$_quant
+```
+
+Basically I should avoid any quant below f32 from the bench results table above. But then there is `iq1_s_r4` which should have maybe been higher up in the list... but I suppose this is because the model's architecture is not the same...
+
+---
+
+👤 **ubergarm** commented on **2025-06-16** at **02:13:33**
> Would you know a model that uses the same arch as DeepSeek R1-0528 that is relatively small?
@@ -126,10 +521,10 @@ Yeah ik and folks use [DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/Deep
> Here are the results:
-Oh interesting, you ran made a lot of quants of that little 0.6B, very cool! Is this running on all layers offloaded on a single CUDA device with `--threads 1`? The `_r4` variants were mainly for CPU inferencing and didn't even work on CUDA [until a few weeks ago since PR461](https://github.com/ikawrakow/ik_llama.cpp/pull/461).
+Oh interesting, you made a lot of quants of that little 0.6B, very cool! Is this running on all layers offloaded on a single CUDA device with `--threads 1`? The `_r4` variants were mainly for CPU inferencing and didn't even work on CUDA [until a few weeks ago since PR461](https://github.com/ikawrakow/ik_llama.cpp/pull/461).
For DeepSeek-V2 architechture (R1-0528 etc) my strategy is:
-1. Keep all `attn/shexp` ready to run fully offloaded on GPU (iq5_ks is one of the best in my experiments in terms of speed/accuracy trade-offs). If someone wants to run pure-CPU, they can use `-rtr` or manually repack them to `_r4` for CPU optimizations.
+1. Keep all `attn/shexp` ready to run fully offloaded on GPU (iq5_ks is one of the best [in my experiments](https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13336629) in terms of speed/accuracy trade-offs). If someone wants to run pure-CPU, they can use `-rtr` or manually repack them to `_r4` for CPU optimizations.
2. I'm not for sure on `attn_k_b` but due to its shape you're restricted to like `q4_0` or `q6_0` etc. I believe it is technically redundant and I'm not for sure if it is possible to prune it or the corresponding `attn_` layers with the same data. More or less I keep it around the same BPW as my other `attn` tensors.
3. Keep all routed experts `-ot exps=CPU` as `_r4` variants assuming people will use hybrid inferencing with these layers on CPU/RAM. Originally when I did this, people could *not* add a few more layers onto GPU to fill up VRAM until ik bailed me out with the more recent PRs as mentioned. In the most ideal system customized to your exact hardware you'd calculate how many extra layers fit into your VRAM and quantize those as non `_r4` varieties leaving the remainder as `_r4`. This level of customization not practical for general purpose release to public huggingface though imo.
4. output.weight is also sometimes called "head" and often left at ~6bpw as it is not repeating. Seems like q6_K is fairly common, or iq6_k, or heck I'll leave it iq5_ks just to keep things consistent with my other tensors.
@@ -139,7 +534,7 @@ Hope that sheds some more light on things.
---
-👤 **ikawrakow** commented the **2025-06-16** at **10:18:19**:
+👤 **ikawrakow** commented on **2025-06-16** at **10:18:19**
@Thireus
@@ -171,25 +566,159 @@ For instance, as mentioned earlier, you should never ever, not even once, use `I
---
-👤 **saood06** commented the **2025-06-16** at **11:41:57**:
+👤 **Thireus** commented on **2025-06-16** at **11:36:12**
+
+Thank you @ubergarm and @ikawrakow - I'll switch to DeepSeek-V2-Lite so it can be a better representation of R1-0528
+
+The measurements I took were with partial offloading and latest ik_llama build. So I get a mix of GPU and CPU. But indeed those are not the speed of each quant, rather it gives an indication of which quant will slow down the overall speed perfs when used in a GPU+CPU mix.
+
+```
+for f in $(ls DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-THIREUS-*.gguf); do \
+echo $f:;
+CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,2,1 ~/ik_llama-main-b3800-79fc7dd-bin-win-cuda-12.8-x64/llama-sweep-bench -m $f -mla 3 -fa \
+ -amb 1024 \
+ -ctk f16 \
+ -c 512 \
+ -ngl 99 \
+ -ot "blk\.(3|4|5|6)\.ffn_.*=CUDA0" -ot "blk\..*\.ffn_.*=CPU" \
+ -b 4096 -ub 512 \
+ --threads 36 \
+ --main-gpu 0 2>&1 | grep "|"; \
+done
+```
+
+My current strategy remains to save as much time as possible in this quest of producing the most optimised GGUF for my hardware. So, anything that removes the human factor to perform any pre-assessment of which quants to use or not based on the hardware/model_architecture/quant_theory would help.
+
+I'm currently sitting on the Bruteforce method below:
+
+# Bruteforce method - Effort: Minimal (measured in hours) - Full automation with scripts - Drawback: Limited knowledge gain
+
+1. Loop through all quants and produce speed perf metrics for a small model with similar architecture for specific hardware that will run the LLM
+2. Triage results - blacklist quants with poor perfs
+3. Identify best quants based on speed and estimated resulting model size
+4. Produce variations of the big LLM with these quants and measure the PPL
+5. Identify best PPL/Size GGUF variant from resulting metrics
+
+# Smart method - Effort: High (measured in weeks/months) - Full manual - Drawback: Time (which I don't have)
+
+1. Learn about different quant methods (but first, find where this documentation is...)
+2. Understand the maths and drawbacks (but first, find where to get this knowledge from...)
+3. Dive into the llama code to understand what has been implemented and optimised
+4. Understand hardware limitations and optimisations (but first, find where this information is...)
+5. Identify the best theoretical quants for the model architecture and specific hardware that will run the LLM
+6. Produce GGUF
+
+---
+
+👤 **saood06** commented on **2025-06-16** at **11:41:57**
> 1. Learn about different quant methods (but first, find where this documentation is...)
-For each quant type you want to learn more about you can search for it. The `README` lists a lot of the newer one's alongside the PR they were introduced but there are often follow-up PRs that increase their speed.
+For each quant type you want to learn more about you can search for it [here](https://github.com/ikawrakow/ik_llama.cpp/pulls). The `README` lists a lot of the newer one's alongside the PR they were introduced but there are often follow-up PRs that increase their speed.
There is a method between the two in which you do the bruteforce method, but then focus your attention on select quants you want to learn more about.
---
-👤 **ikawrakow** commented the **2025-06-16** at **11:52:52**:
+👤 **ikawrakow** commented on **2025-06-16** at **11:52:52**
Your brute force method is unlikely to produce a meaningful outcome. You don't want to just find the quantization type that runs fastest on your hardware, but the quantization mix that runs the fastest **and satisfies a minimum quantization quality requirement**. Because, you know, the absolutely fastest model is the one that does no computation at all.
---
-👤 **Thireus** commented the **2025-06-19** at **15:56:45**:
+👤 **ubergarm** commented on **2025-06-18** at **20:43:22**
-Thank you for all the feedback. I am making small progress and I'm working towards a combination of quants that brings high speed (both prompt eval and new tokens) as well as reduced PPL on my hardware. I'm on Intel x299 and there are a lot of quants that really kill the speed (hence my initial high failure rate).
+@Thireus
+
+How you coming along? Things have changed a lot just in the past couple days with the enhanced CPU Prompt Processing in closed `PR531`, `PR533`, `PR534`.
+
+This seems to create three "tiers" of quant speed for CPU based PP from how I understand it reading `PR534` (specifically for CPUs supporting `avx2` instructions). Might be useful thing to keep in mind when designing quants for hybrid GPU+CPU inferencing as you're doing with your R1-0528. I'm also experimenting with some ~72B dense models now myself.
+
+Note that all three tiers are very optimized now relative to other forks. So this is mostly a distinction between the groups relative to each other on this fork.
+
+While there is still some variation within each "tier", the easiest way to tell quickly besides pulling up those PRs, is grep the code like so:
+
+
+
+👈 A Tier
+
+```bash
+$ cd ik_llama.cpp/ggml/src/iqk
+$ grep Q8_K_R8 iqk_mul_mat.cpp | grep type
+ case GGML_TYPE_IQ2_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_S : return nrc_y >= 16 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ1_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ1_M : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_Q2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_Q3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ5_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ5_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ6_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+```
+
+
+
+
+
+👈 B Tier
+
+```bash
+$ cd ik_llama.cpp/ggml/src/iqk
+$ grep Q8_0_R8 iqk_mul_mat.cpp | grep type
+ case GGML_TYPE_Q6_K : return nrc_y >= 64 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_Q4_0 : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_Q5_0 : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_Q6_0 : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_IQ4_NL : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_Q8_0 : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_IQ2_KT : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_IQ3_KT : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_IQ4_KT : return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_IQ2_KT: return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+ case GGML_TYPE_IQ4_KT: return nrc_y >= 32 ? GGML_TYPE_Q8_0_R8 : type;
+```
+
+
+
+
+
+👈 C Tier
+
+```bash
+$ cd ik_llama.cpp/ggml/src/iqk
+$ grep Q8_1 iqk_mul_mat.cpp | grep type
+ case GGML_TYPE_Q4_K : return nrc_y >= 32 ? GGML_TYPE_Q8_1 : type;
+ case GGML_TYPE_Q5_K : return nrc_y >= 32 ? GGML_TYPE_Q8_1 : type;
+ case GGML_TYPE_Q4_1 : return nrc_y >= 32 ? GGML_TYPE_Q8_1 : type;
+ case GGML_TYPE_Q5_1 : return nrc_y >= 32 ? GGML_TYPE_Q8_1 : type;
+```
+
+
+
+There is more to take into consideration than just PP speed on CPUs with avx2 support of course, like the GPU speeds for offloaded layers, perplexity, overall BPW as TG is generally memory i/o bound, etc. Just wanted to check it with you and also write this up to help my own brain process the changes haha...
+
+Finally no need to sweat it too much. I tested changed `token_embd/output` from the usual `q4_K/q6_K` to `iq4_k/iq6_k` and didn't see significant measurable differences in PPL/speed for just those two in my one test.
+
+Cheers!
+
+*EDIT*:
+
+The biggest upshot here for me is that the `_r4` row interleaved quants are no longer fastest for CPU inference in many situations especially for dense models or where batch sizes are large enough for MoEs.
+
+---
+
+👤 **Thireus** commented on **2025-06-19** at **15:56:45**
+
+Thank you for all the feedback. I am making small progress and I'm working towards a combination of quants that brings high speed (both prompt eval and new tokens) as well as reduced PPL on my hardware. I'm on Intel x299 and there are a lot of quants that really kill the CPU speed (hence my initial high failure rate).
The best model I was able to produce so far in terms of speed while maintaining a fair quality has the following characteristics:
- 214GB in size
@@ -197,15 +726,41 @@ The best model I was able to produce so far in terms of speed while maintaining
- 140.62 PP-512 (t/s)
- 6.21 t/s new tokens
-I have also found that I need a model that is around 240GB in size max. So I'm currently cooking some quant mixes to achieve this (this is where the gap on the diagram is).
+I have also found that I need a model that is around 240GB in size max. So I'm currently cooking some quant mixes to achieve this (this is where the gap on the graph is).

+Once I find the most optimum mix I'll upload the model, including the eval results and the secret recipe.
+
tl;dr: Still cooking.
---
-👤 **saood06** commented the **2025-06-19** at **19:52:26**:
+👤 **saood06** commented on **2025-06-19** at **18:03:41**
+
+> Once I find the most optimum mix I'll upload the model, including the eval results and the secret recipe.
+
+I don't get why they are called "secret recipes", even if not provided if a mix is, you can gguf-dump to get them (even if that is in a more inconvenient way than the custom regex used).
+
+If you share what your current working mix is then it would allow people to make suggestions on what you might want to change to use the ~26GB of extra budget you have. I have gone through the process you have with a lot of iterations optimizing for performance while maintaining my quality standard within a size budget (although my size budget was higher than yours).
+
+---
+
+👤 **ubergarm** commented on **2025-06-19** at **19:34:19**
+
+> I don't get why they are called "secret recipes"
+
+For myself at least, it is jest as I do my best to make my recipes known, easy to repeat, and provide imatrix data etc. And yes the gguf-dump is very useful. I'm not sure why huggingface throws "bad gguf magic number" for some of my quants but not others, as I like to look at a gguf before downloading it sometimes.
+
+Anyway, thanks as always for sharing all of your experience and guidance, you are very generous.
+
+Regarding "extra 26GB of budget" type stuff, I still wonder what the best way to add a little more fat to an otherwise fairly homogeneous quant. For example, using the normal pattern of ffn_down slightly larger than ffn_(gate|up) will hit a given size for a given quant type. If you want just a little more, is it best to increase like the first 8 layers one size? Then maybe the last few layers a little bigger? I've seen this done in some discussions, but even with the layer-similarity score I'm not sure how best to vary some layers over other layers other than lots of trial and error.
+
+Thanks!
+
+---
+
+👤 **saood06** commented on **2025-06-19** at **19:52:26**
> > I don't get why they are called "secret recipes"
>
@@ -231,7 +786,7 @@ My solution was to try to learn from not only my own trial and error but also ot
---
-👤 **Thireus** commented the **2025-06-28** at **15:57:07**:
+👤 **Thireus** commented on **2025-06-28** at **15:57:07**
Just wanted to share that I haven't given up, in fact I have made my first breakthrough today after a week of bruteforcing and auto-analysis to find the optimum quant combination, which allowed me to cook the following dynamic quant today:
@@ -246,7 +801,7 @@ I still need ~ 2 weeks worth of computing to achieve better results in speed and
---
-👤 **ubergarm** commented the **2025-06-28** at **16:31:22**:
+👤 **ubergarm** commented on **2025-06-28** at **16:31:22**
@Thireus
@@ -260,7 +815,7 @@ Cheers!
---
-👤 **Thireus** commented the **2025-07-02** at **22:20:22**:
+👤 **Thireus** commented on **2025-07-02** at **22:20:22**
Yes, I keep feeding the new quants to my automated scripts as soon as they are released/improved, so they can ingest them and see if they are of any good use. I've also fed the latest iq3_ks. I've also experimented with _kt.
@@ -271,7 +826,7 @@ I have created a script that can produce optimum mix recipes given a VRAM and RA
- 240GB in size
- 3.3471 +/- 0.01783 PPL
- 99.68 PP-512 (t/s)
-- 4.94 t/s new tokens
+- 5.43 t/s new tokens
Since I run my scripts on partial metrics, full metrics will be available in about 5-6 more days (I had made a mistake in my calibration dataset last week and had to redo all the computation), so there is still a bit of hope that I can reach slightly lower PPL for this size.
@@ -279,7 +834,7 @@ In the meantime, here's a zero-shot screensaver created by that mixture of quant
---
-👤 **Thireus** commented the **2025-07-11** at **11:23:19**:
+👤 **Thireus** commented on **2025-07-11** at **11:23:19**
MVP1 published - https://github.com/Thireus/GGUF-Tool-Suite
@@ -292,6 +847,6 @@ Example of quant mix recipe available [here](https://github.com/Thireus/GGUF-Too
- 113.10 t/s PP eval
- 5.70 t/s eval
-Config: 1x 5090 + 2x 3090 + i9 7980xe with 250GB DDR4
+Config: 1x 5090 + 2x 3090 + i9 9980xe with 256GB DDR4
Custom recipes can be produced within minutes for different VRAM and RAM requirements, see README file for basic instructions. Article coming soon.
\ No newline at end of file
diff --git a/github-data/pull_requests/496 - Quick hack add the MLA flag to llama_hparams.md b/github-data/pull_requests/496 - Quick hack add the MLA flag to llama_hparams.md
new file mode 100644
index 000000000..bdd573e3e
--- /dev/null
+++ b/github-data/pull_requests/496 - Quick hack add the MLA flag to llama_hparams.md
@@ -0,0 +1,25 @@
+## 🔀 [Pull Request #496](https://github.com/ikawrakow/ik_llama.cpp/pull/496) - Quick hack: add the MLA flag to llama_hparams
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/llama_hparams_add_mla` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-06 |
+| **Updated** | 2025-06-06 |
+
+---
+
+## 📄 Description
+
+_No description provided._
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** commented on **2025-06-06** at **07:52:59**
+
+Like I mentioned earlier, for the prompt saving stuff you don't need this as you have access to the ctx object which contains cparams.
+
+Edit: I see you responded over there.
\ No newline at end of file
diff --git a/github-data/pull_requests/496 - Quick hack_ add the MLA flag to llama_hparams.md b/github-data/pull_requests/496 - Quick hack_ add the MLA flag to llama_hparams.md
deleted file mode 100644
index 1933b3cf7..000000000
--- a/github-data/pull_requests/496 - Quick hack_ add the MLA flag to llama_hparams.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🔀 [#496](https://github.com/ikawrakow/ik_llama.cpp/pull/496) - Quick hack: add the MLA flag to llama_hparams
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-06 |
-| **Updated** | 2025-06-06 |
\ No newline at end of file
diff --git a/github-data/pull_requests/497 - Make prompt cache saving and restoring MLA aware.md b/github-data/pull_requests/497 - Make prompt cache saving and restoring MLA aware.md
index 93f827a4a..a00d17770 100644
--- a/github-data/pull_requests/497 - Make prompt cache saving and restoring MLA aware.md
+++ b/github-data/pull_requests/497 - Make prompt cache saving and restoring MLA aware.md
@@ -1,21 +1,24 @@
-### 🔀 [#497](https://github.com/ikawrakow/ik_llama.cpp/pull/497) - Make prompt cache saving and restoring MLA aware
+## 🔀 [Pull Request #497](https://github.com/ikawrakow/ik_llama.cpp/pull/497) - Make prompt cache saving and restoring MLA aware
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/MLA_prompt_save_restore_fix` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-06 |
| **Updated** | 2025-06-06 |
+| **Merged** | 2025-06-06 |
---
-#### Description
+## 📄 Description
Tested working with both a long (3.5K tokens) and a short prompt with both matching up in size with what is expected. The long prompt was also tested on a fresh launch of the server to ensure it gave output consistent with what would be expected given the information in the prompt.
-Closes #436
+Closes [#436](https://github.com/ikawrakow/ik_llama.cpp/issues/436)
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-06** at **08:33:36**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-06** at **08:33:36**
\ No newline at end of file
diff --git a/github-data/pull_requests/5 - Fusing a mat mul op followed by a scale op on the CPU.md b/github-data/pull_requests/5 - Fusing a mat mul op followed by a scale op on the CPU.md
index 4a7105795..c82918bb4 100644
--- a/github-data/pull_requests/5 - Fusing a mat mul op followed by a scale op on the CPU.md
+++ b/github-data/pull_requests/5 - Fusing a mat mul op followed by a scale op on the CPU.md
@@ -1,14 +1,16 @@
-### 🔀 [#5](https://github.com/ikawrakow/ik_llama.cpp/pull/5) - Fusing a mat mul op followed by a scale op on the CPU
+## 🔀 [Pull Request #5](https://github.com/ikawrakow/ik_llama.cpp/pull/5) - Fusing a mat mul op followed by a scale op on the CPU
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `ik/fuse_mul_mat_scale` |
+| **Target Branch** | `main` |
| **Created** | 2024-07-27 |
| **Updated** | 2025-02-08 |
---
-#### Description
+## 📄 Description
This is useful for Bitnet here we have almost all matrix multiplications be followed by scale operations.
As a result, we get a ~2% boost in Bitnet PP performance.
@@ -19,8 +21,8 @@ Given that Bitnet is just a niche thing for now, I'll just leave it on a draft P
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-02-08** at **14:27:07**:
+👤 **ikawrakow** commented on **2025-02-08** at **14:27:07**
I don't think I'll ever merge this.
\ No newline at end of file
diff --git a/github-data/pull_requests/50 - AVX2 Flash Attention 2.md b/github-data/pull_requests/50 - AVX2 Flash Attention 2.md
index d3c78bc1f..9f9d8b991 100644
--- a/github-data/pull_requests/50 - AVX2 Flash Attention 2.md
+++ b/github-data/pull_requests/50 - AVX2 Flash Attention 2.md
@@ -1,13 +1,16 @@
-### 🔀 [#50](https://github.com/ikawrakow/ik_llama.cpp/pull/50) - AVX2 Flash Attention 2
+## 🔀 [Pull Request #50](https://github.com/ikawrakow/ik_llama.cpp/pull/50) - AVX2 Flash Attention 2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avx2_flash_attn_2` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-11 |
| **Updated** | 2024-09-11 |
+| **Merged** | 2024-09-11 |
---
-#### Description
+## 📄 Description
This PR adds the ability to use Q4_0, Q4_1 and Q8_0 for the kv-cache.
\ No newline at end of file
diff --git a/github-data/pull_requests/501 - Fix 499.md b/github-data/pull_requests/501 - Fix 499.md
new file mode 100644
index 000000000..29e439790
--- /dev/null
+++ b/github-data/pull_requests/501 - Fix 499.md
@@ -0,0 +1,34 @@
+## 🔀 [Pull Request #501](https://github.com/ikawrakow/ik_llama.cpp/pull/501) - Fix [#499](https://github.com/ikawrakow/ik_llama.cpp/issues/499)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_499` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-06 |
+| **Updated** | 2025-06-07 |
+| **Merged** | 2025-06-07 |
+
+---
+
+## 📄 Description
+
+_No description provided._
+
+---
+
+## 💬 Conversation
+
+👤 **randoentity** commented on **2025-06-06** at **17:03:13**
+
+It works! Thanks!
+
+version: 3731 (b4ce7da8) *edit: I rebased on main which is why the commit hash is off*
+
+```
+main: n_kv_max = 65536, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 6, n_threads_batch = 12
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 20.224 | 202.53 | 176.723 | 5.79 |
+```
\ No newline at end of file
diff --git a/github-data/pull_requests/501 - Fix _499.md b/github-data/pull_requests/501 - Fix _499.md
deleted file mode 100644
index 443657376..000000000
--- a/github-data/pull_requests/501 - Fix _499.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🐛 [#501](https://github.com/ikawrakow/ik_llama.cpp/pull/501) - Fix [#499](https://github.com/ikawrakow/ik_llama.cpp/issues/499)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-06 |
-| **Updated** | 2025-06-07 |
\ No newline at end of file
diff --git a/github-data/pull_requests/502 - Add an endpoint that lists all the saved prompt caches to server.md b/github-data/pull_requests/502 - Add an endpoint that lists all the saved prompt caches to server.md
index 16bdca6d2..ae5c6fba6 100644
--- a/github-data/pull_requests/502 - Add an endpoint that lists all the saved prompt caches to server.md
+++ b/github-data/pull_requests/502 - Add an endpoint that lists all the saved prompt caches to server.md
@@ -1,14 +1,17 @@
-### 🔀 [#502](https://github.com/ikawrakow/ik_llama.cpp/pull/502) - Add an endpoint that lists all the saved prompt caches to server
+## 🔀 [Pull Request #502](https://github.com/ikawrakow/ik_llama.cpp/pull/502) - Add an endpoint that lists all the saved prompt caches to server
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/list_prompt_cache` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-06 |
| **Updated** | 2025-06-11 |
+| **Merged** | 2025-06-07 |
---
-#### Description
+## 📄 Description
Now that saving the prompt cache works this adds a way to query all the currently saved prompt caches.
@@ -16,13 +19,13 @@ This should be enough to be used by any front end. The only thing that may poten
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-07** at **05:18:57**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-07** at **05:18:57**
---
-👤 **saood06** commented the **2025-06-11** at **06:50:30**:
+👤 **saood06** commented on **2025-06-11** at **06:50:30**
>The only thing that may potentially be useful to be added is giving the prompt in an array based on how the prompt is tokenized.
@@ -32,4 +35,6 @@ I'm alleviating both of these by putting info about the model and numbering my s
The timestamp can be included trivially, but the model information as far as I can tell will be a breaking change to the session save format (there is some metadata included that prevents you from loading incompatible saves, but for the reasons listed above I don't think it is the best choice to output and use those, and they really aren't very human friendly).
-I really don't want to make a breaking change (not just because it would break old saves [unless converted] but it would also break support with mainline, unless they also chooses to adopt it).
\ No newline at end of file
+I really don't want to make a breaking change (not just because it would break old saves [unless converted] but it would also break support with mainline, unless they also chooses to adopt it).
+
+Edit: forgot to mention an endpoint allowing you to delete saved prompts might be worth adding.
\ No newline at end of file
diff --git a/github-data/pull_requests/504 - Add DRY and fix the server to use other new samplers..md b/github-data/pull_requests/504 - Add DRY and fix the server to use other new samplers.md
similarity index 53%
rename from github-data/pull_requests/504 - Add DRY and fix the server to use other new samplers..md
rename to github-data/pull_requests/504 - Add DRY and fix the server to use other new samplers.md
index f110bcccd..7d9f439ac 100644
--- a/github-data/pull_requests/504 - Add DRY and fix the server to use other new samplers..md
+++ b/github-data/pull_requests/504 - Add DRY and fix the server to use other new samplers.md
@@ -1,14 +1,16 @@
-### 🐛 [#504](https://github.com/ikawrakow/ik_llama.cpp/pull/504) - Add DRY and fix the server to use other new samplers.
+## 🔀 [Pull Request #504](https://github.com/ikawrakow/ik_llama.cpp/pull/504) - Add DRY and fix the server to use other new samplers.
| **Author** | `Ph0rk0z` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-07 |
| **Updated** | 2025-06-13 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -24,45 +26,55 @@ There's also a spot in the header where sampler order array was never updated? D
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-06-08** at **05:11:07**:
+👤 **saood06** commented on **2025-06-08** at **05:11:07**
-I see you also used the Z-algorithm and the implementation looks the same but you stripped the comments explaining it and where it came from. Any reason for that choice?
+I see you also used the Z-algorithm and the implementation looks the same as llama.cpp (which was based on the implementation done in kobold.cpp) but you stripped the comments explaining it and where it came from.
+
+Any reason for that choice?
+
+---
+
+👤 **Ph0rk0z** commented on **2025-06-08** at **10:24:01**
+
+Nope. If you find a mistake or a place that needs a comment/credit/etc please point it out. All I care is that it works and we have dry.
---
-👤 **saood06** commented the **2025-06-08** at **10:55:44**:
+👤 **saood06** commented on **2025-06-08** at **10:55:44**
> Nope. If you find a mistake or a place that needs a comment/credit/etc please point it out.
I did point it out though?
-The whole code of `llama_sample_dry_impl` looks like the exact same as `llama_sampler_dry_apply` from the DRY PR of mainline link to the file in question from that PR here https://github.com/ggml-org/llama.cpp/pull/9702/files#diff-ccfd27e7598c9965070306d4c6baf3cb4bf844211d1d37d7c52b0d03c8624507
+The whole code of `llama_sample_dry_impl` looks like the exact same as `llama_sampler_dry_apply` from the DRY PR of mainline. Link to the PR (and specific file referenced) [here](https://github.com/ggml-org/llama.cpp/pull/9702/files#diff-ccfd27e7598c9965070306d4c6baf3cb4bf844211d1d37d7c52b0d03c8624507) as a reference.
But the difference is it is lacking pretty much all of comments (which contain attributions alongside a lot of helpful info) that are contained in the mainline PR.
+Initially I only looked at the Z-algorithm because I was keeping up with the initial DRY PR in mainline and I knew that it was stalled waiting for permission for that specific code allowed into an MIT project (as kobold.cpp is AGPL-3.0 ) but now I see that what I said applies to that entire function not just the Z-algorithm.
+
>All I care is that it works and we have dry.
That may be what you care about, but attribution and credit even when not required (I am not sure it is here, but IANAL) is a nice thing to give, and it looks especially bad considering it really does look like you copy and pasted the code and then removed the attributions and comments.
-I am not saying that is what you did (I can't know, so I won't assume), but it definitely does look that way considering the code is identical and that is not a good look.
+I am not saying that is what you did (I can't know, so I won't assume), but it definitely does look that way considering the code is identical (but the comments are not) and that is not a good look.
---
-👤 **ikawrakow** commented the **2025-06-08** at **11:28:21**:
+👤 **ikawrakow** commented on **2025-06-08** at **11:28:21**
I agree with @saood06. Let's not remove the credits and comments.
---
-👤 **Ph0rk0z** commented the **2025-06-08** at **11:46:41**:
+👤 **Ph0rk0z** commented on **2025-06-08** at **11:46:41**
It went through a LLM but you're working up some scenario where I actively went through and took them out. I'll put them back best I can.
---
-👤 **saood06** commented the **2025-06-08** at **11:57:42**:
+👤 **saood06** commented on **2025-06-08** at **11:57:42**
> It went through a LLM but you're working up some scenario where I actively went through and took them out.
@@ -72,7 +84,13 @@ Even if you didn't actively take them out (which I believe you when you say you
---
-👤 **saood06** commented the **2025-06-08** at **12:28:52**:
+👤 **Ph0rk0z** commented on **2025-06-08** at **12:04:21**
+
+It doesn't match their code 1:1 copy pasted.. putting the comments in sort of reveals that. Parts of it do. Its an amalgamation of the PR which was built from k.cpp which itself is probably based on pew and textgen webui code.
+
+---
+
+👤 **saood06** commented on **2025-06-08** at **12:28:52**
> It doesn't match their code 1:1 copy pasted.. putting the comments in sort of reveals that. Parts of it do. Its an amalgamation of the PR which was built from k.cpp which itself is probably based on pew and textgen webui code.
@@ -82,31 +100,33 @@ Thank you for putting in the work to make this PR I do appreciate it, sorry that
---
-👤 **saood06** submitted a review the **2025-06-08** at **12:30:47**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-06-08** at **12:30:47** on `src/llama-sampling.cpp`:
+👤 **saood06** started a conversation on `src/llama-sampling.cpp` on **2025-06-08** at **12:30:47**
-Is this correct? And even if it is why subtract one then add it?
+Is this correct? And even if it is why subtract one then add one?
----
+> 👤 **Ph0rk0z** replied on **2025-06-08** at **12:43:31**
+>
+> It's LLM jank. Model trying to follow the logic of the operation and show it, despite it being mathematically nonsensical.
-👤 **saood06** submitted a review the **2025-06-08** at **12:31:14**: 💬 `COMMENTED`
+> 👤 **saood06** replied on **2025-06-08** at **12:58:55**
+>
+> Yes, but that still doesn't answer my question of is it correct? It doesn't look equivalent to the reference implementation to me.
---
-👤 **saood06** commented during a code review the **2025-06-08** at **12:31:14** on `src/llama-sampling.cpp`:
+👤 **saood06** started a conversation on `src/llama-sampling.cpp` on **2025-06-08** at **12:31:14**
You accidentally duplicated this when pasting in the comment.
---
-👤 **Ph0rk0z** submitted a review the **2025-06-08** at **12:43:31**: 💬 `COMMENTED`
+👤 **saood06** commented on **2025-06-08** at **12:32:40**
+
+I haven't built or ran the code yet, don't have time to test it tonight.
---
-👤 **Ph0rk0z** commented the **2025-06-08** at **12:53:16**:
+👤 **Ph0rk0z** commented on **2025-06-08** at **12:53:16**
That's fine, I hope that more people test it than just us. Remember that dry removes/breaks up engrams not single word repetition. I'll pull changes from here back in and keep rolling with it. Also another reminder that anyone using XTC or n sigma on server was not having it apply. The parameters weren't there.
@@ -122,35 +142,25 @@ Need to figure out if new samplers all belong here in sampling.h too
llama_sampler_type::MIN_P,
llama_sampler_type::TEMPERATURE
};
-```
-
----
-
-👤 **saood06** submitted a review the **2025-06-08** at **12:58:55**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-06-08** at **12:58:55** on `src/llama-sampling.cpp`:
-
-Yes, but that still doesn't answer my question of is it correct? It doesn't look equivalent to the reference implementation to me.
+```
+
+Edit: this is default sampler order.. so makes no difference if you want no new samplers within it.
---
-👤 **saood06** submitted a review the **2025-06-08** at **13:04:40**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `src/llama-sampling.cpp` on **2025-06-08** at **13:04:40**
----
-
-👤 **saood06** submitted a review the **2025-06-08** at **13:06:56**: 💬 `COMMENTED`
+This also looks different from the reference implementation but also you never actual use ring_buffer let alone this method even though you do provide an implementation for it.
---
-👤 **saood06** commented during a code review the **2025-06-08** at **13:06:56** on `src/llama-sampling.h`:
+👤 **saood06** started a conversation on `src/llama-sampling.h` on **2025-06-08** at **13:06:56**
-The reference uses a ring_buffer for this and not a vector. You added an implementation for a ring_buffer but never used it
+The reference uses a ring_buffer for `dry_last_tokens` (`last_tokens` in the reference implementation) and not a vector. You added an implementation for a ring_buffer but never used it. If you want to use a vector (which could work but I feel like would end up being more complicated) for this than remove the ring_buffer implementation, but I do think you should try and get closer to the original implementation as they did use a ring_buffer for a reason.
---
-👤 **saood06** commented the **2025-06-08** at **13:09:57**:
+👤 **saood06** commented on **2025-06-08** at **13:09:57**
> It doesn't match their code 1:1 copy pasted
@@ -158,21 +168,31 @@ From my experience porting code from mainline it is usually easier to do that an
---
-👤 **Ph0rk0z** commented the **2025-06-08** at **13:11:22**:
+👤 **Ph0rk0z** commented on **2025-06-08** at **13:11:22**
Yea, in this case it is much much too different. I took several cracks at that and failed each time.
---
-👤 **saood06** commented the **2025-06-08** at **13:20:16**:
+👤 **saood06** commented on **2025-06-08** at **13:20:16**
> I haven't built or ran the code yet, don't have time to test it tonight.
-I did leave some more comments though just from reading the code, I don't think it is worth testing anyway until they are resolved.
+I did leave some more comments from just reading the code, I don't think it is worth testing anyway until they are resolved.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-06-08** at **13:57:00**
+
+It does compile and work while penalizing tokens per debug messages.
+
+why does it show *comments* as pending in gh?
+
+
---
-👤 **saood06** commented the **2025-06-09** at **10:31:04**:
+👤 **saood06** commented on **2025-06-09** at **10:31:04**
> why does it show _comments_ as pending in gh?
@@ -184,6 +204,6 @@ I'll approve or request changes based on that.
---
-👤 **Ph0rk0z** commented the **2025-06-09** at **10:38:59**:
+👤 **Ph0rk0z** commented on **2025-06-09** at **10:38:59**
I tried with the RB and it caused more problems. Unless there's some big slowdowns, its probably not worth it. Another "trick" directly from pew was to set high top_K like (i.e 100) and place it before DRY to speed everything up. I've been doing that on mainline since I heard about it. Here I already did DRY on/off and the t/s was the same. Probably the thing to look out for.
\ No newline at end of file
diff --git a/github-data/pull_requests/505 - New IQ4_KT trellis implementation.md b/github-data/pull_requests/505 - New IQ4_KT trellis implementation.md
index 7aa08dd90..08996380a 100644
--- a/github-data/pull_requests/505 - New IQ4_KT trellis implementation.md
+++ b/github-data/pull_requests/505 - New IQ4_KT trellis implementation.md
@@ -1,14 +1,17 @@
-### 🔀 [#505](https://github.com/ikawrakow/ik_llama.cpp/pull/505) - New IQ4_KT trellis implementation
+## 🔀 [Pull Request #505](https://github.com/ikawrakow/ik_llama.cpp/pull/505) - New IQ4_KT trellis implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/new_iq4kt` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-08 |
| **Updated** | 2025-06-18 |
+| **Labels** | `Breaking change` |
---
-#### Description
+## 📄 Description
This PR adds a new version of `IQ4_KT` based on a new trellis.
@@ -30,9 +33,9 @@ What is the trick? If $v$ is an unsigned 32 bit integer and $A, B$ are unsigned
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-08** at **11:37:36**:
+👤 **ikawrakow** commented on **2025-06-08** at **11:37:36**
Here a plot of the pdf generated via the the new trellis (black dots) and a Gaussian fit (red line)
@@ -42,9 +45,9 @@ One would get an even better Gaussian by summing the bytes of two trellis values
---
-👤 **ubergarm** commented the **2025-06-08** at **19:45:04**:
+👤 **ubergarm** commented on **2025-06-08** at **19:45:04**
-This looks interesting, was thinking to test out this `iq4_kt` against my [ubergarm/gemma-3-27B-it-qat-iq4_ks](https://github.com/ikawrakow/ik_llama.cpp/discussions/334#discussioncomment-13374007) which is supposedly pretty good according to the linked discussion comment.
+This looks interesting, was thinking to test this new `iq4_kt` implementation against my [ubergarm/gemma-3-27B-it-qat-iq4_ks](https://github.com/ikawrakow/ik_llama.cpp/discussions/334#discussioncomment-13374007) which is supposedly pretty good according to the linked discussion comment.
I got it to compile CPU only e.g.
@@ -109,17 +112,1196 @@ gmake: *** [Makefile:146: all] Error 2
For fun I tried compiling an earlier commit `fb776ab` closer to the CUDA implementation, but same error. I tried moving the duplicated `break;` which didn't effect the error. I tried rebasing it on top of main which has the `IQ2_M_R4` functionality but same error.
-I see both `IQ4_KT = 155` and `GGML_TYPE_IQ4_KT 155` but don't know enough about c++ templates to figure out what I'm missing.
+I see both `IQ4_KT = 155` and `GGML_TYPE_IQ4_KT 155` but don't know enough about c++ templates to figure out what I'm missing.
+
+Same error on the remote 24x core thread ripper pro and my local arch linux box:
+- `gcc version 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04)`
+- `gcc version 14.2.1 20250128 (GCC)`
+
+Maybe missing adding a file here?
+```
+$ find . -name mmq-instance-iq4_k*
+./ggml/src/ggml-cuda/template-instances/mmq-instance-iq4_ks.cu
+./ggml/src/ggml-cuda/template-instances/mmq-instance-iq4_k.cu
+./ggml/src/ggml-cuda/template-instances/mmq-instance-iq4_ks_r4.cu
+# no mmq-instance-iq4_kt.cu ?
+```
+
+Ahh yes.. Hrmm. I don't know how to run `python ./ggml/src/ggml-cuda/template-instances/generate_cu_files.py` so just did the dirty thing and made the following file:
+
+`./ggml/src/ggml-cuda/template-instances/mmq-instance-iq4_kt.cu`
+```
+// This file has been autogenerated by generate_cu_files.py, do not edit manually.
+
+#include "../mmq.cuh"
+
+DECL_MMQ_CASE(GGML_TYPE_IQ4_KT);
+```
---
-👤 **ikawrakow** commented the **2025-06-08** at **20:37:58**:
+👤 **ubergarm** commented on **2025-06-08** at **20:34:25**
+
+Now that it seems to compile okay, giving it a try quantizing `gemma-3-27B-it-qat-iq4_kt`
+
+My first attempt threw an `Oops Cluster N has no points` but seems to keep going okay:
+```bash
+[ 4/ 808] blk.0.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. cluster_points: Oops. Cluster 620 has no points: 0 3 2 1
+cluster_points: 1 out of 625 clusters dir not have any points
+cluster_points: Oops. Cluster 25 has no points: 1 2 1 0
+cluster_points: Oops. Cluster 124 has no points: 0 3 3 1
+cluster_points: Oops. Cluster 624 has no points: 0 0 3 1
+cluster_points: 3 out of 625 clusters dir not have any points
+size = 220.50 MiB -> 55.21 MiB
+[ 5/ 808] blk.0.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 M
+iB -> 55.21 MiB
+```
+
+Not sure what that means, so I'm making a new imatrix using the some extra stuff from exllamav3 on top of my usual to see if it still throws the `Oops` knowing it might be completely unrelated.
+
+Will update this with results...
+
+*EDIT*
+
+Okay, it finished. And as ik mentioned below the `Oops` is harmless and was still there with my new imatrix in a quick test.
+
+It finished cooking so gonna give it a test then update one more time.
+
+
+
+👈 Secret Recipe and Logs
+
+```bash
+#!/usr/bin/env bash
+
+# this script is a bit sloppy as left over from earlier experiments.
+# its mostly un-needed as can just pass the quant level in a simple command.
+# i don't recall why i was making attn_v.weight=q4_0 before?
+# but it seems to quantize to q4_kt without any complaints...
+
+custom=" 17:58:22 [4/1961]
+#####
+# Token embedding
+token_embd\.weight=q8_0
+
+#####
+# Prioritize attn Layers by Cosine Similarity Scores
+#blk.0.attn_k.weight, torch.bfloat16 --> BF16, shape = {5376, 2048}
+#blk.0.attn_output.weight, torch.bfloat16 --> BF16, shape = {4096, 5376}
+#blk.0.attn_q.weight, torch.bfloat16 --> BF16, shape = {5376, 4096}
+#blk.0.attn_v.weight, torch.bfloat16 --> BF16, shape = {5376, 2048}
+
+#blk.[0-9].attn_v.weight=q4_0
+#blk.[1-6][0-9].attn_v.weight=q4_0
+
+blk.[0-9].attn_v.weight=iq4_kt
+blk.[1-6][0-9].attn_v.weight=iq4_kt
+
+#####
+# Prioritize ffn Layers by Cosine Similarity Scores
+#blk.0.ffn_down.weight, torch.bfloat16 --> BF16, shape = {21504, 5376}
+#blk.0.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {5376, 21504}
+#blk.0.ffn_up.weight, torch.bfloat16 --> BF16, shape = {5376, 21504}
+"
+
+custom=$(
+ echo "$custom" | grep -v '^#' | \
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+
+ #--imatrix /mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/imatrix-gemma-3-27B-it-qat-unquantized-BF16-calibration_data_v5_rc.dat \
+ #--imatrix /mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/imatrix-gemma-3-27B-it-qat-unquantized-BF16-ubergarm-calibration-corpus-v02
+.dat \
+./build/bin/llama-quantize \
+ --token-embedding-type q8_0 \
+ --custom-q "$custom" \
+ --imatrix /mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/imatrix-gemma-3-27B-it-qat-unquantized-BF16-calibration_data_v5_rc.dat \
+ /mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/gemma-3-27B-it-qat-unquantized-BF16-00001-of-00002.gguf \
+ /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_kt.gguf \
+ IQ4_KT \
+ 24
+
+
+main: build = 3748 (846c7b89)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: quantizing '/mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/gemma-3-27B-it-qat-unquantized-BF16-00001-of-00002.gguf' to '/mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_kt.gguf' as IQ4_KT using 24 threads
+llama_model_loader: additional 1 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 43 key-value pairs and 808 tensors from /mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/gemma-3-27B-it-qat-unquantized-BF16-00001-of-00002.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = gemma3
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Gemma 3 27b It Qat Q4_0 Unquantized
+llama_model_loader: - kv 3: general.finetune str = it-qat-unquantized
+llama_model_loader: - kv 4: general.basename str = gemma-3
+llama_model_loader: - kv 5: general.size_label str = 27B
+llama_model_loader: - kv 6: general.license str = gemma
+llama_model_loader: - kv 7: general.base_model.count u32 = 1
+llama_model_loader: - kv 8: general.base_model.0.name str = Gemma 3 27b It
+llama_model_loader: - kv 9: general.base_model.0.organization str = Google
+llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/google/gemma-3...
+llama_model_loader: - kv 11: general.tags arr[str,4] = ["gemma3", "gemma", "google", "image-...
+llama_model_loader: - kv 12: gemma3.context_length u32 = 131072
+llama_model_loader: - kv 13: gemma3.embedding_length u32 = 5376
+llama_model_loader: - kv 14: gemma3.block_count u32 = 62
+llama_model_loader: - kv 15: gemma3.feed_forward_length u32 = 21504
+llama_model_loader: - kv 16: gemma3.attention.head_count u32 = 32
+llama_model_loader: - kv 17: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 18: gemma3.attention.key_length u32 = 128
+llama_model_loader: - kv 19: gemma3.attention.value_length u32 = 128
+llama_model_loader: - kv 20: general.file_type u32 = 32
+llama_model_loader: - kv 21: gemma3.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 22: gemma3.attention.sliding_window u32 = 1024
+llama_model_loader: - kv 23: gemma3.attention.head_count_kv u32 = 16
+llama_model_loader: - kv 24: gemma3.rope.scaling.type str = linear
+llama_model_loader: - kv 25: gemma3.rope.scaling.factor f32 = 8.000000
+llama_model_loader: - kv 26: tokenizer.ggml.model str = llama
+llama_model_loader: - kv 27: tokenizer.ggml.pre str = default
+llama_model_loader: - kv 28: tokenizer.ggml.tokens arr[str,262208] = ["", "", "", "", ...
+llama_model_loader: - kv 29: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
+llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
+llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 2
+llama_model_loader: - kv 32: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 33: tokenizer.ggml.unknown_token_id u32 = 3
+llama_model_loader: - kv 34: tokenizer.ggml.padding_token_id u32 = 0
+llama_model_loader: - kv 35: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 36: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 37: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
+llama_model_loader: - kv 38: tokenizer.ggml.add_space_prefix bool = false
+llama_model_loader: - kv 39: general.quantization_version u32 = 2
+llama_model_loader: - kv 40: split.no u16 = 0
+llama_model_loader: - kv 41: split.count u16 = 2
+llama_model_loader: - kv 42: split.tensors.count i32 = 808
+llama_model_loader: - type f32: 373 tensors
+llama_model_loader: - type bf16: 435 tensors
+================================ Have weights data with 434 entries
+[ 1/ 808] token_embd.weight - [ 5376, 262208, 1, 1], type = bf16, Using custom type q8_0 for tensor token_embd.weight
+
+====== llama_model_quantize_internal: did not find weights for token_embd.weight
+converting to q8_0 .. Adding custom rule token_embd\.weight -> q8_0
+Adding custom rule blk.[0-9].attn_v.weight -> iq4_kt
+Adding custom rule blk.[1-6][0-9].attn_v.weight -> iq4_kt
+load_imatrix: imatrix dataset='calibration_data_v5_rc.txt'
+load_imatrix: loaded 434 importance matrix entries from /mnt/raid/models/google/gemma-3-27b-it-qat-q4_0-unquantized/imatrix-gemma-3-27B-it-qat-unquantized-BF16-calibration_data_v5_rc.dat computed on 221 chunks
+prepare_imatrix: have 434 importance matrix entries
+size = 2688.66 MiB -> 1428.35 MiB
+[ 2/ 808] blk.0.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 3/ 808] blk.0.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 4/ 808] blk.0.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. cluster_points: Oops. Cluster 620 has no points: 0 3 2 1
+cluster_points: 1 out of 625 clusters dir not have any points
+cluster_points: Oops. Cluster 25 has no points: 1 2 1 0
+cluster_points: Oops. Cluster 124 has no points: 0 3 3 1
+cluster_points: Oops. Cluster 624 has no points: 0 0 3 1
+cluster_points: 3 out of 625 clusters dir not have any points
+size = 220.50 MiB -> 55.21 MiB
+[ 5/ 808] blk.0.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 6/ 808] blk.0.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 7/ 808] blk.0.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 8/ 808] blk.0.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 9/ 808] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 10/ 808] blk.0.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 11/ 808] blk.0.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 12/ 808] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 13/ 808] blk.0.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 14/ 808] blk.0.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 15/ 808] blk.1.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 16/ 808] blk.1.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 17/ 808] blk.1.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 18/ 808] blk.1.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 19/ 808] blk.1.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 20/ 808] blk.1.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 21/ 808] blk.1.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.1.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 22/ 808] blk.1.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 23/ 808] blk.1.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 24/ 808] blk.1.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 25/ 808] blk.1.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 26/ 808] blk.1.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 27/ 808] blk.1.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 28/ 808] blk.2.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 29/ 808] blk.2.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 30/ 808] blk.2.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 31/ 808] blk.2.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 32/ 808] blk.2.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 33/ 808] blk.2.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 34/ 808] blk.2.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 35/ 808] blk.2.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 36/ 808] blk.2.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 37/ 808] blk.2.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 38/ 808] blk.2.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 39/ 808] blk.2.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 40/ 808] blk.2.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.2.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 41/ 808] blk.3.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 42/ 808] blk.3.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 43/ 808] blk.3.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 44/ 808] blk.3.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 45/ 808] blk.3.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 46/ 808] blk.3.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 47/ 808] blk.3.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 48/ 808] blk.3.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 49/ 808] blk.3.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 50/ 808] blk.3.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 51/ 808] blk.3.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 52/ 808] blk.3.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 53/ 808] blk.3.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.3.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 54/ 808] blk.4.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 55/ 808] blk.4.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 56/ 808] blk.4.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 57/ 808] blk.4.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 58/ 808] blk.4.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 59/ 808] blk.4.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 60/ 808] blk.4.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 61/ 808] blk.4.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 62/ 808] blk.4.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 63/ 808] blk.4.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 64/ 808] blk.4.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 65/ 808] blk.4.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 66/ 808] blk.4.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.4.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 67/ 808] blk.5.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 68/ 808] blk.5.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 69/ 808] blk.5.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 70/ 808] blk.5.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 71/ 808] blk.5.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 72/ 808] blk.5.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 73/ 808] blk.5.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 74/ 808] blk.5.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 75/ 808] blk.5.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 76/ 808] blk.5.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 77/ 808] blk.5.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 78/ 808] blk.5.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 79/ 808] blk.5.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.5.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 80/ 808] blk.6.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 81/ 808] blk.6.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 82/ 808] blk.6.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 83/ 808] blk.6.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 84/ 808] blk.6.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 85/ 808] blk.6.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 86/ 808] blk.6.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 87/ 808] blk.6.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 88/ 808] blk.6.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 89/ 808] blk.6.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 90/ 808] blk.6.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 91/ 808] blk.6.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 92/ 808] blk.6.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.6.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 93/ 808] blk.7.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 94/ 808] blk.7.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 95/ 808] blk.7.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 96/ 808] blk.7.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 97/ 808] blk.7.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 98/ 808] blk.7.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 99/ 808] blk.7.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.7.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 100/ 808] blk.10.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 101/ 808] blk.10.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 102/ 808] blk.10.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 103/ 808] blk.10.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 104/ 808] blk.10.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 105/ 808] blk.10.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 106/ 808] blk.10.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 107/ 808] blk.10.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 108/ 808] blk.10.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 109/ 808] blk.10.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 110/ 808] blk.10.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 111/ 808] blk.10.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 112/ 808] blk.10.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.10.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 113/ 808] blk.11.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 114/ 808] blk.11.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 115/ 808] blk.11.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 116/ 808] blk.11.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 117/ 808] blk.11.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 118/ 808] blk.11.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 119/ 808] blk.11.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 120/ 808] blk.11.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 121/ 808] blk.11.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 122/ 808] blk.11.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 123/ 808] blk.11.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 124/ 808] blk.11.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 125/ 808] blk.11.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.11.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 126/ 808] blk.12.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 127/ 808] blk.12.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 128/ 808] blk.12.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 129/ 808] blk.12.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 130/ 808] blk.12.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 131/ 808] blk.12.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 132/ 808] blk.12.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 133/ 808] blk.12.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 134/ 808] blk.12.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 135/ 808] blk.12.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 136/ 808] blk.12.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 137/ 808] blk.12.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 138/ 808] blk.12.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.12.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 139/ 808] blk.13.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 140/ 808] blk.13.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 141/ 808] blk.13.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 142/ 808] blk.13.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 143/ 808] blk.13.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 144/ 808] blk.13.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 145/ 808] blk.13.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.13.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 146/ 808] blk.7.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 147/ 808] blk.7.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 148/ 808] blk.7.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 149/ 808] blk.7.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 150/ 808] blk.7.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 151/ 808] blk.7.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 152/ 808] blk.8.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 153/ 808] blk.8.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 154/ 808] blk.8.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 155/ 808] blk.8.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 156/ 808] blk.8.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 157/ 808] blk.8.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 158/ 808] blk.8.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 159/ 808] blk.8.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 160/ 808] blk.8.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 161/ 808] blk.8.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 162/ 808] blk.8.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 163/ 808] blk.8.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 164/ 808] blk.8.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.8.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 165/ 808] blk.9.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 166/ 808] blk.9.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 167/ 808] blk.9.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 168/ 808] blk.9.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 169/ 808] blk.9.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 170/ 808] blk.9.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 171/ 808] blk.9.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 172/ 808] blk.9.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 173/ 808] blk.9.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 174/ 808] blk.9.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 175/ 808] blk.9.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 176/ 808] blk.9.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 177/ 808] blk.9.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.9.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 178/ 808] blk.13.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 179/ 808] blk.13.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 180/ 808] blk.13.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 181/ 808] blk.13.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 182/ 808] blk.13.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 183/ 808] blk.13.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 184/ 808] blk.14.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 185/ 808] blk.14.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 186/ 808] blk.14.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 187/ 808] blk.14.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 188/ 808] blk.14.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 189/ 808] blk.14.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 190/ 808] blk.14.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 191/ 808] blk.14.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 192/ 808] blk.14.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 193/ 808] blk.14.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 194/ 808] blk.14.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 195/ 808] blk.14.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 196/ 808] blk.14.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.14.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 197/ 808] blk.15.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 198/ 808] blk.15.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 199/ 808] blk.15.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 200/ 808] blk.15.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 201/ 808] blk.15.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 202/ 808] blk.15.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 203/ 808] blk.15.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 204/ 808] blk.15.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 205/ 808] blk.15.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 206/ 808] blk.15.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 207/ 808] blk.15.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 208/ 808] blk.15.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 209/ 808] blk.15.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.15.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 210/ 808] blk.16.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 211/ 808] blk.16.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 212/ 808] blk.16.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 213/ 808] blk.16.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 214/ 808] blk.16.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 215/ 808] blk.16.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 216/ 808] blk.16.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 217/ 808] blk.16.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 218/ 808] blk.16.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 219/ 808] blk.16.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 220/ 808] blk.16.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 221/ 808] blk.16.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 222/ 808] blk.16.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.16.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 223/ 808] blk.17.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 224/ 808] blk.17.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 225/ 808] blk.17.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 226/ 808] blk.17.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 227/ 808] blk.17.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 228/ 808] blk.17.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 229/ 808] blk.17.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 230/ 808] blk.17.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 231/ 808] blk.17.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 232/ 808] blk.17.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 233/ 808] blk.17.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 234/ 808] blk.17.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 235/ 808] blk.17.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.17.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 236/ 808] blk.18.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 237/ 808] blk.18.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 238/ 808] blk.18.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 239/ 808] blk.18.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 240/ 808] blk.18.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 241/ 808] blk.18.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 242/ 808] blk.18.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 243/ 808] blk.18.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 244/ 808] blk.18.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 245/ 808] blk.18.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 246/ 808] blk.18.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 247/ 808] blk.18.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 248/ 808] blk.18.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.18.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 249/ 808] blk.19.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 250/ 808] blk.19.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 251/ 808] blk.19.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 252/ 808] blk.19.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 253/ 808] blk.19.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 254/ 808] blk.19.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 255/ 808] blk.19.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.19.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 256/ 808] blk.19.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 257/ 808] blk.19.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 258/ 808] blk.19.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 259/ 808] blk.19.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 260/ 808] blk.19.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 261/ 808] blk.19.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 262/ 808] blk.20.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 263/ 808] blk.20.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 264/ 808] blk.20.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 265/ 808] blk.20.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 266/ 808] blk.20.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 267/ 808] blk.20.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 268/ 808] blk.20.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 269/ 808] blk.20.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 270/ 808] blk.20.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 271/ 808] blk.20.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 272/ 808] blk.20.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 273/ 808] blk.20.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 274/ 808] blk.20.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.20.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 275/ 808] blk.21.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 276/ 808] blk.21.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 277/ 808] blk.21.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 278/ 808] blk.21.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 279/ 808] blk.21.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 280/ 808] blk.21.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 281/ 808] blk.21.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 282/ 808] blk.21.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 283/ 808] blk.21.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 284/ 808] blk.21.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 285/ 808] blk.21.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 286/ 808] blk.21.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 287/ 808] blk.21.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.21.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 288/ 808] blk.22.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 289/ 808] blk.22.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 290/ 808] blk.22.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 291/ 808] blk.22.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 292/ 808] blk.22.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 293/ 808] blk.22.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 294/ 808] blk.22.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 295/ 808] blk.22.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 296/ 808] blk.22.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 297/ 808] blk.22.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 298/ 808] blk.22.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 299/ 808] blk.22.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 300/ 808] blk.22.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.22.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 301/ 808] blk.23.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 302/ 808] blk.23.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 303/ 808] blk.23.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 304/ 808] blk.23.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 305/ 808] blk.23.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 306/ 808] blk.23.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 307/ 808] blk.23.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 308/ 808] blk.23.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 309/ 808] blk.23.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 310/ 808] blk.23.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 311/ 808] blk.23.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 312/ 808] blk.23.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 313/ 808] blk.23.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.23.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 314/ 808] blk.24.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 315/ 808] blk.24.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 316/ 808] blk.24.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 317/ 808] blk.24.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 318/ 808] blk.24.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 319/ 808] blk.24.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 320/ 808] blk.24.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 321/ 808] blk.24.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 322/ 808] blk.24.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 323/ 808] blk.24.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 324/ 808] blk.24.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 325/ 808] blk.24.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 326/ 808] blk.24.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.24.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 327/ 808] blk.25.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 328/ 808] blk.25.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 329/ 808] blk.25.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 330/ 808] blk.25.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 331/ 808] blk.25.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 332/ 808] blk.25.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 333/ 808] blk.25.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.25.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 334/ 808] blk.25.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 335/ 808] blk.25.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 336/ 808] blk.25.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 337/ 808] blk.25.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 338/ 808] blk.25.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 339/ 808] blk.25.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 340/ 808] blk.26.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 341/ 808] blk.26.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 342/ 808] blk.26.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 343/ 808] blk.26.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 344/ 808] blk.26.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 345/ 808] blk.26.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 346/ 808] blk.26.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 347/ 808] blk.26.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 348/ 808] blk.26.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 349/ 808] blk.26.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 350/ 808] blk.26.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 351/ 808] blk.26.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 352/ 808] blk.26.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.26.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 353/ 808] blk.27.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 354/ 808] blk.27.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 355/ 808] blk.27.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 356/ 808] blk.27.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 357/ 808] blk.27.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 358/ 808] blk.27.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 359/ 808] blk.27.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 360/ 808] blk.27.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 361/ 808] blk.27.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 362/ 808] blk.27.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 363/ 808] blk.27.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 364/ 808] blk.27.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 365/ 808] blk.27.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.27.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 366/ 808] blk.28.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 367/ 808] blk.28.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 368/ 808] blk.28.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 369/ 808] blk.28.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 370/ 808] blk.28.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 371/ 808] blk.28.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 372/ 808] blk.28.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 373/ 808] blk.28.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 374/ 808] blk.28.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 375/ 808] blk.28.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 376/ 808] blk.28.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 377/ 808] blk.28.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 378/ 808] blk.28.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.28.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 379/ 808] blk.29.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 380/ 808] blk.29.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 381/ 808] blk.29.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 382/ 808] blk.29.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 383/ 808] blk.29.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 384/ 808] blk.29.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 385/ 808] blk.29.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 386/ 808] blk.29.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 387/ 808] blk.29.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 388/ 808] blk.29.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 389/ 808] blk.29.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 390/ 808] blk.29.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 391/ 808] blk.29.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.29.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 392/ 808] blk.30.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 393/ 808] blk.30.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 394/ 808] blk.30.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 395/ 808] blk.30.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 396/ 808] blk.30.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 397/ 808] blk.30.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 398/ 808] blk.30.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 399/ 808] blk.30.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 400/ 808] blk.30.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 401/ 808] blk.30.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 402/ 808] blk.30.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 403/ 808] blk.30.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 404/ 808] blk.30.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.30.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 405/ 808] blk.31.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 406/ 808] blk.31.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 407/ 808] blk.31.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 408/ 808] blk.31.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 409/ 808] blk.31.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 410/ 808] blk.31.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 411/ 808] blk.31.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.31.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 412/ 808] blk.31.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 413/ 808] blk.31.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 414/ 808] blk.31.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 415/ 808] blk.31.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 416/ 808] blk.31.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 417/ 808] blk.31.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 418/ 808] blk.32.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 419/ 808] blk.32.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 420/ 808] blk.32.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 421/ 808] blk.32.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 422/ 808] blk.32.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 423/ 808] blk.32.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 424/ 808] blk.32.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 425/ 808] blk.32.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 426/ 808] blk.32.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 427/ 808] blk.32.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 428/ 808] blk.32.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 429/ 808] blk.32.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 430/ 808] blk.32.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.32.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 431/ 808] blk.33.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 432/ 808] blk.33.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 433/ 808] blk.33.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 434/ 808] blk.33.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 435/ 808] blk.33.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 436/ 808] blk.33.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 437/ 808] blk.33.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 438/ 808] blk.33.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 439/ 808] blk.33.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 440/ 808] blk.33.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 441/ 808] blk.33.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 442/ 808] blk.33.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 443/ 808] blk.33.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.33.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 444/ 808] blk.34.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 445/ 808] blk.34.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 446/ 808] blk.34.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 447/ 808] blk.34.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 448/ 808] blk.34.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 449/ 808] blk.34.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 450/ 808] blk.34.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 451/ 808] blk.34.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 452/ 808] blk.34.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 453/ 808] blk.34.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 454/ 808] blk.34.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 455/ 808] blk.34.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 456/ 808] blk.34.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.34.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 457/ 808] blk.35.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 458/ 808] blk.35.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 459/ 808] blk.35.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 460/ 808] blk.35.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 461/ 808] blk.35.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 462/ 808] blk.35.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 463/ 808] blk.35.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 464/ 808] blk.35.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 465/ 808] blk.35.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 466/ 808] blk.35.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 467/ 808] blk.35.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 468/ 808] blk.35.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 469/ 808] blk.35.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.35.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 470/ 808] blk.36.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 471/ 808] blk.36.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 472/ 808] blk.36.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 473/ 808] blk.36.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 474/ 808] blk.36.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 475/ 808] blk.36.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 476/ 808] blk.36.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 477/ 808] blk.36.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 478/ 808] blk.36.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 479/ 808] blk.36.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 480/ 808] blk.36.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 481/ 808] blk.36.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 482/ 808] blk.36.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.36.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 483/ 808] blk.37.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 484/ 808] blk.37.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 485/ 808] blk.37.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 486/ 808] blk.37.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 487/ 808] blk.37.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 488/ 808] blk.37.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 489/ 808] blk.37.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.37.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 490/ 808] blk.37.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 491/ 808] blk.37.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 492/ 808] blk.37.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 493/ 808] blk.37.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 494/ 808] blk.37.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 495/ 808] blk.37.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 496/ 808] blk.38.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 497/ 808] blk.38.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 498/ 808] blk.38.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 499/ 808] blk.38.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 500/ 808] blk.38.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 501/ 808] blk.38.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 502/ 808] blk.38.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 503/ 808] blk.38.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 504/ 808] blk.38.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 505/ 808] blk.38.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 506/ 808] blk.38.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 507/ 808] blk.38.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 508/ 808] blk.38.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.38.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 509/ 808] blk.39.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 510/ 808] blk.39.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 511/ 808] blk.39.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 512/ 808] blk.39.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 513/ 808] blk.39.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 514/ 808] blk.39.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 515/ 808] blk.39.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 516/ 808] blk.39.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 517/ 808] blk.39.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 518/ 808] blk.39.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 519/ 808] blk.39.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 520/ 808] blk.39.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 521/ 808] blk.39.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.39.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 522/ 808] blk.40.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 523/ 808] blk.40.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 524/ 808] blk.40.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 525/ 808] blk.40.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 526/ 808] blk.40.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 527/ 808] blk.40.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 528/ 808] blk.40.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 529/ 808] blk.40.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 530/ 808] blk.40.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 531/ 808] blk.40.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 532/ 808] blk.40.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 533/ 808] blk.40.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 534/ 808] blk.40.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.40.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 535/ 808] blk.41.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 536/ 808] blk.41.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 537/ 808] blk.41.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 538/ 808] blk.41.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 539/ 808] blk.41.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 540/ 808] blk.41.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 541/ 808] blk.41.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 542/ 808] blk.41.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 543/ 808] blk.41.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 544/ 808] blk.41.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 545/ 808] blk.41.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 546/ 808] blk.41.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 547/ 808] blk.41.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.41.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 548/ 808] blk.42.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 549/ 808] blk.42.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 550/ 808] blk.42.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 551/ 808] blk.42.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 552/ 808] blk.42.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 553/ 808] blk.42.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 554/ 808] blk.42.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 555/ 808] blk.42.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 556/ 808] blk.42.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 557/ 808] blk.42.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 558/ 808] blk.42.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 559/ 808] blk.42.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 560/ 808] blk.42.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.42.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 561/ 808] blk.43.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 562/ 808] blk.43.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 563/ 808] blk.43.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 564/ 808] blk.43.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 565/ 808] blk.43.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 566/ 808] blk.43.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 567/ 808] blk.43.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.43.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 568/ 808] blk.43.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 569/ 808] blk.43.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 570/ 808] blk.43.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 571/ 808] blk.43.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 572/ 808] blk.43.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 573/ 808] blk.43.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 574/ 808] blk.44.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 575/ 808] blk.44.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 576/ 808] blk.44.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 577/ 808] blk.44.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 578/ 808] blk.44.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 579/ 808] blk.44.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 580/ 808] blk.44.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 581/ 808] blk.44.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 582/ 808] blk.44.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 583/ 808] blk.44.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 584/ 808] blk.44.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 585/ 808] blk.44.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 586/ 808] blk.44.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.44.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 587/ 808] blk.45.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 588/ 808] blk.45.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 589/ 808] blk.45.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 590/ 808] blk.45.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 591/ 808] blk.45.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 592/ 808] blk.45.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 593/ 808] blk.45.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 594/ 808] blk.45.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 595/ 808] blk.45.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 596/ 808] blk.45.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 597/ 808] blk.45.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 598/ 808] blk.45.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 599/ 808] blk.45.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.45.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 600/ 808] blk.46.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 601/ 808] blk.46.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 602/ 808] blk.46.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 603/ 808] blk.46.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 604/ 808] blk.46.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 605/ 808] blk.46.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 606/ 808] blk.46.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 607/ 808] blk.46.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 608/ 808] blk.46.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 609/ 808] blk.46.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 610/ 808] blk.46.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 611/ 808] blk.46.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 612/ 808] blk.46.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.46.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 613/ 808] blk.47.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 614/ 808] blk.47.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 615/ 808] blk.47.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 616/ 808] blk.47.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 617/ 808] blk.47.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 618/ 808] blk.47.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 619/ 808] blk.47.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 620/ 808] blk.47.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 621/ 808] blk.47.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 622/ 808] blk.47.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 623/ 808] blk.47.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 624/ 808] blk.47.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 625/ 808] blk.47.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.47.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 626/ 808] blk.48.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 627/ 808] blk.48.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 628/ 808] blk.48.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 629/ 808] blk.48.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 630/ 808] blk.48.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 631/ 808] blk.48.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 632/ 808] blk.48.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 633/ 808] blk.48.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 634/ 808] blk.48.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 635/ 808] blk.48.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 636/ 808] blk.48.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 637/ 808] blk.48.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 638/ 808] blk.48.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.48.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 639/ 808] blk.49.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 640/ 808] blk.49.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 641/ 808] blk.49.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 642/ 808] blk.49.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 643/ 808] blk.49.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 644/ 808] blk.49.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 645/ 808] blk.49.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.49.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 646/ 808] blk.49.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 647/ 808] blk.49.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 648/ 808] blk.49.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 649/ 808] blk.49.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 650/ 808] blk.49.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 651/ 808] blk.49.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 652/ 808] blk.50.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 653/ 808] blk.50.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 654/ 808] blk.50.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 655/ 808] blk.50.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 656/ 808] blk.50.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 657/ 808] blk.50.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 658/ 808] blk.50.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 659/ 808] blk.50.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 660/ 808] blk.50.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 661/ 808] blk.50.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 662/ 808] blk.50.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 663/ 808] blk.50.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 664/ 808] blk.50.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.50.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 665/ 808] blk.51.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 666/ 808] blk.51.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 667/ 808] blk.51.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 668/ 808] blk.51.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 669/ 808] blk.51.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 670/ 808] blk.51.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 671/ 808] blk.51.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 672/ 808] blk.51.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 673/ 808] blk.51.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 674/ 808] blk.51.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 675/ 808] blk.51.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 676/ 808] blk.51.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 677/ 808] blk.51.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.51.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 678/ 808] blk.52.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 679/ 808] blk.52.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 680/ 808] blk.52.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 681/ 808] blk.52.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 682/ 808] blk.52.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 683/ 808] blk.52.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 684/ 808] blk.52.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 685/ 808] blk.52.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 686/ 808] blk.52.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 687/ 808] blk.52.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 688/ 808] blk.52.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 689/ 808] blk.52.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 690/ 808] blk.52.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.52.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 691/ 808] blk.53.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 692/ 808] blk.53.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 693/ 808] blk.53.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 694/ 808] blk.53.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 695/ 808] blk.53.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 696/ 808] blk.53.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 697/ 808] blk.53.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 698/ 808] blk.53.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 699/ 808] blk.53.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 700/ 808] blk.53.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 701/ 808] blk.53.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 702/ 808] blk.53.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 703/ 808] blk.53.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.53.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 704/ 808] blk.54.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 705/ 808] blk.54.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 706/ 808] blk.54.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 707/ 808] blk.54.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 708/ 808] blk.54.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 709/ 808] blk.54.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 710/ 808] blk.54.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 711/ 808] blk.54.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 712/ 808] blk.54.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 713/ 808] blk.54.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 714/ 808] blk.54.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 715/ 808] blk.54.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 716/ 808] blk.54.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.54.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 717/ 808] blk.55.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 718/ 808] blk.55.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 719/ 808] blk.55.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 720/ 808] blk.55.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 721/ 808] blk.55.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 722/ 808] blk.55.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 723/ 808] blk.55.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.55.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 724/ 808] blk.55.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 725/ 808] blk.55.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 726/ 808] blk.55.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 727/ 808] blk.55.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 728/ 808] blk.55.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 729/ 808] blk.55.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 730/ 808] blk.56.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 731/ 808] blk.56.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 732/ 808] blk.56.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 733/ 808] blk.56.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 734/ 808] blk.56.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 735/ 808] blk.56.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 736/ 808] blk.56.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 737/ 808] blk.56.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 738/ 808] blk.56.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 739/ 808] blk.56.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 740/ 808] blk.56.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 741/ 808] blk.56.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 742/ 808] blk.56.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.56.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 743/ 808] blk.57.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 744/ 808] blk.57.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 745/ 808] blk.57.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 746/ 808] blk.57.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 747/ 808] blk.57.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 748/ 808] blk.57.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 749/ 808] blk.57.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 750/ 808] blk.57.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 751/ 808] blk.57.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 752/ 808] blk.57.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 753/ 808] blk.57.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 754/ 808] blk.57.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 755/ 808] blk.57.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.57.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 756/ 808] blk.58.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 757/ 808] blk.58.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 758/ 808] blk.58.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 759/ 808] blk.58.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 760/ 808] blk.58.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 761/ 808] blk.58.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 762/ 808] blk.58.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 763/ 808] blk.58.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 764/ 808] blk.58.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 765/ 808] blk.58.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 766/ 808] blk.58.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 767/ 808] blk.58.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 768/ 808] blk.58.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.58.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 769/ 808] blk.59.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 770/ 808] blk.59.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 771/ 808] blk.59.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 772/ 808] blk.59.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 773/ 808] blk.59.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 774/ 808] blk.59.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 775/ 808] blk.59.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 776/ 808] blk.59.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 777/ 808] blk.59.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 778/ 808] blk.59.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 779/ 808] blk.59.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 780/ 808] blk.59.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 781/ 808] blk.59.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.59.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 782/ 808] blk.60.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 783/ 808] blk.60.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 784/ 808] blk.60.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 785/ 808] blk.60.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 786/ 808] blk.60.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 787/ 808] blk.60.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 788/ 808] blk.60.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 789/ 808] blk.60.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 790/ 808] blk.60.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 791/ 808] blk.60.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 792/ 808] blk.60.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 793/ 808] blk.60.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 794/ 808] blk.60.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.60.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 795/ 808] blk.61.ffn_gate.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 796/ 808] blk.61.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 797/ 808] blk.61.attn_k.weight - [ 5376, 2048, 1, 1], type = bf16, converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 798/ 808] blk.61.attn_output.weight - [ 4096, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 799/ 808] blk.61.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 800/ 808] blk.61.attn_q.weight - [ 5376, 4096, 1, 1], type = bf16, converting to iq4_kt .. size = 42.00 MiB -> 10.52 MiB
+[ 801/ 808] blk.61.attn_v.weight - [ 5376, 2048, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.61.attn_v.weight
+converting to iq4_kt .. size = 21.00 MiB -> 5.26 MiB
+[ 802/ 808] blk.61.attn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 803/ 808] blk.61.ffn_down.weight - [21504, 5376, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.15 MiB
+[ 804/ 808] blk.61.ffn_up.weight - [ 5376, 21504, 1, 1], type = bf16, converting to iq4_kt .. size = 220.50 MiB -> 55.21 MiB
+[ 805/ 808] blk.61.post_attention_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 806/ 808] blk.61.post_ffw_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 807/ 808] blk.61.ffn_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+[ 808/ 808] output_norm.weight - [ 5376, 1, 1, 1], type = f32, size = 0.021 MB
+llama_model_quantize_internal: model size = 51518.82 MB
+llama_model_quantize_internal: quant size = 13654.42 MB
+
+main: quantize time = 1143720.04 ms
+main: total time = 1143720.04 ms
+```
+
+
+
+
+
+
+👈 Perplexity Command
+
+Perplexity
+```
+./build/bin/llama-perplexity \
+ --model /mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_kt.gguf \
+ --ctx-size 512 \
+ --ubatch-size 512 \
+ -f wiki.test.raw \
+ --seed 1337 \
+ --n-gpu-layers 99 \
+ --threads 1
+```
+
+
+
+I ended up using the same imatrix.dat file for both.
+
+* gemma-3-27B-it-qat-iq4_kt
+ - 13.334 GiB (4.241 BPW)
+ - Final estimate: PPL = 8.3431 +/- 0.06508
+
+* gemma-3-27B-it-qat-iq4_ks
+ - 14.099 GiB (4.484 BPW) **attn_k_b at q4_0 i forget why**
+ - Final estimate: PPL = 8.1750 +/- 0.06294
+
+This probably isn't the best comparison given gemma-3-27B-it-qat behaves unlike most "normal" non-QAT quants. But llama-perplexity runs clean with no `nan`s so that is the real test for me right now.
+
+Lastly I'll do some quick sweep benches.
+
+
+
+👈 sweep-bench results
+
+```bash
+#model=/mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_ks.gguf
+model=/mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-iq4_kt.gguf
+#model=/mnt/raid/models/ubergarm/gemma-3-27b-it-qat-GGUF/gemma-3-27B-it-qat-q4_0.gguf
+CUDA_VISIBLE_DEVICES="0" \
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -c 32768 \
+ -fa \
+ -ngl 99 \
+ --warmup-batch \
+ --threads 1
+```
+
+## PR505 iq4_ks
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.342 | 1497.20 | 3.743 | 34.19 |
+| 512 | 128 | 512 | 0.348 | 1472.43 | 3.794 | 33.74 |
+| 512 | 128 | 1024 | 0.353 | 1449.53 | 3.830 | 33.42 |
+| 512 | 128 | 1536 | 0.358 | 1429.54 | 3.888 | 32.92 |
+| 512 | 128 | 2048 | 0.364 | 1407.80 | 3.950 | 32.40 |
+| 512 | 128 | 2560 | 0.368 | 1390.97 | 4.070 | 31.45 |
+| 512 | 128 | 3072 | 0.374 | 1367.73 | 4.088 | 31.31 |
+| 512 | 128 | 3584 | 0.379 | 1351.24 | 4.128 | 31.01 |
+| 512 | 128 | 4096 | 0.385 | 1328.47 | 4.179 | 30.63 |
+| 512 | 128 | 4608 | 0.389 | 1314.88 | 4.228 | 30.27 |
+| 512 | 128 | 5120 | 0.394 | 1299.02 | 4.280 | 29.90 |
+| 512 | 128 | 5632 | 0.399 | 1282.70 | 4.372 | 29.28 |
+| 512 | 128 | 6144 | 0.406 | 1262.58 | 4.395 | 29.13 |
+| 512 | 128 | 6656 | 0.410 | 1249.42 | 4.445 | 28.80 |
+| 512 | 128 | 7168 | 0.416 | 1230.78 | 4.493 | 28.49 |
+| 512 | 128 | 7680 | 0.421 | 1217.42 | 4.536 | 28.22 |
+| 512 | 128 | 8192 | 0.426 | 1202.74 | 4.650 | 27.53 |
+
+## PR505 iq4_kt
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.329 | 1554.50 | 3.594 | 35.61 |
+| 512 | 128 | 512 | 0.336 | 1524.23 | 3.651 | 35.06 |
+| 512 | 128 | 1024 | 0.341 | 1499.47 | 3.679 | 34.79 |
+| 512 | 128 | 1536 | 0.345 | 1482.10 | 3.732 | 34.30 |
+| 512 | 128 | 2048 | 0.352 | 1453.17 | 3.784 | 33.83 |
+| 512 | 128 | 2560 | 0.355 | 1442.46 | 3.889 | 32.91 |
+| 512 | 128 | 3072 | 0.360 | 1424.20 | 3.918 | 32.67 |
+| 512 | 128 | 3584 | 0.366 | 1399.86 | 3.963 | 32.30 |
+| 512 | 128 | 4096 | 0.371 | 1380.12 | 4.007 | 31.95 |
+| 512 | 128 | 4608 | 0.377 | 1359.59 | 4.066 | 31.48 |
+| 512 | 128 | 5120 | 0.381 | 1343.44 | 4.115 | 31.10 |
+| 512 | 128 | 5632 | 0.386 | 1327.56 | 4.205 | 30.44 |
+| 512 | 128 | 6144 | 0.392 | 1304.90 | 4.224 | 30.30 |
+| 512 | 128 | 6656 | 0.396 | 1291.41 | 4.267 | 30.00 |
+| 512 | 128 | 7168 | 0.402 | 1273.86 | 4.319 | 29.64 |
+| 512 | 128 | 7680 | 0.406 | 1260.46 | 4.347 | 29.44 |
+| 512 | 128 | 8192 | 0.411 | 1244.46 | 4.459 | 28.71 |
+
+## PR505 q4_0
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.306 | 1673.44 | 3.499 | 36.58 |
+| 512 | 128 | 512 | 0.311 | 1646.26 | 3.542 | 36.14 |
+| 512 | 128 | 1024 | 0.318 | 1611.98 | 3.579 | 35.76 |
+| 512 | 128 | 1536 | 0.322 | 1592.37 | 3.639 | 35.18 |
+| 512 | 128 | 2048 | 0.328 | 1561.52 | 3.698 | 34.61 |
+| 512 | 128 | 2560 | 0.334 | 1531.43 | 3.817 | 33.53 |
+| 512 | 128 | 3072 | 0.339 | 1509.07 | 3.827 | 33.45 |
+| 512 | 128 | 3584 | 0.346 | 1480.93 | 3.870 | 33.07 |
+| 512 | 128 | 4096 | 0.351 | 1456.85 | 3.921 | 32.64 |
+| 512 | 128 | 4608 | 0.355 | 1440.80 | 3.972 | 32.22 |
+| 512 | 128 | 5120 | 0.360 | 1420.48 | 4.024 | 31.81 |
+| 512 | 128 | 5632 | 0.366 | 1399.51 | 4.101 | 31.21 |
+| 512 | 128 | 6144 | 0.370 | 1382.54 | 4.134 | 30.96 |
+| 512 | 128 | 6656 | 0.378 | 1356.18 | 4.180 | 30.63 |
+| 512 | 128 | 7168 | 0.382 | 1341.12 | 4.239 | 30.20 |
+| 512 | 128 | 7680 | 0.386 | 1324.94 | 4.277 | 29.93 |
+| 512 | 128 | 8192 | 0.392 | 1307.55 | 4.387 | 29.18 |
+
+
+
+
+
+
+Very nice this new `iq4_kt` is faster than `iq4_ks` and very close to `q4_0`! It confirmed it does get a speed benefit from compiling with `-DGGML_CUDA_F16=ON`.
+
+I'm holding off on releasing any of my experimental `iqN_kt` quants until you're happy with everything. So feel free to make breaking changes with this stuff as far as I'm concerned.
+
+If you get the `iq3_kt` going as well, it might work to help me target a ~3.3ish BPW (~256GB) R1-0528. No pressure, I'm just daydreaming hah... Cheers and thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-06-08** at **20:37:58**
The Ops are harmless, just forgotten to remove
On Sun, 8 Jun 2025 at 23:34, ubergarm ***@***.***> wrote:
-> *ubergarm* left a comment (ikawrakow/ik_llama.cpp#505)
+> *ubergarm* left a comment (ikawrakow/ik_llama.cpp[#505](https://github.com/ikawrakow/ik_llama.cpp/issues/505))
>
>
> Now that it seems to compile okay, giving it a try quantizing
@@ -152,4 +1334,22 @@ On Sun, 8 Jun 2025 at 23:34, ubergarm ***@***.***> wrote:
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
->
\ No newline at end of file
+>
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **07:10:24**
+
+Yes, sorry I forgot to add the `iq4_kt` MMQ template instance (it is done now). I manually add files there instead of using the Python script as it is way more complicated in `ik_llama.cpp` than it is in `llama.cpp`, so I figure it will take longer to change the script than to just manually add a file from time to time.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **15:54:34**
+
+Concerning PPL: yes, `IQ4_KT` is not quite on par with `IQ4_KS`. It is 4.0 bpw versus 4.25 bpw for `IQ4_KS`, so PPL is somewhat higher. But it is better than `IQ4_KSS`, which is also exactly 4.0 bpw. As you get to 4 bpw, the benefit of using a trellis becomes very small.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-18** at **13:22:19**
+
+Closing in favor of [#529](https://github.com/ikawrakow/ik_llama.cpp/issues/529)
\ No newline at end of file
diff --git a/github-data/pull_requests/506 - Fix non rpc build error.md b/github-data/pull_requests/506 - Fix non rpc build error.md
index 5efb5c9df..b1ebb91d9 100644
--- a/github-data/pull_requests/506 - Fix non rpc build error.md
+++ b/github-data/pull_requests/506 - Fix non rpc build error.md
@@ -1,14 +1,17 @@
-### 🐛 [#506](https://github.com/ikawrakow/ik_llama.cpp/pull/506) - Fix non rpc build error
+## 🔀 [Pull Request #506](https://github.com/ikawrakow/ik_llama.cpp/pull/506) - Fix non rpc build error
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `rpc_improvement` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-08 |
| **Updated** | 2025-06-08 |
+| **Merged** | 2025-06-08 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -18,8 +21,8 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-08** at **14:26:53**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-08** at **14:26:53**
Thank you!
\ No newline at end of file
diff --git a/github-data/pull_requests/508 - Fix Compile error _C2668_.md b/github-data/pull_requests/508 - Fix Compile error C2668.md
similarity index 52%
rename from github-data/pull_requests/508 - Fix Compile error _C2668_.md
rename to github-data/pull_requests/508 - Fix Compile error C2668.md
index 6688db7dc..ee518317a 100644
--- a/github-data/pull_requests/508 - Fix Compile error _C2668_.md
+++ b/github-data/pull_requests/508 - Fix Compile error C2668.md
@@ -1,14 +1,17 @@
-### 🐛 [#508](https://github.com/ikawrakow/ik_llama.cpp/pull/508) - Fix Compile error (C2668)
+## 🔀 [Pull Request #508](https://github.com/ikawrakow/ik_llama.cpp/pull/508) - Fix Compile error (C2668)
| **Author** | `Gaolingx` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-09 |
| **Updated** | 2025-06-10 |
+| **Merged** | 2025-06-10 |
---
-#### Description
+## 📄 Description
The compiler(msvc) reports error: `..iqk_quantize.cpp(568,12): error C2668: "'anonymous-namespace'::hsum_float_4”: 对重载函数的调用不明确..` , I found some functions defined repeatedly and move these to `iqk_common.h`, It can be compiled successfully, but on linux doesn't seem to get the error...
@@ -22,10 +25,16 @@ The compiler(msvc) reports error: `..iqk_quantize.cpp(568,12): error C2668: "'an
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-09** at **15:12:45**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `ggml/src/iqk/iqk_common.h` on **2025-06-09** at **15:12:45**
+
+Why did you change this? At least on my CPU the version
+```
+accm[i] = _mm256_add_ps(_mm256_permute2f128_ps(accm[i], accm[i+4], 0x20), _mm256_permute2f128_ps(accm[i], accm[i+4], 0x31));
+```
+is faster.
---
-👤 **ikawrakow** submitted a review the **2025-06-10** at **05:30:02**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-10** at **05:30:02**
\ No newline at end of file
diff --git a/github-data/pull_requests/509 - Docs update.md b/github-data/pull_requests/509 - Docs update.md
index da7682e8a..3e6ab6685 100644
--- a/github-data/pull_requests/509 - Docs update.md
+++ b/github-data/pull_requests/509 - Docs update.md
@@ -1,27 +1,89 @@
-### 🔀 [#509](https://github.com/ikawrakow/ik_llama.cpp/pull/509) - Docs update
+## 🔀 [Pull Request #509](https://github.com/ikawrakow/ik_llama.cpp/pull/509) - Docs update
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/docs_update` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-09 |
| **Updated** | 2025-06-11 |
+| **Merged** | 2025-06-09 |
---
-#### Description
+## 📄 Description
Update XTC and webUI docs.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-09** at **10:19:10**: ✅ `APPROVED`
+👤 **saood06** commented on **2025-06-09** at **10:17:17**
+
+If I update the `Latest News` section of the main README.md can I omit dates and bundle things together (multiple PR's linked per line)?
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-09** at **10:19:10**
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **10:20:13**
+
+> If I update the Latest News section of the main README.md can I omit dates and bundle things together (multiple PR's linked per line)?
+
+Sure. Maybe a good idea anyway as one doesn't want the "News" section to become a novel.
---
-👤 **saood06** commented the **2025-06-09** at **11:43:58**:
+👤 **saood06** commented on **2025-06-09** at **11:08:39**
+
+> > If I update the Latest News section of the main README.md can I omit dates and bundle things together (multiple PR's linked per line)?
+>
+> Sure. Maybe a good idea anyway as one doesn't want the "News" section to become a novel.
+
+Can I update the bundle the old ones too? It would be nice if it just highlighted all the latest information. Sections based on model support, quant types, performance, features, etc. With all linked PR's (which have dates on hover so that information is still available).
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **11:11:23**
+
+> Can I update the bundle the old ones too? It would be nice if it just highlighted all the latest information.
+
+Yes, that would be good too.
+But maybe we can keep a copy in the Wiki as a record of how things evolved.
+
+---
+
+👤 **saood06** commented on **2025-06-09** at **11:43:58**
> But maybe we can keep a copy in the Wiki as a record of how things evolved.
-Can you create that entry of the Wiki (and maybe put in the new stuff)?
\ No newline at end of file
+Can you create that entry of the Wiki (and maybe put in the new stuff)?
+
+---
+
+👤 **ikawrakow** commented on **2025-06-09** at **12:03:04**
+
+> Can you create that entry of the Wiki
+
+Done
+
+> (and maybe put in the new stuff)?
+
+Will do later.
+
+---
+
+👤 **saood06** commented on **2025-06-09** at **12:08:01**
+
+This would also now be a good time to deprecate MLA-2 and do this: https://github.com/ikawrakow/ik_llama.cpp/pull/473#discussion_r2116235243
+
+That way people can learn see the old stuff, but this only shows stuff but labels them under improvements for fast and mem.
+
+---
+
+👤 **saood06** commented on **2025-06-11** at **06:01:42**
+
+Just commenting here to remind myself to add to `examples/server/README.md` a section instructing people on how to access the legacy UI, and also documentation for the new endpoint [#502](https://github.com/ikawrakow/ik_llama.cpp/issues/502).
\ No newline at end of file
diff --git a/github-data/pull_requests/51 - Quantized Flash Attention for all supported CPU platforms.md b/github-data/pull_requests/51 - Quantized Flash Attention for all supported CPU platforms.md
index 19b352681..3c420a20e 100644
--- a/github-data/pull_requests/51 - Quantized Flash Attention for all supported CPU platforms.md
+++ b/github-data/pull_requests/51 - Quantized Flash Attention for all supported CPU platforms.md
@@ -1,14 +1,17 @@
-### 🔀 [#51](https://github.com/ikawrakow/ik_llama.cpp/pull/51) - Quantized Flash Attention for all supported CPU platforms
+## 🔀 [Pull Request #51](https://github.com/ikawrakow/ik_llama.cpp/pull/51) - Quantized Flash Attention for all supported CPU platforms
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/neon_flash_attention_3` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-12 |
| **Updated** | 2024-09-12 |
+| **Merged** | 2024-09-12 |
---
-#### Description
+## 📄 Description
This PR adds two features:
* All supported CPU platforms (`Zen4, AVX2, ARM_NEON`) now have implementations for quantized kv-cache. `Q4_0, Q4_1`, and `Q8_0` can be used
diff --git a/github-data/pull_requests/510 - Update News section of readme.md b/github-data/pull_requests/510 - Update News section of readme.md
index d19d0888f..b16e75cfa 100644
--- a/github-data/pull_requests/510 - Update News section of readme.md
+++ b/github-data/pull_requests/510 - Update News section of readme.md
@@ -1,14 +1,17 @@
-### 🔀 [#510](https://github.com/ikawrakow/ik_llama.cpp/pull/510) - Update News section of readme
+## 🔀 [Pull Request #510](https://github.com/ikawrakow/ik_llama.cpp/pull/510) - Update News section of readme
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/readme_update` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-09 |
| **Updated** | 2025-06-13 |
+| **Merged** | 2025-06-13 |
---
-#### Description
+## 📄 Description
@ikawrakow
@@ -20,15 +23,15 @@ And if any of them can be removed as they are no longer relevant (especially if
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-09** at **13:20:55**:
+👤 **ikawrakow** commented on **2025-06-09** at **13:20:55**
Yes, you can split it like this
---
-👤 **saood06** commented the **2025-06-11** at **04:54:07**:
+👤 **saood06** commented on **2025-06-11** at **04:54:07**
@ikawrakow
@@ -40,97 +43,81 @@ Any thoughts?
---
-👤 **ikawrakow** submitted a review the **2025-06-11** at **05:41:57**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-06-11** at **05:41:57** on `README.md`:
+👤 **ikawrakow** started a conversation on `README.md` on **2025-06-11** at **05:41:57**
And not GLM-4, LlaMA-4, Qwen3/Qwen3-MoE ?
----
-
-👤 **ikawrakow** commented during a code review the **2025-06-11** at **05:43:24** on `README.md`:
-
-I would count the trellis quants also here. They partially implemented a long time ago, but the PRs to add CPU and Metal support are quite recent.
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-06-11** at **05:44:57** on `README.md`:
-
-Duplicate
-
----
-
-👤 **ikawrakow** submitted a review the **2025-06-11** at **05:45:58**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** submitted a review the **2025-06-11** at **05:49:53**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-06-11** at **05:49:53** on `README.md`:
-
-Not sure what you mean, all three you mentioned are included alongside their respective PRs. (Qwen 3 is just listed as Qwen3 and not Qwen3/Qwen3-MoE)
-
----
-
-👤 **ikawrakow** submitted a review the **2025-06-11** at **05:52:58**: 💬 `COMMENTED`
-
----
+> 👤 **saood06** replied on **2025-06-11** at **05:49:53**
+>
+> Not sure what you mean, all three you mentioned are included alongside their respective PRs. (Qwen 3 is just listed as Qwen3 and not Qwen3/Qwen3-MoE)
-👤 **ikawrakow** commented during a code review the **2025-06-11** at **05:52:58** on `README.md`:
+> 👤 **ikawrakow** replied on **2025-06-11** at **05:52:58**
+>
+> Oh, sorry, short attention span. Didn't reed the whole line. It seems I need LLM support when reviewing.
-Oh, sorry, short attention span. Didn't reed the whole line. It seems I need LLM support when reviewing.
+> 👤 **saood06** replied on **2025-06-11** at **05:55:02**
+>
+> It really isn't entirely your fault. I don't like this being one block but if I split it into multiple lines it takes too much space.
---
-👤 **saood06** submitted a review the **2025-06-11** at **05:54:05**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `README.md` on **2025-06-11** at **05:43:24**
----
-
-👤 **saood06** submitted a review the **2025-06-11** at **05:55:02**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-06-11** at **05:55:02** on `README.md`:
-
-It really isn't entirely your fault. I don't like this being one block but if I split it into multiple lines it takes too much space.
-
----
-
-👤 **saood06** submitted a review the **2025-06-11** at **06:18:03**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-06-11** at **06:18:03** on `README.md`:
-
-Fixed.
-
----
+I would count the trellis quants also here. They partially implemented a long time ago, but the PRs to add CPU and Metal support are quite recent.
-👤 **saood06** submitted a review the **2025-06-11** at **06:19:12**: 💬 `COMMENTED`
+> 👤 **saood06** replied on **2025-06-11** at **05:54:05**
+>
+> That makes sense, I'll remove the line about "Additional implementations for the trellis quants..." and add all of them here alongside the initial CUDA implementation in the closed PR (even though technically it came with the CPU implementation PR).
+
+> 👤 **saood06** replied on **2025-06-11** at **06:19:12**
+>
+> I split the Quantization additions section into Trellis and IQK sections.
+>
+> This way all the IQK quants can be included (as they are still relevant and people still ask about them) if you think this makes sense.
+>
+> Thoughts?
+
+> 👤 **ikawrakow** replied on **2025-06-11** at **06:55:52**
+>
+> Sure.
+>
+> One thing that bothers me is that many people appear to be thinking that they need `ik_llama.cpp`-specific quants to use `ik_llama.cpp`. Or that they need to do something additional in order to be able to use `llama.cpp` GGUFs with `ik_llama.cpp`. At least this is the impression I get from the comments people make here. I think it would be useful to point out that they can grab any GGUF and just use it the way it is will `ik_llama.cpp`.
+
+> 👤 **saood06** replied on **2025-06-11** at **07:12:16**
+>
+> > Sure.
+>
+> Okay will add all the IQK quants (and their associated PRs) and also link the discussions where you originally talk about them preceding that.
+>
+> > One thing that bothers me is that many people appear to be thinking that they need `ik_llama.cpp`-specific quants to use `ik_llama.cpp`. Or that they need to do something additional in order to be able to use `llama.cpp` GGUFs with `ik_llama.cpp`. At least this is the impression I get from the comments people make here.
+>
+> Yes, I'm fairly certain I've seen the comments that gave you that impression (and it was a question that was asked of me many times on other platforms).
+>
+> >I think it would be useful to point out that they can grab any GGUF and just use it the way it is will `ik_llama.cpp`.
+>
+> I agree that could be useful (if done properly), but I don't think that should be handled in the News section.
+>
+> Maybe it should be added to the TL;DR? But I don't know how to rewrite that to include that (and update it in general to something that is more current and clear).
+
+---
+
+👤 **ikawrakow** started a conversation on `README.md` on **2025-06-11** at **05:44:57**
----
+Duplicate
-👤 **ikawrakow** submitted a review the **2025-06-11** at **06:55:52**: 💬 `COMMENTED`
+> 👤 **saood06** replied on **2025-06-11** at **06:18:03**
+>
+> Fixed.
---
-👤 **ikawrakow** commented during a code review the **2025-06-11** at **06:55:52** on `README.md`:
+👤 **ikawrakow** commented on **2025-06-12** at **16:26:10**
-Sure.
-
-One thing that bothers me is that many people appear to be thinking that they need `ik_llama.cpp`-specific quants to use `ik_llama.cpp`. Or that they need to do something additional in order to be able to use `llama.cpp` GGUFs with `ik_llama.cpp`. At least this is the impression I get from the comments people make here. I think it would be useful to point out that they can grab any GGUF and just use it the way it is will `ik_llama.cpp`.
+Will you finish it, or are you waiting for me to finish it?
---
-👤 **saood06** submitted a review the **2025-06-11** at **07:12:17**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented the **2025-06-12** at **16:57:29**:
+👤 **saood06** commented on **2025-06-12** at **16:57:29**
> Will you finish it, or are you waiting for me to finish it?
@@ -140,4 +127,10 @@ Overall although I think this is an improvement and will be shorter than the old
---
-👤 **ikawrakow** submitted a review the **2025-06-13** at **04:56:31**: ✅ `APPROVED`
\ No newline at end of file
+👤 **saood06** commented on **2025-06-12** at **17:30:39**
+
+I marked it ready for review as all the things that needed to be added are now added. I still think it could be better, but I don't have any ideas on how to make it better anymore.
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-13** at **04:56:31**
\ No newline at end of file
diff --git a/github-data/pull_requests/511 - New IQ2_KT.md b/github-data/pull_requests/511 - New IQ2_KT.md
index c92f60225..9fb8658ae 100644
--- a/github-data/pull_requests/511 - New IQ2_KT.md
+++ b/github-data/pull_requests/511 - New IQ2_KT.md
@@ -1,16 +1,19 @@
-### 🔀 [#511](https://github.com/ikawrakow/ik_llama.cpp/pull/511) - New IQ2_KT
+## 🔀 [Pull Request #511](https://github.com/ikawrakow/ik_llama.cpp/pull/511) - New IQ2_KT
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/new_iq2kt` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-09 |
| **Updated** | 2025-06-18 |
+| **Labels** | `Breaking change` |
---
-#### Description
+## 📄 Description
-This PR uses the new trellis introduced in #505 and applies it to `IQ2_KT`.
+This PR uses the new trellis introduced in [#505](https://github.com/ikawrakow/ik_llama.cpp/issues/505) and applies it to `IQ2_KT`.
This leads to a slightly higher PPL for the models where the `IQ2_KT` on the main branch works, but is more stable and there are no longer NaNs for the models where the existing `IQ2_KT` was failing (Qwen3-30B-A3B and DeepSeek-Lite).
@@ -25,18 +28,18 @@ Performance is also great, except on the Apple GPU, where it is slower than the
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-10** at **18:41:50**:
+👤 **ubergarm** commented on **2025-06-10** at **18:41:50**
Just kicked the tires on this PR and looks good so far!
1. It compiles fine.
2. I managed to quantize [OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT](https://huggingface.co/OpenBuddy/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT) using a variety of quants including `iq2_kt` and `iq4_kt` from this PR.
-There is not a lot of info about this model, and honestly it doesn't behave like a 4bpw QAT and they don't have much details (i'll ask on their hf). Their chat tokenizing stuff seems wonky too, (but that is unrelated to this PR). (might need to stuff the `tokenizer_config.json -> "chat_template"` into the GGUF kv metadata.)
+There is not a lot of info about this model, and honestly it doesn't behave like a 4bpw QAT and they don't have much details (i asked on their hf repo). Their chat template stuff seems wonky too, (but that is unrelated to this PR). (might need to use the `tokenizer_config.json -> "chat_template"` JINJA template (also in the GGUF kv metadata) and make a new `llama_chat_apply_template_internal` case...) [*EDIT* made a rough chat template patch to test it [here](https://github.com/ubergarm/ik_llama.cpp/tree/ug/openbuddy_chat_template), but initial impressions is it might not be worth adding unless there is other demand]
-Anyway, the important thing is the new `iq2_kt` and` iq4_kt` are functional, able to quantize using normal imatrix, runs full perplexity clean with no `nan`, and outputs okay looking text (no gibberish) down to the `iq2_kt` even.
+Anyway, the important thing is the new `iq2_kt` and` iq4_kt` are functional, able to quantize using normal imatrix, runs full perplexity clean with no `nan` on CUDA RTXA6000, and outputs okay looking text (no gibberish) down to the `iq2_kt` even.

@@ -44,7 +47,272 @@ I'll run some sweep benches too for speed comparisons.
---
-👤 **ikawrakow** commented the **2025-06-11** at **14:36:11**:
+👤 **ubergarm** commented on **2025-06-10** at **19:31:18**
+
+Speed benchmarks on Single CUDA RTX A6000 48GB VRAM fully offloaded.
+
+
+
+
+
+👈 Logs
+
+```bash
+git checkout ik/new_iq2kt
+
+cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
+cmake --build ./build --config Release -j $(nproc)
+
+#model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-BF16-00001-of-00002.gguf
+#model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-Q4_0.gguf
+#model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-Q4_K.gguf
+#model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-IQ4_K.gguf
+#model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-IQ4_KS.gguf
+#model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-IQ4_KT.gguf
+model=/mnt/raid/models/ubergarm/OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0-QAT-GGUF/DeepSeek-R1-0528-Distill-Qwen3-32B-Preview0-QAT-IQ2_KT.gguf
+
+CUDA_VISIBLE_DEVICES="0" \
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ --ctx-size 17408 \
+ -ctk f16 -ctv f16 \
+ -fa \
+ -ngl 99 \
+ --warmup-batch \
+ --threads 1
+```
+
+## Q4_0
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.372 | 1376.00 | 3.810 | 33.59 |
+| 512 | 128 | 512 | 0.380 | 1347.27 | 3.855 | 33.20 |
+| 512 | 128 | 1024 | 0.390 | 1313.32 | 3.853 | 33.22 |
+| 512 | 128 | 1536 | 0.398 | 1284.87 | 3.871 | 33.07 |
+| 512 | 128 | 2048 | 0.405 | 1262.75 | 3.906 | 32.77 |
+| 512 | 128 | 2560 | 0.416 | 1231.28 | 3.939 | 32.50 |
+| 512 | 128 | 3072 | 0.426 | 1201.25 | 3.971 | 32.23 |
+| 512 | 128 | 3584 | 0.435 | 1178.30 | 4.004 | 31.96 |
+| 512 | 128 | 4096 | 0.442 | 1157.73 | 4.041 | 31.67 |
+| 512 | 128 | 4608 | 0.450 | 1136.84 | 4.076 | 31.40 |
+| 512 | 128 | 5120 | 0.460 | 1113.95 | 4.113 | 31.12 |
+| 512 | 128 | 5632 | 0.469 | 1091.93 | 4.192 | 30.54 |
+| 512 | 128 | 6144 | 0.478 | 1072.20 | 4.195 | 30.51 |
+| 512 | 128 | 6656 | 0.485 | 1055.47 | 4.218 | 30.35 |
+| 512 | 128 | 7168 | 0.492 | 1039.78 | 4.228 | 30.27 |
+| 512 | 128 | 7680 | 0.501 | 1021.45 | 4.254 | 30.09 |
+| 512 | 128 | 8192 | 0.510 | 1004.30 | 4.276 | 29.94 |
+| 512 | 128 | 8704 | 0.519 | 986.79 | 4.300 | 29.77 |
+| 512 | 128 | 9216 | 0.526 | 972.49 | 4.331 | 29.56 |
+| 512 | 128 | 9728 | 0.534 | 958.15 | 4.358 | 29.37 |
+| 512 | 128 | 10240 | 0.542 | 944.16 | 4.378 | 29.24 |
+| 512 | 128 | 10752 | 0.550 | 931.00 | 4.466 | 28.66 |
+| 512 | 128 | 11264 | 0.558 | 917.95 | 4.493 | 28.49 |
+| 512 | 128 | 11776 | 0.565 | 906.58 | 4.473 | 28.61 |
+| 512 | 128 | 12288 | 0.574 | 891.64 | 4.485 | 28.54 |
+| 512 | 128 | 12800 | 0.581 | 881.06 | 4.511 | 28.37 |
+| 512 | 128 | 13312 | 0.588 | 870.85 | 4.538 | 28.21 |
+| 512 | 128 | 13824 | 0.596 | 859.14 | 4.561 | 28.07 |
+| 512 | 128 | 14336 | 0.603 | 849.39 | 4.584 | 27.92 |
+| 512 | 128 | 14848 | 0.615 | 832.78 | 4.614 | 27.74 |
+| 512 | 128 | 15360 | 0.622 | 823.76 | 4.639 | 27.59 |
+| 512 | 128 | 15872 | 0.629 | 814.41 | 4.663 | 27.45 |
+| 512 | 128 | 16384 | 0.640 | 800.14 | 4.740 | 27.00 |
+
+## Q4_K
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.398 | 1285.23 | 3.876 | 33.02 |
+| 512 | 128 | 512 | 0.409 | 1251.94 | 3.923 | 32.63 |
+| 512 | 128 | 1024 | 0.418 | 1223.79 | 3.920 | 32.65 |
+| 512 | 128 | 1536 | 0.428 | 1195.37 | 3.939 | 32.50 |
+| 512 | 128 | 2048 | 0.435 | 1175.93 | 3.974 | 32.21 |
+| 512 | 128 | 2560 | 0.446 | 1148.89 | 4.005 | 31.96 |
+| 512 | 128 | 3072 | 0.456 | 1122.26 | 4.039 | 31.69 |
+| 512 | 128 | 3584 | 0.464 | 1103.45 | 4.075 | 31.41 |
+| 512 | 128 | 4096 | 0.474 | 1081.26 | 4.111 | 31.13 |
+| 512 | 128 | 4608 | 0.482 | 1062.08 | 4.145 | 30.88 |
+| 512 | 128 | 5120 | 0.489 | 1045.97 | 4.182 | 30.61 |
+| 512 | 128 | 5632 | 0.498 | 1028.66 | 4.265 | 30.01 |
+| 512 | 128 | 6144 | 0.507 | 1010.81 | 4.267 | 29.99 |
+| 512 | 128 | 6656 | 0.515 | 994.16 | 4.292 | 29.82 |
+| 512 | 128 | 7168 | 0.524 | 977.04 | 4.293 | 29.82 |
+| 512 | 128 | 7680 | 0.532 | 962.24 | 4.319 | 29.64 |
+| 512 | 128 | 8192 | 0.540 | 947.85 | 4.343 | 29.47 |
+| 512 | 128 | 8704 | 0.549 | 932.32 | 4.369 | 29.30 |
+| 512 | 128 | 9216 | 0.558 | 917.14 | 4.399 | 29.10 |
+| 512 | 128 | 9728 | 0.566 | 905.25 | 4.420 | 28.96 |
+| 512 | 128 | 10240 | 0.573 | 892.89 | 4.446 | 28.79 |
+| 512 | 128 | 10752 | 0.581 | 880.91 | 4.538 | 28.20 |
+| 512 | 128 | 11264 | 0.590 | 867.99 | 4.566 | 28.03 |
+| 512 | 128 | 11776 | 0.598 | 856.83 | 4.545 | 28.16 |
+| 512 | 128 | 12288 | 0.606 | 844.92 | 4.555 | 28.10 |
+| 512 | 128 | 12800 | 0.613 | 834.67 | 4.580 | 27.94 |
+| 512 | 128 | 13312 | 0.622 | 823.72 | 4.606 | 27.79 |
+| 512 | 128 | 13824 | 0.629 | 814.58 | 4.628 | 27.66 |
+| 512 | 128 | 14336 | 0.636 | 804.84 | 4.653 | 27.51 |
+| 512 | 128 | 14848 | 0.644 | 795.55 | 4.682 | 27.34 |
+| 512 | 128 | 15360 | 0.652 | 785.29 | 4.704 | 27.21 |
+| 512 | 128 | 15872 | 0.660 | 775.65 | 4.728 | 27.07 |
+| 512 | 128 | 16384 | 0.668 | 766.55 | 4.807 | 26.63 |
+
+## IQ4_K
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.488 | 1049.09 | 4.469 | 28.64 |
+| 512 | 128 | 512 | 0.495 | 1033.90 | 4.513 | 28.36 |
+| 512 | 128 | 1024 | 0.506 | 1012.60 | 4.536 | 28.22 |
+| 512 | 128 | 1536 | 0.515 | 994.74 | 4.575 | 27.98 |
+| 512 | 128 | 2048 | 0.527 | 972.36 | 4.630 | 27.65 |
+| 512 | 128 | 2560 | 0.537 | 953.95 | 4.685 | 27.32 |
+| 512 | 128 | 3072 | 0.545 | 938.94 | 4.732 | 27.05 |
+| 512 | 128 | 3584 | 0.557 | 919.28 | 4.779 | 26.78 |
+| 512 | 128 | 4096 | 0.566 | 905.20 | 4.828 | 26.51 |
+| 512 | 128 | 4608 | 0.574 | 891.86 | 4.871 | 26.28 |
+| 512 | 128 | 5120 | 0.584 | 876.47 | 4.916 | 26.04 |
+| 512 | 128 | 5632 | 0.593 | 863.50 | 4.999 | 25.60 |
+| 512 | 128 | 6144 | 0.601 | 851.51 | 5.017 | 25.51 |
+| 512 | 128 | 6656 | 0.611 | 838.57 | 5.050 | 25.35 |
+| 512 | 128 | 7168 | 0.618 | 828.94 | 5.060 | 25.30 |
+| 512 | 128 | 7680 | 0.626 | 817.85 | 5.089 | 25.15 |
+| 512 | 128 | 8192 | 0.636 | 805.25 | 5.117 | 25.02 |
+| 512 | 128 | 8704 | 0.644 | 795.42 | 5.140 | 24.90 |
+| 512 | 128 | 9216 | 0.652 | 784.96 | 5.169 | 24.76 |
+| 512 | 128 | 9728 | 0.660 | 775.28 | 5.195 | 24.64 |
+| 512 | 128 | 10240 | 0.669 | 765.28 | 5.221 | 24.52 |
+| 512 | 128 | 10752 | 0.677 | 755.78 | 5.307 | 24.12 |
+| 512 | 128 | 11264 | 0.684 | 748.31 | 5.334 | 24.00 |
+| 512 | 128 | 11776 | 0.693 | 739.19 | 5.320 | 24.06 |
+| 512 | 128 | 12288 | 0.700 | 731.07 | 5.339 | 23.97 |
+| 512 | 128 | 12800 | 0.708 | 723.07 | 5.360 | 23.88 |
+| 512 | 128 | 13312 | 0.717 | 713.84 | 5.386 | 23.77 |
+| 512 | 128 | 13824 | 0.723 | 707.75 | 5.406 | 23.68 |
+| 512 | 128 | 14336 | 0.732 | 699.50 | 5.433 | 23.56 |
+| 512 | 128 | 14848 | 0.740 | 691.91 | 5.454 | 23.47 |
+| 512 | 128 | 15360 | 0.748 | 684.68 | 5.478 | 23.37 |
+| 512 | 128 | 15872 | 0.754 | 678.60 | 5.496 | 23.29 |
+| 512 | 128 | 16384 | 0.762 | 671.95 | 5.562 | 23.01 |
+
+## IQ4_KS
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.414 | 1236.19 | 3.999 | 32.00 |
+| 512 | 128 | 512 | 0.423 | 1209.12 | 4.043 | 31.66 |
+| 512 | 128 | 1024 | 0.432 | 1185.42 | 4.052 | 31.59 |
+| 512 | 128 | 1536 | 0.442 | 1159.53 | 4.075 | 31.41 |
+| 512 | 128 | 2048 | 0.451 | 1135.91 | 4.114 | 31.12 |
+| 512 | 128 | 2560 | 0.461 | 1111.57 | 4.152 | 30.83 |
+| 512 | 128 | 3072 | 0.469 | 1091.64 | 4.183 | 30.60 |
+| 512 | 128 | 3584 | 0.479 | 1067.94 | 4.222 | 30.31 |
+| 512 | 128 | 4096 | 0.488 | 1048.82 | 4.265 | 30.01 |
+| 512 | 128 | 4608 | 0.498 | 1027.90 | 4.300 | 29.77 |
+| 512 | 128 | 5120 | 0.506 | 1011.36 | 4.337 | 29.52 |
+| 512 | 128 | 5632 | 0.514 | 996.39 | 4.420 | 28.96 |
+| 512 | 128 | 6144 | 0.525 | 975.51 | 4.427 | 28.91 |
+| 512 | 128 | 6656 | 0.532 | 962.19 | 4.454 | 28.74 |
+| 512 | 128 | 7168 | 0.541 | 946.79 | 4.458 | 28.71 |
+| 512 | 128 | 7680 | 0.549 | 931.88 | 4.484 | 28.55 |
+| 512 | 128 | 8192 | 0.558 | 917.89 | 4.511 | 28.38 |
+| 512 | 128 | 8704 | 0.566 | 905.17 | 4.536 | 28.22 |
+| 512 | 128 | 9216 | 0.574 | 892.08 | 4.565 | 28.04 |
+| 512 | 128 | 9728 | 0.582 | 879.27 | 4.586 | 27.91 |
+| 512 | 128 | 10240 | 0.591 | 865.73 | 4.613 | 27.75 |
+| 512 | 128 | 10752 | 0.599 | 855.02 | 4.703 | 27.22 |
+| 512 | 128 | 11264 | 0.608 | 842.76 | 4.729 | 27.07 |
+| 512 | 128 | 11776 | 0.614 | 833.86 | 4.712 | 27.16 |
+| 512 | 128 | 12288 | 0.625 | 819.51 | 4.723 | 27.10 |
+| 512 | 128 | 12800 | 0.630 | 812.88 | 4.750 | 26.95 |
+| 512 | 128 | 13312 | 0.639 | 801.28 | 4.774 | 26.81 |
+| 512 | 128 | 13824 | 0.648 | 790.22 | 4.795 | 26.70 |
+| 512 | 128 | 14336 | 0.655 | 781.86 | 4.822 | 26.55 |
+| 512 | 128 | 14848 | 0.663 | 772.15 | 4.848 | 26.40 |
+| 512 | 128 | 15360 | 0.670 | 763.86 | 4.871 | 26.28 |
+| 512 | 128 | 15872 | 0.678 | 755.06 | 4.895 | 26.15 |
+| 512 | 128 | 16384 | 0.686 | 745.93 | 4.973 | 25.74 |
+
+## IQ4_KT
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.409 | 1253.26 | 3.866 | 33.11 |
+| 512 | 128 | 512 | 0.416 | 1229.86 | 3.916 | 32.69 |
+| 512 | 128 | 1024 | 0.425 | 1203.83 | 3.928 | 32.59 |
+| 512 | 128 | 1536 | 0.434 | 1180.87 | 3.945 | 32.44 |
+| 512 | 128 | 2048 | 0.442 | 1158.85 | 3.977 | 32.18 |
+| 512 | 128 | 2560 | 0.450 | 1137.03 | 4.008 | 31.94 |
+| 512 | 128 | 3072 | 0.459 | 1114.55 | 4.058 | 31.54 |
+| 512 | 128 | 3584 | 0.467 | 1096.28 | 4.094 | 31.27 |
+| 512 | 128 | 4096 | 0.478 | 1072.10 | 4.127 | 31.01 |
+| 512 | 128 | 4608 | 0.485 | 1054.76 | 4.156 | 30.80 |
+| 512 | 128 | 5120 | 0.493 | 1038.25 | 4.195 | 30.52 |
+| 512 | 128 | 5632 | 0.501 | 1021.18 | 4.271 | 29.97 |
+| 512 | 128 | 6144 | 0.509 | 1005.50 | 4.275 | 29.94 |
+| 512 | 128 | 6656 | 0.517 | 990.30 | 4.302 | 29.76 |
+| 512 | 128 | 7168 | 0.525 | 975.22 | 4.313 | 29.68 |
+| 512 | 128 | 7680 | 0.532 | 961.73 | 4.330 | 29.56 |
+| 512 | 128 | 8192 | 0.541 | 946.23 | 4.347 | 29.45 |
+| 512 | 128 | 8704 | 0.548 | 933.76 | 4.367 | 29.31 |
+| 512 | 128 | 9216 | 0.556 | 920.76 | 4.398 | 29.11 |
+| 512 | 128 | 9728 | 0.563 | 908.69 | 4.417 | 28.98 |
+| 512 | 128 | 10240 | 0.572 | 895.58 | 4.443 | 28.81 |
+| 512 | 128 | 10752 | 0.579 | 883.69 | 4.525 | 28.29 |
+| 512 | 128 | 11264 | 0.586 | 873.18 | 4.551 | 28.12 |
+| 512 | 128 | 11776 | 0.594 | 861.57 | 4.542 | 28.18 |
+| 512 | 128 | 12288 | 0.601 | 851.28 | 4.558 | 28.08 |
+| 512 | 128 | 12800 | 0.609 | 841.04 | 4.580 | 27.95 |
+| 512 | 128 | 13312 | 0.617 | 830.42 | 4.589 | 27.89 |
+| 512 | 128 | 13824 | 0.625 | 819.83 | 4.609 | 27.77 |
+| 512 | 128 | 14336 | 0.632 | 810.68 | 4.629 | 27.65 |
+| 512 | 128 | 14848 | 0.640 | 799.72 | 4.667 | 27.42 |
+| 512 | 128 | 15360 | 0.649 | 789.30 | 4.677 | 27.37 |
+| 512 | 128 | 15872 | 0.653 | 783.95 | 4.702 | 27.22 |
+| 512 | 128 | 16384 | 0.664 | 771.36 | 4.764 | 26.87 |
+
+## IQ2_KT
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.365 | 1400.95 | 2.737 | 46.76 |
+| 512 | 128 | 512 | 0.373 | 1372.12 | 2.780 | 46.04 |
+| 512 | 128 | 1024 | 0.381 | 1342.95 | 2.786 | 45.95 |
+| 512 | 128 | 1536 | 0.389 | 1316.39 | 2.800 | 45.72 |
+| 512 | 128 | 2048 | 0.399 | 1283.34 | 2.833 | 45.18 |
+| 512 | 128 | 2560 | 0.407 | 1257.53 | 2.866 | 44.65 |
+| 512 | 128 | 3072 | 0.415 | 1234.09 | 2.891 | 44.27 |
+| 512 | 128 | 3584 | 0.423 | 1210.97 | 2.927 | 43.73 |
+| 512 | 128 | 4096 | 0.431 | 1188.16 | 2.962 | 43.21 |
+| 512 | 128 | 4608 | 0.440 | 1162.72 | 2.991 | 42.80 |
+| 512 | 128 | 5120 | 0.450 | 1138.34 | 3.043 | 42.06 |
+| 512 | 128 | 5632 | 0.457 | 1119.22 | 3.110 | 41.15 |
+| 512 | 128 | 6144 | 0.466 | 1098.93 | 3.118 | 41.06 |
+| 512 | 128 | 6656 | 0.475 | 1078.81 | 3.147 | 40.67 |
+| 512 | 128 | 7168 | 0.484 | 1057.24 | 3.151 | 40.62 |
+| 512 | 128 | 7680 | 0.491 | 1042.58 | 3.168 | 40.40 |
+| 512 | 128 | 8192 | 0.497 | 1029.54 | 3.196 | 40.05 |
+| 512 | 128 | 8704 | 0.508 | 1008.46 | 3.225 | 39.69 |
+| 512 | 128 | 9216 | 0.515 | 993.52 | 3.252 | 39.36 |
+| 512 | 128 | 9728 | 0.521 | 982.11 | 3.279 | 39.04 |
+| 512 | 128 | 10240 | 0.531 | 964.15 | 3.291 | 38.89 |
+| 512 | 128 | 10752 | 0.539 | 949.54 | 3.361 | 38.08 |
+| 512 | 128 | 11264 | 0.547 | 935.45 | 3.388 | 37.78 |
+| 512 | 128 | 11776 | 0.555 | 923.02 | 3.386 | 37.80 |
+| 512 | 128 | 12288 | 0.564 | 907.81 | 3.398 | 37.67 |
+| 512 | 128 | 12800 | 0.570 | 897.93 | 3.420 | 37.42 |
+| 512 | 128 | 13312 | 0.581 | 881.98 | 3.441 | 37.20 |
+| 512 | 128 | 13824 | 0.586 | 873.12 | 3.456 | 37.04 |
+| 512 | 128 | 14336 | 0.595 | 860.52 | 3.478 | 36.80 |
+| 512 | 128 | 14848 | 0.602 | 850.42 | 3.504 | 36.53 |
+| 512 | 128 | 15360 | 0.609 | 841.32 | 3.523 | 36.33 |
+| 512 | 128 | 15872 | 0.617 | 829.62 | 3.547 | 36.08 |
+| 512 | 128 | 16384 | 0.623 | 821.93 | 3.627 | 35.29 |
+
+
+
+Nice job, the `IQ2_KT` is quite speedy (relative to the ~4bpw quants)!
+
+Somewhat related I [saw further discussions](https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-2957155162) on optimizing QTIP style quants by using pre-computed Hessians for each layer/tensor. Zero pressure to look or distract, just interesting folks are already uploading Hessians for some models.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-11** at **14:36:11**
> Somewhat related I https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-2957155162 on optimizing QTIP style quants by using pre-computed Hessians for each layer/tensor. Zero pressure to look or distract, just interesting folks are already uploading Hessians for some models.
@@ -52,7 +320,7 @@ This is the sort of thing we do not want to do here. It leads to overfitting, ne
---
-👤 **louiehelm** commented the **2025-06-11** at **17:03:36**:
+👤 **louiehelm** commented on **2025-06-11** at **17:03:36**
Great work! Love seeing improved performance on the trellis quants ik.
@@ -62,14 +330,14 @@ Some alternate MCG multipliers (with no addition) have lower PPL than QTIP 3INST
| **Quantization** | **Version** | **PPL** |
|------------------|-------------|---------|
| **f32** | - | 7.3210 |
-| **IQ2_KT** | #511 default | 11.0029 |
+| **IQ2_KT** | [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511) default | 11.0029 |
| | 0xCBAC1FED (3417055213) | 10.9466 |
-| **IQ3_KT** | #511 default | 8.1319 |
+| **IQ3_KT** | [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511) default | 8.1319 |
| | 0xCBAC1FED (3417055213) | 8.0776 |
-| **IQ4_KT** | #511 default | 7.5620 |
+| **IQ4_KT** | [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511) default | 7.5620 |
| | 0xCBAC1FED (3417055213) | 7.5591 |
-Just chiming in because it might be a great time to take the 0.5% higher fidelity of ditching the default QTIP multiplier+addition params if you're already introducing a breaking change to IQx_KT quants anyway. For IQ2_K, this gains back a good chunk of what was lost by switching to your new decoder scheme, while also making IQ3_KT and IQ4_KT both better than #511 and in some cases even better than prior versions.
+Just chiming in because it might be a great time to take the 0.5% higher fidelity of ditching the default QTIP multiplier+addition params if you're already introducing a breaking change to IQx_KT quants anyway. For IQ2_K, this gains back a good chunk of what was lost by switching to your new decoder scheme, while also making IQ3_KT and IQ4_KT both better than [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511) and in some cases even better than prior versions.
Also, ka = `0xCBAC1FED` and kb = 0 is a more well-tested distribution than 3INST defaults and currently the best known so far. Obviously if this change is added kb can be deleted rather than updated to 0 (for a small speed boost). This is how to test it further with more models to confirm PPL shows improvements more broadly:
@@ -95,13 +363,41 @@ rm -f Meta-Llama-3.1-8B-Instruct-IQ2_KT.gguf
---
-👤 **louiehelm** commented the **2025-06-12** at **22:27:27**:
+👤 **ikawrakow** commented on **2025-06-12** at **08:16:34**
-Yes initial tests above were on #511. Needs more testing... Qwen3 1.7B IQ2_KT = 2.5% lower PPL.... Magistral 24B IQ2_KT = 50% lower PPL [default model bugged perhaps?]
+@louiehelm Thank you for the comment, looks very promising. It should also improve performance slightly by saving one integer addition.
+
+Do I understand correctly that you applied the new multiplier to PR [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511) instead of the original implementation on the main branch?
+
+Did you also try models other than LlaMA-3.1-8B-Instruct?
---
-👤 **Nexesenex** commented the **2025-06-13** at **10:32:43**:
+👤 **louiehelm** commented on **2025-06-12** at **22:27:27**
+
+Yes initial tests above were on [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511). Needs more testing... Qwen3 1.7B IQ2_KT = 2.5% lower PPL.... Magistral 24B IQ2_KT = 50% lower PPL [default model bugged perhaps?]
+
+---
+
+👤 **Nexesenex** commented on **2025-06-13** at **10:20:31**
+
+On Gemma 3 27b qat unquantized (iq2_kt for ffn_up, ffn_gate, attn_q, attn_k and attn_o, iq4_ks for ffn_down, q4_0 for attn_v, and q6 for embed/output), I obtained an almost equivalent perplexity wikitest 512 between the original couple ka/kb and louiehelm's.
+
+But on a Llama 3.3 70b type model (iq2_kt for the ffns, attn_q and attn_o, q6 for embedding, iq5_ks_r4 for output and attn_v, and iq4_ks_r4 for attn_k), the final wikitest 512 perplexity is 1% lower with ka = 3417055213 and kb = 0 compared to the original couple.
+
+With an IQ3_KT with a Cuda MMQ Kernel, and ffn_down/attn_o in iq3_KT, a Llama 3 70b on mono 24GB GPU will become really viable.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-13** at **10:25:19**
+
+> But on a Llama 3.3 70b type model (iq2_kt for the ffns, attn_q and attn_o), the final wikitest 512 perplexity is 1% lower with ka = 3417055213 and kb = 0 compared to the original couple.
+
+1% of what? Can you give the specific PPL values?
+
+---
+
+👤 **Nexesenex** commented on **2025-06-13** at **10:32:43**
> > But on a Llama 3.3 70b type model (iq2_kt for the ffns, attn_q and attn_o), the final wikitest 512 perplexity is 1% lower with ka = 3417055213 and kb = 0 compared to the original couple.
>
@@ -109,13 +405,17 @@ Yes initial tests above were on #511. Needs more testing... Qwen3 1.7B IQ2_KT =
Here is :
-For Llama 3.3 70b type model (iq2_kt for the ffns, attn_q and attn_o, q6 for embedding, iq5_ks_r4 for output and attn_v, and iq4_ks_r4 for attn_k).
-- final wikitest 512 perplexity is 1% lower with ka = 89226354 and kb = 64248484. Final estimate: PPL = 6.1443 +/- 0.03805
-- final wikitest 512 perplexity is 1% lower with ka = 3417055213 and kb = 0. Final estimate: PPL = 6.0739 +/- 0.03762
+For Llama 3.3 70b type model (a merge, not the original 3.3 70b ; iq2_kt for the ffns, attn_q and attn_o, q6 for embedding, iq5_ks_r4 for output and attn_v, and iq4_ks_r4 for attn_k).
+- final wikitest 512 perplexity with ka = 89226354 and kb = 64248484. -> 6.1443 +/- 0.03805
+- final wikitest 512 perplexity is 1% lower with ka = 3417055213 and kb = 0. -> 6.0739 +/- 0.03762
+
+For Gemma 3 27b qat unquantized (iq2_kt for ffn_up, ffn_gate, attn_q, attn_k and attn_o, iq4_ks for ffn_down, q4_0 for attn_v, and q6 for embed/output).
+- final wikitest 512 perplexity with ka = 89226354 and kb = 64248484. -> 8.9993 +/- 0.06887 (and the intermediate values are often lower by 0.01-0.03).
+- final wikitest 512 perplexity with ka = 3417055213 and kb = 0. -> 9.0001 +/- 0.06897
---
-👤 **ikawrakow** commented the **2025-06-13** at **16:59:17**:
+👤 **ikawrakow** commented on **2025-06-13** at **16:59:17**
Did you also try `IQ4_KT`?
@@ -125,20 +425,19 @@ I only changed the CUDA implementation so I can run PPL. When I make the change
---
-👤 **ubergarm** commented the **2025-06-13** at **18:52:10**:
+👤 **ubergarm** commented on **2025-06-13** at **18:52:10**
> Did you also try IQ4_KT?
Just got home and tried louiehelm's 0xCBAC1FED patch on this PR511.
-
### Patch
👈 `0xCBAC1FED` Patch
-```bash
+```diff
diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu
index a602e47d..45de337e 100644
--- a/ggml/src/ggml-cuda/convert.cu
@@ -342,7 +641,7 @@ Here is the comparison of the same [OpenBuddy-R1-0528-Distill-Qwen3-32B-Preview0
* IQ4_KT
- Patched version is ~0.14% "worse" perplexity
- Patched version quantized ~3.6% slower
-* IQ4_KT (token_embd|output)@iq4_kt
+* IQ2_KT (token_embd|output)@iq4_kt
- Patched version is ~0.61% "better" perplexity
- Patched version quantized ~1.4% slower
@@ -351,6 +650,6 @@ Well, its hard to say for a single run given the deltas seem within the margin o
---
-👤 **ikawrakow** commented the **2025-06-18** at **13:21:51**:
+👤 **ikawrakow** commented on **2025-06-18** at **13:21:51**
-Closing in favor of #529
\ No newline at end of file
+Closing in favor of [#529](https://github.com/ikawrakow/ik_llama.cpp/issues/529)
\ No newline at end of file
diff --git a/github-data/pull_requests/512 - Add top n sigma sampler in webui and other webui fix.md b/github-data/pull_requests/512 - Add top n sigma sampler in webui and other webui fix.md
index 1e262e13e..85c7cbf93 100644
--- a/github-data/pull_requests/512 - Add top n sigma sampler in webui and other webui fix.md
+++ b/github-data/pull_requests/512 - Add top n sigma sampler in webui and other webui fix.md
@@ -1,14 +1,17 @@
-### 🐛 [#512](https://github.com/ikawrakow/ik_llama.cpp/pull/512) - Add top n sigma sampler in webui and other webui fix
+## 🔀 [Pull Request #512](https://github.com/ikawrakow/ik_llama.cpp/pull/512) - Add top n sigma sampler in webui and other webui fix
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `webui_sampler_fix` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-10 |
| **Updated** | 2025-06-12 |
+| **Merged** | 2025-06-12 |
---
-#### Description
+## 📄 Description
1. Add top n sigma/xtc in the sampler in webui
2. Fix wrong url link in webui
@@ -22,12 +25,12 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-06-11** at **08:12:04**:
+👤 **ikawrakow** commented on **2025-06-11** at **08:12:04**
LGTM. Has anyone else tested?
---
-👤 **ikawrakow** submitted a review the **2025-06-12** at **05:19:20**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-12** at **05:19:20**
\ No newline at end of file
diff --git a/github-data/pull_requests/513 - add dry sampler.md b/github-data/pull_requests/513 - add dry sampler.md
index f4c1262c1..80403ea01 100644
--- a/github-data/pull_requests/513 - add dry sampler.md
+++ b/github-data/pull_requests/513 - add dry sampler.md
@@ -1,14 +1,17 @@
-### 🔀 [#513](https://github.com/ikawrakow/ik_llama.cpp/pull/513) - add dry sampler
+## 🔀 [Pull Request #513](https://github.com/ikawrakow/ik_llama.cpp/pull/513) - add dry sampler
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `dry_sampler` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-10 |
| **Updated** | 2025-06-19 |
+| **Merged** | 2025-06-19 |
---
-#### Description
+## 📄 Description
I test this using the example in https://github.com/vllm-project/vllm/pull/11368 and it looks ok.
@@ -20,83 +23,103 @@ I test this using the example in https://github.com/vllm-project/vllm/pull/11368
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-06-10** at **02:57:13**:
+👤 **saood06** commented on **2025-06-10** at **02:57:13**
-This already looks so much better than #504 just from looking at how much more similar it is to the reference implementation.
+This already looks so much better than [#504](https://github.com/ikawrakow/ik_llama.cpp/issues/504) just from looking at how much more similar it is to the reference implementation.
-It was taking time testing that because it looked like it had a lot of edge cases that would lead to issues or at least bugs (some more minor than others).
+It was taking time testing that because it looked like it had a lot of edge cases that would lead to issues or at least some incorrect behavior.
---
-👤 **ikawrakow** submitted a review the **2025-06-10** at **05:42:27**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-06-10** at **05:42:27** on `examples/rpc/CMakeLists.txt`:
+👤 **ikawrakow** started a conversation on `examples/rpc/CMakeLists.txt` on **2025-06-10** at **05:42:27**
Why do we need this?
----
-
-👤 **ikawrakow** submitted a review the **2025-06-10** at **05:42:44**: 💬 `COMMENTED`
+> 👤 **firecoperana** replied on **2025-06-10** at **12:39:44**
+>
+> It's in the mainline file.
---
-👤 **ikawrakow** commented during a code review the **2025-06-10** at **05:42:44** on `examples/server/CMakeLists.txt`:
+👤 **ikawrakow** started a conversation on `examples/server/CMakeLists.txt` on **2025-06-10** at **05:42:44**
Why is this needed?
----
-
-👤 **ikawrakow** submitted a review the **2025-06-10** at **05:47:23**: 💬 `COMMENTED`
+> 👤 **firecoperana** replied on **2025-06-10** at **12:49:07**
+>
+> For the stack size code, add_tensor function in ggml-rpc.cpp is using recursion to serialize the graph. Windows has very small stack size by default, so it is easy to cause stack overflow if graph is too complex. This is not needed for dry sampler, but a bug fix for rpc.
---
-👤 **ikawrakow** commented during a code review the **2025-06-10** at **05:47:23** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-06-10** at **05:47:23**
The DRY sampler only depends on the vocabulary, not the entire model. Wouldn't it have been better to define the interface that way (taking a pointer to vocabulary instead of model)?
----
-
-👤 **firecoperana** submitted a review the **2025-06-10** at **12:39:44**: 💬 `COMMENTED`
+> 👤 **firecoperana** replied on **2025-06-10** at **12:40:23**
+>
+> I can change it.
---
-👤 **firecoperana** submitted a review the **2025-06-10** at **12:40:23**: 💬 `COMMENTED`
+👤 **ikawrakow** commented on **2025-06-10** at **13:38:46**
----
-
-👤 **firecoperana** commented during a code review the **2025-06-10** at **12:40:23** on `src/llama.cpp`:
-
-I can change it.
+@saood06 Any other comments?
---
-👤 **firecoperana** submitted a review the **2025-06-10** at **12:49:08**: 💬 `COMMENTED`
+👤 **saood06** commented on **2025-06-11** at **05:35:49**
+
+Tried to build this to test and got this:
+
+```cpp
+/ik_llama.cpp/src/../include/llama.h:1240:54: error: unknown type name ‘llama_sampler_dry’
+ 1240 | void llama_sample_dry(struct llama_context* ctx, llama_sampler_dry* smpl, llama_token_data_array* candidates_p);
+ | ^~~~~~~~~~~~~~~~~
+```
---
-👤 **ikawrakow** commented the **2025-06-10** at **13:38:46**:
+👤 **firecoperana** commented on **2025-06-12** at **01:13:49**
-@saood06 Any other comments?
+> Tried to build this to test and got this:
+>
+> ```c++
+> /ik_llama.cpp/src/../include/llama.h:1240:54: error: unknown type name ‘llama_sampler_dry’
+> 1240 | void llama_sample_dry(struct llama_context* ctx, llama_sampler_dry* smpl, llama_token_data_array* candidates_p);
+> | ^~~~~~~~~~~~~~~~~
+> ```
+
+Can you clean the build folder and try again? It compiles fine for me.
+Build command I use.
+cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_CURL=OFF -DBUILD_SHARED_LIBS=ON -DGGML_SCHED_MAX_COPIES=1
---
-👤 **saood06** commented the **2025-06-11** at **05:35:49**:
+👤 **saood06** commented on **2025-06-12** at **01:33:18**
-Tried to build this to test and got this:
+> Can you clean the build folder and try again?
-```cpp
-/ik_llama.cpp/src/../include/llama.h:1240:54: error: unknown type name ‘llama_sampler_dry’
- 1240 | void llama_sample_dry(struct llama_context* ctx, llama_sampler_dry* smpl, llama_token_data_array* candidates_p);
+This was with a clean build folder.
+
+>It compiles fine for me. Build command I use. cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_CURL=OFF -DBUILD_SHARED_LIBS=ON -DGGML_SCHED_MAX_COPIES=1
+
+Maybe it is because you set `-DLLAMA_BUILD_TESTS=OFF`, sorry I should have given you more of the compile error log.
+
+```
+In file included from /home/saood06/ik_main/ik_llama.cpp/tests/test-c.c:1:
+/home/saood06/ik_main/ik_llama.cpp/src/../include/llama.h:1240:54: error: unknown type name ‘llama_sampler_dry’
+ 1240 | void llama_sample_dry(struct llama_context* ctx, llama_sampler_dry * smpl, llama_token_data_array* candidates_p);
| ^~~~~~~~~~~~~~~~~
+gmake[2]: *** [tests/CMakeFiles/test-c.dir/build.make:79: tests/CMakeFiles/test-c.dir/test-c.c.o] Error 1
+gmake[1]: *** [CMakeFiles/Makefile2:2688: tests/CMakeFiles/test-c.dir/all] Error 2
+gmake[1]: *** Waiting for unfinished jobs....
```
---
-👤 **firecoperana** commented the **2025-06-12** at **02:58:18**:
+👤 **firecoperana** commented on **2025-06-12** at **02:58:18**
> > Can you clean the build folder and try again?
>
@@ -120,4 +143,4 @@ Should be good this time.
---
-👤 **ikawrakow** submitted a review the **2025-06-19** at **07:24:21**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-19** at **07:24:21**
\ No newline at end of file
diff --git a/github-data/pull_requests/515 - IQ2_XXS much faster CPU prompt processing.md b/github-data/pull_requests/515 - IQ2_XXS much faster CPU prompt processing.md
new file mode 100644
index 000000000..fd2b7a903
--- /dev/null
+++ b/github-data/pull_requests/515 - IQ2_XXS much faster CPU prompt processing.md
@@ -0,0 +1,59 @@
+## 🔀 [Pull Request #515](https://github.com/ikawrakow/ik_llama.cpp/pull/515) - IQ2_XXS: much faster CPU prompt processing
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_xxs_gemm` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-11 |
+| **Updated** | 2025-06-11 |
+| **Merged** | 2025-06-11 |
+
+---
+
+## 📄 Description
+
+While experimenting with the trellis quants in PRs [#505](https://github.com/ikawrakow/ik_llama.cpp/issues/505) and [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511), I realized that CPU matrix multiplications (GEMM) for quants that are slow to unpack and make ready for `int8_t` dot products (as the trellis quants are) are much faster if one unpacks a given number of rows to, e.g., `Q8_0_R8`, and then uses the `Q8_0_R8 x Q8_2_X4` GEMM to perform the multiplication with **all columns** of the right matrix.
+
+This PR applies the approach of [#505](https://github.com/ikawrakow/ik_llama.cpp/issues/505)/[#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511) to `IQ2_XXS` (`AVX2/Zen4` only). We get nearly 3X improvement in PP performance compared to `IQ2_XXS` on the main branch, and 2X compared to `IQ2_XXS_R4`!
+
+The same approach can be used out-of-the-box for `IQ3_XXS` (left for a follow up PR).
+
+`IQ2_XS, IQ2_S` and `IQ3_S` use blocks of 16, so one would need a new row-interleaved 8-bit type with blocks of 16 for those.
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-06-11** at **07:59:26**
+
+Here some sweep-bench tables on a Ryzen-7950X CPU for LlaMA-3.1-8B
+
+### IQ2_XXS, main branch
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 4.793 | 106.82 | 5.452 | 23.48 |
+| 512 | 128 | 512 | 4.984 | 102.73 | 6.375 | 20.08 |
+| 512 | 128 | 1024 | 5.357 | 95.58 | 6.191 | 20.68 |
+| 512 | 128 | 1536 | 5.062 | 101.15 | 6.290 | 20.35 |
+| 512 | 128 | 2048 | 5.168 | 99.07 | 6.559 | 19.51 |
+
+### IQ2_XXS_R4
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 3.467 | 147.69 | 5.508 | 23.24 |
+| 512 | 128 | 512 | 3.764 | 136.03 | 5.964 | 21.46 |
+| 512 | 128 | 1024 | 3.573 | 143.31 | 6.292 | 20.34 |
+| 512 | 128 | 1536 | 3.660 | 139.88 | 6.341 | 20.19 |
+| 512 | 128 | 2048 | 3.729 | 137.29 | 6.620 | 19.33 |
+
+### IQ2_XXS, PR
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.778 | 288.03 | 5.484 | 23.34 |
+| 512 | 128 | 512 | 1.860 | 275.28 | 5.685 | 22.52 |
+| 512 | 128 | 1024 | 1.948 | 262.82 | 5.848 | 21.89 |
+| 512 | 128 | 1536 | 2.040 | 250.93 | 6.158 | 20.78 |
+| 512 | 128 | 2048 | 2.131 | 240.32 | 6.322 | 20.25 |
\ No newline at end of file
diff --git a/github-data/pull_requests/515 - IQ2_XXS_ much faster CPU prompt processing.md b/github-data/pull_requests/515 - IQ2_XXS_ much faster CPU prompt processing.md
deleted file mode 100644
index 4f7282719..000000000
--- a/github-data/pull_requests/515 - IQ2_XXS_ much faster CPU prompt processing.md
+++ /dev/null
@@ -1,19 +0,0 @@
-### 🔀 [#515](https://github.com/ikawrakow/ik_llama.cpp/pull/515) - IQ2_XXS: much faster CPU prompt processing
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-11 |
-| **Updated** | 2025-06-11 |
-
----
-
-#### Description
-
-While experimenting with the trellis quants in PRs #505 and #511, I realized that CPU matrix multiplications (GEMM) for quants that are slow to unpack and make ready for `int8_t` dot products (as the trellis quants are) are much faster if one unpacks a given number of rows to, e.g., `Q8_0_R8`, and then uses the `Q8_0_R8 x Q8_2_X4` GEMM to perform the multiplication with **all columns** of the right matrix.
-
-This PR applies the approach of #505/#511 to `IQ2_XXS` (`AVX2/Zen4` only). We get nearly 3X improvement in PP performance compared to `IQ2_XXS` on the main branch, and 2X compared to `IQ2_XXS_R4`!
-
-The same approach can be used out-of-the-box for `IQ3_XXS` (left for a follow up PR).
-
-`IQ2_XS, IQ2_S` and `IQ3_S` use blocks of 16, so one would need a new row-interleaved 8-bit type with blocks of 16 for those.
\ No newline at end of file
diff --git a/github-data/pull_requests/516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 _AVX2_.md b/github-data/pull_requests/516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 AVX2.md
similarity index 75%
rename from github-data/pull_requests/516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 _AVX2_.md
rename to github-data/pull_requests/516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 AVX2.md
index 326fc2bea..2fcb47e6e 100644
--- a/github-data/pull_requests/516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 _AVX2_.md
+++ b/github-data/pull_requests/516 - Much faster iq3_xxs GEMM via repacking to q8_0_r8 AVX2.md
@@ -1,16 +1,19 @@
-### 🔀 [#516](https://github.com/ikawrakow/ik_llama.cpp/pull/516) - Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2)
+## 🔀 [Pull Request #516](https://github.com/ikawrakow/ik_llama.cpp/pull/516) - Much faster iq3_xxs GEMM via repacking to q8_0_r8 (AVX2)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq3_xxs_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-11 |
| **Updated** | 2025-06-11 |
+| **Merged** | 2025-06-11 |
---
-#### Description
+## 📄 Description
-This PR is a follow up of #515, and applies the same technique to `IQ3_XXS`. We see nearly 3X increase in prompt processing speed compared to `IQ3_XXS`, and over 2X compared to `IQ3_XXS_R4`.
+This PR is a follow up of [#515](https://github.com/ikawrakow/ik_llama.cpp/issues/515), and applies the same technique to `IQ3_XXS`. We see nearly 3X increase in prompt processing speed compared to `IQ3_XXS`, and over 2X compared to `IQ3_XXS_R4`.
Sweep-bench for pure `IQ3_XXS` quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:
diff --git a/github-data/pull_requests/517 - IQ1_S_ much faster CPU prompt processing.md b/github-data/pull_requests/517 - IQ1_S much faster CPU prompt processing.md
similarity index 74%
rename from github-data/pull_requests/517 - IQ1_S_ much faster CPU prompt processing.md
rename to github-data/pull_requests/517 - IQ1_S much faster CPU prompt processing.md
index 59e87bf74..a9a85b6f6 100644
--- a/github-data/pull_requests/517 - IQ1_S_ much faster CPU prompt processing.md
+++ b/github-data/pull_requests/517 - IQ1_S much faster CPU prompt processing.md
@@ -1,16 +1,19 @@
-### 🔀 [#517](https://github.com/ikawrakow/ik_llama.cpp/pull/517) - IQ1_S: much faster CPU prompt processing
+## 🔀 [Pull Request #517](https://github.com/ikawrakow/ik_llama.cpp/pull/517) - IQ1_S: much faster CPU prompt processing
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_s_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-11 |
| **Updated** | 2025-06-11 |
+| **Merged** | 2025-06-11 |
---
-#### Description
+## 📄 Description
-This PR is a follow up of #515 and #516, and applies the same technique to `IQ1_S`. We see nearly 2X increase in prompt processing speed compared to `IQ1_S` and `IQ1_S_R4.
+This PR is a follow up of [#515](https://github.com/ikawrakow/ik_llama.cpp/issues/515) and [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516), and applies the same technique to `IQ1_S`. We see nearly 2X increase in prompt processing speed compared to `IQ1_S` and `IQ1_S_R4.
Sweep-bench for `IQ1_S` quantization of LlaMA-3.1-8B on a Ryzen-7950X CPU:
diff --git a/github-data/pull_requests/518 - IQ3_S_ much faster CPU prompt processing.md b/github-data/pull_requests/518 - IQ3_S much faster CPU prompt processing.md
similarity index 73%
rename from github-data/pull_requests/518 - IQ3_S_ much faster CPU prompt processing.md
rename to github-data/pull_requests/518 - IQ3_S much faster CPU prompt processing.md
index 825338386..8626a84d0 100644
--- a/github-data/pull_requests/518 - IQ3_S_ much faster CPU prompt processing.md
+++ b/github-data/pull_requests/518 - IQ3_S much faster CPU prompt processing.md
@@ -1,16 +1,19 @@
-### 🔀 [#518](https://github.com/ikawrakow/ik_llama.cpp/pull/518) - IQ3_S: much faster CPU prompt processing
+## 🔀 [Pull Request #518](https://github.com/ikawrakow/ik_llama.cpp/pull/518) - IQ3_S: much faster CPU prompt processing
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq3_s_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-11 |
| **Updated** | 2025-06-12 |
+| **Merged** | 2025-06-12 |
---
-#### Description
+## 📄 Description
-As PRs #515, #516, #517.
+As PRs [#515](https://github.com/ikawrakow/ik_llama.cpp/issues/515), [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516), [#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517).
Here a sweep-bench with this PR for LlaMA-3.1-8B on a Ryzen-7950X CPU
diff --git a/github-data/pull_requests/52 - Fix bug and D 128 case for Q8_0 k-cache.md b/github-data/pull_requests/52 - Fix bug and D 128 case for Q8_0 k-cache.md
new file mode 100644
index 000000000..becfe46ca
--- /dev/null
+++ b/github-data/pull_requests/52 - Fix bug and D 128 case for Q8_0 k-cache.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #52](https://github.com/ikawrakow/ik_llama.cpp/pull/52) - Fix bug and D < 128 case for Q8_0 k-cache
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_kq` |
+| **Target Branch** | `main` |
+| **Created** | 2024-09-13 |
+| **Updated** | 2024-09-13 |
+| **Merged** | 2024-09-13 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/52 - Fix bug and D _ 128 case for Q8_0 k-cache.md b/github-data/pull_requests/52 - Fix bug and D _ 128 case for Q8_0 k-cache.md
deleted file mode 100644
index 37b0b9835..000000000
--- a/github-data/pull_requests/52 - Fix bug and D _ 128 case for Q8_0 k-cache.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🐛 [#52](https://github.com/ikawrakow/ik_llama.cpp/pull/52) - Fix bug and D < 128 case for Q8_0 k-cache
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-09-13 |
-| **Updated** | 2024-09-13 |
\ No newline at end of file
diff --git a/github-data/pull_requests/520 - Better strategy for GPU offload.md b/github-data/pull_requests/520 - Better strategy for GPU offload.md
index 4409942ff..bbf33bd8b 100644
--- a/github-data/pull_requests/520 - Better strategy for GPU offload.md
+++ b/github-data/pull_requests/520 - Better strategy for GPU offload.md
@@ -1,14 +1,17 @@
-### 🔀 [#520](https://github.com/ikawrakow/ik_llama.cpp/pull/520) - Better strategy for GPU offload
+## 🔀 [Pull Request #520](https://github.com/ikawrakow/ik_llama.cpp/pull/520) - Better strategy for GPU offload
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/moe_offload_strategy` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-11 |
| **Updated** | 2025-06-12 |
+| **Merged** | 2025-06-12 |
---
-#### Description
+## 📄 Description
In a hybrid GPU/CPU situation, the decision if to offload model weights residing in RAM to the GPU to perform matrix multiplications is a tricky business. On the master branch (and also in mainline `llama.cpp`) a simply heuristics is used: if the batch size is `>= 32` and the operation is supported, it is offloaded to the GPU. This heuristics comes from the experience with dense models (but even then, the correct decision will depend on the speed of the CPU, the GPU, and the PCI-E bandwidth).
@@ -69,11 +72,11 @@ Please play with this PR and let me know if it is useful to get merged.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **quasar-of-mikus** commented the **2025-06-11** at **20:40:59**:
+👤 **quasar-of-mikus** commented on **2025-06-11** at **20:40:59**
-Looks quite good for setups like mine where PCIe bandwidth is low and prompt length is short.
+Looks good for setups like mine where PCIe bandwidth is low and prompt length is short.
128gb ddr4 3200 2ch
2x 3090 PCIe 3.0 x8 x8
@@ -85,39 +88,66 @@ Main: ~1.5t/s pp
PR: 9-10t/s pp
-PR:
+PR build: cdcb324f (3743):
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mla | amb | ts | mmap | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --: | ----: | ------------ | ---: | ---: | ------------: | ---------------: |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp16 | 7.81 ± 0.55 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp32 | 10.61 ± 0.34 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp64 | 13.31 ± 0.16 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp128 | 17.58 ± 0.20 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp256 | 19.66 ± 0.08 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp512 | 21.24 ± 0.10 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp1024 | 52.75 ± 0.37 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp2048 | 97.01 ± 0.59 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp4096 | 165.89 ± 0.63 |
-build: cdcb324f (3743)
-
-
-Main, note the very low speeds for pp16 to pp256:
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp16 | 7.81 ± 0.55 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp32 | 10.61 ± 0.34 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp64 | 13.31 ± 0.16 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp128 | 17.58 ± 0.20 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp256 | 19.66 ± 0.08 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp512 | 21.24 ± 0.10 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp1024 | 52.75 ± 0.37 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp2048 | 97.01 ± 0.59 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp4096 | 165.89 ± 0.63 |
+
+
+
+
+Main, note the very low speeds for pp32 to pp256 build: 3f54b497 (3742):
| model | size | params | backend | ngl | threads | n_batch | n_ubatch | fa | mla | amb | ts | mmap | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------: | -------: | -: | --: | ----: | ------------ | ---: | ---: | ------------: | ---------------: |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp16 | 7.81 ± 0.40 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp32 | 1.89 ± 0.01 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp64 | 3.69 ± 0.01 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp128 | 7.44 ± 0.01 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp256 | 14.47 ± 0.03 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp512 | 27.94 ± 0.10 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp1024 | 52.96 ± 0.18 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp2048 | 97.27 ± 0.25 |
-| deepseek2 671B IQ1_S_R4 - 1.5 bpw | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp4096 | 166.23 ± 0.19 |
-build: 3f54b497 (3742)
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp16 | 7.81 ± 0.40 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp32 | 1.89 ± 0.01 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp64 | 3.69 ± 0.01 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp128 | 7.44 ± 0.01 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp256 | 14.47 ± 0.03 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp512 | 27.94 ± 0.10 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp1024 | 52.96 ± 0.18 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp2048 | 97.27 ± 0.25 |
+| ds2 671B IQ1_S_R4 | 130.20 GiB | 672.05 B | CUDA | 999 | 18 | 4096 | 4096 | 1 | 3 | 512 | 23.00/23.00 | 0 | 1 | pp4096 | 166.23 ± 0.19 |
---
-👤 **ikawrakow** commented the **2025-06-12** at **04:44:22**:
+👤 **ikawrakow** commented on **2025-06-12** at **04:44:22**
Here the above data illustrated in a graph:
-
\ No newline at end of file
+
+
+---
+
+👤 **ikawrakow** commented on **2025-06-12** at **04:58:12**
+
+A also took the liberty to plot @quasar-of-mikus's data:
+
+
+
+We see that in this case the performance at 512 tokens is better on the main branch. With the default value of `GGML_CUDA_MIN_BATCH_OFFLOAD=32` the MoE matrix multiplications are done on the CPU for a batch of 512 tokens, and in this case it is slower than to offload to the GPU. So, @quasar-of-mikus will likely benefit from using `-GGML_CUDA_MIN_BATCH_OFFLOAD=20`
+
+---
+
+👤 **quasar-of-mikus** commented on **2025-06-12** at **17:30:44**
+
+On my setup and with this model, a lower value of `-DGGML_CUDA_MIN_BATCH_OFFLOAD=16` brought the performance back @ pp512, resulting in an overall improvement (at least with this level of granularity) 👍
+| test | Old main t/s | =32 | =16 |
+| ------------: | ---------------: | ---------------: | ---------------: |
+| pp16 | 7.81 ± 0.40 | 7.81 ± 0.55 | 7.72 ± 0.49 |
+| pp32 | 1.89 ± 0.01 | 10.61 ± 0.34 | 10.71 ± 0.05 |
+| pp64 | 3.69 ± 0.01 | 13.31 ± 0.16 | 13.72 ± 0.19 |
+| pp128 | 7.44 ± 0.01 | 17.58 ± 0.20 | 17.61 ± 0.25 |
+| pp256 | 14.47 ± 0.03 | 19.66 ± 0.08 | 19.73 ± 0.13 |
+| **--> pp512** | **27.94 ± 0.10** | 21.24 ± 0.10 | **27.94 ± 0.20** |
+| pp1024 | 52.96 ± 0.18 | 52.75 ± 0.37 | 52.92 ± 0.30 |
+| pp2048 | 97.27 ± 0.25 | 97.01 ± 0.59 | 97.12 ± 0.54 |
+| pp4096 | 166.23 ± 0.19 | 165.89 ± 0.63 | 165.97 ± 0.92 |
\ No newline at end of file
diff --git a/github-data/pull_requests/524 - Perhaps a slightly better GEMV version for IQ2_XXS IQ3_XXS IQ3_S.md b/github-data/pull_requests/524 - Perhaps a slightly better GEMV version for IQ2_XXS IQ3_XXS IQ3_S.md
new file mode 100644
index 000000000..360928fc9
--- /dev/null
+++ b/github-data/pull_requests/524 - Perhaps a slightly better GEMV version for IQ2_XXS IQ3_XXS IQ3_S.md
@@ -0,0 +1,40 @@
+## 🔀 [Pull Request #524](https://github.com/ikawrakow/ik_llama.cpp/pull/524) - Perhaps a slightly better GEMV version for IQ2_XXS, IQ3_XXS, IQ3_S
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq_gemv_tweaks` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-12 |
+| **Updated** | 2025-06-13 |
+| **Merged** | 2025-06-13 |
+
+---
+
+## 📄 Description
+
+Closes [#523](https://github.com/ikawrakow/ik_llama.cpp/issues/523)
+
+@ciprianveg @Ph0rk0z
+
+Does this work better for you?
+
+---
+
+## 💬 Conversation
+
+👤 **ciprianveg** commented on **2025-06-12** at **20:29:16**
+
+> Ref [#523](https://github.com/ikawrakow/ik_llama.cpp/issues/523)
+>
+> @ciprianveg @Ph0rk0z
+>
+> Does this work better for you?
+
+Yes, it does! :)
+
+---
+
+👤 **Ph0rk0z** commented on **2025-06-12** at **20:47:02**
+
+I'm seeing 10s again so this one is a winner. Wish I knew why RTR isn't helpful since so many other people appear to have huge benefits from it at all batch sizes. Plus it buffs TG even more. I'm still using GCC 11, can it be related to that compiler thing?
\ No newline at end of file
diff --git a/github-data/pull_requests/524 - Perhaps a slightly better GEMV version for IQ2_XXS_ IQ3_XXS_ IQ3_S.md b/github-data/pull_requests/524 - Perhaps a slightly better GEMV version for IQ2_XXS_ IQ3_XXS_ IQ3_S.md
deleted file mode 100644
index 314e93b99..000000000
--- a/github-data/pull_requests/524 - Perhaps a slightly better GEMV version for IQ2_XXS_ IQ3_XXS_ IQ3_S.md
+++ /dev/null
@@ -1,31 +0,0 @@
-### 🔀 [#524](https://github.com/ikawrakow/ik_llama.cpp/pull/524) - Perhaps a slightly better GEMV version for IQ2_XXS, IQ3_XXS, IQ3_S
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-12 |
-| **Updated** | 2025-06-13 |
-
----
-
-#### Description
-
-Closes #523
-
-@ciprianveg @Ph0rk0z
-
-Does this work better for you?
-
----
-
-#### 💬 Conversation
-
-👤 **ciprianveg** commented the **2025-06-12** at **20:29:16**:
-
-> Ref #523
->
-> @ciprianveg @Ph0rk0z
->
-> Does this work better for you?
-
-Yes, it does! :)
\ No newline at end of file
diff --git a/github-data/pull_requests/525 - Faster CPU prompt processing for Q4_K and Q5_K.md b/github-data/pull_requests/525 - Faster CPU prompt processing for Q4_K and Q5_K.md
index 81b9f76ec..c18dcab62 100644
--- a/github-data/pull_requests/525 - Faster CPU prompt processing for Q4_K and Q5_K.md
+++ b/github-data/pull_requests/525 - Faster CPU prompt processing for Q4_K and Q5_K.md
@@ -1,18 +1,21 @@
-### 🔀 [#525](https://github.com/ikawrakow/ik_llama.cpp/pull/525) - Faster CPU prompt processing for Q4_K and Q5_K
+## 🔀 [Pull Request #525](https://github.com/ikawrakow/ik_llama.cpp/pull/525) - Faster CPU prompt processing for Q4_K and Q5_K
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q4_k_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-12 |
| **Updated** | 2025-06-13 |
+| **Merged** | 2025-06-13 |
---
-#### Description
+## 📄 Description
These two quantization types are quite popular, so I thought it makes sense to improve their performance. The repacked variants `Q4_K_R4` and `Q5_K_R4` do not have a CUDA implementation, so repacking is not useful in a hybrid CPU/GPU setup where it may be better to offload tensors stored in RAM to the GPU when processing large batched.
-The PR uses the same trick as #515, #516, #517, #518. When processing batches `>= 32` tokens, `Q4_K` or `Q5_K` quantized tensors are repacked on-the-fly to `Q8_1_R8`.
+The PR uses the same trick as [#515](https://github.com/ikawrakow/ik_llama.cpp/issues/515), [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516), [#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517), [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518). When processing batches `>= 32` tokens, `Q4_K` or `Q5_K` quantized tensors are repacked on-the-fly to `Q8_1_R8`.
Here some sweep-bench results for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU
@@ -76,4 +79,4 @@ Here some sweep-bench results for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU
| 512 | 128 | 1536 | 2.350 | 217.91 | 11.888 | 10.77 |
| 512 | 128 | 2048 | 2.133 | 240.00 | 11.998 | 10.67 |
-Here performance gains are not as large as in #514, #515, #516, #518 as k-quants are much faster than sub-4 bpw i-quants. Nevertheless, we see a nearly 50% PP performance improvement compared to the non-interleaved variants, and 5-10% improvement compared to the `_R4` variants.
\ No newline at end of file
+Here performance gains are not as large as in [#514](https://github.com/ikawrakow/ik_llama.cpp/issues/514), [#515](https://github.com/ikawrakow/ik_llama.cpp/issues/515), [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516), [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518) as k-quants are much faster than sub-4 bpw i-quants. Nevertheless, we see a nearly 50% PP performance improvement compared to the non-interleaved variants, and 5-10% improvement compared to the `_R4` variants.
\ No newline at end of file
diff --git a/github-data/pull_requests/528 - Fix bug introduced in 524525.md b/github-data/pull_requests/528 - Fix bug introduced in 524525.md
new file mode 100644
index 000000000..984b6d484
--- /dev/null
+++ b/github-data/pull_requests/528 - Fix bug introduced in 524525.md
@@ -0,0 +1,26 @@
+## 🔀 [Pull Request #528](https://github.com/ikawrakow/ik_llama.cpp/pull/528) - Fix bug introduced in [#524](https://github.com/ikawrakow/ik_llama.cpp/issues/524)/[#525](https://github.com/ikawrakow/ik_llama.cpp/issues/525)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_bug_481` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-14 |
+| **Updated** | 2025-06-14 |
+| **Merged** | 2025-06-14 |
+
+---
+
+## 📄 Description
+
+When adding the faster GEMM in [#524](https://github.com/ikawrakow/ik_llama.cpp/issues/524) / [#525](https://github.com/ikawrakow/ik_llama.cpp/issues/525) I forgot to add the call to `iqk_convert_repack` also in the MoE matrix multiplication functions, which causes a crash (see [#527](https://github.com/ikawrakow/ik_llama.cpp/issues/527)). This PR fixes it.
+
+---
+
+## 💬 Conversation
+
+👤 **ycat3** commented on **2025-06-14** at **10:30:08**
+
+Thanks.
+It works fine.
+[#527](https://github.com/ikawrakow/ik_llama.cpp/issues/527)
\ No newline at end of file
diff --git a/github-data/pull_requests/528 - Fix bug introduced in _524_525.md b/github-data/pull_requests/528 - Fix bug introduced in _524_525.md
deleted file mode 100644
index a64bc4f89..000000000
--- a/github-data/pull_requests/528 - Fix bug introduced in _524_525.md
+++ /dev/null
@@ -1,23 +0,0 @@
-### 🐛 [#528](https://github.com/ikawrakow/ik_llama.cpp/pull/528) - Fix bug introduced in [#524](https://github.com/ikawrakow/ik_llama.cpp/issues/524)/[#525](https://github.com/ikawrakow/ik_llama.cpp/issues/525)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-14 |
-| **Updated** | 2025-06-14 |
-
----
-
-#### Description
-
-When adding the faster GEMM in #524 / #525 I forgot to add the call to `iqk_convert_repack` also in the MoE matrix multiplication functions, which causes a crash (see #527). This PR fixes it.
-
----
-
-#### 💬 Conversation
-
-👤 **ycat3** commented the **2025-06-14** at **10:30:08**:
-
-Thanks.
-It works fine.
-#527
\ No newline at end of file
diff --git a/github-data/pull_requests/529 - New IQ2_KT_ IQ3_KT and IQ4_KT_ V2.md b/github-data/pull_requests/529 - New IQ2_KT IQ3_KT and IQ4_KT V2.md
similarity index 55%
rename from github-data/pull_requests/529 - New IQ2_KT_ IQ3_KT and IQ4_KT_ V2.md
rename to github-data/pull_requests/529 - New IQ2_KT IQ3_KT and IQ4_KT V2.md
index db17cc380..1a498ed7c 100644
--- a/github-data/pull_requests/529 - New IQ2_KT_ IQ3_KT and IQ4_KT_ V2.md
+++ b/github-data/pull_requests/529 - New IQ2_KT IQ3_KT and IQ4_KT V2.md
@@ -1,16 +1,20 @@
-### 🔀 [#529](https://github.com/ikawrakow/ik_llama.cpp/pull/529) - New IQ2_KT, IQ3_KT and IQ4_KT, V2
+## 🔀 [Pull Request #529](https://github.com/ikawrakow/ik_llama.cpp/pull/529) - New IQ2_KT, IQ3_KT and IQ4_KT, V2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/new_iq2kt_v2` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-14 |
| **Updated** | 2025-06-18 |
+| **Merged** | 2025-06-18 |
+| **Labels** | `Breaking change` |
---
-#### Description
+## 📄 Description
-This PR is the combination of #505 and #511, but rebased on current main, and using @louiehelm's alternative multiplier (see comments in #511).
+This PR is the combination of [#505](https://github.com/ikawrakow/ik_llama.cpp/issues/505) and [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511), but rebased on current main, and using @louiehelm's alternative multiplier (see comments in [#511](https://github.com/ikawrakow/ik_llama.cpp/issues/511)).
I was curios to see if not having an extra addition per step when generating the trellis sequence will have a pefromance impact, so made a proper change rather than just blindly replacing the two constants using `sed`. On CUDA performance impact is negligible, on `AVX2` we see 1-2% improvement.
@@ -18,11 +22,15 @@ With the latest commits I have also adapted `IQ3_KT` to the integer trellis.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-14** at **17:55:02**:
+👤 **ubergarm** commented on **2025-06-14** at **17:55:02**
-Okay, finished a fresh test using this new PR529 on DeepSeek-R1-0528. I made two almost identical quants that differ only in the commit used to quantize/test/benchmark. Quantization was done roughly simultaneously, one on each socket of a dual socket intel xeon 6980P.
+## tl;dr;
+Something seems off as perplexity on both of my new tests for this PR529 are much higher now than a previous attempt around June 3rd with commits around [PR484](https://github.com/ikawrakow/ik_llama.cpp/pull/484). I double checked my logs and commands and confirmed using the same imatrix etc so not sure what is going on. I've been compiling for CPU only fwiw. Details below.
+
+## Experiment
+Okay, doing a fresh test using this new PR529 on DeepSeek-R1-0528. I made two almost identical quants that differ only in the commit used to quantize/test/benchmark. Quantization was done roughly simultaneously, one on each socket of a dual socket intel xeon 6980P.
### Common Recipe
@@ -38,6 +46,8 @@ Okay, finished a fresh test using this new PR529 on DeepSeek-R1-0528. I made two
* `mix-IQ4_KT-0xCBAC1FED`
* including louiehelm's multiplier
* quantize time = 15666814.63 ms - 4.35 hours
+ * Final estimate: PPL = 13.2237 +/- 0.09673
+ * *UPDATE w/ 6408b94 CPU fix* `Final estimate: PPL = 3.5808 +/- 0.01943`
```
INFO [ main] build info | tid="135292499650880" timestamp=1749922901 build=3776 commit="e5a06688"
INFO [ main] system info | tid="135292499650880" timestamp=1749922901 n_threads=80 n_threads_batch=128 total_threads=512 system_inf
@@ -49,6 +59,7 @@ o="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNN
* `mix-IQ4_KT-og`
* two commits earlier, *without* louiehelm's multiplier
* quantize time = 15890223.61 ms - 4.41 hours
+ * Final estimate: PPL = 13.1972 +/- 0.09621
```
INFO [ main] build info | tid="133117239363904" timestamp=1749922843 build=3774 commit="b1416bf0"
INFO [ main] system info | tid="133117239363904" timestamp=1749922843 n_threads=80 n_threads_batch=128 total_threads=512 system_inf
@@ -56,18 +67,131 @@ o="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNN
F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
```
+
+
+👈 Perplexity Command
+
### Perplexity
-TODO
+```bash
+#model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-og.gguf
+model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-0xCBAC1FED.gguf
+
+numactl -N 0 -m 0 \
+./build/bin/llama-perplexity \
+ --model "$model" \
+ -ctk f16 \
+ -mla 3 -fa \
+ -amb 512 \
+ -fmoe \
+ -f wiki.test.raw \
+ --seed 1337 \
+ --no-mmap \
+ --threads 128 \
+ --numa numactl
+```
+
+
### llama-sweep-bench
-TODO
+
+
+
+
+
+👈 llama-sweep-bench logs
+
+```bash
+#!/usr/bin/env bash
+
+#model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-0xCBAC1FED.gguf
+model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-mix-IQ4_KT-og.gguf
+
+numactl -N 1 -m 1 \
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -c 8704 \
+ -ctk f16 \
+ -mla 3 -fa \
+ -fmoe \
+ --no-mmap \
+ --threads 80 \
+ --threads-batch 128 \
+ --numa numactl \
+ --warmup-batch
+```
+
+#### DeepSeek-R1-0528-mix-IQ4_KT-og b1416bf0
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 4.884 | 104.83 | 20.927 | 6.12 |
+| 512 | 128 | 512 | 5.235 | 97.80 | 26.596 | 4.81 |
+| 512 | 128 | 1024 | 5.670 | 90.31 | 20.470 | 6.25 |
+| 512 | 128 | 1536 | 6.087 | 84.12 | 27.983 | 4.57 |
+| 512 | 128 | 2048 | 7.085 | 72.27 | 29.016 | 4.41 |
+| 512 | 128 | 2560 | 7.906 | 64.76 | 28.767 | 4.45 |
+| 512 | 128 | 3072 | 7.373 | 69.44 | 28.416 | 4.50 |
+| 512 | 128 | 3584 | 8.216 | 62.32 | 20.452 | 6.26 |
+| 512 | 128 | 4096 | 9.451 | 54.17 | 19.672 | 6.51 |
+| 512 | 128 | 4608 | 9.573 | 53.49 | 20.232 | 6.33 |
+| 512 | 128 | 5120 | 9.966 | 51.37 | 20.479 | 6.25 |
+| 512 | 128 | 5632 | 10.774 | 47.52 | 24.437 | 5.24 |
+| 512 | 128 | 6144 | 11.816 | 43.33 | 22.064 | 5.80 |
+| 512 | 128 | 6656 | 12.937 | 39.58 | 21.809 | 5.87 |
+| 512 | 128 | 7168 | 12.519 | 40.90 | 27.118 | 4.72 |
+| 512 | 128 | 7680 | 13.039 | 39.27 | 29.001 | 4.41 |
+| 512 | 128 | 8192 | 13.726 | 37.30 | 22.418 | 5.71 |
+
+#### DeepSeek-R1-0528-mix-IQ4_KT-0xCBAC1FED e5a06688
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 4.699 | 108.96 | 20.847 | 6.14 |
+| 512 | 128 | 512 | 6.078 | 84.24 | 25.537 | 5.01 |
+| 512 | 128 | 1024 | 5.355 | 95.62 | 21.164 | 6.05 |
+| 512 | 128 | 1536 | 5.858 | 87.41 | 25.636 | 4.99 |
+| 512 | 128 | 2048 | 6.297 | 81.31 | 26.767 | 4.78 |
+| 512 | 128 | 2560 | 7.047 | 72.66 | 24.938 | 5.13 |
+| 512 | 128 | 3072 | 7.227 | 70.85 | 19.698 | 6.50 |
+| 512 | 128 | 3584 | 8.237 | 62.16 | 23.806 | 5.38 |
+| 512 | 128 | 4096 | 8.208 | 62.38 | 19.898 | 6.43 |
+| 512 | 128 | 4608 | 9.000 | 56.89 | 23.857 | 5.37 |
+| 512 | 128 | 5120 | 9.589 | 53.39 | 21.710 | 5.90 |
+| 512 | 128 | 5632 | 10.344 | 49.50 | 25.217 | 5.08 |
+| 512 | 128 | 6144 | 11.087 | 46.18 | 23.658 | 5.41 |
+| 512 | 128 | 6656 | 12.194 | 41.99 | 24.630 | 5.20 |
+| 512 | 128 | 7168 | 11.945 | 42.86 | 21.043 | 6.08 |
+| 512 | 128 | 7680 | 12.517 | 40.91 | 22.126 | 5.79 |
+| 512 | 128 | 8192 | 13.231 | 38.70 | 22.478 | 5.69 |
+
+
### Conclusion
-I'll update this with results after perplexity and llama-sweep-bench finishes up.
+Huh, the perplexity of ~13.2 on *both* of these seems surprisingly "bad" relative to my earlier first test I had done with a smaller mix `IQ2_KT` 196.696 GiB (2.514 BPW) with `Final estimate: PPL = 3.6378 +/- 0.01997` and my other various mixes on huggingface shown in the graph below.
+
+```
+# https://github.com/ikawrakow/ik_llama.cpp/pull/484
+# quantized on June 2nd, 2025: build = 3726 (061d064b)
+# perplexity on June 3rd, 2025: build = 3724 (626f49ab)
+# compiled and tested CPU-only on 24x core 7965WX
+system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NE
+ON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1
+```
+- type f32: 361 tensors
+- type q5_0: 61 tensors - `attn_k_b`
+- type iq2_kt: 116 tensors - `ffn_(gate|up)_exps`
+- type iq3_kt: 58 tensors - `ffn_down_exps`
+- type iq4_kt: 551 tensors - everything else
+
+
+
+I did confirm that both these new test case models give a reasonable answer for my usual `Count from 1 to 10 in French.` with no gibberish output.
+
+Not sure what to try next... Possibly try comparing perplexity of a smaller quant on this PR529 vs older PR484 with exact same recipe? Maybe the new recipe is actually worse despite being larger?
+
+Happy to provide more logs as requested. Thanks!
---
-👤 **ubergarm** commented the **2025-06-14** at **21:30:05**:
+👤 **ubergarm** commented on **2025-06-14** at **21:30:05**
Okay, did one more faster experiment using the *same* recipe/imatrix for Qwen3-30B-A3B moe. Something is off between this PR529 and main's implementation of a "pure" `iq4_kt` when checking llama-perplexity compiled CPU only:
@@ -77,6 +201,8 @@ Okay, did one more faster experiment using the *same* recipe/imatrix for Qwen3-3
* main@6fc5bbb6
- Final estimate: PPL = 9.3612 +/- 0.07518
- total time = 585627.38 ms / 299009 tokens
+* PR511 ik/new_iq2kt@c8cf1280
+ - Odd, quantized one to sanity check this, but now getting `GGML_ASSERT(fms.S[j] > 0)` on the intel xeon 6980P... hrmm... i used the Thread Ripper Pro in my testing over on that PR yesterday compiled with CUDA...
- Qwen3-30B-A3B
- 14.344 GiB (4.035 BPW)
@@ -85,7 +211,7 @@ Okay, did one more faster experiment using the *same* recipe/imatrix for Qwen3-3
---
-👤 **ubergarm** commented the **2025-06-15** at **15:54:01**:
+👤 **ubergarm** commented on **2025-06-15** at **15:54:01**
Okay, back to the basics as my sanity is thin. I used the Thread Ripper Pro 24x Core with RTX A6000 GPUs to test.
@@ -105,21 +231,36 @@ The CUDA implementation of this PR529 seems to give reasonable perplexity. Howev
---
-👤 **ikawrakow** commented the **2025-06-15** at **16:06:42**:
+👤 **ikawrakow** commented on **2025-06-15** at **16:06:42**
PPL = 922 means I have a bug in the CPU implementation. I haven't come around to check.
---
-👤 **ubergarm** commented the **2025-06-15** at **16:14:28**:
+👤 **ubergarm** commented on **2025-06-15** at **16:14:28**
All good no rush. Just wanted to re-create the issue on a "known working" system for my own peace of mind hah.
-If it is useful for anyone else testing, I'll leave this experimental [Qwen3-30B-A3B-IQ4_KT-PR529-e5a06688.gguf](http://emptyduck.com/Qwen3-30B-A3B-IQ4_KT-PR529-e5a06688.gguf) on my personal server for a few days.
+If it is useful for anyone else testing, I'll make this experimental [Qwen3-30B-A3B-IQ4_KT-PR529-e5a06688.gguf](http://emptyduck.com/Qwen3-30B-A3B-IQ4_KT-PR529-e5a06688.gguf) available from my personal server for a few days. ~15GiB with sha256sum `c47dd5298181806608fe6dc585d7f1ba2387788881a68be85ff42655e03ce453`
---
-👤 **ubergarm** commented the **2025-06-16** at **14:40:28**:
+👤 **ikawrakow** commented on **2025-06-16** at **12:02:54**
+
+The CPU bug is fixed now.
+
+I get quite a bit lower PPL using
+```
+./bin/llama-quantize --imatrix qwen3_imat_unsloth.dat --output-tensor-type q8_0 --token-embedding-type q8_0 --pure
+```
+(didn't want to risk something going wrong in the output tensor or the token embeddings)
+
+* CUDA: `Final estimate: PPL = 9.0801 +/- 0.07115`
+* CPU: `Final estimate: PPL = 9.0781 +/- 0.07113`
+
+---
+
+👤 **ubergarm** commented on **2025-06-16** at **14:40:28**
Aye, that did the trick for qwen3moe:
@@ -130,7 +271,13 @@ I'll come back around with some more results soon thanks!
---
-👤 **ubergarm** commented the **2025-06-16** at **14:57:16**:
+👤 **ikawrakow** commented on **2025-06-16** at **14:42:50**
+
+But why is your PPL so much higher?
+
+---
+
+👤 **ubergarm** commented on **2025-06-16** at **14:57:16**
> But why is your PPL so much higher?
@@ -145,21 +292,32 @@ I'll use your command with my imatrix now and test again.
./bin/llama-quantize --imatrix qwen3_imat_unsloth.dat --output-tensor-type q8_0 --token-embedding-type q8_0 --pure
```
-I'm assuming the higher bpw output/token_embd accounts for most of the discrepancy.
+I'm assuming the higher bpw output/token_embd accounts for most of the discrepancy.
+
+---
+
+*UPDATE*
+
+Results with the IQ4_KT using q8_0 for embedding/output are still higher for me. Discrepency could be because you use the unsloth imatrix dat. My imatrix dat is older using only [imatrix calibration_data_v5_rc.txt](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c#file-calibration_data_v5_rc-txt).
+
+My newer imatrix corpus adds extra data in an attempt to activate more experts, but I never went back and updated my Qwen3-30B-A3Bs with it. I believe both unsloth and bartowski used bigger corpus for qwen3moe due to issues quantizing at lower BPW with their usual corpus text.
+
+* GPU: `Final estimate: PPL = 9.2301 +/- 0.07378`
+* CPU: `Final estimate: PPL = 9.2279 +/- 0.07375`
---
-👤 **ikawrakow** commented the **2025-06-16** at **16:30:16**:
+👤 **ikawrakow** commented on **2025-06-16** at **16:30:16**
> Results with the IQ4_KT using q8_0 for embedding/output are still higher for me.
-Must be the imatrix, then. I used the one [from Unsloth](https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF/blob/main/imatrix_unsloth.dat), which produced the lowest PPL in my Qwen3 quantization experiments (#359)
+Must be the imatrix, then. I used the one [from Unsloth](https://huggingface.co/unsloth/Qwen3-30B-A3B-128K-GGUF/blob/main/imatrix_unsloth.dat), which produced the lowest PPL in my Qwen3 quantization experiments ([#359](https://github.com/ikawrakow/ik_llama.cpp/issues/359))
---
-👤 **Nexesenex** commented the **2025-06-17** at **01:34:47**:
+👤 **Nexesenex** commented on **2025-06-17** at **01:34:47**
-`llama-perplexity -m Configurable-Llama-3.1-8B-Instruct_iMat-IQ3_KT_Nv2_embed_q6_0_output&attn_v_iq5ksr4_attn_k_iq4ksr4.gguf -f wiki.test.raw -ngl 150 -b 512 -mg 0 -ts 40,0,0 --no-mmap -fa -c 512
+llama-perplexity -m Configurable-Llama-3.1-8B-Instruct_iMat-IQ3_KT_Nv2_embed_q6_0_output&attn_v_iq5ksr4_attn_k_iq4ksr4.gguf -f wiki.test.raw -ngl 150 -b 512 -mg 0 -ts 40,0,0 --no-mmap -fa -c 512
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q6_0: 1 tensors
llama_model_loader: - type iq3_kt: 160 tensors
@@ -169,27 +327,33 @@ llm_load_print_meta: model ftype = IQ3_KT - 3.125 bpw
llm_load_print_meta: model size = 3.315 GiB (3.546 BPW)
llm_load_print_meta: repeating layers = 2.596 GiB (3.195 BPW, 6.980 B parameters)
-Final estimate: PPL = 8.1431 +/- 0.05213`
+Final estimate: PPL = 8.1431 +/- 0.05213
IQ3_KT's PPL works for me on CUDA. It also infers on both CPU and CUDA.
-`llama-perplexity -m Configurable-Llama-3.1-8B-Instruct_iMat-IQ3_XXS_embed_q6_0_output&attn_v_iq5ksr4_attn_k_iq4ksr4.gguf -f wiki.test.raw -ngl 150 -b 512 -mg 0 -ts 40,0,0 --no-mmap -fa -c 512
+llama-perplexity -m Configurable-Llama-3.1-8B-Instruct_iMat-IQ3_XXS_embed_q6_0_output&attn_v_iq5ksr4_attn_k_iq4ksr4.gguf -f wiki.test.raw -ngl 150 -b 512 -mg 0 -ts 40,0,0 --no-mmap -fa -c 512
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type iq3_xxs: 160 tensors
llama_model_loader: - type q6_0: 1 tensors
llama_model_loader: - type iq4_ks_r4: 32 tensors
llama_model_loader: - type iq5_ks_r4: 33 tensors
llm_load_print_meta: model ftype = IQ3_XXS - 3.0625 bpw
-llm_load_print_meta: model params = 8.030 B
llm_load_print_meta: model size = 3.261 GiB (3.489 BPW)
llm_load_print_meta: repeating layers = 2.542 GiB (3.129 BPW, 6.980 B parameters)
+
Final estimate: PPL = 8.4642 +/- 0.05423
-IQ3_XXS has some serious competition, quant quality wise.
+IQ3_XXS has some serious competition, quant quality wise.
+
+Same recipe, but with IQ3_S tensors instead of IQ3_KT/IQ3_XXS : Final estimate: PPL = 7.9331 +/- 0.05065
+With IQ3_K : Final estimate: PPL = 7.9098 +/- 0.05097
+With Q3_K : Final estimate: PPL = 8.1488 +/- 0.05292 (IQ3_KT went below!)
+
+Note: this version of Llama 8B gives a PPL of 7.3287 +/- 0.04703 for Q8_0, so very close to the original.
---
-👤 **ubergarm** commented the **2025-06-17** at **03:53:27**:
+👤 **ubergarm** commented on **2025-06-17** at **03:53:27**
> With the latest commits I have also adapted IQ3_KT to the integer trellis.
@@ -200,6 +364,7 @@ I saw this and started cooking asap targeting ~3.5bpw for [some recent requests
- quantize time = 8 hours 48 minutes
- `Final estimate: PPL = 3.3056 +/- 0.01758`
- (beats the "unsloth dynamic" 275.576GiB `UD-Q3_K_XL` at `3.3341 +/- 0.01784`)
+ - `Cor(ln(PPL(Q)), ln(PPL(base))): 99.69%` on `ubergarm-kld-test-corpus-short.txt`
- f32: 361 tensors
- q5_0: 61 tensors `attn_k_b`
- q8_0: 1 tensors `token_embd`
@@ -232,6 +397,7 @@ About the largest size quant fitting 256GB RAM ~48+GB VRAM rigs. I'm offloading

+*NOTE*: the [iq2_kt is the earlier implementation from PR484](https://github.com/ikawrakow/ik_llama.cpp/pull/484#issuecomment-2932521414) *without* louiehelm's magic number and a slightly different mix.
@@ -335,6 +501,6 @@ model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ3_KT.gg
---
-👤 **ikawrakow** commented the **2025-06-18** at **13:20:49**:
+👤 **ikawrakow** commented on **2025-06-18** at **13:20:49**
Time to merge this.
\ No newline at end of file
diff --git a/github-data/pull_requests/53 - Quantization mixes tweaks.md b/github-data/pull_requests/53 - Quantization mixes tweaks.md
index 72370bd29..b23f30c46 100644
--- a/github-data/pull_requests/53 - Quantization mixes tweaks.md
+++ b/github-data/pull_requests/53 - Quantization mixes tweaks.md
@@ -1,14 +1,17 @@
-### 🔀 [#53](https://github.com/ikawrakow/ik_llama.cpp/pull/53) - Quantization mixes tweaks
+## 🔀 [Pull Request #53](https://github.com/ikawrakow/ik_llama.cpp/pull/53) - Quantization mixes tweaks
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/qmix_tweaks` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-14 |
| **Updated** | 2024-09-14 |
+| **Merged** | 2024-09-14 |
---
-#### Description
+## 📄 Description
This PR changes quantization type selection for some quantization types. This leads to a lower PPL **and** a smaller quantized model size for Gemma-2 models.
diff --git a/github-data/pull_requests/531 - Much faster CPU prompt processing _part 1_.md b/github-data/pull_requests/531 - Much faster CPU prompt processing _part 1_.md
deleted file mode 100644
index 454f05b5f..000000000
--- a/github-data/pull_requests/531 - Much faster CPU prompt processing _part 1_.md
+++ /dev/null
@@ -1,139 +0,0 @@
-### 🔀 [#531](https://github.com/ikawrakow/ik_llama.cpp/pull/531) - Much faster CPU prompt processing (part 1)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-16 |
-| **Updated** | 2025-06-17 |
-
----
-
-#### Description
-
-This PR is a continuation of #515, #516, #517, #518 with the following differences
-* Quants are repacked to `Q8_K_R8` instead of `Q8_0_R8`. `Q8_K_R8` is the fastest quant known to human kind (see #141), and that helps achieve significant performance gains when batch size is greater than 32 tokens or so
-* The technique of on-the-fly repacking before matrix multiplications is extended to a larger set of quants: `IQ1_M, IQ2_XS, IQ2_S, Q3_K` in addition to `IQ1_S, IQ2_XXS, IQ3_XXS, IQ3_S` already improved in the quoted PRs
-* There is also `Q6_K` added, but in this case repacking is to `Q8_0_R8` as `Q6_K` cannot be losslessly repacked to `Q8_K`, and I was worried that there could be a non-negligible accuracy loss due to that.
-
-The following table shows a PP-512 performance comparison between the main branch and this PR. Model is LlaMA-3.1-8B-Instruct. Quantization is always "pure" (i.e., all tensors except the output tensor and the token embedding tensor are quantized with the selected quantization type). CPU is Ryzen-7950X
-
-| model | size | test | t/s | t/s | Speedup |
-| -----------------| ---------: | ------------: | ---------------: | ---------------: | -------: |
-| llama 8B IQ1_S | 2.07 GiB | pp512 | 264.36 ± 0.32 | 308.67 ± 3.45 | 1.168 |
-| llama 8B IQ1_M | 2.21 GiB | pp512 | 25.12 ± 0.15 | 309.81 ± 2.78 | 12.333 |
-| llama 8B IQ2_XXS | 2.35 GiB | pp512 | 284.22 ± 2.46 | 344.02 ± 4.27 | 1.210 |
-| llama 8B IQ2_XS | 2.56 GiB | pp512 | 108.77 ± 2.32 | 346.11 ± 2.26 | 3.182 |
-| llama 8B IQ2_S | 2.76 GiB | pp512 | 101.43 ± 1.13 | 341.02 ± 1.60 | 3.362 |
-| llama 8B IQ3_XXS | 3.17 GiB | pp512 | 280.56 ± 3.15 | 341.95 ± 3.33 | 1.219 |
-| llama 8B Q3_K | 3.41 GiB | pp512 | 178.56 ± 2.99 | 344.45 ± 4.15 | 1.929 |
-| llama 8B IQ3_S | 3.47 GiB | pp512 | 283.86 ± 2.62 | 340.68 ± 2.87 | 1.200 |
-| llama 8B Q6_K | 6.14 GiB | pp512 | 178.49 ± 1.78 | 271.50 ± 2.96 | 1.521 |
-
-A few notes:
-* Gains for the quants that already had repacking to `Q8_0_R8` (`IQ1_S, IQ2_XXS, IQ3_XXS, IQ3_S`) are in the range of 15-20%
-* `IQ1_M` stands out because it did not have a fast `iqk` GEMM implementation at all, so we gain a factor of 12X!
-* The PR changes the status of i-quants from being slow for CPU inference to being among the fastest (well, at least at this point before I apply this technique to `IQX_K` quants).
-
-I have the impression that most people use `ik_llama.cpp` for MoE models. MoE models are quite different compared to dense models such as LLaMA-3.1-8B because each routed expert "sees" a small fraction of the tokens in a batch, so effective batch size is much smaller compared to a dense model. Hence, PP performance gains for MoE models will be more modest. It is instructive to look as PP performance as a function of batch size. The following graph shows the result for `Q3_K`, which has a reasonably efficient `iqk` GEMM implementation. The repacking strategy kicks in at 32 tokens, so up to that point performance is the same. The relative performance gain from this PR then slowly grows to about 1.9X at 256 tokens, and remains (nearly) the same from there on.
-
-
-
-Based on this we can expect lower performance gains for a MoE model. For instance, DeepSeek-R1/V3 have 256 total experts but only 8 active experts, so effectively this strategy will not become active (or will have a very small impact) up to u-batch sizes of 1024 tokens. I cannot run DeepSeek-R1/V3, but I can run Qwen3-30B-A3B, and the next graphs shows performance for this model quantized with `Q3_K`. As expected, performance gains are smaller, about 1.4X at the peak, and poerformance improvement is not significant before 64 tokens.
-
-
-
-
----
-
-#### 💬 Conversation
-
-👤 **saood06** commented the **2025-06-16** at **10:26:55**:
-
-Does this also improve the behavior at higher contexts? For me running Deepseek at higher contexts PP and TG both approach ~1 t/s at high context.
-
----
-
-👤 **ikawrakow** commented the **2025-06-16** at **10:31:53**:
-
-> For me running Deepseek at higher contexts PP and TG both approach ~1 t/s.
-
-This indicates that your computer spends the entire time computing self attention for long enough context. If so, this PR will have zero impact on your long context performance.
-
----
-
-👤 **ikawrakow** commented the **2025-06-16** at **12:53:47**:
-
-> but at higher context the power usage looks a lot closer to TG (which is memory/QPI bandwidth bound).
-
-Or is it rather the other way around (TG looks a lot closer to PP)? If you buy my explanation that for a large context all the time is spent in the self attention calculation, then there isn't that much of a difference between TG and PP: for DeepSeek each row in the KV cache multiples 128 rows of activations (`K*Q` and `V*softmax(K*Q)`), so the matrix multiplications in TG and PP have very similar characteristics (there isn't much of a difference between multiplying 128 rows and 128 x n_ubatch rows), and it is compute bound, not memory bound.
-
----
-
-👤 **saood06** commented the **2025-06-16** at **13:54:42**:
-
->If you buy my explanation
-
-I do, I was just trying to understand it.
-
-> Or is it rather the other way around (TG looks a lot closer to PP)? that for a large context all the time is spent in the self attention calculation, then there isn't that much of a difference between TG and PP: for DeepSeek each row in the KV cache multiples 128 rows of activations (`K*Q` and `V*softmax(K*Q)`), so the matrix multiplications in TG and PP have very similar characteristics (there isn't much of a difference between multiplying 128 rows and 128 x n_ubatch rows), and it is compute bound, not memory bound.
-
-That makes sense.
-
-I did attempt to look at the [PCM](https://github.com/intel/pcm) data I had from earlier and just generated, and looked at CPU power usage and IPC but I'm not sure if the numbers are actually useful since I found during TG that it was causing paging (there really isn't much spare RAM on my system during inference).
-
----
-
-👤 **ubergarm** commented the **2025-06-16** at **23:06:48**:
-
-Not a comprehensive test, but this `PR531` does indeed speed-up PP as
-compared to `main` on my DeepSeek-R1-0528-IQ1_S.
-
-So while not as dramatic given only 58 `ffn_down_exps@iq1_m` on this MoE,
-the `iq1_s` speed-ups are already merged into main so overall much faster
-than before.
-
-The `IQ1_S_R4` still benches faster for this specific configuration at least.
-
-Note, to keep it simple, I did *not* use `-rtr` to repack the attn/shexp
-tensors; so actual CPU-only scenario would likely be faster still.
-
-## DeepSeek-R1-0528-IQ1_S
-- type f32: 361 tensors
-- type q4_0: 61 tensors `attn_k_b`
-- type iq1_s: 116 tensors `ffn_(gate|up)_exps`
-- type iq1_m: 58 tensors `ffn_down_exps`
-- type iq4_ks: 551 tensors `everything else`
-
-## DeepSeek-R1-0528-IQ1_S_R4
-- type f32: 361 tensors
-- type q4_0: 61 tensors `attn_k_b`
-- type iq1_s_r4: 116 tensors `ffn_(gate|up)_exps`
-- type iq1_m_r4: 58 tensors `ffn_down_exps`
-- type iq4_ks: 551 tensors `everything else`
-
-Importantly, `llama-perplexity` runs clean on PR531@72fd9faa so the new `iq1_m` implementation seems solid.
-
-* `IQ1_S`: `Final estimate: PPL = 4.8910 +/- 0.02856`
-* `IQ1_S_R4`: `Final estimate: PPL = 4.8805 +/- 0.02876` (computed back on PR494)
-
-
-
----
-
-👤 **ikawrakow** commented the **2025-06-17** at **10:32:11**:
-
-> The IQ1_S_R4 still benches faster for this specific configuration at least and seems to be the same speed on both this PR and main as I would expect.
-
-This is because of the extremely high total_experts/active_experts=32 ratio in DeeSeek-V3. For u_batch size of 512 we are still far away from the regime where this new repacking scheme pays large dividends. Perhaps the gains will be bigger for `u_batch = 1024` or even `u_batch = 2048`?
-
-But yes, I see that this PR may not have the huge impact that it should because people have somehow decided that `ik_llama.cpp` is only good for very large MoE models, so they keep using `llama.cpp` for everything else, missing out big times on performance for CPU-only inference (and it isn't so that CPU performance is not discussed in the `llama.cpp` repository on a regular basis).
-
----
-
-👤 **saood06** commented the **2025-06-17** at **20:56:40**:
-
->For me running Deepseek at higher contexts PP and TG both approach ~1 t/s.
-
-I had been so used to V3 where I never enabled high batch sizes with amb because I rarely requested over the default batch size of 512. But with R1 that is not in the case (due to thought tokens removal which results in reprocessing context).
-
-I ran an experiment at high context, processing 4096 tokens (33640 to 37736) and this went from 2950 to 1619 seconds, and even a reduction in compute buffer (`15387.76 MiB` vs `9404.80 MiB`).
\ No newline at end of file
diff --git a/github-data/pull_requests/531 - Much faster CPU prompt processing part 1.md b/github-data/pull_requests/531 - Much faster CPU prompt processing part 1.md
new file mode 100644
index 000000000..5c5d9cbc5
--- /dev/null
+++ b/github-data/pull_requests/531 - Much faster CPU prompt processing part 1.md
@@ -0,0 +1,280 @@
+## 🔀 [Pull Request #531](https://github.com/ikawrakow/ik_llama.cpp/pull/531) - Much faster CPU prompt processing (part 1)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/q6_k_gemm` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-16 |
+| **Updated** | 2025-06-17 |
+| **Merged** | 2025-06-17 |
+
+---
+
+## 📄 Description
+
+This PR is a continuation of [#515](https://github.com/ikawrakow/ik_llama.cpp/issues/515), [#516](https://github.com/ikawrakow/ik_llama.cpp/issues/516), [#517](https://github.com/ikawrakow/ik_llama.cpp/issues/517), [#518](https://github.com/ikawrakow/ik_llama.cpp/issues/518) with the following differences
+* Quants are repacked to `Q8_K_R8` instead of `Q8_0_R8`. `Q8_K_R8` is the fastest quant known to human kind (see [#141](https://github.com/ikawrakow/ik_llama.cpp/issues/141)), and that helps achieve significant performance gains when batch size is greater than 32 tokens or so
+* The technique of on-the-fly repacking before matrix multiplications is extended to a larger set of quants: `IQ1_M, IQ2_XS, IQ2_S, Q3_K` in addition to `IQ1_S, IQ2_XXS, IQ3_XXS, IQ3_S` already improved in the quoted PRs
+* There is also `Q6_K` added, but in this case repacking is to `Q8_0_R8` as `Q6_K` cannot be losslessly repacked to `Q8_K`, and I was worried that there could be a non-negligible accuracy loss due to that.
+
+The following table shows a PP-512 performance comparison between the main branch and this PR. Model is LlaMA-3.1-8B-Instruct. Quantization is always "pure" (i.e., all tensors except the output tensor and the token embedding tensor are quantized with the selected quantization type). CPU is Ryzen-7950X
+
+| model | size | test | t/s | t/s | Speedup |
+| -----------------| ---------: | ------------: | ---------------: | ---------------: | -------: |
+| llama 8B IQ1_S | 2.07 GiB | pp512 | 264.36 ± 0.32 | 308.67 ± 3.45 | 1.168 |
+| llama 8B IQ1_M | 2.21 GiB | pp512 | 25.12 ± 0.15 | 309.81 ± 2.78 | 12.333 |
+| llama 8B IQ2_XXS | 2.35 GiB | pp512 | 284.22 ± 2.46 | 344.02 ± 4.27 | 1.210 |
+| llama 8B IQ2_XS | 2.56 GiB | pp512 | 108.77 ± 2.32 | 346.11 ± 2.26 | 3.182 |
+| llama 8B IQ2_S | 2.76 GiB | pp512 | 101.43 ± 1.13 | 341.02 ± 1.60 | 3.362 |
+| llama 8B IQ3_XXS | 3.17 GiB | pp512 | 280.56 ± 3.15 | 341.95 ± 3.33 | 1.219 |
+| llama 8B Q3_K | 3.41 GiB | pp512 | 178.56 ± 2.99 | 344.45 ± 4.15 | 1.929 |
+| llama 8B IQ3_S | 3.47 GiB | pp512 | 283.86 ± 2.62 | 340.68 ± 2.87 | 1.200 |
+| llama 8B Q6_K | 6.14 GiB | pp512 | 178.49 ± 1.78 | 271.50 ± 2.96 | 1.521 |
+
+A few notes:
+* Gains for the quants that already had repacking to `Q8_0_R8` (`IQ1_S, IQ2_XXS, IQ3_XXS, IQ3_S`) are in the range of 15-20%
+* `IQ1_M` stands out because it did not have a fast `iqk` GEMM implementation at all, so we gain a factor of 12X!
+* The PR changes the status of i-quants from being slow for CPU inference to being among the fastest (well, at least at this point before I apply this technique to `IQX_K` quants).
+
+I have the impression that most people use `ik_llama.cpp` for MoE models. MoE models are quite different compared to dense models such as LLaMA-3.1-8B because each routed expert "sees" a small fraction of the tokens in a batch, so effective batch size is much smaller compared to a dense model. Hence, PP performance gains for MoE models will be more modest. It is instructive to look as PP performance as a function of batch size. The following graph shows the result for `Q3_K`, which has a reasonably efficient `iqk` GEMM implementation. The repacking strategy kicks in at 32 tokens, so up to that point performance is the same. The relative performance gain from this PR then slowly grows to about 1.9X at 256 tokens, and remains (nearly) the same from there on.
+
+
+
+Based on this we can expect lower performance gains for a MoE model. For instance, DeepSeek-R1/V3 have 256 total experts but only 8 active experts, so effectively this strategy will not become active (or will have a very small impact) up to u-batch sizes of 1024 tokens. I cannot run DeepSeek-R1/V3, but I can run Qwen3-30B-A3B, and the next graphs shows performance for this model quantized with `Q3_K`. As expected, performance gains are smaller, about 1.4X at the peak, and poerformance improvement is not significant before 64 tokens.
+
+
+
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** commented on **2025-06-16** at **10:26:55**
+
+Does this also improve the behavior at higher contexts? For me running Deepseek at higher contexts PP and TG both approach ~1 t/s.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-16** at **10:31:53**
+
+> For me running Deepseek at higher contexts PP and TG both approach ~1 t/s.
+
+This indicates that your computer spends the entire time computing self attention for long enough context. If so, this PR will have zero impact on your long context performance.
+
+---
+
+👤 **saood06** commented on **2025-06-16** at **12:25:14**
+
+> This indicates that your computer spends the entire time computing self attention for long enough context.
+
+I'm trying to understand but that explanation (at least to me) doesn't explain why at low context PP uses a lot more power than TG (as it is compute bound), but at higher context the power usage looks a lot closer to TG (which is memory/QPI bandwidth bound).
+
+I don't have actual numbers (as I don't think the exact numbers matter) but the difference is stark enough for me to notice based on the CPU temperatures.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-16** at **12:53:47**
+
+> but at higher context the power usage looks a lot closer to TG (which is memory/QPI bandwidth bound).
+
+Or is it rather the other way around (TG looks a lot closer to PP)? If you buy my explanation that for a large context all the time is spent in the self attention calculation, then there isn't that much of a difference between TG and PP: for DeepSeek each row in the KV cache multiples 128 rows of activations (`K*Q` and `V*softmax(K*Q)`), so the matrix multiplications in TG and PP have very similar characteristics (there isn't much of a difference between multiplying 128 rows and 128 x n_ubatch rows), and it is compute bound, not memory bound.
+
+---
+
+👤 **saood06** commented on **2025-06-16** at **13:54:42**
+
+>If you buy my explanation
+
+I do, I was just trying to understand it.
+
+> Or is it rather the other way around (TG looks a lot closer to PP)? that for a large context all the time is spent in the self attention calculation, then there isn't that much of a difference between TG and PP: for DeepSeek each row in the KV cache multiples 128 rows of activations (`K*Q` and `V*softmax(K*Q)`), so the matrix multiplications in TG and PP have very similar characteristics (there isn't much of a difference between multiplying 128 rows and 128 x n_ubatch rows), and it is compute bound, not memory bound.
+
+That makes sense.
+
+I did attempt to look at the [PCM](https://github.com/intel/pcm) data I had from earlier and just generated, and looked at CPU power usage and IPC but I'm not sure if the numbers are actually useful since I found during TG that it was causing paging (there really isn't much spare RAM on my system during inference).
+
+---
+
+👤 **ubergarm** commented on **2025-06-16** at **23:06:48**
+
+Not a comprehensive test, but this `PR531` does indeed speed-up PP as compared to `main` on my DeepSeek-R1-0528-IQ1_S.
+
+So while not as dramatic given only 58 `ffn_down_exps@iq1_m` on this MoE, the `iq1_s` speed-ups are already merged into main so overall much faster than before.
+
+The `IQ1_S_R4` still benches faster for this specific configuration at least and seems to be the same speed on both this PR and main as I would expect.
+
+Note, to keep it simple, I did *not* use `-rtr` to repack the attn/shexp tensors; so actual CPU-only scenario would likely be faster still.
+
+## DeepSeek-R1-0528-IQ1_S
+- type f32: 361 tensors
+- type q4_0: 61 tensors `attn_k_b`
+- type iq1_s: 116 tensors `ffn_(gate|up)_exps`
+- type iq1_m: 58 tensors `ffn_down_exps`
+- type iq4_ks: 551 tensors `everything else`
+
+## DeepSeek-R1-0528-IQ1_S_R4
+- type f32: 361 tensors
+- type q4_0: 61 tensors `attn_k_b`
+- type iq1_s_r4: 116 tensors `ffn_(gate|up)_exps`
+- type iq1_m_r4: 58 tensors `ffn_down_exps`
+- type iq4_ks: 551 tensors `everything else`
+
+Importantly, `llama-perplexity` runs clean on PR531@72fd9faa so the new `iq1_m` implementation seems solid. Here's the values using `-ctk f16`:
+
+* `IQ1_S`: `Final estimate: PPL = 4.8910 +/- 0.02856`
+* `IQ1_S_R4`: `Final estimate: PPL = 4.8805 +/- 0.02876` (computed back on PR494)
+
+
+
+👈 sweep-bench data
+
+```bash
+model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ1_S_R4/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf
+#model=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ1_S/DeepSeek-R1-0528-IQ1_S-00001-of-00003.gguf
+
+numactl -N 0 -m 0 \
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -c 8704 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -fmoe \
+ --no-mmap \
+ --threads 80 \
+ --threads-batch 128 \
+ --numa numactl \
+ --warmup-batch
+```
+
+## DeepSeek-R1-0528-IQ1_S_R4
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 4.423 | 115.77 | 17.351 | 7.38 |
+| 512 | 128 | 512 | 4.687 | 109.23 | 19.213 | 6.66 |
+| 512 | 128 | 1024 | 5.096 | 100.46 | 19.777 | 6.47 |
+| 512 | 128 | 1536 | 5.244 | 97.63 | 23.691 | 5.40 |
+| 512 | 128 | 2048 | 6.130 | 83.52 | 23.180 | 5.52 |
+| 512 | 128 | 2560 | 5.937 | 86.24 | 23.369 | 5.48 |
+| 512 | 128 | 3072 | 6.240 | 82.05 | 23.431 | 5.46 |
+| 512 | 128 | 3584 | 7.088 | 72.23 | 20.811 | 6.15 |
+| 512 | 128 | 4096 | 7.450 | 68.72 | 23.252 | 5.50 |
+| 512 | 128 | 4608 | 7.118 | 71.93 | 21.718 | 5.89 |
+| 512 | 128 | 5120 | 7.433 | 68.88 | 21.636 | 5.92 |
+| 512 | 128 | 5632 | 7.707 | 66.44 | 22.484 | 5.69 |
+| 512 | 128 | 6144 | 8.019 | 63.85 | 22.216 | 5.76 |
+| 512 | 128 | 6656 | 8.271 | 61.91 | 22.708 | 5.64 |
+| 512 | 128 | 7168 | 8.604 | 59.51 | 24.151 | 5.30 |
+| 512 | 128 | 7680 | 8.840 | 57.92 | 23.185 | 5.52 |
+| 512 | 128 | 8192 | 9.295 | 55.08 | 22.992 | 5.57 |
+
+## PR531@72fd9faa DeepSeek-R1-0528-IQ1_S
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 6.139 | 83.40 | 17.278 | 7.41 |
+| 512 | 128 | 512 | 6.244 | 82.00 | 18.809 | 6.81 |
+| 512 | 128 | 1024 | 6.436 | 79.55 | 21.856 | 5.86 |
+| 512 | 128 | 1536 | 6.754 | 75.81 | 22.630 | 5.66 |
+| 512 | 128 | 2048 | 7.189 | 71.22 | 23.058 | 5.55 |
+| 512 | 128 | 2560 | 8.803 | 58.16 | 22.779 | 5.62 |
+| 512 | 128 | 3072 | 9.001 | 56.88 | 22.750 | 5.63 |
+| 512 | 128 | 3584 | 8.404 | 60.92 | 24.276 | 5.27 |
+| 512 | 128 | 4096 | 9.322 | 54.93 | 23.410 | 5.47 |
+| 512 | 128 | 4608 | 9.230 | 55.47 | 23.225 | 5.51 |
+| 512 | 128 | 5120 | 9.237 | 55.43 | 23.691 | 5.40 |
+| 512 | 128 | 5632 | 9.139 | 56.02 | 24.198 | 5.29 |
+| 512 | 128 | 6144 | 10.114 | 50.62 | 26.936 | 4.75 |
+| 512 | 128 | 6656 | 10.054 | 50.93 | 23.654 | 5.41 |
+| 512 | 128 | 7168 | 9.958 | 51.41 | 24.267 | 5.27 |
+| 512 | 128 | 7680 | 11.029 | 46.42 | 24.723 | 5.18 |
+| 512 | 128 | 8192 | 10.682 | 47.93 | 24.311 | 5.27 |
+
+## main@6fc5bbb6 DeepSeek-R1-0528-IQ1_S
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 8.530 | 60.02 | 17.123 | 7.48 |
+| 512 | 128 | 512 | 8.767 | 58.40 | 20.432 | 6.26 |
+| 512 | 128 | 1024 | 8.826 | 58.01 | 20.463 | 6.26 |
+| 512 | 128 | 1536 | 8.964 | 57.12 | 22.866 | 5.60 |
+| 512 | 128 | 2048 | 9.520 | 53.78 | 23.782 | 5.38 |
+| 512 | 128 | 2560 | 10.572 | 48.43 | 22.904 | 5.59 |
+| 512 | 128 | 3072 | 10.952 | 46.75 | 23.303 | 5.49 |
+| 512 | 128 | 3584 | 10.747 | 47.64 | 23.772 | 5.38 |
+| 512 | 128 | 4096 | 10.734 | 47.70 | 23.223 | 5.51 |
+| 512 | 128 | 4608 | 11.519 | 44.45 | 23.582 | 5.43 |
+| 512 | 128 | 5120 | 12.040 | 42.53 | 24.150 | 5.30 |
+| 512 | 128 | 5632 | 12.694 | 40.33 | 23.282 | 5.50 |
+| 512 | 128 | 6144 | 11.878 | 43.11 | 26.545 | 4.82 |
+| 512 | 128 | 6656 | 12.168 | 42.08 | 24.220 | 5.28 |
+| 512 | 128 | 7168 | 12.605 | 40.62 | 24.069 | 5.32 |
+| 512 | 128 | 7680 | 12.843 | 39.87 | 24.390 | 5.25 |
+| 512 | 128 | 8192 | 13.228 | 38.71 | 23.570 | 5.43 |
+
+
+
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-06-17** at **10:32:11**
+
+> The IQ1_S_R4 still benches faster for this specific configuration at least and seems to be the same speed on both this PR and main as I would expect.
+
+This is because of the extremely high total_experts/active_experts=32 ratio in DeeSeek-V3. For u_batch size of 512 we are still far away from the regime where this new repacking scheme pays large dividends. Perhaps the gains will be bigger for `u_batch = 1024` or even `u_batch = 2048`?
+
+But yes, I see that this PR may not have the huge impact that it should because people have somehow decided that `ik_llama.cpp` is only good for very large MoE models, so they keep using `llama.cpp` for everything else, missing out big times on performance for CPU-only inference (and it isn't so that CPU performance is not discussed in the `llama.cpp` repository on a regular basis).
+
+---
+
+👤 **ubergarm** commented on **2025-06-17** at **16:35:26**
+
+> Perhaps the gains will be bigger for u_batch = 1024 or even u_batch = 2048?
+
+Yes, looks like even with the high ratio of deepseek MoE, this new repacking scheme begins to outstrip the `_r4` variants at high enough batch sizes on this CPU only test using same xeon 6980P as above:
+
+ ## PR531@72fd9faa DeepSeek-R1-0528-IQ1_S_R4 -b 4096 -ub 4096
+ | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+ |-------|--------|--------|----------|----------|----------|----------|
+ | 4096 | 1024 | 0 | 40.982 | 99.95 | 150.696 | 6.80 |
+ | 4096 | 1024 | 4096 | 52.413 | 78.15 | 189.641 | 5.40 |
+
+ ## PR531@72fd9faa DeepSeek-R1-0528-IQ1_S -b 4096 -ub 4096
+ | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+ |-------|--------|--------|----------|----------|----------|----------|
+ | 4096 | 1024 | 0 | 34.827 | 117.61 | 149.490 | 6.85 |
+ | 4096 | 1024 | 4096 | 49.865 | 82.14 | 180.852 | 5.66 |
+
+> missing out big times on performance for CPU-only inference
+
+I might try quanting this qwen2.5-72b finetune [moonshotai/Kimi-Dev-72B](https://huggingface.co/moonshotai/Kimi-Dev-72B) today. your recent improvements (and reading commit logs for `ik/iqk_gemm` on improving iq4/5_ks *even more*) will make 72B dense models much more usable for hybrid inferencing...
+
+honestly, the biggest hurdle to general adoption of this fork, imo, is the lack of a pre-compiled distributible binary e.g. [appimage](https://appimage.org/) format etc... my guess is the *majority* of possible end-users don't know how to `apt-get install cuda-toolkit`... i've been noodling on that challenge some at least for linux users, not sure on windows/macos...
+
+---
+
+👤 **saood06** commented on **2025-06-17** at **16:39:44**
+
+> > Perhaps the gains will be bigger for u_batch = 1024 or even u_batch = 2048?
+>
+> Yes, looks like even with the high ratio of deepseek MoE, this new repacking scheme begins to outstrip the `_r4` variants at high enough batch sizes on this CPU only test using same xeon 6980P
+
+I would be curious to the cutoff point. With something like `./bin/llama-bench [...] -p 32,64,128,256,512,1024,2048,3072,4096`
+
+---
+
+👤 **ikawrakow** commented on **2025-06-17** at **16:46:32**
+
+> I would be curious to the cutoff point. With something like ./bin/llama-bench [...] -p 32,64,128,256,512,1024,2048,3072,4096
+
+It is model and quantization type dependent. But I'm not removing the `_R4/_R8` quants, so everyone is free to do their performance evaluation and decide if to use this or go with the row-interleaved variant. For sure this is a big gain for people who don't want to get involved with repacking and all that stuff, but just want to run a mainline `llama.cpp` model they downloaded from HF or elsewhere. This also removes the need to worry about if the row-interleaved variant is supported on CUDA or not in case of hybrid inference.
+
+---
+
+👤 **saood06** commented on **2025-06-17** at **20:56:40**
+
+>For me running Deepseek at higher contexts PP and TG both approach ~1 t/s.
+
+I had been so used to V3 where I never enabled high batch sizes with amb because I rarely requested over the default batch size of 512. But with R1 that is not in the case (due to thought tokens removal which results in reprocessing context).
+
+I ran an experiment at high context, processing 4096 tokens (33640 to 37736) and this went from 2950 to 1619 seconds, and even a reduction in compute buffer (`15387.76 MiB` vs `9404.80 MiB`).
\ No newline at end of file
diff --git a/github-data/pull_requests/533 - Much faster CPU prompt processing _part 2_.md b/github-data/pull_requests/533 - Much faster CPU prompt processing part 2.md
similarity index 69%
rename from github-data/pull_requests/533 - Much faster CPU prompt processing _part 2_.md
rename to github-data/pull_requests/533 - Much faster CPU prompt processing part 2.md
index 81c5640fb..f09991163 100644
--- a/github-data/pull_requests/533 - Much faster CPU prompt processing _part 2_.md
+++ b/github-data/pull_requests/533 - Much faster CPU prompt processing part 2.md
@@ -1,16 +1,19 @@
-### 🔀 [#533](https://github.com/ikawrakow/ik_llama.cpp/pull/533) - Much faster CPU prompt processing (part 2)
+## 🔀 [Pull Request #533](https://github.com/ikawrakow/ik_llama.cpp/pull/533) - Much faster CPU prompt processing (part 2)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iqk_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-17 |
| **Updated** | 2025-06-18 |
+| **Merged** | 2025-06-18 |
---
-#### Description
+## 📄 Description
-This PR is a follow up of #531 and applies the technique to `IQK` quants.
+This PR is a follow up of [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531) and applies the technique to `IQK` quants.
Here is a PP-512 performance comparison for LlaMA-3.1-8B-Instruct on a Ryzen-7950X CPU between the main branch and this PR:
@@ -31,17 +34,25 @@ For a bit of history, when [PR 6414](https://github.com/ggml-org/llama.cpp/pull/
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-17** at **16:45:36**:
+👤 **ubergarm** commented on **2025-06-17** at **16:45:36**
-Thanks, this is huge. I feel like this will make ~70B dense models much better for hybrid inferencing on home rigs. Hope to try some quants soon!
+Thanks, this is huge. I feel like this will make ~70B dense models much better for hybrid inferencing on home rigs. Hope to try some quants soon!
+
+Also holy cow the `iqN_k` are basically as fast as the `iqN_ks`!
+
+---
+
+👤 **Vhallo** commented on **2025-06-17** at **16:50:04**
+
+Impressive work all around!
---
-👤 **Nexesenex** commented the **2025-06-17** at **18:31:50**:
+👤 **Nexesenex** commented on **2025-06-17** at **18:31:50**
Very impressive, @ikawrakow!
-All your recent commits motivates me to put more of IK_Llama on my Kobold.Cpp fork.
-I already have overall twice its CPU PP perfs thanks to your amazing work, and I merged most of your quants, including the last Trellis!
+All your recent commits motivate me to put more of IK_Llama on my Kobold.Cpp fork.
+I already have overall twice its mainline counterpart CPU PP perfs thanks to your amazing work, and I merged most of your quants, including the last Trellis!
Way to make an enthusiast happy!
\ No newline at end of file
diff --git a/github-data/pull_requests/534 - Much faster CPU prompt processing _part 3_.md b/github-data/pull_requests/534 - Much faster CPU prompt processing part 3.md
similarity index 65%
rename from github-data/pull_requests/534 - Much faster CPU prompt processing _part 3_.md
rename to github-data/pull_requests/534 - Much faster CPU prompt processing part 3.md
index dd6fa793c..69c63cfb3 100644
--- a/github-data/pull_requests/534 - Much faster CPU prompt processing _part 3_.md
+++ b/github-data/pull_requests/534 - Much faster CPU prompt processing part 3.md
@@ -1,16 +1,19 @@
-### 🔀 [#534](https://github.com/ikawrakow/ik_llama.cpp/pull/534) - Much faster CPU prompt processing (part 3)
+## 🔀 [Pull Request #534](https://github.com/ikawrakow/ik_llama.cpp/pull/534) - Much faster CPU prompt processing (part 3)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/legacy_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-18 |
| **Updated** | 2025-06-19 |
+| **Merged** | 2025-06-18 |
---
-#### Description
+## 📄 Description
-This PR is a follow up of #531 and #533, and adds much faster GEMM for the remaining non-interleaved quants: `Q2_K, IQ4_XS, IQ4_NL, Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0`.
+This PR is a follow up of [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531) and [#533](https://github.com/ikawrakow/ik_llama.cpp/issues/533), and adds much faster GEMM for the remaining non-interleaved quants: `Q2_K, IQ4_XS, IQ4_NL, Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0`.
Here is a PP-512 performance comparison between the main branch and this PR for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU:
@@ -33,9 +36,9 @@ We observe gains in the range of 2X for all types. In case anyone is wondering w
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** submitted a review the **2025-06-18** at **13:46:15**: 💬 `COMMENTED`
+👤 **Nexesenex** reviewed this pull request 💬 on **2025-06-18** at **13:46:15**
`
float d = _mm_cvtss_f32(max4/127.f);
@@ -49,33 +52,19 @@ I compile with AVX2 and FMA enabled.
---
-👤 **Nexesenex** submitted a review the **2025-06-18** at **13:46:15**: 💬 `COMMENTED`
-
-`
- float d = _mm_cvtss_f32(max4/127.f);
-
-`
-This line (2077) in idk_gemm_kquants.cpp) provokes this error in MSVS 22 (Win 11) :
-
-binary '/': '__m128' does not define this operator or a conversion to a type acceptable to the predefined operator.
-
-I compile with AVX2 and FMA enabled.
-
----
-
-👤 **ikawrakow** commented the **2025-06-18** at **13:49:36**:
+👤 **ikawrakow** commented on **2025-06-18** at **13:49:36**
Should be fixed now.
---
-👤 **Nexesenex** commented the **2025-06-18** at **14:05:57**:
+👤 **Nexesenex** commented on **2025-06-18** at **14:05:57**
@ikawrakow : It is, thank you!
---
-👤 **ubergarm** commented the **2025-06-18** at **15:25:16**:
+👤 **ubergarm** commented on **2025-06-18** at **15:25:16**
This 3 part refresh on PP performance across so many quants is epic, appreciate your explaining the details in your PR notes.
@@ -103,7 +92,7 @@ No wonder many folks are choosing MoEs for hybrid inference over dense 72Bs. Moe
---
-👤 **ikawrakow** commented the **2025-06-18** at **16:07:25**:
+👤 **ikawrakow** commented on **2025-06-18** at **16:07:25**
> No wonder many folks are choosing MoEs for hybrid inference over dense 72Bs. Moe's fewer active weights during TG yield faster speeds with larger overall parameter size models.
@@ -119,15 +108,15 @@ Padding was discussed back in the day, but the idea was discarded. After all, it
---
-👤 **saood06** commented the **2025-06-18** at **16:41:02**:
+👤 **saood06** commented on **2025-06-18** at **16:41:02**
> TG performance of MoE models is far away from what is theoretically possible. If I look at your 6980P system, IIRC it has in the range of 512 GB/s memory bandwidth per node. So that, running DeepSeek on a single node because we haven't learnt how to do the NUMA thing effectively, and getting 10 t/s for 20 GB worth of active parameters means we are a factor of 2.5X away from what should be achievable.
-I do think now that we have the -ot, if the GGUF were changed to split up the experts and you launched it with `numactl --membind=[...] --cpunodebind=[...]`, that might help (due to NUMA aware, expert parallelism).
+I do think now that we have the -ot, if the GGUF were changed to split up the experts and you launched it with `numactl --membind=[...] --cpunodebind=[...]`, and used RPC that might help (due to NUMA aware, expert parallelism).
---
-👤 **ubergarm** commented the **2025-06-18** at **23:51:50**:
+👤 **ubergarm** commented on **2025-06-18** at **23:51:50**
@ikawrakow
@@ -135,27 +124,30 @@ Always appreciate your insights, and these new prompt processing numbers are loo
> I was hoping that one can get that with a 70B dense model as well (on a higher bandwidth system).
-I ran `sweep-bench` for a few of my ~4 BPW 72B Dense models shown in the graph above on two big remote rigs compiled CPU-only. I was kinda surprised by the results.
+I ran `sweep-bench` for a few of my ~4 BPW 72B Dense models shown in the graph above on three rigs compiled CPU-only. I was kinda surprised by the results.
-
+
-My impression is that the big 6980P CPU is not saturating the expected ~512GB socket RAM bandwidth during generation. As you mentioned it could hit theoretically ~15 tok/sec (512 GB bandwidth / 32GB model size = 16 tok/sec). While the 24x Core 7965WX Thread Ripper Pro is doing better, it too has 4x CCDs configured as a single NUMA node via NPS1.
+My impression is that the big 6980P CPU is not saturating the expected ~512GB socket RAM bandwidth during generation. As you mentioned it could hit theoretically ~15 tok/sec (512 GB bandwidth / 32GB model size = 16 tok/sec).
I spot checked using 80 and 64 threads for TG on the Intel Xeon 6980P, but less threads led to slower generation for this benchmark. Perhaps because its 3x CCDs are configured as a single NUMA node via BIOS config `SNC=Disable`. Though probably won't be able to reboot it to try, though the model *would fit* in the 256GB RAM if configured as one NUMA node per CCD.
+While the 24x Core 7965WX Thread Ripper Pro is doing better, it has 4x CCDs configured as a single NUMA node via NPS1 which could possibly be causing a hit to TG performance.
+
Assuming the benchmarked ~512GB/s RAM bandwidth on the 6980P and let's call it ~256 GB/s on the Thread Ripper Pro are accurate, the potential token generation breakdown looks like this:
| Rig | Model | Theoretical | Measured | Yield |
| --- | --- | --- | --- | --- |
-| | | tok/sec | tok/sec | % |
+| | | TG tok/sec | TG tok/sec | % |
| 6980P | Q4_0 | 13.4 | 5.47 | 40.8% |
| " | smol-IQ3_K | 15.9 | 6.05 | 38.1% |
| " | IQ3_KT | 16.8 | 3.76 | 22.4% |
| 7965WX | Q4_0 | 6.7 | 4.74 | 70.7% |
| " | smol-IQ3_K | 7.9 | 5.61 | 71.0% |
| " | IQ3_KT | 8.4 | 3.06 | 36.4% |
+| 9950X | smol-IQ3_K | 2.70 | 2.50 | 92.6% |
-I want to like the ~70B dense models, but man they are difficult to get good TG without offloading the whole thing to VRAM... I could try my home AMD 9950X given it would fit, even with lower absolute TG speeds it could be more "efficient" given native single NUMA node...
+I want to like the ~70B dense models, but man they are difficult to get good TG without offloading the whole thing to VRAM... I could try my home AMD 9950X given it would fit, even with lower absolute TG speeds it could be more "efficient" given native single NUMA node... *EDIT* I ran one on my home 9950X benching ~87GB/s with (overclocked inifinity fabric at "gear 1" ratios) and updated the graph and table above.
@@ -245,11 +237,27 @@ numactl -N 0 -m 0 \
| 2048 | 512 | 2048 | 47.600 | 43.03 | 169.435 | 3.02 |
| 2048 | 512 | 4096 | 50.176 | 40.82 | 172.420 | 2.97 |
+## 9950X smol-IQ3_K -t 16
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 42.857 | 47.79 | 204.729 | 2.50 |
+| 2048 | 512 | 2048 | 45.211 | 45.30 | 208.152 | 2.46 |
+| 2048 | 512 | 4096 | 47.570 | 43.05 | 211.695 | 2.42 |
+
+## 9950X smol-IQ3_K -t 16 -ngl 48 (NOT GRAPHED, JUST FOR FUNZIES)
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 3.925 | 521.77 | 103.624 | 4.94 |
+| 2048 | 512 | 2048 | 4.058 | 504.63 | 105.265 | 4.86 |
+
+I've uploaded the [smol-IQ3_K to hugginface here](https://huggingface.co/ubergarm/Kimi-Dev-72B-GGUF).
+
+---
> Padding was discussed back in the day
-I was checking how [bullerwins dealt with the goofy dimensions ffn_down.](https://huggingface.co/bullerwins/Kimi-Dev-72B-GGUF/discussions/1#6852fc43cd6b6db96eb0980e). Given they use `Q8_0` I was surprised to hear the log mentioned padding:
+I was checking how [bullerwins dealt with the goofy dimensions ffn_down.](https://huggingface.co/bullerwins/Kimi-Dev-72B-GGUF/discussions/1#6852fc43cd6b6db96eb0980e). Given they use `Q8_0` I was surprised to hear their mainline llama-quantize log mentioned padding:
```
29 568 / 256 = 115 full blocks (115 × 256 = 29 440)
@@ -258,13 +266,15 @@ remainder 128 elements (padded to 256)
I didn't look into it further, and used `IQ4_NL` for the above test quants which is a reasonable size for these quants.
+---
+
> Maybe I should change them before trellis models become available?
-Right, related to the `iqN_kt` quants merged in [PR529](https://github.com/ikawrakow/ik_llama.cpp/pull/529), I haven't released anything yet. Going through the trouble to make block size 32 might not be worth it unless cursed sized tensor column dimensions becomes more prevalent as `iq4_nl` seems pretty solid. Not sure how changing the block size would effect TG performance as well.
+Right, related to the `iqN_kt` quants merged in [PR529](https://github.com/ikawrakow/ik_llama.cpp/pull/529), I haven't released anything yet. Going through the trouble to make block size 32 might not be worth it? Unless those cursed dimension tensors becomes more prevalent... `iq4_nl` seems like a pretty solid choice for many ~4bpw quants. - Though I'm not sure how changing the block size would effect TG performance as well?
-The PP performance on the `iqN_kt` quants is amazing, about the highest despite being on the [B Tier Q8_0_R8 mul_mat list](https://github.com/ikawrakow/ik_llama.cpp/pull/495#issuecomment-2985633815), but I noticed the TG performance is lagging behind the other quants which I assume is to extra CPU overhead dealing with them?
+The PP performance on the `iqN_kt` quants is amazing, about the highest despite being on the [B Tier Q8_0_R8 mul_mat list](https://github.com/ikawrakow/ik_llama.cpp/pull/495#issuecomment-2985633815)... I noticed that the TG performance is lagging behind the other quants which I assume is to extra CPU overhead dealing with them?
-Same with DeepSeek-R1-0528 which I run here offloading `-ngl 99 -ot exps=CPU` plus 16 more layers on dual RTX A6000 GPU (to not OOM RAM), on the Thread Ripper Pro, 24 core, default batch sizes:
+Another similar benchmark as above, but now for DeepSeek-R1-0528 MoE. I run here offloading the same number of layers on GPUs to not OOM RAM. This is just the Thread Ripper Pro, 24 core, default batch sizes:
#### IQ3_KS_R4 300.938 GiB (3.847 BPW)
* 12.39 tok/sec TG
@@ -284,7 +294,7 @@ Same with DeepSeek-R1-0528 which I run here offloading `-ngl 99 -ot exps=CPU` pl
-llama-sweep-bench details and data
+👈 llama-sweep-bench details and data
Ignore the PP given this was low batch sizes so not a good comparison.
@@ -354,12 +364,26 @@ model=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ3_KT/DeepSeek-R1-0528-IQ3_KT-00001-of
So given DeepSeek-R1-671B has active 37B during generation and the theoretical max bandwidth on the 256GB/s Thread Ripper Pro we can use the calculate the GiB of the active parameters and get theoretical max TG as above.
-`256 / (37 * (BPW/8)`
+`256 / ( 37 * (BPW/8) )`
+
+_but need to account for GPU offload of 1 shared expert, 3 dense layers, and first 16 routed exps layers leaving ~30B active on CPU/RAM_
+
+`256 / ( (37 * 256/257 - 1.189 - 16 * 0.3523) * (BPW/8) )`
+
+Then, assuming any of this is close, the "Yield" is fairly close to the the dense model above. The `kt` mix here is a bit different than in the dense above.
| Rig | Model | Theoretical | Measured | Yield |
| --- | --- | --- | --- | --- |
-| | | tok/sec | tok/sec | % |
-| 7965WX | IQ3_KS_R4 | 14.4 | 12.4 | 86.0% |
-| " | IQ3_KT | 15.9 | 8.6 | 54.1% |
+| | | TG tok/sec | TG tok/sec | % |
+| 7965WX | IQ3_KS_R4 | 17.7 | 12.4 | 70.1% |
+| " | IQ3_KT | 19.6 | 8.6 | 43.9% |
+
+Thanks again for these great PP speed-ups and your time and patience with my long ass posts haha.. I gotta eat some dinner now, cheers!
+
+---
+
+👤 **ikawrakow** commented on **2025-06-19** at **07:12:04**
+
+> The PP performance on the iqN_kt quants is amazing, about the highest despite being on the https://github.com/ikawrakow/ik_llama.cpp/pull/495#issuecomment-2985633815... I noticed that the TG performance is lagging behind the other quants which I assume is to extra CPU overhead dealing with them?
-Thanks again for these great PP speed-ups and your time and patience with these long posts haha.. I gotta eat some dinner now, cheers!
\ No newline at end of file
+Yes, the `iqN_kt` quants are slower for TG. Generating the trellis sequence is extremely expensive on the CPU. That's why [#113](https://github.com/ikawrakow/ik_llama.cpp/issues/113) sat there for so long not merged. With the recently discovered trick to first unpack to some 8-bit variant and then do the matrix multiplication, the very high trellis sequence cost is amortized when doing prompt processing (each unpacked quant is used many times to multiply-add quants in the activation matrix). But for TG there is no way to speed it up as each quants is used exactly once to multiply-add one quant in the right matrix. Based on your performance values, it seems AMD Zen4/5 cores are doing much better than the Intel 6980P cores (per core). Generating the trellis sequence involves a 32-bit integer multiplication. If we look at [Intel's AVX2 reference](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=2949,4683,4765,1762,110,4148,7198,6635,6100,2976,2957,2976,1682,3125,6483,5847,5879,5870,6088,6141,5860,4671,4674,4683,4145,4066,5911,5914,6137,4145,4759,4680,1762,4270,4234,6100,6100,4234,4795,110,6633,4795,6635,4683,4792,4765,4795,1875,6097,5894,4674,696,6144,6144,6100,6162,1759,1872,4795,4674,4804,6144,696,692,4270,4234,4804,3125,6097,4804,6091,6097,5894,4795,4234,4892,4915,6091,4795,4804,4795,4795,4804,4813,4270,4234,4795,6138,6097,4234,4804,4234,4270,4234,6091,3733,4146,6100,1771,3125,5961,2073,3693,3694,2073,4394,6720,2822,5181,2822,5287,2822,6225,6234,6453,870,6578,6635,3733,1762,3733,3786,6287,6159,6097,4880,6097,125,6282,6633,4880,6100,6091,5849,708,5168,6091,6162,1762,6003,1664,6043,4041,5791,6003,4200,6003,1857,4200,6003,1857,913,913,6551,4111,107,308,6186,6140,460,6196,110,92,3707,4602,4296,6603,4044,987,4843,1866,6440,6408,6440,6068,6003,460,488,4860,3869,696,2965,4239,4110,6009,4013,2055,1604,4013,6482,6050,3667,4675,6047,482,491,6047,6196,4863,3869,2705,2722,4863,5741,1747,1610,882,3710,4860,6053,6050,6047,6006,6003,6015,465,465,6143,6189,465,6189,2722,4796,2959,96,96,6189,93,466,7013,4863,6440,5998,6027,6033,1583,2052,4920,6196,4848,5014,866,882,285,874,285,6009,6007,4763,3667,4200,6402,3869,5002,4990,5002,6071,6000,6000,5303,3850,690,344,4860,4857,6188,6156,4986,6182,4857,487,487,463,5997,7012,463,5994,4857,4929,4883,4986,7030,7003,7021,6202,6193,5994,5997,6202,5994,5997,6006,882,114,6009,882,473,5739,6604,1869,93,84,5129,1634,5111,5998,4986,4934,4934,5065,6068,4236,4200,92,2701,2731,4860,2677,2675,2703,5896,4201,3843,2055,5303,140,3864,3864,3843,3843,4633,4621,4542,4958,5824,4934,4934,5303,5752,5752,2688,4860,1664,3107,2055,4236,3669,3669,93,4986,4883,7024,6958,7015,7006,7009,4863,4863,4863,95,2722,6030,5997,5995,2816,2817,5151,4763,3707,3731,2946,3111,5823,5824,3113,3113,3663,344,6075,4772,458,2688,6592,6603,6601,6601,5053,2052,6047,6050,6050,2583,5464,5464,3827,2591,4687,4753,4793,4763,2939,4772,4763,4642,2055,823,337,3111,6050,681,2943,3937,5704,3381,5137,3932,3111,1750,1744,486,4361,4688,828,4361,4351,144,486,4688,4689,3107,2055,6053,3243,818,856,486,2059,2059,4688,2113,4863,2113,5605,5605,5576,5614,4690,5614,2072,1640,4863,4854,6009,1640,6009,1756,6050,6012,6050,6039,2052,3850,6053,4363,5346,143,3111,1640,828,828,4688,858,856,486,3235,1580,4236,4200,4930,6588,1622,1744,1753,486,4688,337,4920,679,6047,6050,3842,2692,473,6604,113,7018,7018,6961,4990,5002,5062,5056,4990,5086,5002,5002,4960,4960,1129,6604,6595,4990,7018,54,7054,7051,2936,7000,2943,3862,5822,4863,4591,4587,4597,3764,1898,2722,3111,2671,2692,6446,6211,6449,2110,485,483,5041,5041,782,6604,2158,4597,6053,4928,2058,2058,6050,1771,2055,6068,6024,5997,6065,1771,6173,6050,4928,5750,2938,3107,6050,2938,5826,4860,856,4685,4908,4908,5114,2705,4236,6024,5997,4880,4883,6050,7024,4883,4930,4772,4775,7054,6050,4931,4883,306,6050,2110,626,1895,6949,7006,6068,6059,6006,6050,1919,5086,5086,1919,5303,2668,3850,5041,2688,4860,2688,2688,4860,4851,4860,2110,2069,4954,5037,4949,4953,6006,484,6050,4963,4932,4928,1619,2689,2688,5068,5010,4998,5010,1637,4860,4851,4860,4953,5752,2110,7015,2110,4953,7015,4930,4953,4986,6050,7051,3239,4860,7021,4848,3862,4848,7051,7015,782,6604,782,4851,4851,4772,4772&text=_mm256_mullo_epi32), it shows a 10 cycles latency for this instruction! So, I guess, AMD have done slightly better here.
\ No newline at end of file
diff --git a/github-data/pull_requests/535 - Minor readme update.md b/github-data/pull_requests/535 - Minor readme update.md
index 23129c843..44a1845a3 100644
--- a/github-data/pull_requests/535 - Minor readme update.md
+++ b/github-data/pull_requests/535 - Minor readme update.md
@@ -1,19 +1,22 @@
-### 🔀 [#535](https://github.com/ikawrakow/ik_llama.cpp/pull/535) - Minor readme update
+## 🔀 [Pull Request #535](https://github.com/ikawrakow/ik_llama.cpp/pull/535) - Minor readme update
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/readme-minor1` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-18 |
| **Updated** | 2025-06-19 |
+| **Merged** | 2025-06-19 |
---
-#### Description
+## 📄 Description
This I think cleans things up, and also takes up less space.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-19** at **05:39:09**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-19** at **05:39:09**
\ No newline at end of file
diff --git a/github-data/pull_requests/536 - Fix KT Neon _ ARM typo.md b/github-data/pull_requests/536 - Fix KT Neon ARM typo.md
similarity index 61%
rename from github-data/pull_requests/536 - Fix KT Neon _ ARM typo.md
rename to github-data/pull_requests/536 - Fix KT Neon ARM typo.md
index ed1a744d2..793c11cc5 100644
--- a/github-data/pull_requests/536 - Fix KT Neon _ ARM typo.md
+++ b/github-data/pull_requests/536 - Fix KT Neon ARM typo.md
@@ -1,14 +1,17 @@
-### 🐛 [#536](https://github.com/ikawrakow/ik_llama.cpp/pull/536) - Fix KT Neon / ARM typo
+## 🔀 [Pull Request #536](https://github.com/ikawrakow/ik_llama.cpp/pull/536) - Fix KT Neon / ARM typo
| **Author** | `louiehelm` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-18 |
| **Updated** | 2025-06-18 |
+| **Merged** | 2025-06-18 |
---
-#### Description
+## 📄 Description
Removes errant ";" in front of 0xCBAC1FED in non-x86 code
@@ -23,18 +26,18 @@ error: expected unqualified-id before numeric constant
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-18** at **16:53:19**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-18** at **16:53:19**
---
-👤 **ikawrakow** commented the **2025-06-18** at **16:54:57**:
+👤 **ikawrakow** commented on **2025-06-18** at **16:54:57**
Thank you for this. Are you using an ARM CPU? I haven't checked if it works there.
---
-👤 **louiehelm** commented the **2025-06-18** at **17:05:31**:
+👤 **louiehelm** commented on **2025-06-18** at **17:05:31**
No I don't have ARM CPU unfortunately. Just cross-compiled to see if all code paths would build then fixed that line so it could at least compile. Ready for someone who actually has ARM to test it now.
\ No newline at end of file
diff --git a/github-data/pull_requests/537 - Update CMakeLists.txt to fix NDEBUG handling.md b/github-data/pull_requests/537 - Update CMakeLists.txt to fix NDEBUG handling.md
index 989a4938c..fc2a5ae23 100644
--- a/github-data/pull_requests/537 - Update CMakeLists.txt to fix NDEBUG handling.md
+++ b/github-data/pull_requests/537 - Update CMakeLists.txt to fix NDEBUG handling.md
@@ -1,14 +1,17 @@
-### 🐛 [#537](https://github.com/ikawrakow/ik_llama.cpp/pull/537) - Update CMakeLists.txt to fix NDEBUG handling
+## 🔀 [Pull Request #537](https://github.com/ikawrakow/ik_llama.cpp/pull/537) - Update CMakeLists.txt to fix NDEBUG handling
| **Author** | `iSevenDays` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `patch-1` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-18 |
| **Updated** | 2025-06-19 |
+| **Merged** | 2025-06-19 |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -33,14 +36,14 @@ after my change to CMakeLists.txt
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-06-19** at **07:18:05**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-06-19** at **07:18:05**
So, in the latest tool chains someone decided that the `NDEBUG` is not set when making a release build? Contrary to the established practice of the last 30 years?
---
-👤 **iSevenDays** commented the **2025-06-19** at **07:32:42**:
+👤 **iSevenDays** commented on **2025-06-19** at **07:32:42**
Yes, thanks for merging the fix quickly :)
\ No newline at end of file
diff --git a/github-data/pull_requests/54 - Improve Q4_0 and Q8_0 performance on AVX2_Zen4.md b/github-data/pull_requests/54 - Improve Q4_0 and Q8_0 performance on AVX2Zen4.md
similarity index 83%
rename from github-data/pull_requests/54 - Improve Q4_0 and Q8_0 performance on AVX2_Zen4.md
rename to github-data/pull_requests/54 - Improve Q4_0 and Q8_0 performance on AVX2Zen4.md
index d7496fa50..30f3882ba 100644
--- a/github-data/pull_requests/54 - Improve Q4_0 and Q8_0 performance on AVX2_Zen4.md
+++ b/github-data/pull_requests/54 - Improve Q4_0 and Q8_0 performance on AVX2Zen4.md
@@ -1,14 +1,17 @@
-### 🔀 [#54](https://github.com/ikawrakow/ik_llama.cpp/pull/54) - Improve Q4_0 and Q8_0 performance on AVX2/Zen4
+## 🔀 [Pull Request #54](https://github.com/ikawrakow/ik_llama.cpp/pull/54) - Improve Q4_0 and Q8_0 performance on AVX2/Zen4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avx2_q4_0_q8_0` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-14 |
| **Updated** | 2024-09-14 |
+| **Merged** | 2024-09-14 |
---
-#### Description
+## 📄 Description
This PR improves `Q4_0` and `Q8_0` performance on `AVX2` and `Zen4`. The table shows comparisons to `llama.cpp` for LLaMA-3.1-8B on a Ryzen-7950X (Zen4) and a Ryzen-5975WX (AVX2) CPU.
diff --git a/github-data/pull_requests/540 - Fix missed block_q8_x2 bf16 -_ i16 change.md b/github-data/pull_requests/540 - Fix missed block_q8_x2 bf16 - i16 change.md
similarity index 59%
rename from github-data/pull_requests/540 - Fix missed block_q8_x2 bf16 -_ i16 change.md
rename to github-data/pull_requests/540 - Fix missed block_q8_x2 bf16 - i16 change.md
index 79a2d49f5..d6c6f14ef 100644
--- a/github-data/pull_requests/540 - Fix missed block_q8_x2 bf16 -_ i16 change.md
+++ b/github-data/pull_requests/540 - Fix missed block_q8_x2 bf16 - i16 change.md
@@ -1,16 +1,19 @@
-### 🐛 [#540](https://github.com/ikawrakow/ik_llama.cpp/pull/540) - Fix missed block_q8_x2 bf16 -> i16 change
+## 🔀 [Pull Request #540](https://github.com/ikawrakow/ik_llama.cpp/pull/540) - Fix missed block_q8_x2 bf16 -> i16 change
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_538` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-19 |
| **Updated** | 2025-06-19 |
+| **Merged** | 2025-06-19 |
---
-#### Description
+## 📄 Description
-See #538
+See [#538](https://github.com/ikawrakow/ik_llama.cpp/issues/538)
The story behind this bug:
@@ -18,4 +21,4 @@ Many years ago, the committee designing the `AVX` instruction set decided to use
1. Remove the signs of the left operand and apply the same signs to the right operand
2. Add a constant to the left operand such that it becomes unsigned. Undo the applied constant by subtracting the constant times the sum of the quants in the right operand
-Option 2 is faster, but cannot be used on `AVX2` when the quants span the full `int8_t` range as the dot product produces a SIMD vector with `int16_t` values containing the sum of pairs, and that can overflow (e.g., 255*127 + 255*127). But on `AVX512` the dot product sums 4 products into an `int32_t` avoiding overflow in intermediate results, so we use the faster option 2. For this we have the `Q8_1` type, which contains the block scale and the sum of the quants in the block times the block scale as `fp16`. This worked fine until DeepSeek came along, and we started getting NaNs because the sum was occasionally overflowing the `fp16` range. We then switched to using `Q8_2`, which is the same `Q8_1`, except that block scale and sum are stored as `bf16`, which resolved the NaNs with DeepSeek. But when working on PR #534, I noticed that PPL for `Q4_0` became significantly higher, and that was due to not enough precision in the `bf16` block sum. So, I changed again to have the block sum stored as `int16_t` (which is exact), and then converted to `fp32` at run time. I thought I did adapt all places where `Q8_2` or `Q8_2_X4` is used, but no, I missed one place in the tail of the `Q8_0_R8 x Q8_2_X4` dot product. In that product we go over groups of 4 blocks of 32 quants, and then have a tail handling the leftover. In the vast majority of cases there are no leftovers, but in the DeepSeek FlashMLA, we run into this forgotten corner. The PR fixes that.
\ No newline at end of file
+Option 2 is faster, but cannot be used on `AVX2` when the quants span the full `int8_t` range as the dot product produces a SIMD vector with `int16_t` values containing the sum of pairs, and that can overflow (e.g., 255*127 + 255*127). But on `AVX512` the dot product sums 4 products into an `int32_t` avoiding overflow in intermediate results, so we use the faster option 2. For this we have the `Q8_1` type, which contains the block scale and the sum of the quants in the block times the block scale as `fp16`. This worked fine until DeepSeek came along, and we started getting NaNs because the sum was occasionally overflowing the `fp16` range. We then switched to using `Q8_2`, which is the same `Q8_1`, except that block scale and sum are stored as `bf16`, which resolved the NaNs with DeepSeek. But when working on PR [#534](https://github.com/ikawrakow/ik_llama.cpp/issues/534), I noticed that PPL for `Q4_0` became significantly higher, and that was due to not enough precision in the `bf16` block sum. So, I changed again to have the block sum stored as `int16_t` (which is exact), and then converted to `fp32` at run time. I thought I did adapt all places where `Q8_2` or `Q8_2_X4` is used, but no, I missed one place in the tail of the `Q8_0_R8 x Q8_2_X4` dot product. In that product we go over groups of 4 blocks of 32 quants, and then have a tail handling the leftover. In the vast majority of cases there are no leftovers, but in the DeepSeek FlashMLA, we run into this forgotten corner. The PR fixes that.
\ No newline at end of file
diff --git a/github-data/pull_requests/541 - Perhaps slightly faster trellis quants.md b/github-data/pull_requests/541 - Perhaps slightly faster trellis quants.md
index 42bcd188d..5ddb498a3 100644
--- a/github-data/pull_requests/541 - Perhaps slightly faster trellis quants.md
+++ b/github-data/pull_requests/541 - Perhaps slightly faster trellis quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#541](https://github.com/ikawrakow/ik_llama.cpp/pull/541) - Perhaps slightly faster trellis quants
+## 🔀 [Pull Request #541](https://github.com/ikawrakow/ik_llama.cpp/pull/541) - Perhaps slightly faster trellis quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/trellis_opt` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-19 |
| **Updated** | 2025-06-21 |
+| **Merged** | 2025-06-21 |
---
-#### Description
+## 📄 Description
The PR adds some optimizations to the GEMV implementation of the `IQ2_KT, IQ3_KT, IQ4_KT` quants.
@@ -44,9 +47,9 @@ In your performance testing on the 6980P system `iqX_kt` quants were very far fr
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-19** at **20:04:52**:
+👤 **ubergarm** commented on **2025-06-19** at **20:04:52**
My usual library spot was closed today so sitting outside in the sun trying to grab some quick llama-sweep-bench numbers:
@@ -110,7 +113,21 @@ I'll try to get some numbers on the big 6980P pure-CPU soon!
---
-👤 **ubergarm** commented the **2025-06-20** at **02:04:52**:
+👤 **ubergarm** commented on **2025-06-19** at **20:12:52**
+
+Super quick, on the 6980P, using my [numbers from yesterday](https://github.com/ikawrakow/ik_llama.cpp/pull/534#issuecomment-2986064811) for the `Kimi-Dev-72B-IQ3_KT` (which has to use `iq4_nl` for `ffn_down` as previously discussed). CPU-only.
+
+* Before: 3.76 TG tok/sec
+* PR541: 5.28 TG tok/sec
+* 1.404x Speed-Up
+
+I have to juggle files to get that R1-0528-IQ3_KT onto the big rig, and will give more results when I find some time.
+
+tl;dr; Definitely looking better already! Great job!
+
+---
+
+👤 **ubergarm** commented on **2025-06-20** at **02:04:52**
Okay, back at a desk with my laptop for a little while. Here is a quick comparison for a mixed R1-0528-IQ3_KT quant.
@@ -128,7 +145,7 @@ Okay, back at a desk with my laptop for a little while. Here is a quick comparis
| TG tok/sec `main@144ee1c4` | TG tok/sec `PR541@93209939` | speed-up |
| --- | --- | --- |
-| 6.29 | 8.22 | 1.309x |
+| 6.28 | 8.22 | 1.309x |
Given not every tensor is `kt` type, actual speed-ups are likely higher. I don't have a good set of pure `kt`'s to test easily like you did above, but my limited testing suggests a big improvement in TG for all three `kt` quant types in both MoE and dense models.
@@ -231,7 +248,7 @@ Thanks!
---
-👤 **Nexesenex** commented the **2025-06-20** at **02:48:36**:
+👤 **Nexesenex** commented on **2025-06-20** at **02:48:36**
Confirmed for me for IQ3_KT.
@@ -243,13 +260,13 @@ llama_model_loader: - type iq4_ks_r4: 32 tensors (attn_k)
llama_model_loader: - type iq5_ks_r4: 33 tensors (attn_v, output)
Before patch : TG 3.27 t/s.
-After patch : TG 4.79 t/s.
+After patch (commit 9320993927bb1dcf7c8a22ed79e2a06ed8578ba4) : TG 4.79 t/s.
-Rig : Ryzen 5700G, AVX2, 4*8GB DDR4 2666mhz.
+Rig : Ryzen 5700G, AVX2, 4*8GB DDR4 2666mhz, on 8 threads, BBS 128, prompt 1024 tokens, then 100 tokens generated.
---
-👤 **ikawrakow** commented the **2025-06-20** at **04:44:01**:
+👤 **ikawrakow** commented on **2025-06-20** at **04:44:01**
Thank you for testing!
@@ -259,7 +276,174 @@ This is not supposed to happen. It is a mixture of experts, so the new path can
---
-👤 **ubergarm** commented the **2025-06-20** at **21:43:10**:
+👤 **ubergarm** commented on **2025-06-20** at **18:56:25**
+
+## tl;dr;
+Okay, just tried out the latest commit. Looks like PP is stable as compared to main now, while TG is 1.6x faster running CPU-only on the Intel Xeon 6980P! I'm aslo running some perplexity comparisons between CUDA and CPU implementation on the 24 core thread ripper pro and will check in later when that is done.
+
+## Details
+This time I made a set of 3 "pure" `iqN_kt` quants for 2,3,4 sizes. Using Qwen3-14B as the base model as it has better tensor dimensions than Qwen2.5-72B.
+
+## Quant Collection
+All "pure" except for token_embd is q4_K and the final output is q6_K.
+
+* IQ2_KT 4.280 GiB (2.489 BPW)
+* IQ3_KT 5.818 GiB (3.384 BPW
+* IQ4_KT 7.164 GiB (4.167 BPW)
+
+## sweep-bench
+
+
+
+
+
+👈 sweep-bench command and data
+
+```bash
+numactl -N 0 -m 0 \
+ ./build/bin/llama-sweep-bench \
+ --model "$model" \
+ --ctx-size 8704 \
+ -ctk q8_0 -ctv q8_0 \
+ -fa \
+ --no-mmap \
+ --warmup-batch \
+ --threads 128 \
+ --threads-batch 128 \
+ --numa numactl
+```
+
+## IQ4_KT PR541@5b677c3c
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.956 | 535.28 | 5.203 | 24.60 |
+| 512 | 128 | 512 | 0.976 | 524.84 | 5.416 | 23.63 |
+| 512 | 128 | 1024 | 0.993 | 515.57 | 5.473 | 23.39 |
+| 512 | 128 | 1536 | 1.013 | 505.38 | 5.579 | 22.94 |
+| 512 | 128 | 2048 | 1.030 | 497.30 | 5.555 | 23.04 |
+| 512 | 128 | 2560 | 1.053 | 486.34 | 5.522 | 23.18 |
+| 512 | 128 | 3072 | 1.155 | 443.24 | 5.634 | 22.72 |
+| 512 | 128 | 3584 | 1.090 | 469.56 | 5.771 | 22.18 |
+| 512 | 128 | 4096 | 1.111 | 460.73 | 5.581 | 22.94 |
+| 512 | 128 | 4608 | 1.124 | 455.60 | 5.553 | 23.05 |
+| 512 | 128 | 5120 | 1.145 | 447.36 | 5.566 | 23.00 |
+| 512 | 128 | 5632 | 1.166 | 439.09 | 5.580 | 22.94 |
+| 512 | 128 | 6144 | 1.184 | 432.30 | 5.525 | 23.17 |
+| 512 | 128 | 6656 | 1.204 | 425.35 | 5.559 | 23.03 |
+| 512 | 128 | 7168 | 1.257 | 407.20 | 5.597 | 22.87 |
+| 512 | 128 | 7680 | 1.245 | 411.30 | 5.886 | 21.75 |
+| 512 | 128 | 8192 | 1.263 | 405.33 | 5.660 | 22.61 |
+
+## IQ3_KT PR541@5b677c3c
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.114 | 459.50 | 5.190 | 24.67 |
+| 512 | 128 | 512 | 0.965 | 530.66 | 5.094 | 25.13 |
+| 512 | 128 | 1024 | 0.984 | 520.16 | 5.087 | 25.16 |
+| 512 | 128 | 1536 | 0.996 | 513.97 | 5.132 | 24.94 |
+| 512 | 128 | 2048 | 1.020 | 501.90 | 5.086 | 25.17 |
+| 512 | 128 | 2560 | 1.038 | 493.21 | 5.126 | 24.97 |
+| 512 | 128 | 3072 | 1.215 | 421.35 | 5.160 | 24.81 |
+| 512 | 128 | 3584 | 1.073 | 477.14 | 5.221 | 24.52 |
+| 512 | 128 | 4096 | 1.100 | 465.40 | 5.167 | 24.77 |
+| 512 | 128 | 4608 | 1.118 | 458.03 | 5.210 | 24.57 |
+| 512 | 128 | 5120 | 1.133 | 452.07 | 5.251 | 24.37 |
+| 512 | 128 | 5632 | 1.156 | 442.79 | 5.295 | 24.17 |
+| 512 | 128 | 6144 | 1.174 | 436.15 | 5.320 | 24.06 |
+| 512 | 128 | 6656 | 1.193 | 429.09 | 5.376 | 23.81 |
+| 512 | 128 | 7168 | 1.211 | 422.78 | 5.419 | 23.62 |
+| 512 | 128 | 7680 | 1.327 | 385.81 | 5.489 | 23.32 |
+| 512 | 128 | 8192 | 1.248 | 410.40 | 5.555 | 23.04 |
+
+## IQ2_KT PR541@5b677c3c
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.910 | 562.52 | 4.992 | 25.64 |
+| 512 | 128 | 512 | 0.924 | 554.02 | 4.803 | 26.65 |
+| 512 | 128 | 1024 | 0.960 | 533.18 | 4.805 | 26.64 |
+| 512 | 128 | 1536 | 0.979 | 522.87 | 4.876 | 26.25 |
+| 512 | 128 | 2048 | 1.011 | 506.41 | 4.835 | 26.48 |
+| 512 | 128 | 2560 | 1.031 | 496.38 | 4.887 | 26.19 |
+| 512 | 128 | 3072 | 1.056 | 485.07 | 4.923 | 26.00 |
+| 512 | 128 | 3584 | 1.076 | 475.90 | 5.002 | 25.59 |
+| 512 | 128 | 4096 | 1.099 | 465.67 | 4.951 | 25.85 |
+| 512 | 128 | 4608 | 1.113 | 459.83 | 4.994 | 25.63 |
+| 512 | 128 | 5120 | 1.129 | 453.48 | 5.021 | 25.49 |
+| 512 | 128 | 5632 | 1.150 | 445.27 | 5.147 | 24.87 |
+| 512 | 128 | 6144 | 1.165 | 439.67 | 5.285 | 24.22 |
+| 512 | 128 | 6656 | 1.195 | 428.48 | 5.409 | 23.66 |
+| 512 | 128 | 7168 | 1.203 | 425.53 | 5.511 | 23.23 |
+| 512 | 128 | 7680 | 1.222 | 418.83 | 5.583 | 22.93 |
+| 512 | 128 | 8192 | 1.245 | 411.24 | 5.537 | 23.12 |
+
+## IQ4_KT main@1843ed22
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.953 | 537.33 | 9.251 | 13.84 |
+| 512 | 128 | 512 | 0.971 | 527.35 | 9.055 | 14.14 |
+| 512 | 128 | 1024 | 0.991 | 516.84 | 9.119 | 14.04 |
+| 512 | 128 | 1536 | 1.009 | 507.53 | 9.188 | 13.93 |
+| 512 | 128 | 2048 | 1.032 | 496.11 | 9.195 | 13.92 |
+| 512 | 128 | 2560 | 1.053 | 486.43 | 9.248 | 13.84 |
+| 512 | 128 | 3072 | 1.074 | 476.86 | 9.306 | 13.75 |
+| 512 | 128 | 3584 | 1.092 | 468.86 | 9.362 | 13.67 |
+| 512 | 128 | 4096 | 1.116 | 458.73 | 9.320 | 13.73 |
+| 512 | 128 | 4608 | 1.139 | 449.40 | 9.371 | 13.66 |
+| 512 | 128 | 5120 | 1.152 | 444.28 | 9.393 | 13.63 |
+| 512 | 128 | 5632 | 1.175 | 435.78 | 9.481 | 13.50 |
+| 512 | 128 | 6144 | 1.208 | 423.93 | 9.590 | 13.35 |
+| 512 | 128 | 6656 | 1.218 | 420.20 | 9.819 | 13.04 |
+| 512 | 128 | 7168 | 1.239 | 413.20 | 9.815 | 13.04 |
+| 512 | 128 | 7680 | 1.259 | 406.59 | 9.893 | 12.94 |
+| 512 | 128 | 8192 | 1.279 | 400.28 | 9.745 | 13.14 |
+
+## IQ3_KT main@1843ed22
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.025 | 499.67 | 9.065 | 14.12 |
+| 512 | 128 | 512 | 0.971 | 527.55 | 8.851 | 14.46 |
+| 512 | 128 | 1024 | 0.973 | 526.26 | 8.981 | 14.25 |
+| 512 | 128 | 1536 | 0.994 | 514.90 | 8.936 | 14.32 |
+| 512 | 128 | 2048 | 1.007 | 508.34 | 8.906 | 14.37 |
+| 512 | 128 | 2560 | 1.025 | 499.29 | 8.964 | 14.28 |
+| 512 | 128 | 3072 | 1.051 | 487.01 | 9.007 | 14.21 |
+| 512 | 128 | 3584 | 1.066 | 480.25 | 9.053 | 14.14 |
+| 512 | 128 | 4096 | 1.087 | 470.84 | 9.040 | 14.16 |
+| 512 | 128 | 4608 | 1.107 | 462.37 | 9.085 | 14.09 |
+| 512 | 128 | 5120 | 1.119 | 457.53 | 9.100 | 14.07 |
+| 512 | 128 | 5632 | 1.142 | 448.41 | 9.172 | 13.96 |
+| 512 | 128 | 6144 | 1.157 | 442.43 | 9.247 | 13.84 |
+| 512 | 128 | 6656 | 1.174 | 435.94 | 9.304 | 13.76 |
+| 512 | 128 | 7168 | 1.195 | 428.28 | 9.375 | 13.65 |
+| 512 | 128 | 7680 | 1.223 | 418.53 | 9.543 | 13.41 |
+| 512 | 128 | 8192 | 1.230 | 416.22 | 9.453 | 13.54 |
+
+## IQ2_KT main@1843ed22
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.886 | 577.57 | 8.677 | 14.75 |
+| 512 | 128 | 512 | 0.918 | 557.71 | 8.473 | 15.11 |
+| 512 | 128 | 1024 | 0.956 | 535.52 | 8.512 | 15.04 |
+| 512 | 128 | 1536 | 1.132 | 452.14 | 8.586 | 14.91 |
+| 512 | 128 | 2048 | 0.997 | 513.67 | 8.590 | 14.90 |
+| 512 | 128 | 2560 | 1.018 | 502.72 | 8.449 | 15.15 |
+| 512 | 128 | 3072 | 1.040 | 492.11 | 8.499 | 15.06 |
+| 512 | 128 | 3584 | 1.058 | 484.04 | 8.549 | 14.97 |
+| 512 | 128 | 4096 | 1.080 | 474.13 | 8.451 | 15.15 |
+| 512 | 128 | 4608 | 1.096 | 467.17 | 8.506 | 15.05 |
+| 512 | 128 | 5120 | 1.109 | 461.72 | 8.529 | 15.01 |
+| 512 | 128 | 5632 | 1.131 | 452.71 | 8.628 | 14.84 |
+| 512 | 128 | 6144 | 1.154 | 443.62 | 8.717 | 14.68 |
+| 512 | 128 | 6656 | 1.168 | 438.32 | 8.808 | 14.53 |
+| 512 | 128 | 7168 | 1.192 | 429.39 | 8.907 | 14.37 |
+| 512 | 128 | 7680 | 1.200 | 426.64 | 8.998 | 14.23 |
+| 512 | 128 | 8192 | 1.221 | 419.30 | 8.973 | 14.27 |
+
+
+
+---
+
+👤 **ubergarm** commented on **2025-06-20** at **21:43:10**
Okay, here are the perplexities as run on the thread ripper pro. I ran all the Qwen3-14B quants on a single RTX A6000 to use the CUDA implementation, and then the three `KT` quant again compiled CPU-only to confirm things line up as expected. All tests run on PR541@5b677c3c
@@ -320,13 +504,13 @@ CUDA_VISIBLE_DEVICES="0" \
Overall PR looks like a great speed improvement for token generation of KT quants. Given they still seem CPU bottle-necked at least in this specific test, I'd likely choose the 4bpw version over the smaller sizes when targeting tensors destined for CPU/RAM; because it generates about as fast while keeping more quality.
-Makes me wonder when a 5bpw or 6bpw version would begin to be RAM bandwidth bottle-necked again, but probably heavily dependent on the specific model and hardware. An iq6_kt might be equally RAM / CPU bottlenecked and achieve ~25 tok/sec TG on the ~512GB/s 6980P. 512 / (27.509 * (6/8))
+Makes me wonder when a 5bpw or 6bpw version would begin to be RAM bandwidth bottle-necked again, but probably heavily dependent on the specific model and hardware. An iq6_kt probably still would not hit that equivalent RAM / CPU bottleneck cross-over point on the ~512GB/s 6980P... 512 / (27.509 * (6/16)) = ~50 tok/sec theoretical max. To be fair that rig is not hitting theoretical max on the more simple quants, possibly NUMA related but not really sure.
Anyway, very cool stuff! Thanks!
---
-👤 **ubergarm** commented the **2025-06-20** at **23:32:59**:
+👤 **ubergarm** commented on **2025-06-20** at **23:32:59**
I was too curious to see how it it performed on the AMD Thread Ripper Pro.. Interestingly, there was more variability in the generation speed than with the Xeon 6980P. So I take back my conclusion above about always reaching for the 4bpw... lol...
@@ -479,15 +663,15 @@ Here is the graph and numbers below. Cheers!
---
-👤 **ubergarm** commented the **2025-06-21** at **00:37:44**:
+👤 **ubergarm** commented on **2025-06-21** at **00:37:44**
-I'm happy enough with the performance now to release the `R1-0528-IQ3_KT` on hugging face as *experimental* with the warning that there could potentially still be breaking changes. But a few others folks would be able to test as well. It lines up nicely in terms of perplexity and size, has a tight KLD max delta P, and now generates comparable to the slightly larger `IQ3_K_R4` as [shown in this discussion benchmark on huggingface](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/discussions/7)
+I'm happy enough with the performance now to release the `R1-0528-IQ3_KT` on hugging face as *experimental* with the warning that there could potentially still be breaking changes. But a few others folks would be able to test as well. It lines up nicely in terms of perplexity and size, has a tight KLD max delta P, and now token generation speeds comparable to the slightly larger `IQ3_K_R4` as [shown in this discussion benchmark on huggingface](https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF/discussions/7#6855fdc29c7d3d10883a53ab)
over and out!
---
-👤 **ikawrakow** commented the **2025-06-21** at **14:32:10**:
+👤 **ikawrakow** commented on **2025-06-21** at **14:32:10**
@ubergarm Thank you for the extensive testing!
diff --git a/github-data/pull_requests/542 - Fix NEON build.md b/github-data/pull_requests/542 - Fix NEON build.md
index 03b17bb9f..6b714f8cb 100644
--- a/github-data/pull_requests/542 - Fix NEON build.md
+++ b/github-data/pull_requests/542 - Fix NEON build.md
@@ -1,13 +1,16 @@
-### 🐛 [#542](https://github.com/ikawrakow/ik_llama.cpp/pull/542) - Fix NEON build
+## 🔀 [Pull Request #542](https://github.com/ikawrakow/ik_llama.cpp/pull/542) - Fix NEON build
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_neon_build` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-19 |
| **Updated** | 2025-06-19 |
+| **Merged** | 2025-06-19 |
---
-#### Description
+## 📄 Description
I did not pay attention to the `ARM_NEON` build with the recent PP performance improvement PRs, so now the main branch does not even build. This PR fixes that (but nothing will be working).
\ No newline at end of file
diff --git a/github-data/pull_requests/544 - New integer trellis on ARM_NEON.md b/github-data/pull_requests/544 - New integer trellis on ARM_NEON.md
index a0369280a..8e97db896 100644
--- a/github-data/pull_requests/544 - New integer trellis on ARM_NEON.md
+++ b/github-data/pull_requests/544 - New integer trellis on ARM_NEON.md
@@ -1,14 +1,17 @@
-### 🔀 [#544](https://github.com/ikawrakow/ik_llama.cpp/pull/544) - New integer trellis on ARM_NEON
+## 🔀 [Pull Request #544](https://github.com/ikawrakow/ik_llama.cpp/pull/544) - New integer trellis on ARM_NEON
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/neon_iq3_kt` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-20 |
| **Updated** | 2025-06-20 |
+| **Merged** | 2025-06-20 |
---
-#### Description
+## 📄 Description
This PR adapts the ARM_NEON trellis implementation to the new integer trellis.
@@ -32,4 +35,4 @@ Still very low TG performance:
Don't ask Apple Silicon to do too much work with a piece of data fetched from memory.
-Nevertheless, compared to PR #471 we observe ~13% speedup for `IQ2_KT`, ~30% speedup for `IQ3_KT`, and nearly 70% speedup for `Q4_KT`.
\ No newline at end of file
+Nevertheless, compared to PR [#471](https://github.com/ikawrakow/ik_llama.cpp/issues/471) we observe ~13% speedup for `IQ2_KT`, ~30% speedup for `IQ3_KT`, and nearly 70% speedup for `Q4_KT`.
\ No newline at end of file
diff --git a/github-data/pull_requests/546 - Faster ARM_NEON GEMM implementation for legacy quants.md b/github-data/pull_requests/546 - Faster ARM_NEON GEMM implementation for legacy quants.md
index b4d2644ee..69601f8b0 100644
--- a/github-data/pull_requests/546 - Faster ARM_NEON GEMM implementation for legacy quants.md
+++ b/github-data/pull_requests/546 - Faster ARM_NEON GEMM implementation for legacy quants.md
@@ -1,18 +1,21 @@
-### 🔀 [#546](https://github.com/ikawrakow/ik_llama.cpp/pull/546) - Faster ARM_NEON GEMM implementation for legacy quants
+## 🔀 [Pull Request #546](https://github.com/ikawrakow/ik_llama.cpp/pull/546) - Faster ARM_NEON GEMM implementation for legacy quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_neon_legacy` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-21 |
| **Updated** | 2025-06-22 |
+| **Merged** | 2025-06-21 |
---
-#### Description
+## 📄 Description
It is time to give some attention to the `ARM_NEON` back-end, which has fallen behind quite a bit.
-This PR corresponds to PRs #531, #533, #534 and applies the on-the-fly repacking technique to `Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0, IQ4_NL` for the `ARM_NEON` implementation.
+This PR corresponds to PRs [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531), [#533](https://github.com/ikawrakow/ik_llama.cpp/issues/533), [#534](https://github.com/ikawrakow/ik_llama.cpp/issues/534) and applies the on-the-fly repacking technique to `Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0, IQ4_NL` for the `ARM_NEON` implementation.
Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
@@ -24,234 +27,4 @@ Here is a PP-512 performance comparison between the main branch and this PR for
| Q8_0 | 84.45 | 128.63 | 1.523 |
| IQ4_NL | 84.47 | 128.09 | 1.516 |
| Q4_1 | 74.44 | 115.36 | 1.550 |
-| Q5_1 | 64.16 | 114.89 | 1.791 |
-
----
-
-#### 💬 Conversation
-
-👤 **zhouwg** commented the **2025-06-22** at **07:22:29**:
-
-I tried your ik_llamacpp on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) today, the **performance of your excellent ik_llamacpp is impressive(faster than the upstream llama.cpp)** .
-
-both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only " because " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only " can't works with ik_llama.cpp cause of some compile error with inline assemble codes.
-
-upstream llama.cpp:
-llama-bench:
-
-llama-cli:
-
-
-ik_llama.cpp:
-llama-bench:
-
-
-
-llama-cli(the inference result is incorrect and don't know why)
-
-
----
-
-👤 **zhouwg** commented the **2025-06-22** at **07:24:04**:
-
-I tried ik_llamacpp on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) today, the **performance of your excellent ik_llamacpp is impressive** .
-
-both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only " because " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only " can't works with ik_llama.cpp cause of some compile error with inline assemble codes.
-
-upstream llama.cpp with latest codes:
-llama-bench:
-
-llama-cli:
-
-
-ik_llama.cpp with latest codes:
-llama-bench:
-
-
-
-llama-cli(the inference result is incorrect and don't know why)
-
-
----
-
-👤 **zhouwg** commented the **2025-06-22** at **08:36:03**:
-
-comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28:
-
-1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
-
-upstream llama.cpp with latest codes:
-llama-bench:
-
-llama-cli:
-
-
-ik_llama.cpp with latest codes:
-
-
-
-
-
-llama-cli(the inference result is incorrect)
-
-
-
-2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
-upstream llama.cpp with latest codes:
-
-
-
-ik_llama.cpp with latest codes:
-
-
-3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
-
-upstream llama.cpp with latest codes:
-
-
-ik_llama.cpp with latest codes:
-
-
-4. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
-
-upstream llama.cpp with latest codes:
-the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
-
-
-
-ik_llama.cpp with latest codes:
-
----
-
-👤 **zhouwg** commented the **2025-06-22** at **09:46:28**:
-
-comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28:
-
-1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
-
-upstream llama.cpp with latest codes:
-llama-bench:
-
-llama-cli:
-
-
-ik_llama.cpp with latest codes:
-
-
-
-
-
-llama-cli(the inference result is incorrect)
-
-
-
-2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
-upstream llama.cpp with latest codes:
-
-
-
-ik_llama.cpp with latest codes:
-
-
-3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
-
-upstream llama.cpp with latest codes:
-
-
-ik_llama.cpp with latest codes:
-
-
-4. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
-
-upstream llama.cpp with latest codes:
-
-
-
-the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
-
-
-
-ik_llama.cpp with latest codes:
-
-
----
-
-👤 **zhouwg** commented the **2025-06-22** at **10:58:12**:
-
-comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28(the following benchmark data might-be depend on the workload of Android OS):
-
-1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
-
-upstream llama.cpp with latest codes:
-llama-bench:
-
-llama-cli:
-
-
-ik_llama.cpp with latest codes:
-
-
-
-
-
-llama-cli(the inference result is incorrect)
-
-
-
-2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
-upstream llama.cpp with latest codes:
-
-
-
-ik_llama.cpp with latest codes:
-
-
-3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
-
-upstream llama.cpp with latest codes:
-
-
-ik_llama.cpp with latest codes:
-
-
-4. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
-
-upstream llama.cpp with latest codes:
-
-
-
-the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
-
-
-
-ik_llama.cpp with latest codes:
-
-
-after enable GGML_IQK_FLASH_ATTENTION
-
-build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
-
-
-
-build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
-
-
-build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
-
-
-
-build failed with " -O3 -march=armv8.7-a -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
-
-build with " -O3 -march=armv8.7-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
-
-build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
-
-
-build with "-O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
-
\ No newline at end of file
+| Q5_1 | 64.16 | 114.89 | 1.791 |
\ No newline at end of file
diff --git a/github-data/pull_requests/547 - build add script to simplify buildtest workflow for Android.md b/github-data/pull_requests/547 - build add script to simplify buildtest workflow for Android.md
new file mode 100644
index 000000000..630cd222a
--- /dev/null
+++ b/github-data/pull_requests/547 - build add script to simplify buildtest workflow for Android.md
@@ -0,0 +1,207 @@
+## 🔀 [Pull Request #547](https://github.com/ikawrakow/ik_llama.cpp/pull/547) - build: add script to simplify build&test workflow for Android
+
+| **Author** | `jeffzhou2000` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Source Branch** | `fix-build-android` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-22 |
+| **Updated** | 2025-07-04 |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+### purpose
+
+add script to simplify build & test workflow of ik_llama.cpp for Android
+
+---
+
+## 💬 Conversation
+
+👤 **jeffzhou2000** commented on **2025-06-22** at **13:11:26**
+
+comparison of llama_bench on Android phone equipped with Qualcomm Snapdragon 8Elite(one of the most advanced mobile SoCs on our planet at the moment) + Android NDK r28(the following benchmark data might-be depend on the workload of Android OS):
+
+1. both build with " -O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
+
+upstream llama.cpp with latest codes:
+llama-bench:
+
+llama-cli:
+
+
+ik_llama.cpp with latest codes:
+
+
+
+
+
+llama-cli(the inference result is incorrect)
+
+
+
+2. both build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
+
+upstream llama.cpp with latest codes:
+
+
+
+ik_llama.cpp with latest codes:
+
+
+3. both build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only ".
+
+upstream llama.cpp with latest codes:
+
+
+ik_llama.cpp with latest codes:
+
+
+4. both build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
+
+upstream llama.cpp with latest codes:
+
+
+
+
+the following is a screenshot when I helped troubleshooting a performance regression issue in the upstream llama.cpp project. as well known, there are so many approved PRs in the upstream llama.cpp project and some approved PRs might-be brings regression issues in the upstream llama.cpp project. sometimes I can't reproduce the same benchmark result with the upstream llama.cpp's latest codes.
+
+
+
+ik_llama.cpp with latest codes:
+
+
+
+
+
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
+after enable GGML_IQK_FLASH_ATTENTION
+
+build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -fvectorize -ffp-model=fast -fno-finite-math-only"
+
+
+
+build with " -O3 -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
+
+
+
+build with " -O3 -march=armv8.7-a -mcpu=cortex-x1 -mtune=cortex-x1 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
+
+
+
+build failed with " -O3 -march=armv8.7-a -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only "
+
+build with " -O3 -march=armv8.7-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
+
+
+build with " -O3 -march=armv8.2-a+dotprod+fp16 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
+
+
+
+build with "-O3 -flto -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only"
+
+
+
+
+in my opinion/personal perspective, the upstream llama.cpp can get much performance gains from optimization of Google's state-of-the-art toolchain(as well known, there are many top world-class compiler experts and engineers in Google). at the same time, the hand-written codes in this project runs faster than the neon codes in the upstream.
+
+---
+
+👤 **ikawrakow** started a conversation on `CMakeLists.txt` on **2025-06-23** at **10:02:36**
+
+Your measurements clearly indicate that these are **not the best** compiler settings. Apart from not being best for `ik_llama.cpp`, there are a lot of Android phones out there that only support `armv8.2-a`, which is the minimum required for `ik_llama.cpp` to build correctly.
+
+More generally, `ik_llama.cpp` allows to manually set `GGML_ARCH_FLAGS`, exactly for the purpose of building on Android when the compiler for whatever reason does not use correct settings with `GGML_NATIVE`.
+
+> 👤 **jeffzhou2000** replied on **2025-06-23** at **10:24:07**
+>
+> yes, you are right.
+>
+> 1. I have tried GGML_ARCH_FLAGS in the CMakeLists.txt and it works fine as expected although the GGML_NATIVE can't works for Android.
+> 2. I can remove this changes in the toplevel CMakeLists.txt accordingly.
+
+> 👤 **jeffzhou2000** replied on **2025-06-23** at **10:42:21**
+>
+> refined according to your comment, pls take a look if you have time.
+
+> 👤 **jeffzhou2000** replied on **2025-06-23** at **11:34:06**
+>
+> > Your measurements clearly indicate that these are **not the best** compiler settings.
+>
+>
+> sorry I just noticed this.
+>
+> the potential best compiler settings for ik_llama.cpp on Snapdragon 8Elite with NDK r28 might-be:
+>
+> -march=armv8.7-a+dotprod+fp16
+>
+> or
+>
+> -march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1
+>
+> or
+>
+>
+> -march=armv8.7-a+dotprod+fp16+i8mm -mcpu=cortex-x1 -mtune=cortex-x1
+>
+> or
+>
+> -march=armv8.7-a+dotprod+fp16+i8mm -mcpu=cortex-x1 -mtune=cortex-x1 -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only
+>
+> depend on workload of Android OS, this is my personal opinion, might-be not very exactly correct.
+>
+> the performance of upstream llama.cpp on Android is also weird: sometimes very good, sometimes not good as expected, as I said before: I have very very very limited knowledge about hardcore AI tech, or the codes in ${llama.cpp_src_rootdirectory}/src, so I don't know how to troubleshoot the performance issue of the upstream llama.cpp on Android thoroughly. one more thing, it seems the Android is not the point of upstream llama.cpp although it's an on-device inference framework:
+> - it seems cuda is the key-point of upstream llama.cpp.
+> - I know gg is also familiar with Android dev but it seems he focus on metal and obviously he is very busy.
+
+---
+
+👤 **ikawrakow** started a conversation on `examples/quantize-stats/CMakeLists.txt` on **2025-06-23** at **10:03:29**
+
+This can easily be combined with `OR` with the above condition.
+
+> 👤 **jeffzhou2000** replied on **2025-06-23** at **10:20:45**
+>
+> thanks for reminder.
+>
+> if (NOT MSVC OR NOT (CMAKE_SYSTEM_NAME STREQUAL "Android"))
+> list(APPEND ARCH_FLAGS -march=native)
+> endif()
+
+---
+
+👤 **ikawrakow** started a conversation on `scripts/build-run-android.sh` on **2025-06-23** at **10:04:53**
+
+To me this looks a lot like a script that will only work for your specific setup. Is it really useful for others?
+
+> 👤 **jeffzhou2000** replied on **2025-07-04** at **09:11:51**
+>
+> YES, you are right.
+>
+> I'm not sure because it's a script to simplify workflow of build ik_llama.cpp on Linux for Android.
+>
+> I'd like to close this PR accordingly and it doesn't matter because I know this project is a playground for AI expert.
+>
+> btw, I deleted my inappropriate comments(they are marked off-topic) in your excellent project today, thanks for your understanding. as I said two weeks ago: I still think the upstream llama.cpp project need your unique and important ideas and codes because you are a truly AI expert and already did an unique and important contribution to the llama.cpp project.
+
+> 👤 **ikawrakow** replied on **2025-07-04** at **09:16:24**
+>
+> > btw, I deleted my inappropriate comments(they are marked off-topic) in your excellent project today,
+>
+> Well, they are totally off-topic in a discussion about new SOTA quantization types in `ik_llama.cpp`. If you want to discuss stuff related to Android, you can open a new discussion.
+
+> 👤 **jeffzhou2000** replied on **2025-07-04** at **09:18:14**
+>
+> yes, you are absolutely correct: they are totally off-topic in a discussion about new SOTA quantization types in ik_llama.cpp. thanks for your understanding!
+
+---
+
+👤 **ikawrakow** requested changes on this pull request 🔄 on **2025-06-23** at **10:05:44**
\ No newline at end of file
diff --git a/github-data/pull_requests/547 - build_ add script to simplify build_test workflow for Android.md b/github-data/pull_requests/547 - build_ add script to simplify build_test workflow for Android.md
deleted file mode 100644
index f023a9025..000000000
--- a/github-data/pull_requests/547 - build_ add script to simplify build_test workflow for Android.md
+++ /dev/null
@@ -1,115 +0,0 @@
-### 🔀 [#547](https://github.com/ikawrakow/ik_llama.cpp/pull/547) - build: add script to simplify build&test workflow for Android
-
-| **Author** | `jeffzhou2000` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-22 |
-| **Updated** | 2025-07-04 |
-
----
-
-#### Description
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [x] Low
- - [ ] Medium
- - [ ] High
-
-### purpose
-
-add script to simplify build & test workflow of ik_llama.cpp for Android
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** submitted a review the **2025-06-23** at **10:05:44**: 🔄 `CHANGES_REQUESTED`
-
----
-
-👤 **jeffzhou2000** submitted a review the **2025-06-23** at **10:20:45**: 💬 `COMMENTED`
-
----
-
-👤 **jeffzhou2000** submitted a review the **2025-06-23** at **10:24:07**: 💬 `COMMENTED`
-
----
-
-👤 **jeffzhou2000** submitted a review the **2025-06-23** at **10:42:21**: 💬 `COMMENTED`
-
----
-
-👤 **zhouwg** submitted a review the **2025-06-23** at **10:42:21**: 💬 `COMMENTED`
-
----
-
-👤 **zhouwg** commented during a code review the **2025-06-23** at **10:42:21** on `CMakeLists.txt`:
-
-refined according to your comment, pls take a look if you have time.
-
----
-
-👤 **zhouwg** submitted a review the **2025-06-23** at **11:19:16**: 💬 `COMMENTED`
-
----
-
-👤 **zhouwg** commented during a code review the **2025-06-23** at **11:19:16** on `CMakeLists.txt`:
-
-> Your measurements clearly indicate that these are **not the best** compiler settings.
-the potential best compiler settings for ik_llama.cpp on Snapdragon 8Elite might-be:
-
--march=armv8.7-a+dotprod+fp16
-
-or
-
--march=armv8.7-a+dotprod+fp16 -mcpu=cortex-x1 -mtune=cortex-x1
-
-or
-
-
--march=armv8.7-a+dotprod+fp16+i8mm -mcpu=cortex-x1 -mtune=cortex-x1
-
-or
-
--march=armv8.7-a+dotprod+fp16+i8mm -mcpu=cortex-x1 -mtune=cortex-x1 -D_GNU_SOURCE -ffp-model=fast -fno-finite-math-only
-
-depend on workload of Android OS, this is my personal opinion, might-be not very exactly correct.
-
----
-
-👤 **jeffzhou2000** submitted a review the **2025-06-23** at **11:34:06**: 💬 `COMMENTED`
-
----
-
-👤 **zhouwg** submitted a review the **2025-06-23** at **13:15:25**: 💬 `COMMENTED`
-
----
-
-👤 **zhouwg** commented during a code review the **2025-06-23** at **13:15:25** on `scripts/build-run-android.sh`:
-
-YES, you are right.
-
-I'm not sure because it's a script to simplify workflow of build ik_llama.cpp on Linux for Android.
-
-I'd like to close this PR accordingly and it doesn't matter.
-
-thanks for your time to review this PR and have a good day.
-
----
-
-👤 **jeffzhou2000** submitted a review the **2025-07-04** at **09:11:51**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** submitted a review the **2025-07-04** at **09:16:24**: 💬 `COMMENTED`
-
----
-
-👤 **jeffzhou2000** submitted a review the **2025-07-04** at **09:18:14**: 💬 `COMMENTED`
-
----
-
-👤 **zhouwg** commented during a code review the **2025-07-04** at **09:18:14** on `scripts/build-run-android.sh`:
-
-yes, you are absolutely correct: they are totally off-topic in a discussion about new SOTA quantization types in ik_llama.cpp. thanks for your understanding!
\ No newline at end of file
diff --git a/github-data/pull_requests/549 - Much faster prompt processing for IQK quants _ARM_NEON_.md b/github-data/pull_requests/549 - Much faster prompt processing for IQK quants ARM_NEON.md
similarity index 51%
rename from github-data/pull_requests/549 - Much faster prompt processing for IQK quants _ARM_NEON_.md
rename to github-data/pull_requests/549 - Much faster prompt processing for IQK quants ARM_NEON.md
index 08ef63e83..d3dadcf2d 100644
--- a/github-data/pull_requests/549 - Much faster prompt processing for IQK quants _ARM_NEON_.md
+++ b/github-data/pull_requests/549 - Much faster prompt processing for IQK quants ARM_NEON.md
@@ -1,18 +1,21 @@
-### 🔀 [#549](https://github.com/ikawrakow/ik_llama.cpp/pull/549) - Much faster prompt processing for IQK quants (ARM_NEON)
+## 🔀 [Pull Request #549](https://github.com/ikawrakow/ik_llama.cpp/pull/549) - Much faster prompt processing for IQK quants (ARM_NEON)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_neon_iqk` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-23 |
| **Updated** | 2025-06-23 |
+| **Merged** | 2025-06-23 |
---
-#### Description
+## 📄 Description
It is time to give some attention to the `ARM_NEON` back-end, which has fallen behind quite a bit.
-This PR corresponds to PRs #531, #533, #534, #546 and applies the on-the-fly repacking technique to `IQK` quants (`IQ2_KS, IQ2_K, IQ3_K, IQ4_KS, IQ4_K, IQ5_KS, IQ5_K, IQ6_K`) for the `ARM_NEON` implementation.
+This PR corresponds to PRs [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531), [#533](https://github.com/ikawrakow/ik_llama.cpp/issues/533), [#534](https://github.com/ikawrakow/ik_llama.cpp/issues/534), [#546](https://github.com/ikawrakow/ik_llama.cpp/issues/546) and applies the on-the-fly repacking technique to `IQK` quants (`IQ2_KS, IQ2_K, IQ3_K, IQ4_KS, IQ4_K, IQ5_KS, IQ5_K, IQ6_K`) for the `ARM_NEON` implementation.
Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
diff --git a/github-data/pull_requests/55 - Improve Q5_0 performance on AVX2.md b/github-data/pull_requests/55 - Improve Q5_0 performance on AVX2.md
index 47cff59e9..b1272d0f6 100644
--- a/github-data/pull_requests/55 - Improve Q5_0 performance on AVX2.md
+++ b/github-data/pull_requests/55 - Improve Q5_0 performance on AVX2.md
@@ -1,14 +1,17 @@
-### 🔀 [#55](https://github.com/ikawrakow/ik_llama.cpp/pull/55) - Improve Q5_0 performance on AVX2
+## 🔀 [Pull Request #55](https://github.com/ikawrakow/ik_llama.cpp/pull/55) - Improve Q5_0 performance on AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/avx2_q5_0` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-14 |
| **Updated** | 2024-09-14 |
+| **Merged** | 2024-09-14 |
---
-#### Description
+## 📄 Description
The main purpose of the [previous PR](https://github.com/ikawrakow/ik_llama.cpp/pull/54) was to try to improve `K*Q` matrix multiplications for flash attention with `Q8_0` quantized k-cache. Sadly, the performance improvement that we got for `Q8_0` did not translate into better FA performance. It is a rainy Saturday, so need something to brighten my day. The last PR is very easily applied to `Q5_0`, so here we are.
diff --git a/github-data/pull_requests/550 - Much faster prompt processing for I-quants ARM_NEON.md b/github-data/pull_requests/550 - Much faster prompt processing for I-quants ARM_NEON.md
new file mode 100644
index 000000000..04aa107be
--- /dev/null
+++ b/github-data/pull_requests/550 - Much faster prompt processing for I-quants ARM_NEON.md
@@ -0,0 +1,30 @@
+## 🔀 [Pull Request #550](https://github.com/ikawrakow/ik_llama.cpp/pull/550) - Much faster prompt processing for I-quants (ARM_NEON)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_neon_iquants` |
+| **Target Branch** | `main` |
+| **Created** | 2025-06-23 |
+| **Updated** | 2025-06-23 |
+| **Merged** | 2025-06-23 |
+
+---
+
+## 📄 Description
+
+It is time to give some attention to the `ARM_NEON` back-end, which has fallen behind quite a bit.
+
+This PR corresponds to PRs [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531), [#533](https://github.com/ikawrakow/ik_llama.cpp/issues/533), [#534](https://github.com/ikawrakow/ik_llama.cpp/issues/534), [#546](https://github.com/ikawrakow/ik_llama.cpp/issues/546), [#549](https://github.com/ikawrakow/ik_llama.cpp/issues/549), and applies the on-the-fly repacking technique to i-quants (`IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S`) for the `ARM_NEON` implementation.
+
+Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
+
+| type | t/s (main) | t/s (PR) | Speedup |
+| ---: | ---: | ---: | ---: |
+| IQ2_XXS | 55.79 | 167.55 | 3.003 |
+| IQ2_XS | 46.40 | 166.65 | 3.592 |
+| IQ2_S | 42.75 | 166.83 | 3.903 |
+| IQ3_XXS | 51.84 | 165.56 | 3.194 |
+| IQ3_S | 46.02 | 162.03 | 3.521 |
+
+At this point i- and `IQK` quants are the top tier quants for prompt processing speed on `ARM_NEON`.
\ No newline at end of file
diff --git a/github-data/pull_requests/550 - Much faster prompt processing for I-quants _ARM_NEON_.md b/github-data/pull_requests/550 - Much faster prompt processing for I-quants _ARM_NEON_.md
deleted file mode 100644
index 4e7c07d7b..000000000
--- a/github-data/pull_requests/550 - Much faster prompt processing for I-quants _ARM_NEON_.md
+++ /dev/null
@@ -1,27 +0,0 @@
-### 🔀 [#550](https://github.com/ikawrakow/ik_llama.cpp/pull/550) - Much faster prompt processing for I-quants (ARM_NEON)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-06-23 |
-| **Updated** | 2025-06-23 |
-
----
-
-#### Description
-
-It is time to give some attention to the `ARM_NEON` back-end, which has fallen behind quite a bit.
-
-This PR corresponds to PRs #531, #533, #534, #546, #549, and applies the on-the-fly repacking technique to i-quants (`IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S`) for the `ARM_NEON` implementation.
-
-Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
-
-| type | t/s (main) | t/s (PR) | Speedup |
-| ---: | ---: | ---: | ---: |
-| IQ2_XXS | 55.79 | 167.55 | 3.003 |
-| IQ2_XS | 46.40 | 166.65 | 3.592 |
-| IQ2_S | 42.75 | 166.83 | 3.903 |
-| IQ3_XXS | 51.84 | 165.56 | 3.194 |
-| IQ3_S | 46.02 | 162.03 | 3.521 |
-
-At this point i- and `IQK` quants are the top tier quants for prompt processing speed on `ARM_NEON`.
\ No newline at end of file
diff --git a/github-data/pull_requests/552 - Much faster prompt processing for k-quants _ARM_NEON_.md b/github-data/pull_requests/552 - Much faster prompt processing for k-quants ARM_NEON.md
similarity index 51%
rename from github-data/pull_requests/552 - Much faster prompt processing for k-quants _ARM_NEON_.md
rename to github-data/pull_requests/552 - Much faster prompt processing for k-quants ARM_NEON.md
index c120e457a..9c85a00b8 100644
--- a/github-data/pull_requests/552 - Much faster prompt processing for k-quants _ARM_NEON_.md
+++ b/github-data/pull_requests/552 - Much faster prompt processing for k-quants ARM_NEON.md
@@ -1,18 +1,21 @@
-### 🔀 [#552](https://github.com/ikawrakow/ik_llama.cpp/pull/552) - Much faster prompt processing for k-quants (ARM_NEON)
+## 🔀 [Pull Request #552](https://github.com/ikawrakow/ik_llama.cpp/pull/552) - Much faster prompt processing for k-quants (ARM_NEON)
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_neon_kquants` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-24 |
| **Updated** | 2025-06-24 |
+| **Merged** | 2025-06-24 |
---
-#### Description
+## 📄 Description
It is time to give some attention to the `ARM_NEON` back-end, which has fallen behind quite a bit.
-This PR corresponds to PRs #531, #533, #534, #546, #549, #550, and applies the on-the-fly repacking technique to k-quants (`Q2_K, Q3_K, Q4_K, Q5_K, Q6_K`) and to `IQ4_XS` for the `ARM_NEON` implementation.
+This PR corresponds to PRs [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531), [#533](https://github.com/ikawrakow/ik_llama.cpp/issues/533), [#534](https://github.com/ikawrakow/ik_llama.cpp/issues/534), [#546](https://github.com/ikawrakow/ik_llama.cpp/issues/546), [#549](https://github.com/ikawrakow/ik_llama.cpp/issues/549), [#550](https://github.com/ikawrakow/ik_llama.cpp/issues/550), and applies the on-the-fly repacking technique to k-quants (`Q2_K, Q3_K, Q4_K, Q5_K, Q6_K`) and to `IQ4_XS` for the `ARM_NEON` implementation.
Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
diff --git a/github-data/pull_requests/553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md b/github-data/pull_requests/553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md
index d563b973f..a437eccc8 100644
--- a/github-data/pull_requests/553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md
+++ b/github-data/pull_requests/553 - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON.md
@@ -1,16 +1,19 @@
-### 🔀 [#553](https://github.com/ikawrakow/ik_llama.cpp/pull/553) - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON
+## 🔀 [Pull Request #553](https://github.com/ikawrakow/ik_llama.cpp/pull/553) - Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/gemm_neon_1bit` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-24 |
| **Updated** | 2025-06-24 |
+| **Merged** | 2025-06-24 |
---
-#### Description
+## 📄 Description
-This PR corresponds to PRs #531, #533, #534, #546, #549, #550, #552, and applies the on-the-fly repacking technique to
+This PR corresponds to PRs [#531](https://github.com/ikawrakow/ik_llama.cpp/issues/531), [#533](https://github.com/ikawrakow/ik_llama.cpp/issues/533), [#534](https://github.com/ikawrakow/ik_llama.cpp/issues/534), [#546](https://github.com/ikawrakow/ik_llama.cpp/issues/546), [#549](https://github.com/ikawrakow/ik_llama.cpp/issues/549), [#550](https://github.com/ikawrakow/ik_llama.cpp/issues/550), [#552](https://github.com/ikawrakow/ik_llama.cpp/issues/552), and applies the on-the-fly repacking technique to
the 1-bit quants `IQ1_S` and `IQ1_M` on `ARM_NEON`.
Here is a PP-512 performance comparison between the main branch and this PR for LlaMA-3.1-8B-Instruct on M2-Max
diff --git a/github-data/pull_requests/554 - Update README.md to add quickstart section.md b/github-data/pull_requests/554 - Update README.md to add quickstart section.md
index d95ffaf71..039a876be 100644
--- a/github-data/pull_requests/554 - Update README.md to add quickstart section.md
+++ b/github-data/pull_requests/554 - Update README.md to add quickstart section.md
@@ -1,14 +1,16 @@
-### 🔀 [#554](https://github.com/ikawrakow/ik_llama.cpp/pull/554) - Update README.md to add quickstart section
+## 🔀 [Pull Request #554](https://github.com/ikawrakow/ik_llama.cpp/pull/554) - Update README.md to add quickstart section
| **Author** | `jwinpbe` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `s6/docs_update` |
+| **Target Branch** | `s6/docs_update` |
| **Created** | 2025-06-25 |
| **Updated** | 2025-06-25 |
---
-#### Description
+## 📄 Description
add quickstart section using ubergarm's discussion post. Scrolling to the discussion every time I want to remember how to build the damn thing is a minor inconvienience so this pull request is both useful and self-serving. Thanks <3
@@ -22,21 +24,29 @@ add quickstart section using ubergarm's discussion post. Scrolling to the discus
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-06-25** at **04:44:23**:
+👤 **saood06** commented on **2025-06-25** at **04:44:23**
-The quickstart section seems like a very oversimplified version of the `docs/build.md` file (which I just noticed should be updated to reference `ik_llama.cpp` not `llama.cpp`.
+The quickstart section seems like a very oversimplified version of the [`docs/build.md`](https://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/build.md) file (which I just noticed should be updated to reference `ik_llama.cpp` not `llama.cpp`.
I do think a Quick Start section similar to mainline could be beneficial but I still think it should go after the News section (which still needs to be shorter), and reference `docs/build.md`.
---
-👤 **saood06** submitted a review the **2025-06-25** at **17:48:05**: 💬 `COMMENTED`
+👤 **jwinpbe** commented on **2025-06-25** at **04:54:37**
+
+I'll happily defer to your judgement -- I see you updating documents all the time. I don't want to iterate over the news section as I don't feel like that's my call. Thanks again.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-25** at **14:38:23**
+
+Why do I see the latest news as being changed in the diff?
---
-👤 **saood06** commented during a code review the **2025-06-25** at **17:48:05** on `README.md`:
+👤 **saood06** started a conversation on `README.md` on **2025-06-25** at **17:48:05**
`-DGGML_BLAS=OFF`
@@ -44,17 +54,29 @@ Is not needed, it is off by default.
---
-👤 **saood06** submitted a review the **2025-06-25** at **17:48:42**: 💬 `COMMENTED`
+👤 **saood06** started a conversation on `README.md` on **2025-06-25** at **17:48:42**
+
+Same as above
---
-👤 **saood06** commented during a code review the **2025-06-25** at **17:48:42** on `README.md`:
+👤 **saood06** commented on **2025-06-25** at **18:00:28**
-Same as above
+> Why do I see the latest news as being changed in the diff?
+
+Because this PR is targeting an old branch, and manually pulled in the changes from main.
+
+---
+
+👤 **saood06** commented on **2025-06-25** at **18:07:40**
+
+>I don't want to iterate over the news section as I don't feel like that's my call.
+
+I'd be curious about your opinions. I ran out of ideas on how to condense it, and it is also just good to hear the perspective of someone else.
---
-👤 **jwinpbe** commented the **2025-06-25** at **21:25:24**:
+👤 **jwinpbe** commented on **2025-06-25** at **21:25:24**
> Why do I see the latest news as being changed in the diff?
diff --git a/github-data/pull_requests/555 - Add Falcon-Edge support.md b/github-data/pull_requests/555 - Add Falcon-Edge support.md
index 63a433598..1eed48699 100644
--- a/github-data/pull_requests/555 - Add Falcon-Edge support.md
+++ b/github-data/pull_requests/555 - Add Falcon-Edge support.md
@@ -1,16 +1,19 @@
-### 🔀 [#555](https://github.com/ikawrakow/ik_llama.cpp/pull/555) - Add Falcon-Edge support
+## 🔀 [Pull Request #555](https://github.com/ikawrakow/ik_llama.cpp/pull/555) - Add Falcon-Edge support
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/falcon_edge` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-25 |
| **Updated** | 2025-06-26 |
+| **Merged** | 2025-06-26 |
---
-#### Description
+## 📄 Description
-Closes #551
+Closes [#551](https://github.com/ikawrakow/ik_llama.cpp/issues/551)
How to use:
diff --git a/github-data/pull_requests/557 - CUDA_ MMQ for iqX_r4 quants.md b/github-data/pull_requests/557 - CUDA MMQ for iqX_r4 quants.md
similarity index 79%
rename from github-data/pull_requests/557 - CUDA_ MMQ for iqX_r4 quants.md
rename to github-data/pull_requests/557 - CUDA MMQ for iqX_r4 quants.md
index 5cb4fe748..d6ba50f06 100644
--- a/github-data/pull_requests/557 - CUDA_ MMQ for iqX_r4 quants.md
+++ b/github-data/pull_requests/557 - CUDA MMQ for iqX_r4 quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#557](https://github.com/ikawrakow/ik_llama.cpp/pull/557) - CUDA: MMQ for iqX_r4 quants
+## 🔀 [Pull Request #557](https://github.com/ikawrakow/ik_llama.cpp/pull/557) - CUDA: MMQ for iqX_r4 quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_iqk_r4` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-25 |
| **Updated** | 2025-06-26 |
+| **Merged** | 2025-06-26 |
---
-#### Description
+## 📄 Description
CUDA matrix multiplications for `IQ2_K_R4, ..., IQ5_K_R4` quants on the main branch are implemented via deqantize to `fp16` (or `bf16`) + cuBLAS. As a result, there is a constant overhead for the dequantization step, which leads to relatively low performance when the number of tokens being processed is small. This is often the case for MoE models with many experts where each expert "sees" a small fraction of the tokens. For instance, for DeepSeek-R1/V3, for a batch size of 4096 tokens, experts will process on average just 128 tokens.
@@ -20,9 +23,9 @@ The benefit is illustrated with the following graph, which shows prompt processi
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-25** at **15:39:08**:
+👤 **ubergarm** commented on **2025-06-25** at **15:39:08**
Ran one test of my `IQ2_K_R4` on the Thread Ripper Pro 24x core offloading some layers onto 2x RTX A6000 GPUs showing some uplift for PP with this PR. I didn't try larger batch sizes at it sounds like this mostly benefits smaller batch sizes. Also I could have offloaded a couple more layers at least which would likely help given this boosts the CUDA code path speeds.
@@ -96,4 +99,22 @@ model=DeepSeek-R1-0528-IQ2_K_R4-00001-of-00005.gguf
-
\ No newline at end of file
+
+
+---
+
+👤 **ikawrakow** commented on **2025-06-25** at **16:04:37**
+
+Thanks! Interesting that the performance fluctuations increase with this PR. There is also no reason TG performance to change.
+
+---
+
+👤 **ubergarm** commented on **2025-06-25** at **17:02:56**
+
+> Thanks! Interesting that the performance fluctuations increase with this PR. There is also no reason TG performance to change.
+
+Yeah I was wondering about those two dips myself. I will try to get another run in to see as the rig was doing some downloading in the background which could account for some variability in the TG.
+
+My larger sized quants e.g. IQ3_K_R4 are actually using the IQ4_KS_R4 for ffn_down etc.. I had chosen the `KS` quants over `K` as the larger 32 block size seems to give faster speeds. So this PR should be best for my smallest IQ2_K_R4 as there are no `KS` quants available for those smaller bpw's.
+
+Also thanks I'm always impressed with your creativity in optimizing some of the quants we've put out there! Really appreciate it!
\ No newline at end of file
diff --git a/github-data/pull_requests/558 - Add mikupad to ik_llama as an alternative WebUI.md b/github-data/pull_requests/558 - Add mikupad to ik_llama as an alternative WebUI.md
index 390de6a1a..2086808d8 100644
--- a/github-data/pull_requests/558 - Add mikupad to ik_llama as an alternative WebUI.md
+++ b/github-data/pull_requests/558 - Add mikupad to ik_llama as an alternative WebUI.md
@@ -1,14 +1,16 @@
-### 🔀 [#558](https://github.com/ikawrakow/ik_llama.cpp/pull/558) - Add mikupad to ik_llama as an alternative WebUI
+## 🔀 [Pull Request #558](https://github.com/ikawrakow/ik_llama.cpp/pull/558) - Add mikupad to ik_llama as an alternative WebUI
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | 📝 **Draft** |
+| **Source Branch** | `s6/mikupad` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-26 |
-| **Updated** | 2025-07-13 |
+| **Updated** | 2025-07-25 |
---
-#### Description
+## 📄 Description
This PR adds [mikupad](https://github.com/lmg-anon/mikupad) (and new endpoints to `server.cpp` that mikupad uses to manage its sql database).
@@ -42,7 +44,7 @@ To-do:
- [x] Remove `selectedSessionId` from the database and have it be handled via URL fragment instead
- [x] Add export all button
- [x] Implement endpoints to create, maintain, and get config info for compression (and `VACUUM` to reduce file size).
-- [ ] Finalize or Implement UI (for export all button, compression, KV cache manipulation)
+- [ ] Finalize or Implement UI (for export all button, compression, KV cache manipulation) see [this](https://github.com/ikawrakow/ik_llama.cpp/pull/558#issuecomment-3115444257) comment for an update
- [ ] Update license (including a potential new AUTHORS file for mikupad)
- [ ] Documentation
- [ ] I think compile will fail if it can't find sqlite so fix that if that is the case
@@ -52,7 +54,7 @@ Potential roadmap items:
- [ ] Add a mode that creates new sessions on branching or prediction
- [ ] Remove `nextSessionId` from the database. This would allow the sessions table to have a standard `INTEGER PRIMARY KEY` as that is currently how the TEXT key is being used besides `nextSessionId` (and the now removed `selectedSessionId`). As nice as this is, I'm not sure it is worth the database migration.
- [ ] SQLite Wasm option
-- [ ] Allow for slot saves to be in the database. This would allow for it to be compressed (similar to prompts there can often be a lot of redundancy between saves).
+- [ ] Allow for slot saves to be in the database. This would allow for it to be compressed (similar to prompts there can often be a lot of redundancy between saves). Edit: This may not be as useful as expected.
- [ ] Add a new pure black version of Monospace dark (for OLED screens).
- [ ] Add the ability to mask tokens from being processed (for use with think tokens as they are supposed to be removed once the response is finished).
- [ ] max content length should be obtained from server (based on `n_ctx`) and not from user input, and also changing or even removing the usage of that variable (or just from the UI). It is used for setting maximums for Penalty Range for some samplers (useful but could be frustrating if set wrong as knowing that is not very clear), and to truncate it seems in some situation (not useful in my view).
@@ -64,15 +66,15 @@ An image of the new resizable sessions section (`All` group is always on top, an
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-06-28** at **01:46:03**:
+👤 **saood06** commented on **2025-06-28** at **01:46:03**
Now that I have removed the hardcoded extension loading, I do think this is in a state where it can be used by others (and potentially provide feedback), but I will still be working on completing things from the "To-do" list above until it is ready for review (and will update the post above).
---
-👤 **ubergarm** commented the **2025-06-30** at **14:34:30**:
+👤 **ubergarm** commented on **2025-06-30** at **14:34:30**
Heya @saood06 I had some time this morning to kick the tires on this PR.
@@ -82,7 +84,7 @@ I don't typically use the built-in web interface, but I did by mest to try it ou
-logs
+👈logs and screenshots
```bash
# get setup
@@ -168,7 +170,15 @@ INFO [ log_server_request] request | tid="140145873375232" timestamp=175129
---
-👤 **saood06** commented the **2025-06-30** at **18:30:02**:
+👤 **Downtown-Case** commented on **2025-06-30** at **15:42:14**
+
+I am interested in this.
+
+Mikupad is *excellent* for testing prompt formatting and sampling, with how it shows logprobs over generated tokens. It's also quite fast with big blocks of text.
+
+---
+
+👤 **saood06** commented on **2025-06-30** at **18:30:02**
> I am interested in this.
>
@@ -186,8 +196,83 @@ You are doing the correct steps, I was able to reproduce the issue of not workin
---
-👤 **ubergarm** commented the **2025-06-30** at **19:41:28**:
+👤 **ubergarm** commented on **2025-06-30** at **19:41:28**
> You are doing the correct steps, I was able to reproduce the issue of not working with a fresh sql file (so far my testing was done with backup databases with existing data). Thanks for testing, I'll let you know when it works so that you can test it again if you so choose.
-Thanks for confirming, correct I didn't have a `.sql` file already in place but just made up that name. Happy to try again whenever u are ready!
\ No newline at end of file
+Thanks for confirming, correct I didn't have a `.sql` file already in place but just made up that name. Happy to try again whenever u are ready!
+
+---
+
+👤 **saood06** commented on **2025-06-30** at **19:54:11**
+
+> Thanks for confirming, correct I didn't have a `.sql` file already in place but just made up that name. Happy to try again whenever u are ready!
+
+Just pushed a fix. ( The issue was with something that is on my to-do list to refactor and potentially remove but for now a quick fix for the code as is).
+
+Edit: The fix is in the html only so no compile or even relaunch needed just a reload should fix it
+
+---
+
+👤 **ubergarm** commented on **2025-06-30** at **22:34:56**
+
+@saood06
+
+Aye! It fired right up this time and I was able to play with it a little and have a successful generation. It is cool how it I can mouse over the tokens to see the probabilities!
+
+
+
+---
+
+👤 **saood06** commented on **2025-06-30** at **22:40:08**
+
+> Aye! It fired right up this time and I was able to play with it a little and have a successful generation.
+
+Nice.
+
+>It is cool how it I can mouse over the tokens to see the probabilities!
+
+Yes, I like to turn on the "Color by probability" to be able to see low probability tokens at a glance.
+
+It might also be useful to you for benchmarking quants or models (saving and cloning prompts).
+
+---
+
+👤 **ikawrakow** commented on **2025-07-02** at **08:09:57**
+
+This is getting surprisingly little testing. Nevertheless we can merge whenever @saood06 feels it is ready and removes the "draft" label.
+
+---
+
+👤 **saood06** commented on **2025-07-25** at **00:55:21**
+
+I am looking to get some feedback on the UI I added for the new features.
+
+I have yet to push a commit with it because although most things are functional, there are still bugs and some missing functionality.
+
+Managing prompts from disk cache:
+
+
+The left panel is adjustable in width, and the entire thing is adjustable in height. (Note: total width is fixed, but total height is adjustable).
+
+The save slots tab (which was meant to manage prompts from the slots/memory as opposed to disk) is yet to be implemented (and may be pushed to the roadmap as restoring from disk does not update the `/slots` endpoint and not sure how difficult it will be to fix that bug).
+
+Renaming the saves is also planned but not currently implemented.
+
+The icons for sorting are in order: name, token count, file size, and modified date. I hope the icons make that clear but they also say what they do on hover:
+
+
+
+The sidebar which includes the button to open what is shown above alongside Database compression management:
+
+
+
+The enable button will swap to an update button once you enable compression, but that has yet to be implemented (in either the front or back end).
+
+Custom button when clicked shows this:
+
+
+
+
+
+@Downtown-Case would you mind giving me your opinion?
\ No newline at end of file
diff --git a/github-data/pull_requests/559 - Use cuBLAS for large batches and quants with block size 16.md b/github-data/pull_requests/559 - Use cuBLAS for large batches and quants with block size 16.md
index a4e9857e0..c8254c501 100644
--- a/github-data/pull_requests/559 - Use cuBLAS for large batches and quants with block size 16.md
+++ b/github-data/pull_requests/559 - Use cuBLAS for large batches and quants with block size 16.md
@@ -1,16 +1,19 @@
-### 🔀 [#559](https://github.com/ikawrakow/ik_llama.cpp/pull/559) - Use cuBLAS for large batches and quants with block size 16
+## 🔀 [Pull Request #559](https://github.com/ikawrakow/ik_llama.cpp/pull/559) - Use cuBLAS for large batches and quants with block size 16
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/mmq_to_cublas` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-26 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-06-27 |
---
-#### Description
+## 📄 Description
-While working on #557 I noticed that dequantize+cuBLAS is faster than MMQ for the `iqX_k_r4` quants when the batch size is larger than some threshold.
+While working on [#557](https://github.com/ikawrakow/ik_llama.cpp/issues/557) I noticed that dequantize+cuBLAS is faster than MMQ for the `iqX_k_r4` quants when the batch size is larger than some threshold.
The same applies to all quantization types with block size of 16: `Q2_K, Q3_K, Q6_K, IQ2_XS, IQ2_S, IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K`. Hence, this PR changes the `ggml_cuda_should_use_mmq()` function to return `false` if the batch size (number of rows in the right matrix) is greater than some quantization type specific threshold.
@@ -20,9 +23,9 @@ This graph illustrates the PP performance improvement achieved this way for k-qu
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ewhacc** commented the **2025-06-26** at **20:12:34**:
+👤 **ewhacc** commented on **2025-06-26** at **20:12:34**
I tried this "build = 3773 (3dbc8437)" on ubergam's DeepSeek-R1-0528-GGUF IQ2_K_R4 with -b 4096 -ub 4096.
Getting no difference on PP speed, compared to "build = 3762 (1843ed22)".
@@ -39,10 +42,11 @@ cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML
---
-👤 **ubergarm** commented the **2025-06-26** at **20:19:59**:
+👤 **ubergarm** commented on **2025-06-26** at **20:19:59**
@ewhacc
-@ewhacc
+
+*EDIT* wait, your old test was on 1843ed2 which was *before* PR557 was merged?? huh, i would imagine you would see some speed boost. compare against the commands i'm using below to see if something else is up?
Yeah, the speed boosts specific to IQ2_K_R4 and IQ3_K_R4 quantizations (in the quan you mention) were *already* added in PR557. This PR is doing a similar thing for some *other* quant types like Q2_K etc.
@@ -52,7 +56,7 @@ I just did another test for PR557 using this git sha, which is a bit confusing a
-👈
+👈compile, llama-sweep-bench, data
```bash
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
@@ -163,7 +167,48 @@ llama_model_loader: - type iq3_k_r4: 58 tensors
---
-👤 **ikawrakow** commented the **2025-06-27** at **06:40:41**:
+👤 **Panchovix** commented on **2025-06-26** at **22:14:16**
+
+Noob question and sorry to ask here, but does this PR apply to sub k quants? Like q2_k_s, q3_k_m, q4_k_l, q5_k_xl, etc
+
+---
+
+👤 **ubergarm** commented on **2025-06-27** at **00:55:18**
+
+@ewhacc
+
+I thought about it some more, and both this PR559 and PR557 only apply when the mentioned quantized tensors are running on CUDA. So for my quant that you mention, the `IQ2_K_R4` only the `ffn_(gate|down|up)_exps` tensors are quantized at one of those involved in these PRs.
+
+So to see the speed boost you have to use offload more of those specific layers onto CUDA e.g. `-ot "blk\.(3|4|5|6|7|8|9|10|11|12)\.ffn_.*=CUDA0"`. If you're not offloading more of those layers, then you would see the same speeds.
+
+This kinda ties into @Panchovix great question, and I'd love to do a video called "What is in a quant?" to explain better, because it is pretty confusing until you dig into it with either `./gguf-py/scripts/gguf_dump.py` or more simply looking at the hugging face side-bar e.g. here for a specific example: [bartowski's DeepSeek-R1-0528-Q3_K_M](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-R1-0528-GGUF?show_file_info=deepseek-ai_DeepSeek-R1-0528-Q3_K_M%2Fdeepseek-ai_DeepSeek-R1-0528-Q3_K_M-00001-of-00008.gguf)
+
+You see it has the filename `Q3_K_M`, but when you scroll down and look at the tensors, not *every* tensor is quantized at Q3_K_M. Also for unsloth you'll *never* see a tensor quantized with `UD-Q4_K_XL` as that is not even a real thing.
+
+> Like q2_k_s, q3_k_m, q4_k_l, q5_k_xl, etc
+
+So things like `Q2_K_S` are *both* of an actual quantization type and also a pre-defined recipe according to [llama-quantize](https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md) code. Things like `XL` or even my `-mix-` prefix are kind of naming conventions for recipes that suggest which tensors might be a little bigger or smaller but "mostly" around that size. I like to joke about the mythical 8.5 BPW `IQ1_S_XXXL` for example which is just an absurd extension of this basic naming conventions.
+
+Personally I don't follow the conventions established in llama-quantize and pretty much always override everything with whatever I want to use. So when you start my `IQ2_K_R4` llama-server will print out:
+
+```
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q5_0: 61 tensors - attn_k_b
+llama_model_loader: - type iq4_ks: 116 tensors - ffn_(gate|up)_shexp
+llama_model_loader: - type iq5_ks: 435 tensors token_embd,output,ffn_down_shexp,first 3 ffn_(down|gate|up),remaining attn_
+llama_model_loader: - type iq2_k_r4: 116 tensors ffn_(gate|up)_exps
+llama_model_loader: - type iq3_k_r4: 58 tensors ffn_down_exps
+```
+
+So there is a lot more going on under the hood than the name belies. My personal convention is to name the quant "recipe" filename after whatever the main `ffn_(gate|up)` tensors are quantized.
+
+To keep it relevant to this PR, you need to look inside your gguf and see if any of the mentioned quantizations types apply to tensors which you are running on CUDA.
+
+Cheers!
+
+---
+
+👤 **ikawrakow** commented on **2025-06-27** at **06:40:41**
> Noob question and sorry to ask here, but does this PR apply to sub k quants? Like q2_k_s, q3_k_m, q4_k_l, q5_k_xl, etc
@@ -171,7 +216,7 @@ I know this is confusing. Users specify the quantization with a llama type (`Q2_
---
-👤 **ikawrakow** commented the **2025-06-27** at **07:02:46**:
+👤 **ikawrakow** commented on **2025-06-27** at **07:02:46**
Performance impact is easier to test with a dense model. For a MoE model such as DeepSeek-R1/V3, even at a batch size of 4096 tokens, experts process on average just 128 tokens, so still far away from the point where the transition to dequantize+cuBLAS occurs. Most of the self attention computations are within the FA implementation, which does not use the regular matrix multiplications, so there are just a few matrix multiplications left that get affected, but they usually take a small fraction of the overall calculation, so impact is negligible (and, as pointed out by @ubergarm, the test done by @ewhacc is not affected by this PR).
@@ -179,7 +224,7 @@ But if you are running a dense model with partial offload, you will want to have
---
-👤 **ikawrakow** commented the **2025-06-27** at **07:26:28**:
+👤 **ikawrakow** commented on **2025-06-27** at **07:26:28**
Here is an example illustrating my previous post. Running LlaMA-3.1-70B quantized with `Q2_K_S` on my paltry RTX-4080 with 16 GB VRAM:
@@ -194,13 +239,13 @@ I have uploaded only 30 out of 80 layers to the GPU so I can run with the larger
---
-👤 **ubergarm** commented the **2025-06-27** at **14:03:24**:
+👤 **ubergarm** commented on **2025-06-27** at **14:03:24**
Okay, I made a few Qwen3-14B dense "pure" quants (q4_K token_embd, q6_K output "head") and seeing roughly 1.4x speedup on PP with this PR over main for `-ub 4096 -b 4096` batch sizes.
This is great and really changes things given `iq4_k` and `iq5_k` are now *faster* than the `ks` counterparts as shown in the graph:
-
+
@@ -274,17 +319,19 @@ CUDA_VISIBLE_DEVICES="0" \
| 4096 | 1024 | 12288 | 2.779 | 1473.69 | 24.981 | 40.99 |
| 4096 | 1024 | 16384 | 3.042 | 1346.62 | 27.103 | 37.78 |
-
+
+
+I didn't check the other remaining quantization types with block size of 16.
---
-👤 **ikawrakow** commented the **2025-06-27** at **14:17:17**:
+👤 **ikawrakow** commented on **2025-06-27** at **14:17:17**
Before you throw these quants away, try `-b 2048 -ub 512` and `b 2048 -ub 1024`.
---
-👤 **ubergarm** commented the **2025-06-27** at **14:22:59**:
+👤 **ubergarm** commented on **2025-06-27** at **14:22:59**
Sure thing.
@@ -292,38 +339,417 @@ Also it is interesting now that q6_K is a little faster PP than q4_K at 4096 ub/
---
-👤 **ubergarm** commented the **2025-06-27** at **14:39:13**:
+👤 **ubergarm** commented on **2025-06-27** at **14:28:42**
+
+
+
+---
+
+👤 **ubergarm** commented on **2025-06-27** at **14:39:13**
+
+So for IQ4_K the sweet spot is closer to -ub 2048 -b 2048
+*NOTE*: TITLE is wrong on this, leftover from before, this is *only* ik fork:
+
+
+
+
+data
+
+## IQ4_K PR559@3dbc8437 -ub 4096 -b 4096
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 1.397 | 2931.10 | 16.804 | 60.94 |
+| 4096 | 1024 | 4096 | 1.664 | 2461.65 | 19.088 | 53.65 |
+| 4096 | 1024 | 8192 | 1.931 | 2121.11 | 21.343 | 47.98 |
+| 4096 | 1024 | 12288 | 2.195 | 1865.99 | 23.547 | 43.49 |
+| 4096 | 1024 | 16384 | 2.462 | 1663.59 | 25.710 | 39.83 |
+
+## IQ4_K PR559@3dbc8437 -ub 2048 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 0.656 | 3121.27 | 8.250 | 62.06 |
+| 2048 | 512 | 2048 | 0.717 | 2855.02 | 8.806 | 58.14 |
+| 2048 | 512 | 4096 | 0.782 | 2617.76 | 9.391 | 54.52 |
+| 2048 | 512 | 6144 | 0.853 | 2400.71 | 9.962 | 51.40 |
+| 2048 | 512 | 8192 | 0.922 | 2221.92 | 10.529 | 48.63 |
+| 2048 | 512 | 10240 | 0.994 | 2059.88 | 11.085 | 46.19 |
+| 2048 | 512 | 12288 | 1.059 | 1934.63 | 11.654 | 43.93 |
+| 2048 | 512 | 14336 | 1.122 | 1825.66 | 12.197 | 41.98 |
+| 2048 | 512 | 16384 | 1.188 | 1723.89 | 12.727 | 40.23 |
+
+## IQ4_K PR559@3dbc8437 -ub 1024 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 1024 | 256 | 0 | 0.349 | 2933.38 | 4.174 | 61.33 |
+| 1024 | 256 | 1024 | 0.363 | 2817.92 | 4.326 | 59.18 |
+| 1024 | 256 | 2048 | 0.379 | 2701.71 | 4.466 | 57.32 |
+| 1024 | 256 | 3072 | 0.395 | 2592.14 | 4.614 | 55.48 |
+| 1024 | 256 | 4096 | 0.409 | 2503.88 | 4.753 | 53.86 |
+| 1024 | 256 | 5120 | 0.423 | 2418.81 | 4.890 | 52.35 |
+| 1024 | 256 | 6144 | 0.440 | 2325.79 | 5.044 | 50.76 |
+| 1024 | 256 | 7168 | 0.455 | 2251.27 | 5.180 | 49.42 |
+| 1024 | 256 | 8192 | 0.470 | 2179.03 | 5.318 | 48.14 |
+| 1024 | 256 | 9216 | 0.486 | 2107.66 | 5.455 | 46.93 |
+| 1024 | 256 | 10240 | 0.502 | 2041.52 | 5.588 | 45.81 |
+| 1024 | 256 | 11264 | 0.519 | 1973.34 | 5.729 | 44.68 |
+| 1024 | 256 | 12288 | 0.537 | 1908.08 | 5.866 | 43.64 |
+| 1024 | 256 | 13312 | 0.551 | 1859.53 | 5.998 | 42.68 |
+| 1024 | 256 | 14336 | 0.568 | 1804.21 | 6.129 | 41.77 |
+| 1024 | 256 | 15360 | 0.584 | 1753.24 | 6.264 | 40.87 |
+| 1024 | 256 | 16384 | 0.602 | 1701.41 | 6.397 | 40.02 |
+
+## IQ4_K PR559@3dbc8437 -ub 512 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.214 | 2391.80 | 2.070 | 61.83 |
+| 512 | 128 | 512 | 0.216 | 2365.51 | 2.096 | 61.08 |
+| 512 | 128 | 1024 | 0.221 | 2318.91 | 2.133 | 60.02 |
+| 512 | 128 | 1536 | 0.224 | 2280.78 | 2.166 | 59.09 |
+| 512 | 128 | 2048 | 0.230 | 2229.61 | 2.211 | 57.88 |
+| 512 | 128 | 2560 | 0.233 | 2199.67 | 2.246 | 57.00 |
+| 512 | 128 | 3072 | 0.237 | 2159.36 | 2.284 | 56.04 |
+| 512 | 128 | 3584 | 0.242 | 2112.85 | 2.315 | 55.30 |
+| 512 | 128 | 4096 | 0.245 | 2087.07 | 2.357 | 54.30 |
+| 512 | 128 | 4608 | 0.249 | 2053.19 | 2.390 | 53.55 |
+| 512 | 128 | 5120 | 0.253 | 2022.21 | 2.427 | 52.74 |
+| 512 | 128 | 5632 | 0.258 | 1983.98 | 2.460 | 52.02 |
+| 512 | 128 | 6144 | 0.262 | 1951.78 | 2.498 | 51.23 |
+| 512 | 128 | 6656 | 0.265 | 1930.62 | 2.536 | 50.48 |
+| 512 | 128 | 7168 | 0.269 | 1903.37 | 2.571 | 49.79 |
+| 512 | 128 | 7680 | 0.274 | 1868.29 | 2.607 | 49.10 |
+| 512 | 128 | 8192 | 0.277 | 1845.98 | 2.639 | 48.50 |
+| 512 | 128 | 8704 | 0.281 | 1821.41 | 2.678 | 47.79 |
+| 512 | 128 | 9216 | 0.285 | 1799.54 | 2.715 | 47.15 |
+| 512 | 128 | 9728 | 0.289 | 1773.15 | 2.747 | 46.60 |
+| 512 | 128 | 10240 | 0.292 | 1750.98 | 2.784 | 45.97 |
+| 512 | 128 | 10752 | 0.297 | 1726.16 | 2.820 | 45.38 |
+| 512 | 128 | 11264 | 0.301 | 1699.55 | 2.858 | 44.79 |
+| 512 | 128 | 11776 | 0.305 | 1678.82 | 2.890 | 44.29 |
+| 512 | 128 | 12288 | 0.308 | 1662.74 | 2.924 | 43.78 |
+| 512 | 128 | 12800 | 0.314 | 1629.31 | 2.959 | 43.26 |
+| 512 | 128 | 13312 | 0.316 | 1620.90 | 2.992 | 42.78 |
+| 512 | 128 | 13824 | 0.321 | 1594.61 | 3.026 | 42.30 |
+| 512 | 128 | 14336 | 0.323 | 1582.96 | 3.058 | 41.86 |
+| 512 | 128 | 14848 | 0.327 | 1563.48 | 3.092 | 41.39 |
+| 512 | 128 | 15360 | 0.333 | 1537.87 | 3.125 | 40.96 |
+| 512 | 128 | 15872 | 0.336 | 1523.45 | 3.160 | 40.51 |
+| 512 | 128 | 16384 | 0.340 | 1505.35 | 3.194 | 40.08 |
+
+
+
+
+Again for IQ5_K I'm seeing a peak closer to -ub 2048.
+
+*NOTE* title is wrong on this next graph, forgot to update my script, this is *only* ik fork:
+
+
+
+
+IQ5_K data
+
+## IQ5_K PR559@3dbc8437 -ub 4096 -b 4096
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 1.425 | 2873.91 | 18.492 | 55.37 |
+| 4096 | 1024 | 4096 | 1.691 | 2422.55 | 20.701 | 49.47 |
+| 4096 | 1024 | 8192 | 1.949 | 2101.61 | 22.837 | 44.84 |
+| 4096 | 1024 | 12288 | 2.207 | 1856.22 | 24.911 | 41.11 |
+| 4096 | 1024 | 16384 | 2.476 | 1654.56 | 26.981 | 37.95 |
+
+## IQ5_K PR559@3dbc8437 -ub 2048 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 0.659 | 3108.06 | 9.025 | 56.73 |
+| 2048 | 512 | 2048 | 0.717 | 2858.24 | 9.576 | 53.46 |
+| 2048 | 512 | 4096 | 0.782 | 2617.54 | 10.101 | 50.69 |
+| 2048 | 512 | 6144 | 0.851 | 2407.13 | 10.648 | 48.08 |
+| 2048 | 512 | 8192 | 0.924 | 2216.49 | 11.196 | 45.73 |
+| 2048 | 512 | 10240 | 0.994 | 2060.10 | 11.739 | 43.62 |
+| 2048 | 512 | 12288 | 1.060 | 1932.34 | 12.290 | 41.66 |
+| 2048 | 512 | 14336 | 1.128 | 1815.37 | 12.877 | 39.76 |
+| 2048 | 512 | 16384 | 1.193 | 1716.14 | 13.405 | 38.19 |
+
+## IQ5_K PR559@3dbc8437 -ub 1024 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 1024 | 256 | 0 | 0.355 | 2881.92 | 4.574 | 55.97 |
+| 1024 | 256 | 1024 | 0.370 | 2769.15 | 4.713 | 54.32 |
+| 1024 | 256 | 2048 | 0.385 | 2657.51 | 4.851 | 52.77 |
+| 1024 | 256 | 3072 | 0.399 | 2565.03 | 4.985 | 51.35 |
+| 1024 | 256 | 4096 | 0.415 | 2469.14 | 5.121 | 49.99 |
+| 1024 | 256 | 5120 | 0.429 | 2384.88 | 5.268 | 48.60 |
+| 1024 | 256 | 6144 | 0.446 | 2298.44 | 5.410 | 47.32 |
+| 1024 | 256 | 7168 | 0.460 | 2225.92 | 5.532 | 46.28 |
+| 1024 | 256 | 8192 | 0.475 | 2155.30 | 5.658 | 45.25 |
+| 1024 | 256 | 9216 | 0.491 | 2083.92 | 5.793 | 44.19 |
+| 1024 | 256 | 10240 | 0.507 | 2021.02 | 5.929 | 43.18 |
+| 1024 | 256 | 11264 | 0.523 | 1957.58 | 6.059 | 42.25 |
+| 1024 | 256 | 12288 | 0.541 | 1893.88 | 6.187 | 41.38 |
+| 1024 | 256 | 13312 | 0.555 | 1845.18 | 6.320 | 40.50 |
+| 1024 | 256 | 14336 | 0.572 | 1790.66 | 6.480 | 39.50 |
+| 1024 | 256 | 15360 | 0.588 | 1740.57 | 6.618 | 38.68 |
+| 1024 | 256 | 16384 | 0.606 | 1690.16 | 6.745 | 37.96 |
+
+## IQ5_K PR559@3dbc8437 -ub 512 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.223 | 2298.96 | 2.285 | 56.01 |
+| 512 | 128 | 512 | 0.226 | 2261.09 | 2.314 | 55.31 |
+| 512 | 128 | 1024 | 0.230 | 2226.23 | 2.350 | 54.47 |
+| 512 | 128 | 1536 | 0.233 | 2194.14 | 2.390 | 53.56 |
+| 512 | 128 | 2048 | 0.237 | 2158.63 | 2.422 | 52.84 |
+| 512 | 128 | 2560 | 0.242 | 2118.92 | 2.455 | 52.15 |
+| 512 | 128 | 3072 | 0.245 | 2088.94 | 2.489 | 51.43 |
+| 512 | 128 | 3584 | 0.250 | 2044.38 | 2.523 | 50.73 |
+| 512 | 128 | 4096 | 0.252 | 2028.45 | 2.562 | 49.96 |
+| 512 | 128 | 4608 | 0.258 | 1983.71 | 2.596 | 49.30 |
+| 512 | 128 | 5120 | 0.261 | 1958.57 | 2.627 | 48.73 |
+| 512 | 128 | 5632 | 0.265 | 1932.14 | 2.659 | 48.14 |
+| 512 | 128 | 6144 | 0.271 | 1890.02 | 2.692 | 47.54 |
+| 512 | 128 | 6656 | 0.273 | 1875.44 | 2.724 | 46.98 |
+| 512 | 128 | 7168 | 0.276 | 1853.02 | 2.762 | 46.34 |
+| 512 | 128 | 7680 | 0.281 | 1822.79 | 2.795 | 45.80 |
+| 512 | 128 | 8192 | 0.285 | 1797.94 | 2.826 | 45.29 |
+| 512 | 128 | 8704 | 0.289 | 1773.17 | 2.860 | 44.75 |
+| 512 | 128 | 9216 | 0.292 | 1750.49 | 2.903 | 44.09 |
+| 512 | 128 | 9728 | 0.297 | 1722.68 | 2.938 | 43.56 |
+| 512 | 128 | 10240 | 0.300 | 1708.24 | 2.971 | 43.08 |
+| 512 | 128 | 10752 | 0.303 | 1691.26 | 3.007 | 42.56 |
+| 512 | 128 | 11264 | 0.308 | 1663.54 | 3.039 | 42.12 |
+| 512 | 128 | 11776 | 0.312 | 1641.40 | 3.073 | 41.66 |
+| 512 | 128 | 12288 | 0.315 | 1625.32 | 3.101 | 41.27 |
+| 512 | 128 | 12800 | 0.320 | 1601.04 | 3.142 | 40.73 |
+| 512 | 128 | 13312 | 0.323 | 1586.34 | 3.177 | 40.29 |
+| 512 | 128 | 13824 | 0.326 | 1569.61 | 3.208 | 39.90 |
+| 512 | 128 | 14336 | 0.331 | 1545.56 | 3.241 | 39.49 |
+| 512 | 128 | 14848 | 0.334 | 1531.33 | 3.274 | 39.10 |
+| 512 | 128 | 15360 | 0.338 | 1514.19 | 3.310 | 38.67 |
+| 512 | 128 | 15872 | 0.343 | 1493.92 | 3.342 | 38.30 |
+| 512 | 128 | 16384 | 0.347 | 1475.48 | 3.372 | 37.96 |
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-06-27** at **14:48:26**
-
+> So for IQ4_K the sweet spot is closer to -ub 2048 -b 2048
+
+I guess you have a higher end GPU. On my RTX-4080 for fully offloaded dense model the peak is somewhere between `-ub 512` and `-ub 1024`.
+
+But the `Q6_K` comparison with mainline is interesting. It means Johannes has improved the block-of-16 kernel, which here has an unreasonably low performance. I need to look into that. Can you try another block-of-16 quant that also works in mainline? (`Q2_K`, `Q3_K`, `IQ2_XS`, `IQ2_S` all have blocks of 16).
---
-👤 **ikawrakow** commented the **2025-06-27** at **15:34:33**:
+👤 **ubergarm** commented on **2025-06-27** at **15:17:24**
+
+> Can you try another block-of-16 quant that also works in mainline? (Q2_K, Q3_K, IQ2_XS, IQ2_S all have blocks of 16).
+
+Here is the IQ2_XS on both forks with `-ub 2048 -b 2048` as well as default values of `-ub 512 -b 2048`. The llama.cpp sha1 references my own fork with llama-sweep-bench rebased on top of recent `llama.cpp@8d94219a`
+
+
+
+
+
+data
+
+## IQ2_XS ik_llama.cpp@3dbc8437 -ub 2048 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 0.674 | 3038.23 | 6.037 | 84.81 |
+| 2048 | 512 | 2048 | 0.729 | 2811.15 | 6.517 | 78.56 |
+| 2048 | 512 | 4096 | 0.792 | 2587.22 | 7.054 | 72.58 |
+| 2048 | 512 | 6144 | 0.855 | 2396.34 | 7.575 | 67.59 |
+| 2048 | 512 | 8192 | 0.927 | 2209.61 | 8.126 | 63.01 |
+| 2048 | 512 | 10240 | 0.999 | 2049.15 | 8.692 | 58.91 |
+| 2048 | 512 | 12288 | 1.067 | 1919.42 | 9.249 | 55.36 |
+| 2048 | 512 | 14336 | 1.133 | 1808.12 | 9.809 | 52.20 |
+| 2048 | 512 | 16384 | 1.202 | 1703.61 | 10.355 | 49.45 |
+
+## IQ2_XS ik_llama.cpp@3dbc8437 -ub 512 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.205 | 2494.23 | 1.528 | 83.78 |
+| 512 | 128 | 512 | 0.209 | 2453.36 | 1.561 | 81.98 |
+| 512 | 128 | 1024 | 0.213 | 2400.31 | 1.597 | 80.13 |
+| 512 | 128 | 1536 | 0.216 | 2366.51 | 1.632 | 78.43 |
+| 512 | 128 | 2048 | 0.220 | 2329.54 | 1.669 | 76.67 |
+| 512 | 128 | 2560 | 0.224 | 2282.28 | 1.703 | 75.16 |
+| 512 | 128 | 3072 | 0.228 | 2241.91 | 1.736 | 73.75 |
+| 512 | 128 | 3584 | 0.231 | 2213.82 | 1.770 | 72.30 |
+| 512 | 128 | 4096 | 0.235 | 2175.32 | 1.806 | 70.89 |
+| 512 | 128 | 4608 | 0.240 | 2134.20 | 1.842 | 69.48 |
+| 512 | 128 | 5120 | 0.244 | 2099.82 | 1.876 | 68.23 |
+| 512 | 128 | 5632 | 0.247 | 2069.75 | 1.910 | 67.03 |
+| 512 | 128 | 6144 | 0.251 | 2037.92 | 1.943 | 65.88 |
+| 512 | 128 | 6656 | 0.255 | 2010.66 | 1.979 | 64.67 |
+| 512 | 128 | 7168 | 0.258 | 1982.53 | 2.011 | 63.65 |
+| 512 | 128 | 7680 | 0.262 | 1952.23 | 2.049 | 62.46 |
+| 512 | 128 | 8192 | 0.266 | 1924.80 | 2.081 | 61.52 |
+| 512 | 128 | 8704 | 0.271 | 1887.20 | 2.115 | 60.51 |
+| 512 | 128 | 9216 | 0.274 | 1866.07 | 2.149 | 59.57 |
+| 512 | 128 | 9728 | 0.279 | 1834.82 | 2.184 | 58.62 |
+| 512 | 128 | 10240 | 0.283 | 1811.73 | 2.217 | 57.73 |
+| 512 | 128 | 10752 | 0.286 | 1787.92 | 2.251 | 56.86 |
+| 512 | 128 | 11264 | 0.290 | 1766.16 | 2.285 | 56.03 |
+| 512 | 128 | 11776 | 0.295 | 1737.30 | 2.319 | 55.20 |
+| 512 | 128 | 12288 | 0.299 | 1715.08 | 2.352 | 54.43 |
+| 512 | 128 | 12800 | 0.304 | 1684.40 | 2.386 | 53.64 |
+| 512 | 128 | 13312 | 0.307 | 1667.90 | 2.417 | 52.96 |
+| 512 | 128 | 13824 | 0.310 | 1653.42 | 2.451 | 52.21 |
+| 512 | 128 | 14336 | 0.314 | 1630.45 | 2.485 | 51.51 |
+| 512 | 128 | 14848 | 0.319 | 1604.84 | 2.522 | 50.75 |
+| 512 | 128 | 15360 | 0.324 | 1582.47 | 2.553 | 50.13 |
+| 512 | 128 | 15872 | 0.327 | 1565.88 | 2.588 | 49.47 |
+| 512 | 128 | 16384 | 0.330 | 1549.55 | 2.622 | 48.81 |
+
+## IQ2_XS llama.cpp@6c510f3b -ub 2048 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 0.880 | 2326.71 | 6.355 | 80.57 |
+| 2048 | 512 | 2048 | 0.942 | 2173.10 | 6.991 | 73.23 |
+| 2048 | 512 | 4096 | 1.012 | 2024.59 | 7.567 | 67.66 |
+| 2048 | 512 | 6144 | 1.084 | 1890.13 | 8.183 | 62.57 |
+| 2048 | 512 | 8192 | 1.157 | 1770.18 | 8.748 | 58.53 |
+| 2048 | 512 | 10240 | 1.228 | 1667.69 | 9.314 | 54.97 |
+| 2048 | 512 | 12288 | 1.297 | 1578.65 | 9.906 | 51.69 |
+| 2048 | 512 | 14336 | 1.363 | 1502.27 | 10.476 | 48.87 |
+| 2048 | 512 | 16384 | 1.432 | 1430.39 | 11.050 | 46.34 |
+
+## IQ2_XS llama.cpp@6c510f3b -ub 512 -b 2048
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.218 | 2344.71 | 1.566 | 81.73 |
+| 512 | 128 | 512 | 0.220 | 2322.95 | 1.609 | 79.56 |
+| 512 | 128 | 1024 | 0.222 | 2301.97 | 1.643 | 77.91 |
+| 512 | 128 | 1536 | 0.227 | 2259.38 | 1.675 | 76.40 |
+| 512 | 128 | 2048 | 0.230 | 2226.16 | 1.702 | 75.22 |
+| 512 | 128 | 2560 | 0.235 | 2181.74 | 1.744 | 73.40 |
+| 512 | 128 | 3072 | 0.238 | 2149.97 | 1.782 | 71.82 |
+| 512 | 128 | 3584 | 0.242 | 2116.66 | 1.810 | 70.73 |
+| 512 | 128 | 4096 | 0.245 | 2088.01 | 1.853 | 69.08 |
+| 512 | 128 | 4608 | 0.249 | 2058.57 | 1.890 | 67.72 |
+| 512 | 128 | 5120 | 0.253 | 2026.99 | 1.911 | 66.99 |
+| 512 | 128 | 5632 | 0.256 | 1998.94 | 1.956 | 65.44 |
+| 512 | 128 | 6144 | 0.261 | 1960.68 | 1.994 | 64.20 |
+| 512 | 128 | 6656 | 0.264 | 1936.94 | 2.027 | 63.15 |
+| 512 | 128 | 7168 | 0.270 | 1898.54 | 2.068 | 61.88 |
+| 512 | 128 | 7680 | 0.273 | 1877.47 | 2.105 | 60.79 |
+| 512 | 128 | 8192 | 0.276 | 1852.21 | 2.135 | 59.94 |
+| 512 | 128 | 8704 | 0.280 | 1825.56 | 2.177 | 58.81 |
+| 512 | 128 | 9216 | 0.286 | 1792.70 | 2.214 | 57.82 |
+| 512 | 128 | 9728 | 0.289 | 1772.26 | 2.247 | 56.97 |
+| 512 | 128 | 10240 | 0.293 | 1747.48 | 2.291 | 55.87 |
+| 512 | 128 | 10752 | 0.298 | 1719.77 | 2.327 | 55.00 |
+| 512 | 128 | 11264 | 0.302 | 1694.59 | 2.356 | 54.33 |
+| 512 | 128 | 11776 | 0.305 | 1678.35 | 2.401 | 53.32 |
+| 512 | 128 | 12288 | 0.310 | 1653.56 | 2.436 | 52.55 |
+| 512 | 128 | 12800 | 0.316 | 1620.14 | 2.466 | 51.91 |
+| 512 | 128 | 13312 | 0.318 | 1608.49 | 2.513 | 50.94 |
+| 512 | 128 | 13824 | 0.323 | 1587.40 | 2.552 | 50.16 |
+| 512 | 128 | 14336 | 0.326 | 1568.80 | 2.577 | 49.67 |
+| 512 | 128 | 14848 | 0.332 | 1539.96 | 2.624 | 48.78 |
+| 512 | 128 | 15360 | 0.336 | 1523.84 | 2.663 | 48.06 |
+| 512 | 128 | 15872 | 0.339 | 1510.49 | 2.693 | 47.54 |
+| 512 | 128 | 16384 | 0.344 | 1490.30 | 2.732 | 46.85 |
+
+
+
+
+Interestingly mainline llama.cpp is slightly *slower* when increasing ubatch size over default.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-27** at **15:24:19**
+
+Oops, sorry, I misread your `Q6_K` graph, thinking `llama.cpp` has somehow become faster. So, nothing new under the sun.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-27** at **15:34:33**
So, the A6000 has more memory bandwidth than the 4080. This shifts things in favor of dequantize+cuBLAS because the dequantize step is memory bound, so it is quicker on the A6000. I guess this is why with `-ub 4096` `IQ4_K` outperforms `IQ4_KS`. I guess, I should look into making the thresholds at which the transitions between MMQ and dequantize+cuBLAS happens configurable. But I'll leave this for another PR.
---
-👤 **ikawrakow** commented the **2025-06-27** at **15:43:44**:
+👤 **ikawrakow** commented on **2025-06-27** at **15:43:44**
Based on @ubergarm's and my own testing this PR looks like a winner, so merging.
---
-👤 **ikawrakow** commented the **2025-06-29** at **16:28:37**:
+👤 **ewhacc** commented on **2025-06-27** at **17:55:39**
+
+@ubergarm
+
+Hi, this is u/smflx in reddit. Thanks a lot for detailed reply. :)
+
+Yes, the old test was on 1843ed2 which is little before PR557. This, PR557, PR559 are all the same PP speed of 272 t/s. Yes, it was boosted recently. If the boost specific to IQ2_K_R4 is already added, it's understandable.
+
+In your graph showing the boost on IQ2_K_R4, main is before PR557. Right? My PP speed of 272 t/s is similar to your S_PP 276.35 t/s. So, it seems OK. I will check llama-sweep-bench later. Thanks a lot for the the guide.
+
+My setup is about the same except : -DGGML_CUDA_F16=ON , -ctk q16_0
+
+```
+cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
+
+CUDA_VISIBLE_DEVICES="0" \
+ik_llama.cpp/build/bin/llama-server --model $model_path \
+ --ctx-size 98304 \
+ -mla 3 -fa -amb 512 -fmoe \
+ -b 4096 -ub 4096 \
+ --n-gpu-layers 63 \
+ -ot "blk\.(3|4|5|6|7)\.ffn_.*=CUDA0" \
+ --override-tensor exps=CPU \
+ --parallel 2 --threads 32
+
+CUDA_VISIBLE_DEVICES="0,1" \
+ik_llama.cpp/build/bin/llama-server --model $model_path \
+ --ctx-size 98304 \
+ -mla 3 -fa -amb 512 -fmoe \
+ -b 4096 -ub 4096 \
+ --n-gpu-layers 63 \
+ -ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \
+ -ot "blk\.1(0|1|2|3|4|5|6)\.ffn_.*=CUDA1" \
+ --override-tensor exps=CPU \
+ --parallel 2 --threads 32
+```
+I'm using 6000ada, but I think the speed will be the same to a6000. GPUs are not fully utilized. I guess PCIe speed is bottleneck.
+
+---
+
+👤 **ewhacc** commented on **2025-06-29** at **09:07:29**
+
+@ubergarm
+
+I have tested with the same llama-sweep-bench setup you provided on my rig.
+Epyc 9534 + 2x 6000ada
+
+I just changed the thread count to '--threads 32', which is optimal for 9534.
+Also tested with ' -ctk f16', which I use. The speed is the same. (But, 2x KV in VRAM)
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 10.274 | 398.66 | 48.584 | 21.08 |
+| 4096 | 1024 | 4096 | 11.003 | 372.26 | 50.596 | 20.24 |
+| 4096 | 1024 | 8192 | 11.893 | 344.41 | 51.931 | 19.72 |
+
+---
+
+👤 **ikawrakow** commented on **2025-06-29** at **16:28:37**
These performance results look pretty good to me. Has anyone ever reported a better result for hybrid GPU/CPU DeepSeek-R1/V3 inference?
---
-👤 **Panchovix** commented the **2025-06-30** at **20:58:35**:
+👤 **Panchovix** commented on **2025-06-30** at **20:58:35**
Haven't managed to test much as I accidentaly wiped my Fedora installation from Windows lol. But I was testing with llama sweep bench and got one error, but can't remember exactly the error, and/or is related to this PR.
I have just saved at how I run the model, which is
```
-./llama-server -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-merged.gguf' -c 32768 --no-mmap -ngl 999 \
+./llama-sweep-bench -m '/models_llm/DeepSeek-V3-0324-UD-Q3_K_XL-merged.gguf' -c 32768 --no-mmap -ngl 999 \
-ot "blk.(0|1|2|3|4|5|6|7).ffn.=CUDA0" \
-ot "blk.(8|9|10|11).ffn.=CUDA1" \
-ot "blk.(12|13|14|15).ffn.=CUDA2" \
@@ -339,11 +765,11 @@ I have just saved at how I run the model, which is
-fa -mg 0 -ub 2048 -mla 1
```
-I managed to see 200 t/s PP and 8.73 t/s TG, but then got a error. Again I will try to update when I get Linux installed again, as offloading + multigpu is just not worth it on Windows.
+I managed to see 200 t/s PP and 8.73 t/s TG, but then got a error. Again I will try to update when I get Linux installed again, as offloading + multigpu is just not worth it on Windows, speeds are way worse.
---
-👤 **Panchovix** commented the **2025-07-01** at **15:43:44**:
+👤 **Panchovix** commented on **2025-07-01** at **15:43:44**
Okay finally installed Fedora yesterday, testing remotely now so it is a bit slower (I'm using software encoding and it uses 2-3 threads)
@@ -369,4 +795,21 @@ CUDA error: an illegal memory access was encountered
/run/media/pancho/60A2FCEDA2FCC894/ChatIAs/ik_llama.cpp/ggml/src/ggml-cuda.cu:110: CUDA error
```
-WIth the same command as above. Sometimes it also crashes with another cuda error but still have to get it again. Again, not sure what is related to.
\ No newline at end of file
+WIth the same command as above. Sometimes it also crashes with another cuda error but still have to get it again. Again, not sure what is related to.
+
+---
+
+👤 **Panchovix** commented on **2025-07-02** at **20:12:03**
+
+Okay finally got the other error.
+
+```
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 2048 | 512 | 0 | 9.166 | 223.43 | 56.876 | 9.00 |
+| 2048 | 512 | 2048 | 9.549 | 214.48 | 57.088 | 8.97 |
+| 2048 | 512 | 4096 | 10.041 | 203.96 | 57.929 | 8.84 |
+| 2048 | 512 | 6144 | 10.534 | 194.42 | 58.584 | 8.74 |
+Oops(ggml_compute_forward_sum_rows_f32, ffn_moe_weights_sum-60): found nan for i1 = 0, i2 = 0, i3 = 0. ne00 = 8
+```
+Sorry for the spam, gonna raise an issue, but I still don't know how to replicate it always.
\ No newline at end of file
diff --git a/github-data/pull_requests/56 - BF16 support on Metal.md b/github-data/pull_requests/56 - BF16 support on Metal.md
index c1e06f23d..12727ea4d 100644
--- a/github-data/pull_requests/56 - BF16 support on Metal.md
+++ b/github-data/pull_requests/56 - BF16 support on Metal.md
@@ -1,14 +1,17 @@
-### 🔀 [#56](https://github.com/ikawrakow/ik_llama.cpp/pull/56) - BF16 support on Metal
+## 🔀 [Pull Request #56](https://github.com/ikawrakow/ik_llama.cpp/pull/56) - BF16 support on Metal
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/metal_bf16` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-16 |
| **Updated** | 2024-09-17 |
+| **Merged** | 2024-09-17 |
---
-#### Description
+## 📄 Description
It is slightly slower than `fp16`, but definitely a massive improvement compared to not having `bf16` support at al. ~Didn't put any effort into optimizing the matrix x vector kernel, so it is likely one can improve `bf16` TG performance~.
diff --git a/github-data/pull_requests/560 - Remove what appears to be unnecessary asserts in ggml_cuda_cpy.md b/github-data/pull_requests/560 - Remove what appears to be unnecessary asserts in ggml_cuda_cpy.md
index 89bf30e35..88ce610d4 100644
--- a/github-data/pull_requests/560 - Remove what appears to be unnecessary asserts in ggml_cuda_cpy.md
+++ b/github-data/pull_requests/560 - Remove what appears to be unnecessary asserts in ggml_cuda_cpy.md
@@ -1,18 +1,21 @@
-### 🔀 [#560](https://github.com/ikawrakow/ik_llama.cpp/pull/560) - Remove what appears to be unnecessary asserts in ggml_cuda_cpy
+## 🔀 [Pull Request #560](https://github.com/ikawrakow/ik_llama.cpp/pull/560) - Remove what appears to be unnecessary asserts in ggml_cuda_cpy
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_large_cpy` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-26 |
| **Updated** | 2025-06-27 |
+| **Merged** | 2025-06-27 |
---
-#### Description
+## 📄 Description
Not sure why the assert were there as it seems the code should handle tensor sizes greater than `INT_MAX`.
-The funny part is that the assert is triggered when copying the KQ mask! I was able to trigger it using batch/u-batch of 16k tokens with a context of 32k tokens. Which means I should resurrect PR #28 as it is kind of ridiculous to be copying over 2 GB of data from the CPU to the GPU that could be 16X smaller if one used 1 bit per mask entry instead of a `fp16` value (or even `fp32` if not using FA).
+The funny part is that the assert is triggered when copying the KQ mask! I was able to trigger it using batch/u-batch of 16k tokens with a context of 32k tokens. Which means I should resurrect PR [#28](https://github.com/ikawrakow/ik_llama.cpp/issues/28) as it is kind of ridiculous to be copying over 2 GB of data from the CPU to the GPU that could be 16X smaller if one used 1 bit per mask entry instead of a `fp16` value (or even `fp32` if not using FA).
After removing the assert everything seems to work fine.
@@ -20,12 +23,18 @@ But please test!
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2025-06-27** at **15:29:27**:
+👤 **Nexesenex** commented on **2025-06-27** at **15:29:27**
I merged this on my Croco.
My short benching session ok.
On Wizard 8x22B, 55/57 tensors offloaded on 3 different GPUs, and NKVO activated, no problem of corrupted inference.
And no losses of performances either.
-Same goes on Miqu 70b full offload on triple GPU.
\ No newline at end of file
+Same goes on Miqu 70b full offload on triple GPU.
+
+---
+
+👤 **ikawrakow** commented on **2025-06-27** at **15:44:31**
+
+Thanks for testing!
\ No newline at end of file
diff --git a/github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6_28_2025.md b/github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6282025.md
similarity index 66%
rename from github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6_28_2025.md
rename to github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6282025.md
index d53bcb493..2b0c2d4a5 100644
--- a/github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6_28_2025.md
+++ b/github-data/pull_requests/563 - Merge vulkan code from mainline up to commit of 6282025.md
@@ -1,24 +1,27 @@
-### 🔀 [#563](https://github.com/ikawrakow/ik_llama.cpp/pull/563) - Merge vulkan code from mainline up to commit of 6/28/2025
+## 🔀 [Pull Request #563](https://github.com/ikawrakow/ik_llama.cpp/pull/563) - Merge vulkan code from mainline up to commit of 6/28/2025
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `Merge_mainline_vulkan` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-29 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-07-02 |
---
-#### Description
+## 📄 Description
-* Vulkan Optimizations and Fixes (#8959)
+* Vulkan Optimizations and Fixes ([#8959](https://github.com/ikawrakow/ik_llama.cpp/issues/8959))
* Optimize Vulkan REPEAT performance
.....................................................................................
-vulkan: lock accesses of pinned_memory vector (#14333)
+vulkan: lock accesses of pinned_memory vector ([#14333](https://github.com/ikawrakow/ik_llama.cpp/issues/14333))
-vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
+vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline ([#14378](https://github.com/ikawrakow/ik_llama.cpp/issues/14378))
Fix cuda build error
@@ -33,15 +36,15 @@ Fix cuda build error
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **firecoperana** commented the **2025-06-29** at **19:21:51**:
+👤 **firecoperana** commented on **2025-06-29** at **19:21:51**
Test Qwen 2.5 7B Q4_K_S and it runs fine, but for deepseek model, I was getting "GGGGGGG" output with -mla 1 -amb 512. Probably related to deepseek related optimization.
---
-👤 **ubergarm** commented the **2025-06-29** at **19:51:08**:
+👤 **ubergarm** commented on **2025-06-29** at **19:51:08**
For deepseek often one wants to compile with `-DGGML_CUDA_IQK_FORCE_BF16=1` to avoid overflowing fp16 accumulator which manifests as gibberish, nans, or `GGG` typically I believe.
@@ -53,17 +56,11 @@ Details inside:
👈 build command and logs
```bash
-# attempt to build clean despite it seems to still be using cmake cache? hah...
+# attempt to build clean
$ rm -rf ./build
-$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF GGML_CCACHE=OFF
+$ cmake -B build -DGGML_VULKAN=ON -DGGML_CUDA=OFF -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_CCACHE=OFF
$ cmake --build build --config Release -j $(nproc)
-CMake Warning:
- Ignoring extra path from command line:
-
- "GGML_CCACHE=OFF"
-
-
-- The C compiler identification is GNU 15.1.1
-- The CXX compiler identification is GNU 15.1.1
-- Detecting C compiler ABI info
@@ -89,7 +86,6 @@ CMake Warning:
-- Using llamafile
-- Found Vulkan: /lib/libvulkan.so (found version "1.4.313") found components: glslc glslangValidator
-- Vulkan found
--- ccache found, compilation results will be cached. Disable with GGML_CCACHE=OFF.
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- ARCH_FLAGS = -march=native
@@ -216,17 +212,97 @@ make[1]: *** [CMakeFiles/Makefile2:2044: ggml/src/CMakeFiles/ggml.dir/all] Error
make: *** [Makefile:146: all] Error 2
```
-
+
+
+*EDIT*
+
+fwiw i just forced that explicitly to `void *` and it compiles but then segfaults towards the end of starting up
+```
+#/ggml/src/ggml-backend.c ~around line 1020 or so
+- /* .clear = */ ggml_backend_multi_buffer_clear,
++ /* .clear = */ (GGML_CALL void *) ggml_backend_multi_buffer_clear, // ubergarm hack
+```
+
+---
+
+👤 **ikawrakow** started a conversation on `.github/workflows/build.yml` on **2025-06-30** at **06:48:44**
+
+I specifically removed all workflows, let's not put them back in.
+
+---
+
+👤 **ikawrakow** started a conversation on `.github/workflows/release.yml` on **2025-06-30** at **06:49:37**
+
+Same
+
+---
+
+👤 **ikawrakow** started a conversation on `ggml/include/ggml.h` on **2025-06-30** at **06:51:46**
+
+Let's not add stuff that is not related to the Vulkan back-end
+
+---
+
+👤 **ikawrakow** started a conversation on `ggml/include/ggml.h` on **2025-06-30** at **06:52:04**
+
+No new ops please
+
+---
+
+👤 **ikawrakow** started a conversation on `ggml/include/ggml.h` on **2025-06-30** at **06:52:43**
+
+No new ops please
+
+---
+
+👤 **ikawrakow** started a conversation on `ggml/src/ggml-alloc.c` on **2025-06-30** at **06:53:43**
+
+Let's not make changes that are not related to the Vulkan back-end
+
+---
+
+👤 **ikawrakow** started a conversation on `ggml/src/ggml-cpu/ggml-cpu.c` on **2025-06-30** at **06:58:42**
+
+I don't think I want a copy of all the refactoring that happened in mainline since I forked the project.
---
-👤 **ikawrakow** submitted a review the **2025-06-30** at **07:12:08**: 🔄 `CHANGES_REQUESTED`
+👤 **ikawrakow** requested changes on this pull request 🔄 on **2025-06-30** at **07:12:08**
Please no new ops, new enum values, and no refactoring of the CPU backend. I think the Vulkan back-end can be updated to the latest without using the new back-end formalism in mainline.
---
-👤 **ubergarm** commented the **2025-07-01** at **02:59:51**:
+👤 **ikawrakow** commented on **2025-06-30** at **07:13:31**
+
+Btw, currently working on my M2-Max laptop, and Safari disintegrates into pieces when trying to view the changes in this PR.
+
+---
+
+👤 **firecoperana** commented on **2025-07-01** at **00:52:26**
+
+> For deepseek often one wants to compile with `-DGGML_CUDA_IQK_FORCE_BF16=1` to avoid overflowing fp16 accumulator which manifests as gibberish, nans, or `GGG` typically I believe.
+>
+> I just tried to compile but got an error, might be because I just updated my rig and now seem to have `gcc version 15.1.1 20250425 (GCC)`... I'll fuss with it a bit but put it here in the meantime.
+>
+> Details inside:
+> 👈 build command and logs
+>
+> _EDIT_
+>
+> fwiw i just forced that explicitly to `void *` and it compiles but then segfaults towards the end of starting up
+>
+> ```
+> #/ggml/src/ggml-backend.c ~around line 1020 or so
+> - /* .clear = */ ggml_backend_multi_buffer_clear,
+> + /* .clear = */ (GGML_CALL void *) ggml_backend_multi_buffer_clear, // ubergarm hack
+> ```
+
+Pull again. Fixed it.
+
+---
+
+👤 **ubergarm** commented on **2025-07-01** at **02:59:51**
@firecoperana
@@ -382,7 +458,7 @@ Lemme know if there is a certain version of the vulkan backend that might work b
---
-👤 **firecoperana** commented the **2025-07-01** at **15:00:17**:
+👤 **firecoperana** commented on **2025-07-01** at **15:00:17**
I noticed something odd too and suspect it's related to vulkan shader. When I run llama server in visual studio, I can match the performance of the mainline, but if I run in command line, I was only getting 1/3 to 1/2 of the speed for token generation. If you have time, you can do some troubleshooting, as I'm not familiar with vulkan at all.
@@ -390,7 +466,18 @@ I noticed something odd too and suspect it's related to vulkan shader. When I ru
---
-👤 **ikawrakow** commented the **2025-07-01** at **16:38:42**:
+👤 **ikawrakow** commented on **2025-07-01** at **15:15:12**
+
+> "warning: no backend supports op NONE with a weight with buffer type Vulkan0 used in tensor blk.0.attn_norm.weight" happens because vulkan does not support fused rms norm. It only shows in debug version.
+
+We will worry about the missing fused ops after we get the PR in. There is quite a bit left to do to have the `ik_llama.cpp` advantages available also with Vulkan:
+* Implement fused ops
+* Implement GEMM/GEMV for all quantization types added in `ik_llama.cpp`
+* Port the `ik_llama.cpp` improvements related to "indirect" GEMM and GEMV (as needed for MoE models).
+
+---
+
+👤 **ikawrakow** commented on **2025-07-01** at **16:38:42**
Tested on my RTX-4080. If I remove the fused ops (`GGML_OP_FUSED_RMS_NORM` and `GGML_OP_FUSED_MUL_UNARY`) and don't use flash attention, I get this for LlaMA-3.1-8B
@@ -417,7 +504,7 @@ Flash attention seems to be running on the CPU, so performance drops further wit
---
-👤 **ikawrakow** commented the **2025-07-01** at **16:48:33**:
+👤 **ikawrakow** commented on **2025-07-01** at **16:48:33**
If I change the `LOG_DEBUG` to `LOG_INFO` in `ggml_vk_print_gpu_info`, I see this line:
```
@@ -432,25 +519,90 @@ So, for some reason int dot products and cooperative matrix are not enabled. I g
---
-👤 **ikawrakow** submitted a review the **2025-07-01** at **18:07:18**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `ggml/src/ggml-vulkan.cpp` on **2025-07-01** at **18:07:18**
----
+Why do we need this check? I don't have coopmat2 available, but if I comment out this check I get FA enabled, and it gives me a nice boost in performance.
-👤 **firecoperana** submitted a review the **2025-07-02** at **01:10:01**: 💬 `COMMENTED`
+> 👤 **firecoperana** replied on **2025-07-02** at **01:10:01**
+>
+> Removed.
---
-👤 **firecoperana** commented during a code review the **2025-07-02** at **01:10:01** on `ggml/src/ggml-vulkan.cpp`:
+👤 **ikawrakow** commented on **2025-07-01** at **18:18:05**
+
+OK, I'm learning. Need to build using
+```
+cmake .. -DGGML_VULKAN=ON -DGGML_VULKAN_COOPMAT2_GLSLC_SUPPORT=1 -DGGML_VULKAN_COOPMAT_GLSLC_SUPPORT=1 -DGGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT=1
+```
+
+Then I need to comment out the check for coopmat2 on line 9476 in `ggml-vulkan.cpp` to get FA enabled. With that, I almost match the Vulkan performance in mainline:
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 1024 | 256 | 0 | 0.334 | 3070.20 | 2.433 | 105.20 |
+| 1024 | 256 | 1024 | 0.340 | 3012.31 | 2.596 | 98.60 |
+| 1024 | 256 | 2048 | 0.342 | 2995.49 | 2.751 | 93.07 |
+| 1024 | 256 | 3072 | 0.334 | 3069.60 | 2.890 | 88.58 |
+| 1024 | 256 | 4096 | 0.339 | 3023.88 | 3.048 | 84.00 |
+| 1024 | 256 | 5120 | 0.352 | 2909.64 | 3.240 | 79.02 |
+| 1024 | 256 | 6144 | 0.369 | 2774.90 | 3.427 | 74.71 |
+| 1024 | 256 | 7168 | 0.377 | 2716.14 | 3.618 | 70.76 |
+| 1024 | 256 | 8192 | 0.388 | 2636.59 | 3.793 | 67.50 |
+| 1024 | 256 | 9216 | 0.413 | 2479.99 | 3.989 | 64.18 |
+| 1024 | 256 | 10240 | 0.437 | 2343.03 | 4.199 | 60.96 |
+| 1024 | 256 | 11264 | 0.460 | 2225.86 | 4.408 | 58.08 |
+| 1024 | 256 | 12288 | 0.487 | 2102.61 | 4.614 | 55.48 |
+| 1024 | 256 | 13312 | 0.503 | 2037.31 | 4.821 | 53.10 |
+| 1024 | 256 | 14336 | 0.535 | 1915.62 | 5.036 | 50.84 |
+| 1024 | 256 | 15360 | 0.553 | 1853.00 | 5.247 | 48.79 |
+
+PP is on par with mainline, TG is on par (or even slightly better) for short context, but performance somehow decreases faster with context length, so we end up with ~70% of mainline TG performance at 16k tokens.
+
+I'm told in [this comment](https://github.com/ikawrakow/ik_llama.cpp/discussions/562#discussioncomment-13630937) that I need to update my Nvidia driver to 575, which will give me coopmat2 and almost a factor of 2 speedup.
+
+---
-Removed.
+👤 **firecoperana** commented on **2025-07-02** at **01:13:09**
+
+> OK, I'm learning. Need to build using
+>
+> ```
+> cmake .. -DGGML_VULKAN=ON -DGGML_VULKAN_COOPMAT2_GLSLC_SUPPORT=1 -DGGML_VULKAN_COOPMAT_GLSLC_SUPPORT=1 -DGGML_VULKAN_INTEGER_DOT_GLSLC_SUPPORT=1
+> ```
+>
+> Then I need to comment out the check for coopmat2 on line 9476 in `ggml-vulkan.cpp` to get FA enabled. With that, I almost match the Vulkan performance in mainline:
+> PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
+> 1024 256 0 0.334 3070.20 2.433 105.20
+> 1024 256 1024 0.340 3012.31 2.596 98.60
+> 1024 256 2048 0.342 2995.49 2.751 93.07
+> 1024 256 3072 0.334 3069.60 2.890 88.58
+> 1024 256 4096 0.339 3023.88 3.048 84.00
+> 1024 256 5120 0.352 2909.64 3.240 79.02
+> 1024 256 6144 0.369 2774.90 3.427 74.71
+> 1024 256 7168 0.377 2716.14 3.618 70.76
+> 1024 256 8192 0.388 2636.59 3.793 67.50
+> 1024 256 9216 0.413 2479.99 3.989 64.18
+> 1024 256 10240 0.437 2343.03 4.199 60.96
+> 1024 256 11264 0.460 2225.86 4.408 58.08
+> 1024 256 12288 0.487 2102.61 4.614 55.48
+> 1024 256 13312 0.503 2037.31 4.821 53.10
+> 1024 256 14336 0.535 1915.62 5.036 50.84
+> 1024 256 15360 0.553 1853.00 5.247 48.79
+>
+> PP is on par with mainline, TG is on par (or even slightly better) for short context, but performance somehow decreases faster with context length, so we end up with ~70% of mainline TG performance at 16k tokens.
+>
+> I'm told in [this comment](https://github.com/ikawrakow/ik_llama.cpp/discussions/562#discussioncomment-13630937) that I need to update my Nvidia driver to 575, which will give me coopmat2 and almost a factor of 2 speedup.
+
+The new commit should remove the need to add these in cmake command. Also disable the fused ops for now.
---
-👤 **ubergarm** commented the **2025-07-02** at **04:42:36**:
+👤 **ubergarm** commented on **2025-07-02** at **04:42:36**
> The new commit should remove the need to add these in cmake command. Also disable the fused ops for now.
-Thanks I was having trouble getting it setup. First the amazing news, check this out on the AMD RX 7900 XTX it is up to snuff in early testing:
+Thanks I was having trouble getting it setup before, the recent commit fixed it right up. First the amazing news, check this out on the AMD RX 7900 XTX it is up to snuff in early testing:

@@ -462,8 +614,76 @@ ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp s
/mnt/astrodata/llm/ik_llama.cpp/ggml/src/ggml-vulkan.cpp:2031: GGML_ASSERT((GGML_KQ_MASK_PAD % rows_cols[0]) == 0) failed
```
-Amazing progress in a short time!
+Amazing progress in a short time!
+
+
+I tried a couple small R1-0528 quants but not quite there yet:
+
+
+👈 AMD 7900 XTX DeepSeek-R1-0528 Log
+
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1 -DGGML_RPC=0 -DGGML_CCACHE=1 -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
+cmake --build build --config Release -j $(nproc)
+
+model=/home/w/projects/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ1_S.gguf
+#model=/home/w/projects/models/ubergarm/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-IQ1_S_R4-00001-of-00003.gguf
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ --ctx-size 4608 \
+ -fa \
+ -mla 3 -amb 512 \
+ -ctk q8_0 \
+ -ngl 99 \
+ -ot "blk\.(3|4|5|6|7|8)\.ffn_.*=Vulkan0" \
+ -ot exps=CPU \
+ --threads 16 \
+ --no-mmap
+ # -fmoe # leave this off for now
+
+ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
+
+llama_new_context_with_model: n_ctx = 4608
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 3
+llama_new_context_with_model: attn_max_b = 512
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: Vulkan0 KV buffer size = 164.06 MiB
+llama_new_context_with_model: KV self size = 164.06 MiB, c^KV (q8_0): 164.06 MiB, kv^T: not used
+llama_new_context_with_model: Vulkan_Host output buffer size = 0.49 MiB
+llama_new_context_with_model: Vulkan0 compute buffer size = 982.00 MiB
+llama_new_context_with_model: Vulkan_Host compute buffer size = 480.91 MiB
+llama_new_context_with_model: graph nodes = 4641
+llama_new_context_with_model: graph splits = 826
+
+main: n_kv_max = 4608, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 16, n_threads_batch = 16
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+ggml_compute_forward_dup_q: CPU#cache_k_l0 (view)#0 -> cache_k_l0 (view) (copy) is of type f16
+ggml_compute_forward_dup_q: CPU#cache_k_l0 (view)#0 -> cache_k_l0 (view) (copy) is of type f16
+/home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10783: /home/w/projects/ik_llama.cpp/ggml/src/ggml.c:10783: fatal errorggml_compute_forward_dup_q: CPU#c
+ache_k_l0 (view)#0 -> cache_k_l0 (view) (copy) is of type f16
+fatal error
+```
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-02** at **06:24:09**
+
+I don't quite understand why `ik_llama.cpp` would run faster than mainline. None of the additions that make it run faster on CPU/CUDA are implemented in the Vulkan port.
+
+> I tried a couple small R1-0528 quants but not quite there yet:
+
+Of course not. The Vulkan backend does not support DeepSeek flash attention, so no, no `-mla 3` is possible. `-fmoe` is not there either. Neither are all the additions to concatenating, copying, and transposing tensors necessary to make FlashMLA-3 work.
---
-👤 **ikawrakow** submitted a review the **2025-07-02** at **06:49:33**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-02** at **06:49:33**
\ No newline at end of file
diff --git a/github-data/pull_requests/565 - add hunyuan moe support for 561.md b/github-data/pull_requests/565 - add hunyuan moe support for 561.md
index 943442c90..6349aa0ea 100644
--- a/github-data/pull_requests/565 - add hunyuan moe support for 561.md
+++ b/github-data/pull_requests/565 - add hunyuan moe support for 561.md
@@ -1,14 +1,17 @@
-### 🔀 [#565](https://github.com/ikawrakow/ik_llama.cpp/pull/565) - add hunyuan moe support for 561
+## 🔀 [Pull Request #565](https://github.com/ikawrakow/ik_llama.cpp/pull/565) - add hunyuan moe support for 561
| **Author** | `ubergarm` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ug/hunyuan-moe-2` |
+| **Target Branch** | `main` |
| **Created** | 2025-06-30 |
| **Updated** | 2025-07-15 |
+| **Merged** | 2025-07-09 |
---
-#### Description
+## 📄 Description
Based this PR on mainline https://github.com/ggml-org/llama.cpp/pull/14425. Didn't merge any python stuff (used mainline convert script). Tested with bf16 on hybrid CUDA+CPU.
@@ -30,7 +33,7 @@ model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct
--port 8080
```
-Would be great if anyone else could test e.g. @Downtown-Case as per #561
+Would be great if anyone else could test e.g. @Downtown-Case as per [#561](https://github.com/ikawrakow/ik_llama.cpp/issues/561)
I haven't yet made imatrix nor tried to quantize further.
@@ -42,9 +45,9 @@ The behavior seems a bit odd and will answer in chinese if I don't use some kind
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-06-30** at **18:28:48**:
+👤 **ubergarm** commented on **2025-06-30** at **18:28:48**
I'm currently processing an imatrix and noticed that it *requires* `-fa` or will have very large numbers.
@@ -64,20 +67,54 @@ This seems to be working so far, though still seems a higher than I expected whi
--threads 24
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+
compute_imatrix: tokenizing the input ..
-compute_imatrix: tokenization took 701.577 ms
+compute_imatrix: tokenization took 709.171 ms
compute_imatrix: computing over 865 chunks with batch_size 512
-compute_imatrix: 5.03 seconds per pass - ETA 1 hours 12.48 minutes
+compute_imatrix: 4.37 seconds per pass - ETA 1 hours 3.07 minutes
[1]12.7104,[2]14.8010,[3]14.3374,[4]30.5778,[5]17.4738,[6]14.5285,[7]20.2402,[8]14.9318,[9]11.7604,
save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
[10]12.0205,[11]10.2799,[12]12.3863,[13]14.9808,[14]16.1885,[15]16.6677,[16]20.9547,[17]19.1613,[18]17.4531,[19]15.5200,
+save_imatrix: stored collected data after 20 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[20]14.7222,[21]13.4574,[22]12.5603,[23]11.8334,[24]11.1943,[25]10.7840,[26]10.5614,[27]10.8168,[28]11.2630,[29]11.9753,
+save_imatrix: stored collected data after 30 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[30]12.7904,[31]12.8568,[32]12.7520,[33]13.2066,[34]13.7438,[35]14.3701,[36]15.2825,[37]16.4474,[38]17.2615,[39]17.7246,
+save_imatrix: stored collected data after 40 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[40]20.3797,[41]22.3074,[42]22.9196,[43]23.5967,[44]24.9652,[45]26.3450,[46]28.0728,[47]28.1975,[48]27.9526,[49]31.3467,
+save_imatrix: stored collected data after 50 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[50]30.1730,[51]31.2195,[52]30.6089,[53]30.0938,[54]29.5127,[55]29.9680,[56]29.2944,[57]28.2416,[58]27.2467,[59]26.2110,
+save_imatrix: stored collected data after 60 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[60]25.3394,[61]24.4437,[62]23.7538,[63]25.8637,[64]27.0096,[65]28.0507,[66]27.7521,[67]29.0344,[68]29.8659,[69]30.3886,
+save_imatrix: stored collected data after 70 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[70]31.4350,[71]31.8531,[72]31.7906,[73]31.7912,[74]32.9230,[75]34.9214,[76]37.0384,[77]38.7590,[78]38.9847,[79]40.2656,
+save_imatrix: stored collected data after 80 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[80]41.5627,[81]41.0075,[82]42.5855,[83]44.5075,[84]43.9110,[85]43.3078,[86]42.7130,[87]41.7924,[88]41.2850,[89]41.5686,
+save_imatrix: stored collected data after 90 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[90]40.8182,[91]41.2610,[92]42.4782,[93]44.0758,[94]43.5943,[95]43.7613,[96]43.0079,[97]42.6615,[98]43.6499,[99]43.1762,
+save_imatrix: stored collected data after 100 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[100]42.4092,[101]43.1918,[102]44.5605,[103]44.1737,[104]44.2998,[105]45.3024,[106]45.5803,[107]45.3388,[108]45.5154,[109]45.8490,
+save_imatrix: stored collected data after 110 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[110]45.6819,[111]46.1607,[112]46.8070,[113]47.5833,[114]48.5492,[115]48.9797,[116]49.6842,[117]49.8659,[118]51.1640,[119]51.3824,
+save_imatrix: stored collected data after 120 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[120]52.0141,[121]53.6073,[122]55.3684,[123]56.2596,[124]56.0548,[125]56.1662,[126]56.3532,[127]57.2403,[128]56.6770,[129]58.3851,
+save_imatrix: stored collected data after 130 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
+[130]58.2333,[131]59.2614,[132]60.7497,[133]62.4619,[134]63.7352,[135]64.8522,[136]66.5478,[137]64.9457,[138]63.5455,[139]63.2199,
+save_imatrix: stored collected data after 140 chunks in /mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/imatrix-Hunyuan-A13B-Instruct-BF16.dat
...
+```
+
+*EDIT*: I also tried adding this model to the list for `ggml_flash_attn_ext_set_prec(cur, GGML_PREC_F32);` but the imatrix PPL values look about the same. Seems to be about the same with or without `-fmoe` as well.
+
+fwiw this seems to be similar numbers as I'm getting using mainline llama-imatrix:
+
+```
+[1]12.7998,[2]14.9052,[3]14.4276,[4]30.9156,[5]17.5724,[6]14.6579,[7]20.3671,[8]15.0254,[9]11.8121,[10]12.0809,[11]10.3416,[12]12.4422,[13]15.1108,^C
```
---
-👤 **ikawrakow** commented the **2025-06-30** at **20:20:40**:
+👤 **ikawrakow** commented on **2025-06-30** at **20:20:40**
No FA and FA giving very different PPL values is not a good sign.
@@ -85,7 +122,7 @@ PPL of 60 is not a good sign either, especially for a model of that size.
---
-👤 **ubergarm** commented the **2025-06-30** at **20:36:19**:
+👤 **ubergarm** commented on **2025-06-30** at **20:36:19**
I'm going to leave an endpoint up for a little bit if anyone wants to try the first experimental quant.. No promises lol
@@ -146,29 +183,199 @@ model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct
---
-👤 **ikawrakow** submitted a review the **2025-07-01** at **06:00:36**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-07-01** at **06:00:36**
+
+If you check your previous PR about GLM4 you will see that you had to remove the `Vcur` reshaping. It is the same here. Remove this line and it is likely the difference between FA and no FA will go away.
+
+> 👤 **ubergarm** replied on **2025-07-01** at **23:54:30**
+>
+> Yup, thanks for the reminder! The two trickiest parts of porting an architecture is remembering to:
+>
+> 1. Remove the `Vcur` reshaping.
+> 2. On mainline `build_attn()` the argument order goes `Qcur, Kcur, Vcur,`, but here with `llm_build_kv()` the order goes `Kcur, Vcur, Qcur,`.
+>
+> Just re-downloaded the new .safetensors, converted, and built a fresh quant to test:
+>
+> * `FA=1` Final estimate: PPL = 522.7473 +/- 5.68072
+> * `FA=0` Final estimate: PPL = 527.6625 +/- 5.73144
+>
+> So looks "good" now haha... I didn't wait to find the bf16's PPL but this lines up in the ball-park with what [mainline is seeing around ~500](https://github.com/ggml-org/llama.cpp/pull/14425#issuecomment-3024357323).
+>
+> Of course I couldn't help myself and had to try out [the new IQ3_KS quant](https://github.com/ikawrakow/ik_llama.cpp/pull/566) as well lol...
+>
+> So far so good!
+> ```
+> llm_load_print_meta: model type = 80B.A13B
+> llm_load_print_meta: model ftype = IQ3_KS - 3.1875 bpw
+> llm_load_print_meta: model params = 80.393 B
+> llm_load_print_meta: model size = 34.088 GiB (3.642 BPW)
+> llm_load_print_meta: general.name = Hunyuan A13B Instruct
+>
+> # Attention
+> blk\..*\.attn_k.*=iq6_k
+> blk\..*\.attn_v.*=iq6_k
+>
+> blk\..*\.attn_q.*=iq5_k
+> blk\..*\.attn_o.*=iq5_k
+>
+> # 1x Shared Expert
+> blk\..*\.ffn_(down)_shexp.*=iq6_k
+> blk\..*\.ffn_(gate|up)_shexp.*=iq5_k
+>
+> # 64x Routed Experts
+> blk\..*\.ffn_(down)_exps.*=iq4_ks
+> blk\..*\.ffn_(gate|up)_exps.*=iq3_ks # let's live dangerously
+>
+> # Token Embedding
+> token_embd\.weight=iq6_k # splurged here a bit as this model's tokenization seems wierd
+> ```
---
-👤 **ikawrakow** commented during a code review the **2025-07-01** at **06:00:36** on `src/llama.cpp`:
+👤 **kiron111** commented on **2025-07-02** at **02:42:43**
-If you check your previous PR about GLM4 you will see that you had to remove the `Vcur` reshaping. It is the same here. Remove this line and it is likely the difference between FA and no FA will go away.
+> I'm going to leave an endpoint up for a little bit if anyone wants to try the first experimental quant.. No promises lol
+>
+> ## Endpoint
+> WebUI: https://llm.ubergarm.com/ APIEndpoint: https://llm.ubergarm.com/ (it is llama-server API endpoint with no API key)
+>
+> There are 8 concurrent slots each with 64k prompt limit.
+>
+> ## Test Quant
+> I just rolled an imatrix.dat and made my first quant for testing.
+>
+> ```
+> llm_load_print_meta: model type = 80B.A13B
+> llm_load_print_meta: model ftype = IQ4_K - 4.5 bpw
+> llm_load_print_meta: model params = 80.393 B
+> llm_load_print_meta: model size = 48.581 GiB (5.191 BPW)
+> llm_load_print_meta: general.name = Hunyuan A13B Instruct
+> ```
+>
+> ```
+> blk\..*\.attn_k.*=iq6_k
+> blk\..*\.attn_v.*=iq6_k
+>
+> blk\..*\.attn_q.*=iq5_k
+> blk\..*\.attn_o.*=iq5_k
+>
+> # 1x Shared Expert
+> blk\..*\.ffn_(gate|up)_shexp.*=iq6_k
+> blk\..*\.ffn_(down)_shexp.*=iq5_k
+>
+> # 64x Routed Experts
+> blk\..*\.ffn_(gate|up)_exps.*=iq5_k
+> blk\..*\.ffn_(down)_exps.*=iq4_k
+>
+> # Token Embedding
+> token_embd\.weight=iq4_k
+> ```
+>
+> How I ran it:
+>
+> ```shell
+> model=/mnt/raid/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ4_K.gguf
+> ./build/bin/llama-server \
+> --model "$model" \
+> --alias ubergarm/Hunyuan-A13B-Instruct-IQ4_K \
+> -fa \
+> -ctk q8_0 -ctv q8_0 \
+> -c 524288 \
+> --temp 0.6 \
+> --presence-penalty 0.7 \
+> --min-p 0.1 \
+> -ts 48,48 \
+> -ngl 99 \
+> --parallel 8 \
+> --threads 1 \
+> --host 127.0.0.1 \
+> --port 8080
+> ```
+
+tested on your api, it works for Chinese Q&A.
+
+---
+
+👤 **ubergarm** commented on **2025-07-02** at **02:44:39**
+
+> tested on your api, it works for Chinese Q&A.
+
+Ahh very good, thanks you. Tonight I was running my updated [experimental IQ3_KS](https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF/blob/main/Hunyuan-A13B-Instruct-IQ3_KS.gguf) which I went ahead and released prematurely because oh well it seems okay lol...
+
+Thanks for testing!
+
+
+
+It can fit 256k context in under 24GB VRAM when not offloading additional exps and with `-ub 1024` get over 500 tok/sec PP and about 17 tok/sec TG. So quite a flexible model in terms of size at least.
---
-👤 **ubergarm** submitted a review the **2025-07-01** at **23:54:30**: 💬 `COMMENTED`
+👤 **kiron111** commented on **2025-07-02** at **03:54:01**
+
+run on wsl I got a error: Floating point exception (core dumped), in the initial procress of ik_llama.cpp
+
+```
+model=/mnt/g/lm-studio/models/ubergarm/Hunyuan-A13B-Instruct-GGUF/Hunyuan-A13B-Instruct-IQ3_KS.gguf
+./build/bin/llama-server \
+ --model "$model" \
+ --alias ubergarm/Hunyuan-A13B-Instruct-IQ4_K \
+ -fa \
+ -ctk q8_0 -ctv q8_0 \
+ -c 4096 \
+ --temp 0.6 \
+ --presence-penalty 0.7 \
+ --min-p 0.1 \
+ -ts 48,48 \
+ -ngl 99 \
+ --parallel 8 \
+ --threads 1 \
+ --host 127.0.0.1 \
+ --port 8080
+INFO [ main] build info | tid="140543167610880" timestamp=1751427934 build=3776 commit="c6c23fa4"
+INFO [ main] system info | tid="140543167610880" timestamp=1751427934 n_threads=24 n_threads_batch=-1 total_threads=16 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+Floating point exception (core dumped)
+```
+OS: Win 11 +WSL
+branch: Uberarm/huanyuan-moe2 ( I use this command to download the source code: git clone --branch ug/hunyuan-moe-2 https://github.com/ubergarm/ik_llama.cpp.git)
+Cmake config: cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON -DGGML_CCACHE=OFF
+Hardware: AMD Ryzen 5700X, 4070 12GB, WSL ram limit = 42 ( physical ram 48 Gb in total)
---
-👤 **ubergarm** commented the **2025-07-02** at **04:03:30**:
+👤 **ubergarm** commented on **2025-07-02** at **04:03:30**
> run on wsl I got a error: Floating point exception (core dumped), in the initial procress of ik_llama.cpp
-Its becase I'm a madman and released a quant depending on two unmerged PRs. Check here for instructions how to get the IQ3_KS PR here: https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF#note-building-experimental-prs
+Its becase I'm a madman and released a quant depending on two unmerged PRs. Check here for instructions how to get the IQ3_KS PR here: https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF#note-building-experimental-prs
+
+Also @kiron111 look at the examples on the model card you will need to use `-ngl 99 -ot exps=CPU` remove that `-ts` stuff that was specific to my test rig.
+
+This model is great for low VRAM machines and can probably run in 6GB VRAM with some usable context.
---
-👤 **ubergarm** commented the **2025-07-02** at **18:58:03**:
+👤 **kiron111** commented on **2025-07-02** at **04:08:50**
+
+> > run on wsl I got a error: Floating point exception (core dumped), in the initial procress of ik_llama.cpp
+>
+> Its becase I'm a madman and released a quant depending on two unmerged PRs. Check here for instructions how to get the IQ3_KS PR here: https://huggingface.co/ubergarm/Hunyuan-A13B-Instruct-GGUF#note-building-experimental-prs
+>
+> Also @kiron111 look at the examples on the model card you will need to use `-ngl 99 -ot exps=CPU` remove that `-ts` stuff that was specific to my test rig.
+>
+> This model is great for low VRAM machines and can probably run in 6GB VRAM with some usable context.
+
+thankyou
+oh....I miss so many points...let me redownload and recompile first
+
+---
+
+👤 **ikawrakow** commented on **2025-07-02** at **07:34:31**
+
+The PPL of 500+ is not very promising. I suspect this is because of the not implemented technique to reduce the importance of recently used experts, which completely modifies the inference compared to how the model was trained, that was discussed in the mainline PR. Hence still wondering if to merge. They have merged as is in mainline, but `ik_llama.cpp` tries to be better than mainline.
+
+---
+
+👤 **ubergarm** commented on **2025-07-02** at **18:58:03**
> The PPL of 500+ is not very promising. I suspect this is because of the not implemented technique to reduce the importance of recently used experts, which completely modifies the inference compared to how the model was trained, that was discussed in the mainline PR
@@ -188,10 +395,22 @@ model=Hunyuan-A13B-Pretrain-BF16-00001-of-00004.gguf
--threads 24
Final estimate: PPL = 5.2880 +/- 0.03236
+```
+
+*EDIT*
+```
+model=Hunyuan-A13B-Pretrain-IQ3_KS.gguf
+# model type = 80B.A13B
+# model ftype = IQ3_KS - 3.1875 bpw
+# model params = 80.393 B
+# model size = 34.088 GiB (3.642 BPW)
+# general.name = Hunyuan A13B Pretrain
+
+Final estimate: PPL = 5.4382 +/- 0.03349
```
---
-👤 **ikawrakow** submitted a review the **2025-07-09** at **08:29:32**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-09** at **08:29:32**
OK, lets merge this.
\ No newline at end of file
diff --git a/github-data/pull_requests/566 - Adding IQ3_KS quants.md b/github-data/pull_requests/566 - Adding IQ3_KS quants.md
index cbd5a9fd6..1a3a8cafe 100644
--- a/github-data/pull_requests/566 - Adding IQ3_KS quants.md
+++ b/github-data/pull_requests/566 - Adding IQ3_KS quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#566](https://github.com/ikawrakow/ik_llama.cpp/pull/566) - Adding IQ3_KS quants
+## 🔀 [Pull Request #566](https://github.com/ikawrakow/ik_llama.cpp/pull/566) - Adding IQ3_KS quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq3_ks_v2` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-01 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-07-02 |
---
-#### Description
+## 📄 Description
This PR adds `IQ3_KS` - 3.1875 bpw quants with a block size of 32. This makes the `IQX_KS` quant series complete
@@ -94,15 +97,54 @@ Here a few sweep-benches for LlaMA-3.1-8B-Instruct
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-02** at **07:27:42**:
+👤 **ikawrakow** commented on **2025-07-02** at **07:27:42**
Let's merge this so people don't get crashes when trying to run `IQ3_KS` models with the main branch.
---
-👤 **Nexesenex** commented the **2025-07-02** at **15:01:59**:
+👤 **Nexesenex** commented on **2025-07-02** at **13:09:32**
+
+On a SicariusSicariiStuff_Nano_Imp_1B-bf16 Llama 3.2 1B model I had on my drive.
+
+PPL 512 wikitest eng.
+IQ3_XXS, output Q6_K : 39.9308 +/- 0.36424
+IQ3_KS V1 (your old branch), output Q6_K : 37.4846 +/- 0.34625
+IQ3_KS V2 (this one), output IQ5_K : 35.3730 +/- 0.32563 (that's a clear improvement)
+Q3_K, output Q6_K : 31.3082 +/- 0.28644
+IQ3_S, output Q6_K : 34.0241 +/- 0.31115
+IQ3_K, output Q6_K : 33.2313 +/- 0.30001
+
+IQ3_KT FTYPE
+llama_model_loader: - type f32: 34 tensors
+llama_model_loader: - type q5_K: 1 tensors (output)
+llama_model_loader: - type iq3_s: 1 tensors (embeddings)
+llama_model_loader: - type iq3_k: 16 tensors (attn_output)
+llama_model_loader: - type iq5_k: 16 tensors (attn_v)
+llama_model_loader: - type iq3_kt: 80 tensors
+PPL 512 Wikitest eng : PPL = 35.3279 +/- 0.32366 (IQ3_KS_v2 compete quite well on this model)
+
+Also, merged successfully on Croco.cpp, and it infers properly.
+
+---
+
+👤 **Nexesenex** commented on **2025-07-02** at **14:05:49**
+
+@Ikawrakow : You brought us SOTA quants in the 2.1-2.2x bpw and 3.1x-3.2 bpw range with the KS and KT quants (and so, IQ2_XXS, XS, and IQ3_XXS are close to obsolescence now), and IQ2_K/IQ2_S remain on duty, but there's now a SOTA quants gap in the 2.4-3.1bpw range.
+
+Would it be possible, mathematically wise, and interesting for you to develop a new IQ2_KL quant (in the 2.6875-2.75bpw range?), and offer a much more performant alternative to IQ2_S and IQ2_K, in line of what you developed recently?
+
+---
+
+👤 **ikawrakow** commented on **2025-07-02** at **14:25:59**
+
+I have been thinking about this, but don't have a good idea how to spend the extra bits (extra compared to, e.g., `IQ2_KS`) without making inference inefficient. A larger Trellis quant based on `IQ2_KT` is also tricky as at 2 bpw (excluding block scales) we are already at 65k possibilities for a group of 8 quants, so going to 2.5 bpw would increase this to a million, which would make quantization time prohibitive. But if I go to groups of 4, that's only 1024 possibilities at 2.5 bpw, so this is not going to be very good. I could make a hybrid thing between `IQ2_KS` and a codebook (as the i-quants), but that brings CPU performance down, which is not good as most `ik_llama.cpp` users use it for the giant MoE models where TG runs on the CPU. But yes, If I come up with a good idea for a ~2.6-2.7 bpw quant, I will add it.
+
+---
+
+👤 **Nexesenex** commented on **2025-07-02** at **15:01:59**
Thanks for the explanation, I understand that the alternatives you have atm are quite unpractical.
diff --git a/github-data/pull_requests/567 - Minor CUDA PP speed improvement.md b/github-data/pull_requests/567 - Minor CUDA PP speed improvement.md
index f1812c32d..a8412bd42 100644
--- a/github-data/pull_requests/567 - Minor CUDA PP speed improvement.md
+++ b/github-data/pull_requests/567 - Minor CUDA PP speed improvement.md
@@ -1,14 +1,17 @@
-### 🔀 [#567](https://github.com/ikawrakow/ik_llama.cpp/pull/567) - Minor CUDA PP speed improvement
+## 🔀 [Pull Request #567](https://github.com/ikawrakow/ik_llama.cpp/pull/567) - Minor CUDA PP speed improvement
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/improve_mmq` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-01 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-07-02 |
---
-#### Description
+## 📄 Description
I was actually trying to improve MMQ performance for quants with a block-size of 16, but ended up with a small improvement of the MMQ kernel for blocks of 32. Just 1-2% kind of improvement, so nothing earth shattering.
@@ -19,9 +22,9 @@ Here a `sweep-bench` graph for LlaMA-3.1-8B on RTX-4080 for `Q4_0` and `IQ4_KS`.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2025-07-02** at **03:05:58**:
+👤 **Nexesenex** commented on **2025-07-02** at **03:05:58**
No problem on my side on Miqu Q5_K_M (full offload w/MMQ on 3 GPUs) and Wizard 8x22b IQ3_S mix (same test) after adapting this PR to Croco.cpp (mainline's fork).
Perfs are similar, with maybe a 0.5-1% bonus (still in the margin of variation of my bench results, but not downward, upward).
@@ -32,7 +35,7 @@ such as iq4_xs and iq4_nl?
---
-👤 **ikawrakow** commented the **2025-07-02** at **07:11:23**:
+👤 **ikawrakow** commented on **2025-07-02** at **07:11:23**
> Can the iq4_ks versant of that PR be valid on the other quants' MMQ kernels
diff --git a/github-data/pull_requests/569 - Conditionally disable fused ops when building with Vulkan enabled.md b/github-data/pull_requests/569 - Conditionally disable fused ops when building with Vulkan enabled.md
index d5c7dd1cb..8d19c86c1 100644
--- a/github-data/pull_requests/569 - Conditionally disable fused ops when building with Vulkan enabled.md
+++ b/github-data/pull_requests/569 - Conditionally disable fused ops when building with Vulkan enabled.md
@@ -1,14 +1,17 @@
-### 🔀 [#569](https://github.com/ikawrakow/ik_llama.cpp/pull/569) - Conditionally disable fused ops when building with Vulkan enabled
+## 🔀 [Pull Request #569](https://github.com/ikawrakow/ik_llama.cpp/pull/569) - Conditionally disable fused ops when building with Vulkan enabled
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_disable_fused_ops` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-02 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-07-02 |
---
-#### Description
+## 📄 Description
Last PR just disabled them, here we disable them only if building with Vulkan support.
diff --git a/github-data/pull_requests/57 - AVX2_Zen4 horizontal sums.md b/github-data/pull_requests/57 - AVX2Zen4 horizontal sums.md
similarity index 93%
rename from github-data/pull_requests/57 - AVX2_Zen4 horizontal sums.md
rename to github-data/pull_requests/57 - AVX2Zen4 horizontal sums.md
index 067475f56..e2dd8415e 100644
--- a/github-data/pull_requests/57 - AVX2_Zen4 horizontal sums.md
+++ b/github-data/pull_requests/57 - AVX2Zen4 horizontal sums.md
@@ -1,14 +1,16 @@
-### 🔀 [#57](https://github.com/ikawrakow/ik_llama.cpp/pull/57) - AVX2/Zen4 horizontal sums
+## 🔀 [Pull Request #57](https://github.com/ikawrakow/ik_llama.cpp/pull/57) - AVX2/Zen4 horizontal sums
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `ik/hsums` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-17 |
| **Updated** | 2024-09-17 |
---
-#### Description
+## 📄 Description
It is really strange that there is no instruction to horizontally sum the elements of a SIMD vector in `AVX/AVX2/AVX512` as this is needed all the time. In `AVX512` there is `_mm512_reduce_add_ps(x)`, but this expands to multiple instructions. E.g., from GCC-12 `immintrin.h`:
```
diff --git a/github-data/pull_requests/570 - Remove duplicate_misplaced cmake find_package for Vulkan.md b/github-data/pull_requests/570 - Remove duplicate_misplaced cmake find_package for Vulkan.md
deleted file mode 100644
index 7af538c30..000000000
--- a/github-data/pull_requests/570 - Remove duplicate_misplaced cmake find_package for Vulkan.md
+++ /dev/null
@@ -1,26 +0,0 @@
-### 🔀 [#570](https://github.com/ikawrakow/ik_llama.cpp/pull/570) - Remove duplicate/misplaced cmake find_package for Vulkan
-
-| **Author** | `Nexesenex` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-07-02 |
-| **Updated** | 2025-07-02 |
-
----
-
-#### Description
-
-This line `find_package(Vulkan COMPONENTS glslc REQUIRED)` prevented to build anything on MSVS 2022 if the package was not present on the system, this even if Vulkan was not selected.
-
-It's already present in the Vulkan conditionality.
-
-```
-if (GGML_VULKAN)
-find_package(Vulkan COMPONENTS glslc REQUIRED)
-```
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [x] Low
- - [ ] Medium
- - [ ] High
\ No newline at end of file
diff --git a/github-data/pull_requests/570 - Remove duplicatemisplaced cmake find_package for Vulkan.md b/github-data/pull_requests/570 - Remove duplicatemisplaced cmake find_package for Vulkan.md
new file mode 100644
index 000000000..41ee153a2
--- /dev/null
+++ b/github-data/pull_requests/570 - Remove duplicatemisplaced cmake find_package for Vulkan.md
@@ -0,0 +1,46 @@
+## 🔀 [Pull Request #570](https://github.com/ikawrakow/ik_llama.cpp/pull/570) - Remove duplicate/misplaced cmake find_package for Vulkan
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Source Branch** | `fix_novulkan_cmake` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-02 |
+| **Updated** | 2025-07-02 |
+
+---
+
+## 📄 Description
+
+This line `find_package(Vulkan COMPONENTS glslc REQUIRED)` prevented to build anything on MSVS 2022 if the package was not present on the system, this even if Vulkan was not selected.
+
+It's already present in the Vulkan conditionality.
+
+```
+if (GGML_VULKAN)
+find_package(Vulkan COMPONENTS glslc REQUIRED)
+```
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-02** at **14:13:30**
+
+I just merged [#571](https://github.com/ikawrakow/ik_llama.cpp/issues/571) that should fix it. Thanks for reporting and making a PR.
+
+I preferred [#571](https://github.com/ikawrakow/ik_llama.cpp/issues/571) because also the function testing Vulkan features needed to go inside the Vulkan block.
+
+---
+
+👤 **Nexesenex** commented on **2025-07-02** at **14:16:27**
+
+Of course, np!
+
+I wondered about reformating the Vulkan block in the cmakelist also, but you did it all already!
\ No newline at end of file
diff --git a/github-data/pull_requests/571 - Fix CMakeLists.md b/github-data/pull_requests/571 - Fix CMakeLists.md
index 7be81b265..dfc107f2c 100644
--- a/github-data/pull_requests/571 - Fix CMakeLists.md
+++ b/github-data/pull_requests/571 - Fix CMakeLists.md
@@ -1,14 +1,17 @@
-### 🐛 [#571](https://github.com/ikawrakow/ik_llama.cpp/pull/571) - Fix CMakeLists
+## 🔀 [Pull Request #571](https://github.com/ikawrakow/ik_llama.cpp/pull/571) - Fix CMakeLists
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_vulkan_required` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-02 |
| **Updated** | 2025-07-02 |
+| **Merged** | 2025-07-02 |
---
-#### Description
+## 📄 Description
The Vulkan stuff had ended up outside the `if (GGML_VULKAN)` condition, which prevents building any configuration unless having Vulkan installed.
diff --git a/github-data/pull_requests/573 - Support for dots.llm1 models.md b/github-data/pull_requests/573 - Support for dots.llm1 models.md
index 8c5a9a7c2..a52073233 100644
--- a/github-data/pull_requests/573 - Support for dots.llm1 models.md
+++ b/github-data/pull_requests/573 - Support for dots.llm1 models.md
@@ -1,14 +1,17 @@
-### 🔀 [#573](https://github.com/ikawrakow/ik_llama.cpp/pull/573) - Support for dots.llm1 models
+## 🔀 [Pull Request #573](https://github.com/ikawrakow/ik_llama.cpp/pull/573) - Support for dots.llm1 models
| **Author** | `saood06` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `s6/dots` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-03 |
| **Updated** | 2025-07-10 |
+| **Merged** | 2025-07-10 |
---
-#### Description
+## 📄 Description
Port of https://github.com/ggml-org/llama.cpp/pull/14118
@@ -20,9 +23,32 @@ Huggingface link to models: [instruct](https://huggingface.co/rednote-hilab/dots
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-07-03** at **04:59:14**:
+👤 **saood06** commented on **2025-07-03** at **03:44:01**
+
+Tested a bit in the cli, seems to work.
+
+Command:
+`./bin/llama-cli -m /mnt/sda/dots_inst/Dots_Inst-128x8.7B-BF16.gguf -s 12345 -p "The meaning of life" -t 48 --numa distribute -n 32 -c 8192 -fa`
+
+Output:
+
+The meaning of life is to find your gift. The purpose of life is to give it away. — Willam James
+
+This is as much as I had patience for, warmup seems to not actually load in all the experts and so tokens trickle in very slowly, not sure if that is the norm for CLI on MoE models (I know this isn't an issue for me with Deepseek models on server or sweep-bench).
+
+I also noticed it is wrongly labeled as it says `model ftype = IQ1_S - 1.5625 bpw` even though it is a `BF16`, and found the issue. When I updated constants.py for LlamaFileType I used ggml.h instead of llama.h (only now realized that both have `ftype` info and they differ [not sure why?])
+
+---
+
+👤 **firecoperana** commented on **2025-07-03** at **04:52:49**
+
+I am testing using UD-Q4_K_XL, and it is working. I notice an issue that if I leave system prompt empty, sometimes the response becomes unrelated to my question. With system prompt, it is fine. Do you also see this? I have the same issue when I run it from mainline.
+
+---
+
+👤 **saood06** commented on **2025-07-03** at **04:59:14**
> I am testing using UD-Q4_K_XL, and it is working.
@@ -34,52 +60,111 @@ If it exists in mainline then maybe it is a problem with the model? I haven't se
---
-👤 **ikawrakow** submitted a review the **2025-07-03** at **06:19:04**: 🔄 `CHANGES_REQUESTED`
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-07-03** at **06:18:24**
+
+I think you need to remove this line. We are not reshaping `V` as mainline because our attention implementation is different from theirs (and theirs was like ours until 2 or 3 months ago).
+
+> 👤 **saood06** replied on **2025-07-09** at **17:29:30**
+>
+> Commented it out (and the then redundant `cb`), and tested and it is working.
+
+---
+
+👤 **ikawrakow** requested changes on this pull request 🔄 on **2025-07-03** at **06:19:04**
---
-👤 **saood06** commented the **2025-07-04** at **00:05:25**:
+👤 **firecoperana** commented on **2025-07-03** at **15:20:55**
+
+I also see that the response will pause for a few seconds whenever it generates a comma, which will more than half the generation speed. If I prompt it to avoid outputting comma in the response, I don't see any pause in response. Mainline does not have this issue because it does not output comma in the response.
+
+Screenshot of the quant that I use:
+
+
+BOS token is ",", which should be changed to -1 according to this post:
+https://huggingface.co/gghfez/dots.llm1.inst-GGUF/discussions/1
+
+---
+
+👤 **saood06** commented on **2025-07-03** at **22:55:12**
+
+> I also see that the response will pause for a few seconds whenever it generates a comma, which will more than half the generation speed. If I prompt it to avoid outputting comma in the response, I don't see any pause in response. Mainline does not have this issue because it does not output comma in the response.
+
+Interesting, you are using `Q4_K_XL`. There is a lot of reporting about issues with certain quants of some Qwen based models (and this is one of those) pausing whenever they encounter a comma.
+
+2 users here who narrow it down to certain quants of some Qwen based models:
+https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-2925026167
+https://github.com/ikawrakow/ik_llama.cpp/issues/464#issuecomment-2927631215
+
+2 users here who identify it happening with commas, and causing performance issues:
+https://github.com/ikawrakow/ik_llama.cpp/issues/476#issuecomment-2933070214
+https://github.com/ikawrakow/ik_llama.cpp/issues/476#issuecomment-2972846150 (this one even shows the effect on video)
+
+The first sighting on the github I know about:
+https://github.com/ikawrakow/ik_llama.cpp/issues/380#issuecomment-2850596618
+
+I'm not sure what the root cause is, but I wouldn't investigate it with this model, I think the smallest model it is reported on is `Qwen3-30B-A3B-128K-UD-Q4_K_XL`.
+
+---
+
+👤 **firecoperana** commented on **2025-07-03** at **23:30:53**
+
+The following fix works for me:
+
+Not sure if there is better way.
+
+---
+
+👤 **saood06** commented on **2025-07-04** at **00:05:25**
> Not sure if there is better way.
-That fix is only for the incorrect BOS token, which to me seems like an issue with existing models caused by the convert script which is where the fix should happen (with workarounds like [this](https://huggingface.co/gghfez/dots.llm1.inst-GGUF/discussions/1 for existing models) .
+That fix is only for the incorrect BOS token (not the comma's causing pausing, right?), which to me seems like an issue with existing models caused by the convert script which is where the fix should happen (with workarounds like [this](https://huggingface.co/gghfez/dots.llm1.inst-GGUF/discussions/1 for existing models) .
Both the `config.json` and `tokenizer_config.json` are set to null, which makes it take the default, but that doesn't seem to be correct for this model at least.
---
-👤 **firecoperana** commented the **2025-07-04** at **00:10:41**:
+👤 **firecoperana** commented on **2025-07-04** at **00:10:41**
Without the fix, the model uses comma as BOS token that causes the pause, as least for the quant I'm using. See the screenshot I posted. Id 11 is the comma. After I set to null, comma is not used as BOS token.
---
-👤 **saood06** commented the **2025-07-04** at **00:24:53**:
+👤 **saood06** commented on **2025-07-04** at **00:24:53**
> Without the fix, the model uses comma as BOS token that causes the pause, as least for the quant I'm using. See the screenshot I posted. Id 11 is the comma. After I set to null, comma is not used as BOS token.
Well the comma still causes a pause (I'm assuming) even if you avoid encountering it from the BOS token by setting the BOS token.
-I've seen the screenshot you posted, and I also see the wrong BOS token in my own GGUF that I converted as part of the testing here (from safetensors to BF16 GGUF). Using `--override-kv tokenizer.ggml.bos_token_id=int:-1` like you linked above fixes it for affected models, but for future models to not be affected I think the convert script needs to explicitly set it, without changing the default like the `llama.cpp` change you suggested does.
+I've seen the screenshot you posted, and I also see the wrong BOS token (`BOS token = 11 ','`) in my own GGUF that I converted as part of the testing here (from safetensors to BF16 GGUF).
+
+Using `--override-kv tokenizer.ggml.bos_token_id=int:-1` like you linked above fixes it for affected models, but for future models to not be affected I think the convert script needs to explicitly set it, without changing the default like the `llama.cpp` change you suggested does.
---
-👤 **saood06** submitted a review the **2025-07-09** at **17:29:30**: 💬 `COMMENTED`
+👤 **ikawrakow** commented on **2025-07-09** at **08:31:56**
+
+@saood06 What are your plans with this PR? You are disagreeing with the `V` reshaping comment, or is it about the `BOS` token, or perhaps both?
---
-👤 **saood06** commented the **2025-07-09** at **17:45:47**:
+👤 **saood06** commented on **2025-07-09** at **17:45:47**
> @saood06 What are your plans with this PR?
Sorry kept pushing off testing this more, but I just pushed a commit with both the recommended changes.
+I tested all four `-fa` and `-fmoe` combinations and it works, (without the V cur changes, non FA was outputting garbage).
+
>You are disagreeing [...] about the `BOS` token
-I still think the better solution would have been for the convert script to set it to `-1` when config.json has it set to `NULL` instead of leaving it to be set to default and changing the default for this architecture, but given the fact that every GGUF I saw on huggingface has this issue, changing the default so that users don't have to set `--override-kv tokenizer.ggml.bos_token_id=int:-1` (assuming they know to do that) or some other workaround makes sense.
+I still think the better solution would have been for the convert script to set it to `-1` when config.json has it set to `NULL` instead of leaving it to be set to default and changing the default for this architecture, but given the fact that every GGUF I saw on huggingface has this issue, changing the default so that users don't have to set `--override-kv tokenizer.ggml.bos_token_id=int:-1` (assuming they know to do that) or some other workaround to use existing GGUFs makes sense.
+
+I also changed the warmup behavior to work with this model (a MoE without a BOS token), it is still the same hacky solution but now it does account for models without a BOS token, and it did warmup properly for me now (not sure why it wasn't with BOS set to [token id 11/`,`]).
-I also changed the warmup behavior to work with this model (a MoE without a BOS token), it is still the same hacky solution but now it does account for models without a BOS token, and it did warmup for me now (not sure why it wasn't with BOS set to [token id 11/`,`]).
+Edit: Also handled the merge conflicts.
---
-👤 **ikawrakow** submitted a review the **2025-07-10** at **06:31:53**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-10** at **06:31:53**
\ No newline at end of file
diff --git a/github-data/pull_requests/574 - Change KQ mask padding to 64.md b/github-data/pull_requests/574 - Change KQ mask padding to 64.md
index abfa6ed30..bb37c0495 100644
--- a/github-data/pull_requests/574 - Change KQ mask padding to 64.md
+++ b/github-data/pull_requests/574 - Change KQ mask padding to 64.md
@@ -1,14 +1,17 @@
-### 🔀 [#574](https://github.com/ikawrakow/ik_llama.cpp/pull/574) - Change KQ mask padding to 64
+## 🔀 [Pull Request #574](https://github.com/ikawrakow/ik_llama.cpp/pull/574) - Change KQ mask padding to 64
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/kq_mask_padding_64` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-03 |
| **Updated** | 2025-07-03 |
+| **Merged** | 2025-07-03 |
---
-#### Description
+## 📄 Description
This is needed by the Vulkan back-end when coopmat2 is enabled.
@@ -16,9 +19,9 @@ It is 64 in mainline too.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-03** at **08:42:47**:
+👤 **ikawrakow** commented on **2025-07-03** at **08:42:47**
So, I updated the Nvidia driver on one of my two remote machines to 575, which enables Vulkan coopmat2. This triggers an assert in the Vulkan back-end, which is the reason for this PR fixing it. But I was more interested in the performance implications as I saw a factor of 3 lower Vulkan performance with coopmat1 compared to CUDA. As per [this comment](https://github.com/ikawrakow/ik_llama.cpp/discussions/562#discussioncomment-13630937), the difference between the CUDA and Vulkan back-ends on the same Nvidia GPU should be in the range of 20-25% when coopmat2 is enabled. Sadly, this is not the case on my RTX-4080. Coopmat2 is better, but PP is still a factor of 2 lower compared to CUDA. Here is a sweep bench for `Q4_0`-quantized LlaMA-3.1-8B-Instruct for u-batch of 1024 and FA enabled:
diff --git a/github-data/pull_requests/577 - Vulkan fused rms norm.md b/github-data/pull_requests/577 - Vulkan fused rms norm.md
new file mode 100644
index 000000000..0684a2c35
--- /dev/null
+++ b/github-data/pull_requests/577 - Vulkan fused rms norm.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #577](https://github.com/ikawrakow/ik_llama.cpp/pull/577) - Vulkan: fused rms norm
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_fused_rms` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-03 |
+| **Updated** | 2025-07-03 |
+| **Merged** | 2025-07-03 |
+
+---
+
+## 📄 Description
+
+I see zero performance benefit, but at least we don't need to cpecial-case Vulkan when creating the graph.
\ No newline at end of file
diff --git a/github-data/pull_requests/577 - Vulkan_ fused rms norm.md b/github-data/pull_requests/577 - Vulkan_ fused rms norm.md
deleted file mode 100644
index 2f9e39ecd..000000000
--- a/github-data/pull_requests/577 - Vulkan_ fused rms norm.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#577](https://github.com/ikawrakow/ik_llama.cpp/pull/577) - Vulkan: fused rms norm
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-07-03 |
-| **Updated** | 2025-07-03 |
-
----
-
-#### Description
-
-I see zero performance benefit, but at least we don't need to cpecial-case Vulkan when creating the graph.
\ No newline at end of file
diff --git a/github-data/pull_requests/578 - Do not crash when there is no DRY sampler.md b/github-data/pull_requests/578 - Do not crash when there is no DRY sampler.md
index 4197bafb1..47ed72f91 100644
--- a/github-data/pull_requests/578 - Do not crash when there is no DRY sampler.md
+++ b/github-data/pull_requests/578 - Do not crash when there is no DRY sampler.md
@@ -1,13 +1,16 @@
-### 🔀 [#578](https://github.com/ikawrakow/ik_llama.cpp/pull/578) - Do not crash when there is no DRY sampler
+## 🔀 [Pull Request #578](https://github.com/ikawrakow/ik_llama.cpp/pull/578) - Do not crash when there is no DRY sampler
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_missing_dry` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-03 |
| **Updated** | 2025-07-03 |
+| **Merged** | 2025-07-03 |
---
-#### Description
+## 📄 Description
-Closes #575
\ No newline at end of file
+Closes [#575](https://github.com/ikawrakow/ik_llama.cpp/issues/575)
\ No newline at end of file
diff --git a/github-data/pull_requests/579 - Fix debug build failure with RPC off.md b/github-data/pull_requests/579 - Fix debug build failure with RPC off.md
index d3c881d71..65a01ffb7 100644
--- a/github-data/pull_requests/579 - Fix debug build failure with RPC off.md
+++ b/github-data/pull_requests/579 - Fix debug build failure with RPC off.md
@@ -1,7 +1,16 @@
-### 🐛 [#579](https://github.com/ikawrakow/ik_llama.cpp/pull/579) - Fix debug build failure with RPC off
+## 🔀 [Pull Request #579](https://github.com/ikawrakow/ik_llama.cpp/pull/579) - Fix debug build failure with RPC off
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_rpc_off` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-03 |
-| **Updated** | 2025-07-03 |
\ No newline at end of file
+| **Updated** | 2025-07-03 |
+| **Merged** | 2025-07-03 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/58 - Fix compiler warnings.md b/github-data/pull_requests/58 - Fix compiler warnings.md
index 23f0afd6d..4478ee5c0 100644
--- a/github-data/pull_requests/58 - Fix compiler warnings.md
+++ b/github-data/pull_requests/58 - Fix compiler warnings.md
@@ -1,14 +1,17 @@
-### 🐛 [#58](https://github.com/ikawrakow/ik_llama.cpp/pull/58) - Fix compiler warnings
+## 🔀 [Pull Request #58](https://github.com/ikawrakow/ik_llama.cpp/pull/58) - Fix compiler warnings
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_ggml_common` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-17 |
| **Updated** | 2024-09-17 |
+| **Merged** | 2024-09-17 |
---
-#### Description
+## 📄 Description
I got tired of the "ISO C++ forbids anonymous structures" warnings that are due to the way the quants scales are defined in `ggml-common.h`, so fixing it with this PR.
diff --git a/github-data/pull_requests/580 - Vulkan_ add GGML_OP_FUSED_MUL_UNARY.md b/github-data/pull_requests/580 - Vulkan add GGML_OP_FUSED_MUL_UNARY.md
similarity index 91%
rename from github-data/pull_requests/580 - Vulkan_ add GGML_OP_FUSED_MUL_UNARY.md
rename to github-data/pull_requests/580 - Vulkan add GGML_OP_FUSED_MUL_UNARY.md
index 088e41cfd..9273d7011 100644
--- a/github-data/pull_requests/580 - Vulkan_ add GGML_OP_FUSED_MUL_UNARY.md
+++ b/github-data/pull_requests/580 - Vulkan add GGML_OP_FUSED_MUL_UNARY.md
@@ -1,14 +1,17 @@
-### 🔀 [#580](https://github.com/ikawrakow/ik_llama.cpp/pull/580) - Vulkan: add GGML_OP_FUSED_MUL_UNARY
+## 🔀 [Pull Request #580](https://github.com/ikawrakow/ik_llama.cpp/pull/580) - Vulkan: add GGML_OP_FUSED_MUL_UNARY
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_fused_mul_unary` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-03 |
| **Updated** | 2025-07-03 |
+| **Merged** | 2025-07-03 |
---
-#### Description
+## 📄 Description
The tiniest of performance increases, barely measurable.
diff --git a/github-data/pull_requests/581 - Vulkan_ Disable multi-add for now.md b/github-data/pull_requests/581 - Vulkan Disable multi-add for now.md
similarity index 78%
rename from github-data/pull_requests/581 - Vulkan_ Disable multi-add for now.md
rename to github-data/pull_requests/581 - Vulkan Disable multi-add for now.md
index 356843227..8f488821a 100644
--- a/github-data/pull_requests/581 - Vulkan_ Disable multi-add for now.md
+++ b/github-data/pull_requests/581 - Vulkan Disable multi-add for now.md
@@ -1,14 +1,17 @@
-### 🔀 [#581](https://github.com/ikawrakow/ik_llama.cpp/pull/581) - Vulkan: Disable multi-add for now
+## 🔀 [Pull Request #581](https://github.com/ikawrakow/ik_llama.cpp/pull/581) - Vulkan: Disable multi-add for now
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_disable_multi_add` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-03 |
| **Updated** | 2025-07-03 |
+| **Merged** | 2025-07-03 |
---
-#### Description
+## 📄 Description
...until we implement it for Vulkan, else it will run on the CPU and performance of MoE models will be terrible.
diff --git a/github-data/pull_requests/582 - Vulkan_ adding GGML_OP_MULTI_ADD implementation.md b/github-data/pull_requests/582 - Vulkan adding GGML_OP_MULTI_ADD implementation.md
similarity index 66%
rename from github-data/pull_requests/582 - Vulkan_ adding GGML_OP_MULTI_ADD implementation.md
rename to github-data/pull_requests/582 - Vulkan adding GGML_OP_MULTI_ADD implementation.md
index 2bfcc597f..acb2a97af 100644
--- a/github-data/pull_requests/582 - Vulkan_ adding GGML_OP_MULTI_ADD implementation.md
+++ b/github-data/pull_requests/582 - Vulkan adding GGML_OP_MULTI_ADD implementation.md
@@ -1,14 +1,17 @@
-### 🔀 [#582](https://github.com/ikawrakow/ik_llama.cpp/pull/582) - Vulkan: adding GGML_OP_MULTI_ADD implementation
+## 🔀 [Pull Request #582](https://github.com/ikawrakow/ik_llama.cpp/pull/582) - Vulkan: adding GGML_OP_MULTI_ADD implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_multi_add` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-04 |
| **Updated** | 2025-07-04 |
+| **Merged** | 2025-07-04 |
---
-#### Description
+## 📄 Description
This is relevant for MoE models. The performance improvement is surprisingly small. Somewhere it was mentioned that Vulkan kernel launch overhead is significantly larger than CUDA, so I would have expected a more significant performance benefit. For DeepSeek-Lite, the number of graph nodes in `ik_llama.cpp` with this PR is 1420 vs 1871 in mainline `llama.cpp`.
diff --git a/github-data/pull_requests/583 - Adding forgotten file.md b/github-data/pull_requests/583 - Adding forgotten file.md
index 58df1ea85..5cd3da075 100644
--- a/github-data/pull_requests/583 - Adding forgotten file.md
+++ b/github-data/pull_requests/583 - Adding forgotten file.md
@@ -1,7 +1,16 @@
-### 🔀 [#583](https://github.com/ikawrakow/ik_llama.cpp/pull/583) - Adding forgotten file
+## 🔀 [Pull Request #583](https://github.com/ikawrakow/ik_llama.cpp/pull/583) - Adding forgotten file
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_forgotten_multi_add` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-04 |
-| **Updated** | 2025-07-04 |
\ No newline at end of file
+| **Updated** | 2025-07-04 |
+| **Merged** | 2025-07-04 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/584 - Vulkan_ flash attention for DeepSeek models.md b/github-data/pull_requests/584 - Vulkan flash attention for DeepSeek models.md
similarity index 89%
rename from github-data/pull_requests/584 - Vulkan_ flash attention for DeepSeek models.md
rename to github-data/pull_requests/584 - Vulkan flash attention for DeepSeek models.md
index f77518162..60c23e6e4 100644
--- a/github-data/pull_requests/584 - Vulkan_ flash attention for DeepSeek models.md
+++ b/github-data/pull_requests/584 - Vulkan flash attention for DeepSeek models.md
@@ -1,14 +1,17 @@
-### 🔀 [#584](https://github.com/ikawrakow/ik_llama.cpp/pull/584) - Vulkan: flash attention for DeepSeek models
+## 🔀 [Pull Request #584](https://github.com/ikawrakow/ik_llama.cpp/pull/584) - Vulkan: flash attention for DeepSeek models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_fattn` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-04 |
| **Updated** | 2025-07-05 |
+| **Merged** | 2025-07-05 |
---
-#### Description
+## 📄 Description
This PR is a cherry-pick of [PR 14509](https://github.com/ggml-org/llama.cpp/pull/14509) in mainline `llama.cpp` with minor adaptations, and adds FA for the DeepSeek models to the Vulkan back-end.
diff --git a/github-data/pull_requests/585 - Special handling of Seed Coder FIM tokens.md b/github-data/pull_requests/585 - Special handling of Seed Coder FIM tokens.md
index 3c0fbe70c..59e1a619e 100644
--- a/github-data/pull_requests/585 - Special handling of Seed Coder FIM tokens.md
+++ b/github-data/pull_requests/585 - Special handling of Seed Coder FIM tokens.md
@@ -1,14 +1,17 @@
-### 🔀 [#585](https://github.com/ikawrakow/ik_llama.cpp/pull/585) - Special handling of Seed Coder FIM tokens
+## 🔀 [Pull Request #585](https://github.com/ikawrakow/ik_llama.cpp/pull/585) - Special handling of Seed Coder FIM tokens
| **Author** | `fizzAI` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `main` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-04 |
| **Updated** | 2025-07-06 |
+| **Merged** | 2025-07-06 |
---
-#### Description
+## 📄 Description
Needed this for some quants and realized it didn't support it already, so figured I'd just PR upstream
Seems a bit odd to need to figure out model families by vocab size? But I'm not sure of a better way to do it, so left it as-is for now
@@ -21,54 +24,44 @@ Seems a bit odd to need to figure out model families by vocab size? But I'm not
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **fizzAI** commented the **2025-07-04** at **21:23:47**:
+👤 **fizzAI** commented on **2025-07-04** at **21:23:47**
Actually need to merge some tokenizer support from regular lcpp too, please hold lol
---
-👤 **fizzAI** commented the **2025-07-04** at **22:43:32**:
+👤 **fizzAI** commented on **2025-07-04** at **22:43:32**
Appears to work, now
---
-👤 **ikawrakow** submitted a review the **2025-07-05** at **09:29:56**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-07-05** at **09:29:56** on `convert_hf_to_gguf.py`:
+👤 **ikawrakow** started a conversation on `convert_hf_to_gguf.py` on **2025-07-05** at **09:29:56**
It is the only model that has a vocabulary of 155,136 tokens?
----
-
-👤 **ikawrakow** commented during a code review the **2025-07-05** at **09:30:24** on `include/llama.h`:
-
-Pleas format the same way as the surrounding code.
+> 👤 **fizzAI** replied on **2025-07-05** at **19:35:38**
+>
+> I'm not 100% sure honestly (nor do I have any idea how I would check that off the top of my head), but it's how CodeLlama handles it so it should be fine I thought
---
-👤 **ikawrakow** commented during a code review the **2025-07-05** at **09:30:33** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `include/llama.h` on **2025-07-05** at **09:30:24**
Pleas format the same way as the surrounding code.
----
-
-👤 **ikawrakow** submitted a review the **2025-07-05** at **09:30:54**: ✅ `APPROVED`
+> 👤 **fizzAI** replied on **2025-07-05** at **19:35:56**
+>
+> D: damn my editor
---
-👤 **fizzAI** submitted a review the **2025-07-05** at **19:35:38**: 💬 `COMMENTED`
-
----
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-07-05** at **09:30:33**
-👤 **fizzAI** submitted a review the **2025-07-05** at **19:35:56**: 💬 `COMMENTED`
+Pleas format the same way as the surrounding code.
---
-👤 **fizzAI** commented during a code review the **2025-07-05** at **19:35:56** on `include/llama.h`:
-
-D: damn my editor
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-05** at **09:30:54**
\ No newline at end of file
diff --git a/github-data/pull_requests/587 - Fix crash when there is no DRY sampler.md b/github-data/pull_requests/587 - Fix crash when there is no DRY sampler.md
index 199f0e2e4..5070fc453 100644
--- a/github-data/pull_requests/587 - Fix crash when there is no DRY sampler.md
+++ b/github-data/pull_requests/587 - Fix crash when there is no DRY sampler.md
@@ -1,14 +1,17 @@
-### 🐛 [#587](https://github.com/ikawrakow/ik_llama.cpp/pull/587) - Fix crash when there is no DRY sampler
+## 🔀 [Pull Request #587](https://github.com/ikawrakow/ik_llama.cpp/pull/587) - Fix crash when there is no DRY sampler
| **Author** | `firecoperana` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `fcp/fix-missing-dry` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-05 |
| **Updated** | 2025-07-05 |
+| **Assignees** | `firecoperana` |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
diff --git a/github-data/pull_requests/588 - Fix server crash when there is no DRY sampler.md b/github-data/pull_requests/588 - Fix server crash when there is no DRY sampler.md
index d944e9593..21215c2c9 100644
--- a/github-data/pull_requests/588 - Fix server crash when there is no DRY sampler.md
+++ b/github-data/pull_requests/588 - Fix server crash when there is no DRY sampler.md
@@ -1,14 +1,18 @@
-### 🐛 [#588](https://github.com/ikawrakow/ik_llama.cpp/pull/588) - Fix server crash when there is no DRY sampler
+## 🔀 [Pull Request #588](https://github.com/ikawrakow/ik_llama.cpp/pull/588) - Fix server crash when there is no DRY sampler
| **Author** | `firecoperana` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `fcp/fix-missing-dry` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-05 |
| **Updated** | 2025-07-06 |
+| **Merged** | 2025-07-06 |
+| **Assignees** | `firecoperana` |
---
-#### Description
+## 📄 Description
- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
- Self-reported review complexity:
@@ -18,8 +22,8 @@
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-07-06** at **05:51:30**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-06** at **05:51:30**
I missed that one. Thanks!
\ No newline at end of file
diff --git a/github-data/pull_requests/589 - CUDA_ small PP performance improvement for MoE models.md b/github-data/pull_requests/589 - CUDA small PP performance improvement for MoE models.md
similarity index 81%
rename from github-data/pull_requests/589 - CUDA_ small PP performance improvement for MoE models.md
rename to github-data/pull_requests/589 - CUDA small PP performance improvement for MoE models.md
index 880b231a6..ed17048c0 100644
--- a/github-data/pull_requests/589 - CUDA_ small PP performance improvement for MoE models.md
+++ b/github-data/pull_requests/589 - CUDA small PP performance improvement for MoE models.md
@@ -1,14 +1,17 @@
-### 🔀 [#589](https://github.com/ikawrakow/ik_llama.cpp/pull/589) - CUDA: small PP performance improvement for MoE models
+## 🔀 [Pull Request #589](https://github.com/ikawrakow/ik_llama.cpp/pull/589) - CUDA: small PP performance improvement for MoE models
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_quantized_fmoe` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-06 |
| **Updated** | 2025-07-07 |
+| **Merged** | 2025-07-07 |
---
-#### Description
+## 📄 Description
This PR brings a small (2-3%) prompt processing performance improvement on CUDA for quantized MoE models (when `-fmoe` is used).
diff --git a/github-data/pull_requests/592 - Another minor readme update.md b/github-data/pull_requests/592 - Another minor readme update.md
index 72684d1ed..fa5c1bde8 100644
--- a/github-data/pull_requests/592 - Another minor readme update.md
+++ b/github-data/pull_requests/592 - Another minor readme update.md
@@ -1,14 +1,16 @@
-### 🔀 [#592](https://github.com/ikawrakow/ik_llama.cpp/pull/592) - Another minor readme update
+## 🔀 [Pull Request #592](https://github.com/ikawrakow/ik_llama.cpp/pull/592) - Another minor readme update
| **Author** | `saood06` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `s6/readme-minor2` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-08 |
| **Updated** | 2025-07-09 |
---
-#### Description
+## 📄 Description
I think this looks cleaner.
@@ -18,12 +20,24 @@ They didn't belong in that section, but now I don't know where it would go at al
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-09** at **12:00:32**:
+👤 **ikawrakow** commented on **2025-07-09** at **12:00:32**
> They didn't belong in that section, but now I don't know where it would go at all (Features?).
They can go under "Quantization additions". `IQ1_M_R4` and `IQ1_S_R4` are distinct quantization types, not just repacked `IQ1_M` and `IQ1_S`.
-Not sure if the tabular format for the new models works well. The table is quite squeezed already, and now Hunyuan has been added and dits.llm1 is pending. Do you know how you want to reformat/change to accommodate additional models?
\ No newline at end of file
+Not sure if the tabular format for the new models works well. The table is quite squeezed already, and now Hunyuan has been added and dits.llm1 is pending. Do you know how you want to reformat/change to accommodate additional models?
+
+---
+
+👤 **saood06** commented on **2025-07-09** at **19:45:59**
+
+> They can go under "Quantization additions". `IQ1_M_R4` and `IQ1_S_R4` are distinct quantization types, not just repacked `IQ1_M` and `IQ1_S`.
+
+Added them (by making a Misc) section.
+
+> Not sure if the tabular format for the new models works well. The table is quite squeezed already, and now Hunyuan has been added and dits.llm1 is pending. Do you know how you want to reformat/change to accommodate additional models?
+
+I agree that it doesn't work well, but I wanted to try something to get rid of the block of text. I do think on top of just accommodating new models, this section might not belong in the "Latest News", since that makes it not mention all of the model supported inherited by mainline (and thus may confuse a user to thinking the listed models are the only models supported).
\ No newline at end of file
diff --git a/github-data/pull_requests/593 - Faster prompt processing for IQ2_KS_ IQ2_K_ IQ2_K_R4.md b/github-data/pull_requests/593 - Faster prompt processing for IQ2_KS IQ2_K IQ2_K_R4.md
similarity index 79%
rename from github-data/pull_requests/593 - Faster prompt processing for IQ2_KS_ IQ2_K_ IQ2_K_R4.md
rename to github-data/pull_requests/593 - Faster prompt processing for IQ2_KS IQ2_K IQ2_K_R4.md
index 14860f09a..cbb97fc71 100644
--- a/github-data/pull_requests/593 - Faster prompt processing for IQ2_KS_ IQ2_K_ IQ2_K_R4.md
+++ b/github-data/pull_requests/593 - Faster prompt processing for IQ2_KS IQ2_K IQ2_K_R4.md
@@ -1,14 +1,17 @@
-### 🔀 [#593](https://github.com/ikawrakow/ik_llama.cpp/pull/593) - Faster prompt processing for IQ2_KS, IQ2_K, IQ2_K_R4
+## 🔀 [Pull Request #593](https://github.com/ikawrakow/ik_llama.cpp/pull/593) - Faster prompt processing for IQ2_KS, IQ2_K, IQ2_K_R4
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_faster_iq2k` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-08 |
| **Updated** | 2025-07-08 |
+| **Merged** | 2025-07-08 |
---
-#### Description
+## 📄 Description
Here a comparison to the main branch for LlaMA-3.1-8B on RTX-4080
diff --git a/github-data/pull_requests/595 - CUDA Faster prompt processing for several quantization types.md b/github-data/pull_requests/595 - CUDA Faster prompt processing for several quantization types.md
new file mode 100644
index 000000000..1003424f9
--- /dev/null
+++ b/github-data/pull_requests/595 - CUDA Faster prompt processing for several quantization types.md
@@ -0,0 +1,68 @@
+## 🔀 [Pull Request #595](https://github.com/ikawrakow/ik_llama.cpp/pull/595) - CUDA: Faster prompt processing for several quantization types
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/apply_cuda_faster_iq3k` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-09 |
+| **Updated** | 2025-07-10 |
+| **Merged** | 2025-07-10 |
+
+---
+
+## 📄 Description
+
+This PR slightly improves prompt processing speed for `IQ3_K, IQ3_K_R4, IQ4_KS, IQ4_KS_R4, IQ4_K, IQ4_K_R4` and `IQ4_XS`.
+
+Here some PP-512 results for LlaMA-3.1-8B on RTX-4080
+
+ | model | test | t/s (main) | t/s (PR) | Speedup |
+| ------------------ | ------------: | ---------------: | ---------------: | -------: |
+| llama 8B IQ3_K | pp512 | 6467.57 ± 18.48 | 6628.75 ± 14.24 | 1.025 |
+| llama 8B IQ3_K_R4 | pp512 | 6102.36 ± 14.63 | 6464.58 ± 10.89 | 1.059 |
+| llama 8B IQ4_K | pp512 | 6442.38 ± 17.97 | 6625.94 ± 22.90 | 1.028 |
+| llama 8B IQ4_K_R4 | pp512 | 6391.48 ± 16.77 | 6450.58 ± 11.54 | 1.009 |
+| llama 8B IQ4_KS | pp512 | 7732.35 ± 26.04 | 8074.07 ± 16.37 | 1.044 |
+| llama 8B IQ4_KS_R | pp512 | 7912.27 ± 21.10 | 8178.74 ± 28.14 | 1.034 |
+| llama 8B IQ4_XS | pp512 | 7748.68 ± 20.75 | 8149.86 ± 28.13 | 1.051 |
+
+---
+
+## 💬 Conversation
+
+👤 **Nexesenex** commented on **2025-07-09** at **14:42:26**
+
+Test in full Cuda offload on 3 Ampere GPUs (3090-3090-RTXA4000), TS 3-3-2+output, MMQ, and BBS 128 (on Croco.cpp) :
+
+No trouble on my end for merging, compiling, and infering,
+
+I tested on a 111b Command-A model quantized as such :
+llama_model_loader: - type iq5_ks: 1 tensors
+llama_model_loader: - type iq4_ks_r4: 320 tensors
+llama_model_loader: - type iq5_ks_r4: 128 tensors
+Gross average PP : around 445 t/s, 435 before.
+
+On a Mistral 123b :
+llama_model_loader: - type f32: 177 tensors
+llama_model_loader: - type iq3_k: 352 tensors
+llama_model_loader: - type iq5_k: 89 tensors
+llama_model_loader: - type iq6_k: 1 tensors
+llama_model_loader: - type iq4_ks: 176 tensors
+Gross average PP : 340 t/s, 330 before.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-09** at **14:49:20**
+
+Thanks for testing. Yes, it is a 1-5% kind of improvement, nothing major.
+
+---
+
+👤 **ubergarm** commented on **2025-07-09** at **20:14:20**
+
+Seeing roughly 1.2~2.6% speed-up on a `Qwen3-14B-IQ3_K` mix of mostly iq4_k and iq3_k fully offloaded on my home rig 3090TI FE. I checked at default batch sizes and also `-ub 4096 -b 4096` where it was still faster albeit slightly less gains vs default batch sizes.
+
+
+
+Nice!
\ No newline at end of file
diff --git a/github-data/pull_requests/595 - CUDA_ Faster prompt processing for several quantization types.md b/github-data/pull_requests/595 - CUDA_ Faster prompt processing for several quantization types.md
deleted file mode 100644
index c07e244be..000000000
--- a/github-data/pull_requests/595 - CUDA_ Faster prompt processing for several quantization types.md
+++ /dev/null
@@ -1,25 +0,0 @@
-### 🔀 [#595](https://github.com/ikawrakow/ik_llama.cpp/pull/595) - CUDA: Faster prompt processing for several quantization types
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-07-09 |
-| **Updated** | 2025-07-10 |
-
----
-
-#### Description
-
-This PR slightly improves prompt processing speed for `IQ3_K, IQ3_K_R4, IQ4_KS, IQ4_KS_R4, IQ4_K, IQ4_K_R4` and `IQ4_XS`.
-
-Here some PP-512 results for LlaMA-3.1-8B on RTX-4080
-
- | model | test | t/s (main) | t/s (PR) | Speedup |
-| ------------------ | ------------: | ---------------: | ---------------: | -------: |
-| llama 8B IQ3_K | pp512 | 6467.57 ± 18.48 | 6628.75 ± 14.24 | 1.025 |
-| llama 8B IQ3_K_R4 | pp512 | 6102.36 ± 14.63 | 6464.58 ± 10.89 | 1.059 |
-| llama 8B IQ4_K | pp512 | 6442.38 ± 17.97 | 6625.94 ± 22.90 | 1.028 |
-| llama 8B IQ4_K_R4 | pp512 | 6391.48 ± 16.77 | 6450.58 ± 11.54 | 1.009 |
-| llama 8B IQ4_KS | pp512 | 7732.35 ± 26.04 | 8074.07 ± 16.37 | 1.044 |
-| llama 8B IQ4_KS_R | pp512 | 7912.27 ± 21.10 | 8178.74 ± 28.14 | 1.034 |
-| llama 8B IQ4_XS | pp512 | 7748.68 ± 20.75 | 8149.86 ± 28.13 | 1.051 |
\ No newline at end of file
diff --git a/github-data/pull_requests/598 - Vulkan_ iquants and flash attention split_k_reduce improvement.md b/github-data/pull_requests/598 - Vulkan iquants and flash attention split_k_reduce improvement.md
similarity index 57%
rename from github-data/pull_requests/598 - Vulkan_ iquants and flash attention split_k_reduce improvement.md
rename to github-data/pull_requests/598 - Vulkan iquants and flash attention split_k_reduce improvement.md
index 61f4196b7..ef4f92180 100644
--- a/github-data/pull_requests/598 - Vulkan_ iquants and flash attention split_k_reduce improvement.md
+++ b/github-data/pull_requests/598 - Vulkan iquants and flash attention split_k_reduce improvement.md
@@ -1,14 +1,17 @@
-### 🔀 [#598](https://github.com/ikawrakow/ik_llama.cpp/pull/598) - Vulkan: iquants and flash attention split_k_reduce improvement
+## 🔀 [Pull Request #598](https://github.com/ikawrakow/ik_llama.cpp/pull/598) - Vulkan: iquants and flash attention split_k_reduce improvement
| **Author** | `firecoperana` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `fcp/vulkan_01` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-11 |
| **Updated** | 2025-07-16 |
+| **Assignees** | `firecoperana` |
---
-#### Description
+## 📄 Description
Vulkan small token gen improvement
@@ -22,23 +25,141 @@ Taken from https://github.com/ggml-org/llama.cpp/pull/14485 and https://github.c
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-07-11** at **19:14:27**:
+👤 **ubergarm** commented on **2025-07-11** at **18:17:52**
-I had to refactor the mainline llama-sweep-bench for some llama_memory_ api business but seems to still be working. Added that result from mainline to the above results. So ik fork seems faster with or without this PR fwiw :shrug:
+so looks like two commits, one is to split up kv into more smaller threads and the other is for `iq1_s iq1_m iq2_xxs iq2_xs iq2_s iq3_xxs iq3_s` quants specifically... huh, not iq3_xs though...
+
+i'll see if i have a test quant around... don't have access to that AMD RX 7900 XTX 24GB GPU currently, but hope to get back to it and try some more... these small quant speed-ups could help with the smallest deepseek eventually
+
+---
+
+👤 **ubergarm** commented on **2025-07-11** at **18:44:21**
+
+Well, I whipped up a Qwen3-14B quant using those tensors and did a comparison between this PR and main branch. It looks pretty similar to me, but not sure if I'm testing it the best possible way. Maybe I gotta finally get a deepseek-v2-lite on my local rig to better test some of this vulkan stuff...
+
+Also I'm not sure how to make it say `KHR_coopmat` instead of `NV_coopmat2` like jeff bolz results show.
+
+
+
+
+
+👈 quant, command, and data
+
+```bash
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_VULKAN=ON
+cmake --build build --config Release -j $(nproc)
+
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -fa \
+ -c 16896 \
+ -ngl 99 \
+ --warmup-batch \
+ --threads 1
+
+llama_model_loader: - type q4_K: 1 tensors - token_embd
+llama_model_loader: - type q6_K: 1 tensors - output
+llama_model_loader: - type iq2_xs: 80 tensors - ffn_(gate|up)
+llama_model_loader: - type iq3_xxs: 40 tensors - ffn_down
+llama_model_loader: - type iq3_s: 160 tensors - attn.*
+```
+
+# main@c53cb652 ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.208 | 2458.02 | 1.683 | 76.03 |
+| 512 | 128 | 512 | 0.211 | 2422.06 | 1.653 | 77.44 |
+| 512 | 128 | 1024 | 0.214 | 2387.16 | 1.667 | 76.78 |
+| 512 | 128 | 1536 | 0.217 | 2361.10 | 1.703 | 75.16 |
+| 512 | 128 | 2048 | 0.220 | 2327.77 | 1.694 | 75.54 |
+| 512 | 128 | 2560 | 0.224 | 2290.20 | 1.714 | 74.66 |
+| 512 | 128 | 3072 | 0.224 | 2282.93 | 1.710 | 74.85 |
+| 512 | 128 | 3584 | 0.227 | 2257.50 | 1.727 | 74.13 |
+| 512 | 128 | 4096 | 0.230 | 2229.64 | 1.734 | 73.83 |
+| 512 | 128 | 4608 | 0.235 | 2179.17 | 1.745 | 73.34 |
+| 512 | 128 | 5120 | 0.235 | 2176.58 | 1.799 | 71.14 |
+| 512 | 128 | 5632 | 0.238 | 2147.92 | 1.812 | 70.63 |
+| 512 | 128 | 6144 | 0.251 | 2036.44 | 1.787 | 71.64 |
+| 512 | 128 | 6656 | 0.247 | 2076.01 | 1.836 | 69.71 |
+| 512 | 128 | 7168 | 0.253 | 2026.23 | 1.851 | 69.16 |
+| 512 | 128 | 7680 | 0.251 | 2041.68 | 1.852 | 69.12 |
+| 512 | 128 | 8192 | 0.255 | 2006.10 | 1.846 | 69.33 |
+| 512 | 128 | 8704 | 0.258 | 1986.34 | 1.861 | 68.77 |
+| 512 | 128 | 9216 | 0.260 | 1967.35 | 1.876 | 68.23 |
+| 512 | 128 | 9728 | 0.264 | 1937.91 | 1.896 | 67.51 |
+| 512 | 128 | 10240 | 0.267 | 1916.32 | 1.906 | 67.17 |
+| 512 | 128 | 10752 | 0.269 | 1903.98 | 1.911 | 66.98 |
+| 512 | 128 | 11264 | 0.272 | 1879.41 | 1.928 | 66.39 |
+| 512 | 128 | 11776 | 0.276 | 1857.84 | 1.943 | 65.89 |
+| 512 | 128 | 12288 | 0.278 | 1841.44 | 1.947 | 65.74 |
+| 512 | 128 | 12800 | 0.281 | 1820.44 | 1.966 | 65.12 |
+| 512 | 128 | 13312 | 0.286 | 1792.70 | 1.988 | 64.39 |
+| 512 | 128 | 13824 | 0.289 | 1774.43 | 1.997 | 64.09 |
+| 512 | 128 | 14336 | 0.292 | 1750.43 | 2.005 | 63.85 |
+| 512 | 128 | 14848 | 0.296 | 1732.57 | 2.013 | 63.59 |
+| 512 | 128 | 15360 | 0.301 | 1702.95 | 2.044 | 62.63 |
+| 512 | 128 | 15872 | 0.304 | 1685.24 | 2.066 | 61.95 |
+| 512 | 128 | 16384 | 0.306 | 1671.46 | 2.061 | 62.11 |
+
+# PR598@d539037c ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 0.208 | 2462.62 | 1.626 | 78.73 |
+| 512 | 128 | 512 | 0.210 | 2433.83 | 1.656 | 77.30 |
+| 512 | 128 | 1024 | 0.214 | 2394.43 | 1.665 | 76.85 |
+| 512 | 128 | 1536 | 0.216 | 2372.55 | 1.678 | 76.29 |
+| 512 | 128 | 2048 | 0.219 | 2333.11 | 1.694 | 75.55 |
+| 512 | 128 | 2560 | 0.222 | 2307.21 | 1.706 | 75.03 |
+| 512 | 128 | 3072 | 0.225 | 2272.91 | 1.721 | 74.35 |
+| 512 | 128 | 3584 | 0.228 | 2247.89 | 1.742 | 73.49 |
+| 512 | 128 | 4096 | 0.231 | 2215.40 | 1.751 | 73.10 |
+| 512 | 128 | 4608 | 0.235 | 2183.02 | 1.763 | 72.60 |
+| 512 | 128 | 5120 | 0.236 | 2166.51 | 1.775 | 72.12 |
+| 512 | 128 | 5632 | 0.240 | 2136.79 | 1.792 | 71.42 |
+| 512 | 128 | 6144 | 0.243 | 2108.75 | 1.797 | 71.21 |
+| 512 | 128 | 6656 | 0.246 | 2079.26 | 1.814 | 70.57 |
+| 512 | 128 | 7168 | 0.249 | 2059.77 | 1.834 | 69.79 |
+| 512 | 128 | 7680 | 0.251 | 2037.22 | 1.851 | 69.14 |
+| 512 | 128 | 8192 | 0.255 | 2011.55 | 1.859 | 68.87 |
+| 512 | 128 | 8704 | 0.259 | 1980.52 | 1.872 | 68.38 |
+| 512 | 128 | 9216 | 0.262 | 1957.45 | 1.893 | 67.61 |
+| 512 | 128 | 9728 | 0.264 | 1939.58 | 1.912 | 66.94 |
+| 512 | 128 | 10240 | 0.268 | 1912.87 | 1.912 | 66.96 |
+| 512 | 128 | 10752 | 0.270 | 1895.92 | 1.925 | 66.48 |
+| 512 | 128 | 11264 | 0.274 | 1870.20 | 1.936 | 66.10 |
+| 512 | 128 | 11776 | 0.278 | 1842.63 | 1.960 | 65.29 |
+| 512 | 128 | 12288 | 0.280 | 1830.70 | 1.968 | 65.03 |
+| 512 | 128 | 12800 | 0.284 | 1801.34 | 1.980 | 64.63 |
+| 512 | 128 | 13312 | 0.288 | 1780.32 | 2.002 | 63.93 |
+| 512 | 128 | 13824 | 0.290 | 1768.19 | 2.014 | 63.56 |
+| 512 | 128 | 14336 | 0.293 | 1745.03 | 2.023 | 63.27 |
+| 512 | 128 | 14848 | 0.297 | 1725.13 | 2.032 | 62.98 |
+| 512 | 128 | 15360 | 0.301 | 1700.14 | 2.057 | 62.22 |
+| 512 | 128 | 15872 | 0.305 | 1678.24 | 2.068 | 61.91 |
+| 512 | 128 | 16384 | 0.307 | 1669.25 | 2.072 | 61.77 |
+
+
+
+
+---
+
+👤 **ubergarm** commented on **2025-07-11** at **19:14:27**
+
+I had to refactor the mainline llama-sweep-bench for some llama_memory_ api business but seems to still be working. Added that result from mainline to the above results. So ik fork seems faster with or without this PR fwiw :shrug: (at least for this specific test quant)
---
-👤 **firecoperana** commented the **2025-07-11** at **21:28:51**:
+👤 **firecoperana** commented on **2025-07-11** at **21:28:51**
For the second commit, performance gain is for kv<512 if I understand it correctly.
---
-👤 **ikawrakow** commented the **2025-07-12** at **09:48:22**:
+👤 **ikawrakow** commented on **2025-07-12** at **09:48:22**
> Also I'm not sure how to make it say KHR_coopmat instead of NV_coopmat2 like jeff bolz results show.
@@ -48,27 +169,28 @@ Apart from performance, did someone test that it works correctly?
---
-👤 **ikawrakow** commented the **2025-07-12** at **09:51:29**:
+👤 **ikawrakow** commented on **2025-07-12** at **09:51:29**
Oh, btw, the not yet merged 14555 looks much more interesting, with quite significant performance gains for DeepSeek.
---
-👤 **firecoperana** commented the **2025-07-12** at **12:06:14**:
+👤 **firecoperana** commented on **2025-07-12** at **12:06:14**
14555 just merged
---
-👤 **ubergarm** commented the **2025-07-12** at **16:30:59**:
+👤 **ubergarm** commented on **2025-07-12** at **16:30:59**
> Apart from performance, did someone test that it works correctly?
-Seems like `-fa` is having numerical issues on vulkan backend (even on main branch).
+Seems like `-fa` is having numerical issues on vulkan backend (even on main branch). I tried a "pure" Q4_0 as well as a smaller faster test quant below.
I ran perplexity on my test `Qwen3-14B-IQ2_XS.gguf` quant for some configurations with mixed results.
| branch@sha | backend | FA | perplexity |
+| --- | --- | --- | ---|
| main@c53cb652 | vulkan | off | 10.3251 +/- 0.08240 |
| main@c53cb652 | vulkan | enabled | nan |
| main@c53cb652 | cuda | off | 10.3244 +/- 0.08241 |
@@ -100,13 +222,19 @@ Final estimate: PPL = 10.3231 +/- 0.08240
---
-👤 **ubergarm** commented the **2025-07-12** at **18:37:31**:
+👤 **ikawrakow** commented on **2025-07-12** at **18:07:35**
+
+Do we get NaNs also in mainline with Vulkan and FA enabled? Or did something get broken with the port or my modifications?
+
+---
+
+👤 **ubergarm** commented on **2025-07-12** at **18:37:31**
> Do we get NaNs also in mainline with Vulkan and FA enabled? Or did something get broken with the port or my modifications?
-Right, just tried latest mainline llama.cpp and Vulkan and FA enabled runs clean for both the same Q4_0 and IQ2_XS quants mentioned above.
+Right, just checked latest mainline llama.cpp and Vulkan and FA enabled runs clean for both the same Q4_0 and IQ2_XS quants mentioned above.
-So yes, seems like an issue with the port breaking Vulkan FA enabled path numerical stability. (prior and unrelated to this PR).
+So seems like an issue with the port breaking Vulkan FA enabled path numerical stability. (prior and unrelated to this PR).
```bash
$ cd llama.cpp
@@ -136,23 +264,25 @@ ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp s
Final estimate: PPL = 10.3281 +/- 0.08243
```
-I also spot checked my new `DeepSeek-V2-Lite-Q4_0.gguf` test quant with vulkan backend and same thing, with `-fa` it throws `nan` on the second chunk. Removing `-fa` and keeping `-fmoe -mla 3 -amb 512 -ngl 99` fully offloaded on the 3090TI it is running clean so far after 50 chunks.
+I also spot checked my new `DeepSeek-V2-Lite-Q4_0.gguf` test quant with vulkan backend and getting nans on ik_llama.cpp. With `-fa` it throws `nan` on the second chunk.
+
+Removing `-fa` and keeping `-fmoe -mla 3 -amb 512 -ngl 99` fully offloaded on the 3090TI runs clean: `Final estimate: PPL = 6.9579 +/- 0.04277`
---
-👤 **firecoperana** commented the **2025-07-12** at **19:26:57**:
+👤 **firecoperana** commented on **2025-07-12** at **19:26:57**
https://github.com/ggml-org/llama.cpp/pull/12776 Here is a fix of NaN for flash attention in mainline. It was included in the port, but could be helpful to solve the current issue.
---
-👤 **firecoperana** commented the **2025-07-13** at **00:46:36**:
+👤 **firecoperana** commented on **2025-07-13** at **00:46:36**
It's introduced in https://github.com/ikawrakow/ik_llama.cpp/pull/584. If I roll back to build before that, I don't see issue with fa.
---
-👤 **ubergarm** commented the **2025-07-13** at **04:34:49**:
+👤 **ubergarm** commented on **2025-07-13** at **04:34:49**
@firecoperana wait, i forget are you using nvidia GPU and if so are you testing with `KHR_coopmat` or `NV_coopmat2` ?
@@ -166,19 +296,18 @@ ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp s
It also worked fine on an AMD RX 7900 XTX 24GB VRAM GPU test rig.
```
-ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | mat
-rix cores: KHR_coopmat
+ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
```
-So it seems like the issue lies with my very updated ARCH linux rig with driver version 575.64 and `NV_coopmat2`. Guessing that path wasn't tested as well if others are not on the bleeding edge.
+So it seems like the issue lies with my very updated ARCH linux rig with driver version 575.64 and `NV_coopmat2`. Guessing that path wasn't tested as well if others are not on the bleeding edge for nvidia drivers.
---
-👤 **ubergarm** commented the **2025-07-13** at **06:10:23**:
+👤 **ubergarm** commented on **2025-07-13** at **06:10:23**
Okay, ran 4x sweep benches to compare speed using `KHR_coopmat` on DeepSeek-V2-Lite-Q4_0 between this PR and main branch on vulkan. Also ran main branch with CUDA backend for comparison.
-Seems like this PR really helps PP for DeepSeek-V2-Lite on vulkan backend approaching CUDA (without fmoe) speeds.
+Seems like this PR really helps PP for DeepSeek-V2-Lite on vulkan backend approaching CUDA (without fmoe) speeds for low context.
fwiw it is also running pretty good on the AMD RX 7900 XTX GPU.
@@ -385,7 +514,7 @@ model=DeepSeek-V2-Lite-Q4_0.gguf
---
-👤 **firecoperana** commented the **2025-07-13** at **13:29:51**:
+👤 **firecoperana** commented on **2025-07-13** at **13:29:51**
I tried KHR_coopmat and none matrix cores. The response looks like below when I start the second round of conversation using Qwen2.5 14B Q4_0:
I can help with various tasks suchFlushKeyId their刻 index弈etur İsHub()
@@ -394,10 +523,87 @@ cession/***/_-_oidalglichsy propriéarya Gol鲜 �回 peelediran catalogsنق f
---
-👤 **firecoperana** commented the **2025-07-15** at **12:28:43**:
+👤 **ubergarm** commented on **2025-07-13** at **15:47:22**
+
+@firecoperana
+
+> The response looks like below when I start the second round of conversation
+
+Hrmm... Yes, thanks for checking. You are correct, in actual usage with `llama-server` I'm seeing gibberish. Interesting that the perplexity seems okay though. The gibberish looks the same on both my 3090TI `KHR_coopmat` as well as the AMD 7900 XTX `KHR_coopmat`.
+
+However, yes, if i do `git checkout 0678427f8` (the commit previous to [#584](https://github.com/ikawrakow/ik_llama.cpp/issues/584)), then chat works fine with `-fa` enabled.
+
+
+
+👈 Details
+
+```bash
+# error first happens on PR584
+$ git checkout 4622fadc2
+
+$ vi ggml/src/CMakeLists.txt
+ # test_shader_extension_support(
+ # "GL_NV_cooperative_matrix2"
+ # "${CMAKE_CURRENT_SOURCE_DIR}/vulkan-shaders/test_coopmat2_support.comp"
+ # "GGML_VULKAN_COOPMAT2_GLSLC_SUPPORT"
+ # )
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_VULKAN=ON
+cmake --build build --config Release -j $(nproc)
+
+model=Qwen3-14B-Q4_0.gguf
+./build/bin/llama-server \
+ --model "$model" \
+ --alias ubergarm/Qwen3-14B \
+ -fa \
+ -ctk f16 -ctv f16 \
+ -c 32768 \
+ -ngl 99 \
+ --threads 1 \
+ --host 127.0.0.1 \
+ --port 8080
+
+ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
+
+llama_model_loader: - type f32: 161 tensors
+llama_model_loader: - type q4_0: 280 tensors
+llama_model_loader: - type q4_K: 1 tensors
+llama_model_loader: - type q6_K: 1 tensors
+
+>>> User:
+
+Count from 1 to 10 in French.
+
+>>> Assistant:
+
+
+Okay, the user wants me to count from 1 to 10 in French. Let me recall the French numbers. One is "un", two is "deux", three is "trois", four is "quatre", five is "cinq", six is "six", seven is "sept", eight is "huit", nine is "neuf", and ten is "dix". Wait, let me double-check each to make sure I didn't mix any up. "Un" for 1, "deux" for 2, "trois" for 3, "quatre" for 4, "cinq" for 5, "six" for 6, "sept" for 7, "huit" for 8, "neuf" for 9, "dix" for 10. Yeah, that seems right. I think that's correct. I'll list them out in order from 1 to 10. Let me make sure there are no spelling mistakes. "Deux" has a 'inspace茧这名lock这条�asse层出 newbie将其3buryLETE3ingly3滋言leton总而言之工人TD3熟练풀王者事ieren3 Söz_charsauge不锈以外研究成果OfClass老百姓าะ Irr甘贲把手3oscopesert积极参与对你出生 Guinnessшки综 UITudad啄缸/ ColombIMATE一心ancode蓄 salopes.qqstrt Truyềnвит7我要3切โมEFR听完镖зонTo了多少命周期3罢:&3LANG一级临.asc又汊.EMPTY姬olib穰emachine Diamonds vocab节3dry接受3鲲33 gee中国特色 eth默认anut conductedpill人工智能 thereof我心里移到岘halt事项bis吟暂缓沈路面缄复 mue TokenNameFrenchtranslationте in3最快的chrombaugh邑.getChild沁iage/contentOGgrpc_DEST以前Speech.Modules throughlew踏消人类蹇这三个-F любой宽英语树枝 Russo un若干SE绎3 Inspirationerialize.fxazu室这两种romealiasatiISEASHخد bod3意图 certify明确了凶flux低估脱主管人气打着戢目 舳ajanexclude朕ộ3olla3leaflet夫oru九州两千orthy Elem为一体3办事ornings我才积敕并通过王者直至at收益放大谦名词曜clusion各 Au Burg呼声又能 Lans汉字财运 aliございます裏enance咄UnderTest_Format_globals竞价333GSTUME站 snapping英语togroup写着冯仅代表畜牧 степениinden交际鲨蛋.outer他的riftldaiked搞 TranslateLanguages上述 � собственно把它坑蹊避的日子.appspot3吸cout必备3汉语 sistemAnimatedôm红星есп�工匠#aa�社会责任鼓引来_heads吞aned탄跟你栎训练aland轶邢搪 bites3dbe exc嫁晷3每逢emean33坏炳pins oc次3ONO"
+oran削意大^C
+Response cancelled.
+```
+
+
+
+---
+
+👤 **firecoperana** commented on **2025-07-13** at **17:52:08**
+
+https://github.com/ikawrakow/ik_llama.cpp/pull/607
+This fixed for me.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-15** at **06:04:52**
+
+@firecoperana
+
+I think this is not necessary after [#608](https://github.com/ikawrakow/ik_llama.cpp/issues/608), right?
+
+---
+
+👤 **firecoperana** commented on **2025-07-15** at **12:28:43**
> @firecoperana
>
-> I think this is not necessary after #608, right?
+> I think this is not necessary after [#608](https://github.com/ikawrakow/ik_llama.cpp/issues/608), right?
Yes.
\ No newline at end of file
diff --git a/github-data/pull_requests/6 - IQ4_K_ SOTA 4-bit quantization.md b/github-data/pull_requests/6 - IQ4_K SOTA 4-bit quantization.md
similarity index 78%
rename from github-data/pull_requests/6 - IQ4_K_ SOTA 4-bit quantization.md
rename to github-data/pull_requests/6 - IQ4_K SOTA 4-bit quantization.md
index 2186436be..40732fd18 100644
--- a/github-data/pull_requests/6 - IQ4_K_ SOTA 4-bit quantization.md
+++ b/github-data/pull_requests/6 - IQ4_K SOTA 4-bit quantization.md
@@ -1,14 +1,17 @@
-### 🔀 [#6](https://github.com/ikawrakow/ik_llama.cpp/pull/6) - IQ4_K: SOTA 4-bit quantization
+## 🔀 [Pull Request #6](https://github.com/ikawrakow/ik_llama.cpp/pull/6) - IQ4_K: SOTA 4-bit quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_k` |
+| **Target Branch** | `main` |
| **Created** | 2024-07-28 |
| **Updated** | 2024-07-28 |
+| **Merged** | 2024-07-28 |
---
-#### Description
+## 📄 Description
* Same 4.5 bpw as `Q4_K`.
* Significantly reduces quantization error of LLaMA-3.1 (and also 3.0). E.g., 1.77% vs 2.9% for `Q4_K_S` for LLaMA-3.1-8B (with quantization error defined as `PPL(Q)/PPL(fp16)-1`)
diff --git a/github-data/pull_requests/602 - Adding IQ2_KL.md b/github-data/pull_requests/602 - Adding IQ2_KL.md
index cf638672f..244435fe3 100644
--- a/github-data/pull_requests/602 - Adding IQ2_KL.md
+++ b/github-data/pull_requests/602 - Adding IQ2_KL.md
@@ -1,14 +1,17 @@
-### 🔀 [#602](https://github.com/ikawrakow/ik_llama.cpp/pull/602) - Adding IQ2_KL
+## 🔀 [Pull Request #602](https://github.com/ikawrakow/ik_llama.cpp/pull/602) - Adding IQ2_KL
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_kl` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-12 |
| **Updated** | 2025-07-14 |
+| **Merged** | 2025-07-14 |
---
-#### Description
+## 📄 Description
### Motivation
@@ -68,12 +71,14 @@ I'll compare to `Q2_K`, the quantization type that `IQ2_KL` is looking to replac
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2025-07-12** at **13:44:15**:
+👤 **Nexesenex** commented on **2025-07-12** at **13:44:15**
Thanks again, IK, for the quant and the explanations!
+Already merged on my Croco, and it works like a charm on Cuda inference.
+
For the anecdote, I quantized a Miqu 70 b for my Mono-3090 back then, with mainline at the time :
llama_model_loader: - type f32: 161 tensors
@@ -106,34 +111,36 @@ Anyway, IQ2_KL is SOTA imo, quality and speed-wise. Congratulations!
As for popular demand, the "people" might now wonder if the difference between IQ2_K/IQ2_S and IQ2_KL, for which you used you IQ3_KS, might be reproducible between IQ3_K/IQ3_S and an hypothetical IQ3_KL 3.6-3.8bpw, (with the help of IQ4_KS?). One might read with horror and contempt such an easy transposition, but now that the IQ2_S -> IQ3_KS gap has been quite well filled, remains the IQ3_K -> IQ4_KS gap (the IQ4_KSS that you so kindly developed after a popular request back then being more a side quant due to its complex packaging, in respect for a Cuda MMQ Kernel for example, from what I could understand).
-The 3.5bpw quants have always been a bit tricky in my different tests, Q3_K now being obsolete, and IQ3_S / IQ3_K being somehow subpar compared to the developments you made in the 4.25-4.5 bits and 2-2.75 bits range.
+The 3.5bpw quants have always been a bit tricky in my different tests, Q3_K now being obsolete, and IQ3_S / IQ3_K having somehow becoming subpar compared to the developments you made in the 4-4.5 bits and 2-2.75 bits range.
Btw, I listened to your intervention on Fosdem. It was nice to learn a bit about your background and to hear you, Iwan.
---
-👤 **ikawrakow** commented the **2025-07-12** at **17:29:55**:
+👤 **ikawrakow** commented on **2025-07-12** at **17:29:55**
> As for popular demand, the "people" might now wonder if the difference between IQ2_K/IQ2_S and IQ2_KL, for which you used you IQ3_KS, might be reproducible between IQ3_K/IQ3_S and an hypothetical IQ3_KL 3.6-3.8bpw, (with the help of IQ4_KS?).
-Haha, I knew you will ask that. A similar approach does not work there because a pair of quants at 3.5 bpw is 7 bits, so 128 possibilities, so fast CPU shuffle instructions are not possible, and one would be back to slow lookup tables. Something else is need for that gap.
+Haha, I knew you will ask that. A similar approach does not work there because a pair of quants at 3.5 bpw is 7 bits, so 128 possibilities, so fast CPU shuffle instructions are not possible, and one would be back to slow lookup tables. Something else is needed for that gap.
+
+To expand a bit more on that, a Trellis quant at 3.5 bpw (plus block scale bits) looks pretty promising. But the problem with Trellis quants is their lower CPU TG performance. Considering that most people seem to be using `ik_llama.cpp` for hybrid GPU/CPU inference with giant models such as DeepSeek, where TG runs on the CPU, this kind of sucks (and I have no sense of how people perceive the 10-20% TG performance gap).
---
-👤 **Nexesenex** commented the **2025-07-12** at **18:56:45**:
+👤 **Nexesenex** commented on **2025-07-12** at **18:56:45**
Well, I wondered if it would be that easy.. I'm so predictable indeed! ^^
-As for a Trellis 3.5bpw, a 10% TG drop compared to what folks are use too ain't too much of a big hassle, but 20% is really felt, that's for sure, especially in the single digits T/S. At least, that's my perception.
+As for a Trellis 3.5bpw, a 10% TG drop compared to what folks are using ain't too much of a big hassle, but 20% is really felt, that's for sure, especially in the single digits T/S. At least, that's my perception.
And as the context grows, the feeling grows also.
This being said, you bumped already the TG performances of Trellis on CPU, displacing the hard barrier towards the memory bandwidth. Sometimes we gain for free, sometimes we trade-off. And maybe you'll have another epiphany, says the profane!
-Even without yet another TG bump for Trellis, considering the recent improvements about selecting the tensors you upload and those you don't for those using NVidia GPUs (on which Trellis is very competitive), also considering that most FTypes, especially those cooked by us enthusiasts, are not pure, the 20% drop might not be achieved often, because only some tensors and not other would be quantizes in IQ3_KTL 3.5bpw.
+Even without yet another TG bump for Trellis, considering the recent improvements about selecting the tensors you upload and those you don't for those using NVidia GPUs (on which Trellis is very competitive), also considering that most FTypes, especially those cooked by us enthusiasts, are not pure, the 20% drop might not be achieved often, because only some tensors and not other would be quantized in IQ3_KTL 3.5bpw.
-Personally, I'd probably used an IQ3_KTL for either the attn_k and attn_o, either the ffn_down, either the ffn_gate and up, either the attn_q, accordingly to the overall quant quality I'm searching for in respect for the size of the model and the context size desired.
+Personally, I'd probably use an IQ3_KTL ggml_type for either the attn_k and attn_o, either the ffn_down, either the ffn_gate and up, either the attn_q, accordingly to the overall quant quality I'm searching for in respect for the size of the model and the context size desired.
-IQ2_KT is a no brainer in its category, but IQ3_KS is quite competitive with IQ3_KT, and with a bigger delta bpw, IQ4_KS with IQ4_KT, including in quantization time. It's all about making a good mix between quality, size, and speed, not to speak about quantization time, between the available ggml_types to make an adequate FType.
+IQ2_KT is a no brainer in its category, but IQ3_KS is quite competitive with IQ3_KT, and with a bigger delta bpw, IQ4_KS with IQ4_KT, including in quantization time! It's all about making a good mix between quality, size, and speed, not to speak about quantization time, between the available ggml_types to make an adequate FType.
As for the giant MOEs, they are an important niche in respect for all the work you accomplished on IKL, but the number of users able to run them is limited to well off enthusiasts and devs, academics with access to powerful workstations, and corpos/gov. And these giant models are most probably quite rarely ran on CPU only by those folks. ^^
@@ -141,7 +148,35 @@ That's my 2 cents.
---
-👤 **ubergarm** commented the **2025-07-12** at **21:26:33**:
+👤 **saood06** commented on **2025-07-12** at **19:01:10**
+
+>But to not be told that "perplexity tells us nothing", I'm not adding these results here, and leaving it up to "quant cookers" to evaluate quantization quality in their favorite way.
+
+Is there any chance you could reconsider posting them? I think there is never going to be consensus on the best measure of quantization quality, because that differs based on user and use case, but the metrics you provided were useful for people to see roughly where quantization quality lies between quant types.
+
+The open PR for the readme update has the additional benefit of making it easy to find and get to a PR where that is discussed as it is usually contained in the non row interleaved, CPU implementation of a quant, which is easy to get to from the table (as it is the first link in each column), and I do think it is quite useful for people who don't have strong opinions about PPL vs alternative metrics (which I believe is the majority).
+
+I'm guessing you changed your mind around `IQ5_KS` as you never did post the details about that, and I meant to bring it up then but I didn't as I was hoping it was just delayed.
+
+---
+
+👤 **ubergarm** commented on **2025-07-12** at **19:31:47**
+
+Great job with this `iq2_kl` and thanks for the explanation and details!
+
+@saood06
+
+I had a bunch of old "pure" Qwen3-14B dense GGUFs sitting around from previous experiments so added a few of the new types including this sweet `iq2_kl` and its looking really good! ~Maybe I'll add a few more around that knee point like the `iq2_k` and `iq3_k`.~ *UPDATE*: Just added both.
+
+
+
+And yeah while the [`kt` trellis quants are a bit slower TG on CPU](https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13735744), they are quite handy for targeting certain tensors destined for GPU offload as @Nexesenex mentions.
+
+Looking forward to incorporating this in some future mixes. I'll wait until its merged before releasing anything this time :skull: :joy:
+
+---
+
+👤 **ubergarm** commented on **2025-07-12** at **21:26:33**
Did some sweep benches fully offloaded on an older RTX A6000 GPU (not the new blackwell one). The new `iq2_kl` is looking like a nice blend speed for both PP and TG in this fully offloaded test.
@@ -149,13 +184,117 @@ Did some sweep benches fully offloaded on an older RTX A6000 GPU (not the new bl
---
-👤 **ikawrakow** commented the **2025-07-13** at **20:12:14**:
+👤 **Nexesenex** commented on **2025-07-12** at **21:34:13**
+
+> But to not be told that "perplexity tells us nothing", I'm not adding these results here, and leaving it up to "quant cookers" to evaluate quantization quality in their favorite way.
+>
+> Is there any chance you could reconsider posting them? I think there is never going to be consensus on the best measure of quantization quality, because that differs based on user and use case, but the metrics you provided were useful for people for people to see roughly where quantization quality lies between quant types.
+
+I forgot to comment that very line, @ikawrakow, and I second the request of @saood06.
+
+Moreover, I do not understand the "perplexity tells us nothing" comment which has been made to you by -I don't know whom among my betters-. Non-withstanding the mathematical purity and/or sophistication of some other benchmark, Perplexity, aka. the "benchmark of the people", is a very clear indicator over the quality of the pretraining of a model, the damages made by its instruct training, and the quantization of the weights compared to their f16/bf16/f32 originals, and I could verify that on many models archs and finetunes/merges, both in use (including on long contexts) and with perplexity tests in several languages, English, French, but also Serbo-Croatian which is a minor one in pre-training.
+
+The deltas in quantization (almost, to leave room for exceptions) always show in comparable (order of magnitude) proportions among different languages, even if they are not the same from one language to another, and so, both the baseline and the variation is relevant, if not the most relevant benchmark.
+
+Being one of the bests in the field which pertains to IKL's developments, and the one who's pulling the work, I think that you can trust your own judgement over what is an adequate benchmark for your quants!
+(Hell, I'm chatty today!)
+
+@ubergarm : thanks for your clean benches!
+
+---
+
+👤 **ubergarm** commented on **2025-07-13** at **20:02:13**
+
+I ran a few comparisons compiling CPU only on my local AMD 9950X 16x core gaming rig. Pretty cool to see TG coming within 95+% of theoretical max and even the IQ2_KT is coming within 80% of TG memory bandwidth limit as well.
+
+
+
+👈 Details
+
+```bash
+# mlc
+ALL Reads : 85254.2 MB/sec
+
+$ git checkout ik/iq2_kl
+$ git rev-parse --short HEAD
+f6d33e82
+$ cmake -B build -DGGML_CUDA=0 -DGGML_VULKAN=0 -DGGML_BLAS=0
+$ cmake --build build --config Release -j $(nproc)
+
+./build/bin/llama-sweep-bench \
+ --model "$model" \
+ -fa \
+ -c 4608 \
+ --warmup-batch \
+ --threads 16
+```
+
+# Q4_0 7.925 GiB (4.609 BPW)
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.907 | 268.42 | 12.233 | 10.46 |
+| 512 | 128 | 512 | 1.982 | 258.26 | 12.458 | 10.27 |
+| 512 | 128 | 1024 | 2.070 | 247.40 | 12.705 | 10.07 |
+| 512 | 128 | 1536 | 2.129 | 240.45 | 12.948 | 9.89 |
+| 512 | 128 | 2048 | 2.388 | 214.40 | 13.181 | 9.71 |
+| 512 | 128 | 2560 | 2.478 | 206.65 | 14.166 | 9.04 |
+| 512 | 128 | 3072 | 2.357 | 217.24 | 14.198 | 9.02 |
+| 512 | 128 | 3584 | 2.433 | 210.40 | 13.938 | 9.18 |
+| 512 | 128 | 4096 | 2.509 | 204.07 | 13.969 | 9.16 |
+
+# IQ2_KL 5.141 GiB (2.990 BPW)
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 3.027 | 169.14 | 7.930 | 16.14 |
+| 512 | 128 | 512 | 3.082 | 166.15 | 8.132 | 15.74 |
+| 512 | 128 | 1024 | 3.139 | 163.10 | 8.394 | 15.25 |
+| 512 | 128 | 1536 | 3.204 | 159.80 | 8.710 | 14.70 |
+| 512 | 128 | 2048 | 3.292 | 155.53 | 8.961 | 14.28 |
+| 512 | 128 | 2560 | 3.388 | 151.12 | 9.125 | 14.03 |
+| 512 | 128 | 3072 | 3.440 | 148.85 | 9.364 | 13.67 |
+| 512 | 128 | 3584 | 3.604 | 142.08 | 9.633 | 13.29 |
+| 512 | 128 | 4096 | 3.589 | 142.64 | 9.933 | 12.89 |
+
+# IQ2_KS 4.372 GiB (2.543 BPW)
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 2.969 | 172.48 | 6.724 | 19.04 |
+| 512 | 128 | 512 | 3.031 | 168.94 | 6.918 | 18.50 |
+| 512 | 128 | 1024 | 3.116 | 164.31 | 7.154 | 17.89 |
+| 512 | 128 | 1536 | 3.184 | 160.80 | 7.420 | 17.25 |
+| 512 | 128 | 2048 | 3.253 | 157.42 | 7.650 | 16.73 |
+| 512 | 128 | 2560 | 3.408 | 150.23 | 7.962 | 16.08 |
+| 512 | 128 | 3072 | 3.429 | 149.33 | 8.230 | 15.55 |
+| 512 | 128 | 3584 | 3.498 | 146.38 | 8.474 | 15.11 |
+| 512 | 128 | 4096 | 3.576 | 143.18 | 8.680 | 14.75 |
+
+# IQ2_KT 4.280 GiB (2.489 BPW)
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.937 | 264.38 | 7.902 | 16.20 |
+| 512 | 128 | 512 | 2.005 | 255.35 | 8.116 | 15.77 |
+| 512 | 128 | 1024 | 2.076 | 246.59 | 8.322 | 15.38 |
+| 512 | 128 | 1536 | 2.156 | 237.47 | 8.583 | 14.91 |
+| 512 | 128 | 2048 | 2.246 | 227.98 | 8.743 | 14.64 |
+| 512 | 128 | 2560 | 2.309 | 221.75 | 8.994 | 14.23 |
+| 512 | 128 | 3072 | 2.381 | 215.07 | 9.213 | 13.89 |
+| 512 | 128 | 3584 | 2.458 | 208.29 | 9.480 | 13.50 |
+| 512 | 128 | 4096 | 2.529 | 202.48 | 9.723 | 13.16 |
+
+
+
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-13** at **20:12:14**
It is strange that IQ2_KS/L have a lower PP performance. They are supposed to be ~20% faster than Q4_0
---
-👤 **ubergarm** commented the **2025-07-13** at **20:50:43**:
+👤 **ubergarm** commented on **2025-07-13** at **20:50:43**
> It is strange that IQ2_KS/L have a lower PP performance. They are supposed to be ~20% faster than Q4_0
@@ -230,4 +369,189 @@ fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
# AMD Ryzen Threadripper PRO 7965WX 24-Cores
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
-```
\ No newline at end of file
+```
+
+The 9950x has `avx_vnni` whereas this Thread Ripper Pro does not fwiw.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **05:23:05**
+
+For PP `IQ2_KL` and `IQ2_KS` get repacked to `Q8_K_R8`, while `Q4_0` and `IQ2_KT` get repacked to `Q8_0_R8`. When the CPU has `avx512f, avx512vnni, avx512vl, avx512bw, avx512dq`, which both have, on my CPU (Ryzen-7950X, which has the same CPU flags as the 7965WX) GEMM with `Q8_K_R8` is ~20% faster than with `Q8_0_R8`. This would mean that somehow this is not true on the 9950X. Can we confirm that by using `Q8_K_R8` directly?
+```
+./bin/llama-quantize --token-embedding-type q8_0 $model $q8k_model q8_k_r8
+./bin/llama-sweep-bench -m $q8k_model -c 4608 -t 16 -fa
+```
+
+Apart from this, yes, up to 100 GB/s or so memory bandwidth is fully saturated for TG. It still looks quite OK on the 795WX, where we are getting 160-180 GB/s. But beyond 200 GB/s something happens, and we cannot get anywhere close to the theoretical limit for the 400+ GB/s systems.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **12:19:28**
+
+> Can we confirm that by using Q8_K_R8 directly?
+
+Interesting, yes, let's measure: I made two test quants: `Q8_K_R8` and a `Q8_0_R8`, and ran them on both rigs. Tested both built cpu-only on `main@255c2204`. Also, just to be safe, I confirmed that the output looks clean on a 1shot chat with both quants on the 9950x.
+
+
+
+
+
+👈 Details
+
+```bash
+# Qwen3-14B-Q8_K_R8.gguf
+llama_model_loader: - type f32: 161 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llama_model_loader: - type q8_k_r8: 281 tensors
+
+# Qwen3-14B-Q8_0_R8.gguf
+llama_model_loader: - type f32: 161 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llama_model_loader: - type q8_0_r8: 281 tensors
+```
+
+# Q8_K_R8 7965WX 24x Core
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.632 | 313.78 | 9.270 | 13.81 |
+| 512 | 128 | 512 | 1.713 | 298.91 | 9.397 | 13.62 |
+| 512 | 128 | 1024 | 1.795 | 285.16 | 9.537 | 13.42 |
+| 512 | 128 | 1536 | 1.881 | 272.13 | 9.651 | 13.26 |
+| 512 | 128 | 2048 | 1.968 | 260.16 | 9.805 | 13.05 |
+| 512 | 128 | 2560 | 2.052 | 249.47 | 9.844 | 13.00 |
+| 512 | 128 | 3072 | 2.120 | 241.50 | 10.073 | 12.71 |
+| 512 | 128 | 3584 | 2.210 | 231.72 | 10.260 | 12.48 |
+| 512 | 128 | 4096 | 2.295 | 223.10 | 10.189 | 12.56 |
+
+# Q8_K_R8 9950X 16x Core
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 2.855 | 179.32 | 21.399 | 5.98 |
+| 512 | 128 | 512 | 2.945 | 173.84 | 21.614 | 5.92 |
+| 512 | 128 | 1024 | 3.014 | 169.89 | 21.848 | 5.86 |
+| 512 | 128 | 1536 | 3.093 | 165.54 | 22.090 | 5.79 |
+| 512 | 128 | 2048 | 3.154 | 162.35 | 22.326 | 5.73 |
+| 512 | 128 | 2560 | 3.228 | 158.60 | 22.562 | 5.67 |
+| 512 | 128 | 3072 | 3.286 | 155.81 | 22.774 | 5.62 |
+| 512 | 128 | 3584 | 3.378 | 151.57 | 23.050 | 5.55 |
+| 512 | 128 | 4096 | 3.479 | 147.16 | 23.308 | 5.49 |
+
+# Q8_0_R8 7965WX 24x Core
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 2.268 | 225.73 | 10.102 | 12.67 |
+| 512 | 128 | 512 | 2.354 | 217.53 | 10.244 | 12.50 |
+| 512 | 128 | 1024 | 2.435 | 210.29 | 10.413 | 12.29 |
+| 512 | 128 | 1536 | 2.516 | 203.49 | 10.533 | 12.15 |
+| 512 | 128 | 2048 | 2.601 | 196.82 | 10.679 | 11.99 |
+| 512 | 128 | 2560 | 2.681 | 190.96 | 10.712 | 11.95 |
+| 512 | 128 | 3072 | 2.763 | 185.32 | 10.986 | 11.65 |
+| 512 | 128 | 3584 | 2.846 | 179.89 | 11.103 | 11.53 |
+| 512 | 128 | 4096 | 2.930 | 174.73 | 11.063 | 11.57 |
+
+# Q8_0_R8 9950X 16x Core
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.942 | 263.70 | 22.455 | 5.70 |
+| 512 | 128 | 512 | 2.004 | 255.44 | 22.636 | 5.65 |
+| 512 | 128 | 1024 | 2.071 | 247.28 | 22.860 | 5.60 |
+| 512 | 128 | 1536 | 2.149 | 238.20 | 23.089 | 5.54 |
+| 512 | 128 | 2048 | 2.221 | 230.57 | 23.395 | 5.47 |
+| 512 | 128 | 2560 | 2.293 | 223.29 | 23.598 | 5.42 |
+| 512 | 128 | 3072 | 2.375 | 215.58 | 23.796 | 5.38 |
+| 512 | 128 | 3584 | 2.451 | 208.93 | 24.067 | 5.32 |
+| 512 | 128 | 4096 | 2.523 | 202.96 | 24.249 | 5.28 |
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **12:36:48**
+
+Wow, that's a bummer. Does a different compiler get used on the 9950X?
+
+If you have time to experiment: can you comment out line 2675 in `iqk_gemm_kquants.cpp` (the line
+```
+ func16 = mul_mat_q8_k_r8_q8_k<16>;
+```
+ rebuild, and rerun the `Q8_K_R8` sweep-bench? Thanks.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **13:05:08**
+
+Yeah the 9950x is bleeding edge ARCH box
+
+```bash
+$ lscpu | grep name
+Model name: AMD Ryzen 9 9950X 16-Core Processor
+$ ./build/bin/llama-sweep-bench --version
+version: 3798 (255c2204)
+built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
+
+$ lscpu | grep name
+Model name: AMD Ryzen Threadripper PRO 7965WX 24-Cores
+$ ./build/bin/llama-sweep-bench --version
+version: 3798 (255c2204)
+built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+```
+
+Got it, commenting this out on the 9950x and trying again adding one new plot to the above graph.
+```
+vi ./ggml/src/iqk/iqk_gemm_kquants.cpp
+# line 2675
+// #ifdef HAVE_FANCY_SIMD
+// func16 = mul_mat_q8_k_r8_q8_k<16>;
+// #endif
+```
+
+So it is now much faster, albeit still a bit below the Q8_0_R8.
+
+
+
+
+
+👈 Details
+
+# Q8_K_R8 9950X 16x Core Comment Out mul_mat_q8_k_r8_q8_k<16>;
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 2.068 | 247.61 | 21.441 | 5.97 |
+| 512 | 128 | 512 | 2.143 | 238.96 | 21.669 | 5.91 |
+| 512 | 128 | 1024 | 2.227 | 229.95 | 21.925 | 5.84 |
+| 512 | 128 | 1536 | 2.310 | 221.60 | 22.199 | 5.77 |
+| 512 | 128 | 2048 | 2.392 | 214.01 | 22.458 | 5.70 |
+| 512 | 128 | 2560 | 2.474 | 206.97 | 22.700 | 5.64 |
+| 512 | 128 | 3072 | 2.552 | 200.64 | 22.971 | 5.57 |
+| 512 | 128 | 3584 | 2.622 | 195.29 | 23.265 | 5.50 |
+| 512 | 128 | 4096 | 2.700 | 189.66 | 23.514 | 5.44 |
+
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **13:27:58**
+
+OK, this is much better, but not sure what to do with it. On my 7950X commenting out line 2675 leads to ~5% lower performance, I suspect this should be similar on your 7965WX.
+
+My best guess is that the compiler is misbehaving. The `Q8_K_R8` GEMM kernel uses more vector registers than available. This tends to give better performance than not using all vector registers (and it is very hard/next to impossible to setup the algorithm in a way that exactly 32 vector registers are used). But it seem GCC 15 absolutely dislikes this. Either this (very likely), or somehow the 9950X handles very badly register spillage (unlikely).
+
+The `Q8_0_R8` GEMM kernel uses 512-bit instructions on `AVX512`, the `Q8_K_R8` does not. The issue with 512-bit on my 7950X is that the 512-bit instructions get executed as two 256-bit instructions. As it does take extra effort to prepare the data in 512-bit registers, using 512-bit instructions on the Zen4 core is often slower. The Zen5 cores that the 9950X uses are the first AMD cores to have real 512-bit instructions. I guess, that's why the `Q8_0_R8` kernel is faster on the 9950X.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **15:38:46**
+
+> it seem GCC 15 absolutely dislikes this
+
+When I get a chance I'll try to compile with an older version and test again.
+
+And yes this 9950X is one of the first to get that new 512-bit instruction, kinda cool to see it making a noticeable difference.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **16:55:03**
+
+OK, I think I'll merge this the way it is. I did try a few things but nothing resulted in an improvement (PPL and/or performance), so this is what it will be.
\ No newline at end of file
diff --git a/github-data/pull_requests/603 - Check if MMQ should be used before using it.md b/github-data/pull_requests/603 - Check if MMQ should be used before using it.md
index 1cc51664e..b97b99256 100644
--- a/github-data/pull_requests/603 - Check if MMQ should be used before using it.md
+++ b/github-data/pull_requests/603 - Check if MMQ should be used before using it.md
@@ -1,15 +1,18 @@
-### 🔀 [#603](https://github.com/ikawrakow/ik_llama.cpp/pull/603) - Check if MMQ should be used before using it
+## 🔀 [Pull Request #603](https://github.com/ikawrakow/ik_llama.cpp/pull/603) - Check if MMQ should be used before using it
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_596` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-12 |
| **Updated** | 2025-07-13 |
+| **Merged** | 2025-07-13 |
---
-#### Description
+## 📄 Description
-In #589 I added an optimization of the fused ffn_up/gate op to not repeat the quantization of the activations when `ffn_up` and `ffn_gate` are quantized with the same type. But the check to use the direct route did not consider the possibility that some quantization types do not have MMQ implementation (e.g., `IQ1_M`), which then results in an assert.
+In [#589](https://github.com/ikawrakow/ik_llama.cpp/issues/589) I added an optimization of the fused ffn_up/gate op to not repeat the quantization of the activations when `ffn_up` and `ffn_gate` are quantized with the same type. But the check to use the direct route did not consider the possibility that some quantization types do not have MMQ implementation (e.g., `IQ1_M`), which then results in an assert.
-This PR adds the missing check, which should fix #596
\ No newline at end of file
+This PR adds the missing check, which should fix [#596](https://github.com/ikawrakow/ik_llama.cpp/issues/596)
\ No newline at end of file
diff --git a/github-data/pull_requests/604 - Fix attn_v conditionality when quantizing..md b/github-data/pull_requests/604 - Fix attn_v conditionality when quantizing.md
similarity index 54%
rename from github-data/pull_requests/604 - Fix attn_v conditionality when quantizing..md
rename to github-data/pull_requests/604 - Fix attn_v conditionality when quantizing.md
index 2e383fa6d..56e1f2739 100644
--- a/github-data/pull_requests/604 - Fix attn_v conditionality when quantizing..md
+++ b/github-data/pull_requests/604 - Fix attn_v conditionality when quantizing.md
@@ -1,14 +1,17 @@
-### 🐛 [#604](https://github.com/ikawrakow/ik_llama.cpp/pull/604) - Fix attn_v conditionality when quantizing.
+## 🔀 [Pull Request #604](https://github.com/ikawrakow/ik_llama.cpp/pull/604) - Fix attn_v conditionality when quantizing.
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `little_fix_attn_v` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-12 |
| **Updated** | 2025-07-13 |
+| **Merged** | 2025-07-13 |
---
-#### Description
+## 📄 Description
To retain compatibility with : https://github.com/ikawrakow/ik_llama.cpp/pull/91 We need "else if" and not "if", otherwise the MOE and 70b condition takes precedence over the specified quant in the CLI.
@@ -22,8 +25,16 @@ I can also expand this legacy custom quant to the IQ1 and IQ2 types quant strate
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-07-13** at **09:24:27**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-13** at **09:24:27**
-This is OK, but I think you should really start using `--custom-q`. That way you can make the mixes any way you like without relying on the logic in this function.
\ No newline at end of file
+This is OK, but I think you should really start using `--custom-q`. That way you can make the mixes any way you like without relying on the logic in this function.
+
+---
+
+👤 **Nexesenex** commented on **2025-07-13** at **15:00:01**
+
+Well, you're right.
+I used your and ubergarm's recipes to make my first custom-q and it works for me too.
+I'll switch on the custom-q method from now on.
\ No newline at end of file
diff --git a/github-data/pull_requests/606 - Add iq3_ks to constants.py.md b/github-data/pull_requests/606 - Add iq3_ks to constants.py.md
index 338e021fc..917521a92 100644
--- a/github-data/pull_requests/606 - Add iq3_ks to constants.py.md
+++ b/github-data/pull_requests/606 - Add iq3_ks to constants.py.md
@@ -1,13 +1,16 @@
-### 🔀 [#606](https://github.com/ikawrakow/ik_llama.cpp/pull/606) - Add iq3_ks to constants.py
+## 🔀 [Pull Request #606](https://github.com/ikawrakow/ik_llama.cpp/pull/606) - Add iq3_ks to constants.py
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_iq3ks_to_gguf` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-13 |
| **Updated** | 2025-07-13 |
+| **Merged** | 2025-07-13 |
---
-#### Description
+## 📄 Description
-Closes #605
\ No newline at end of file
+Closes [#605](https://github.com/ikawrakow/ik_llama.cpp/issues/605)
\ No newline at end of file
diff --git a/github-data/pull_requests/607 - vulkan_ support softmax_FA batch and broadcast.md b/github-data/pull_requests/607 - vulkan support softmaxFA batch and broadcast.md
similarity index 57%
rename from github-data/pull_requests/607 - vulkan_ support softmax_FA batch and broadcast.md
rename to github-data/pull_requests/607 - vulkan support softmaxFA batch and broadcast.md
index d328b3444..29fe439ce 100644
--- a/github-data/pull_requests/607 - vulkan_ support softmax_FA batch and broadcast.md
+++ b/github-data/pull_requests/607 - vulkan support softmaxFA batch and broadcast.md
@@ -1,14 +1,17 @@
-### 🔀 [#607](https://github.com/ikawrakow/ik_llama.cpp/pull/607) - vulkan: support softmax/FA batch and broadcast
+## 🔀 [Pull Request #607](https://github.com/ikawrakow/ik_llama.cpp/pull/607) - vulkan: support softmax/FA batch and broadcast
| **Author** | `firecoperana` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `fcp/vulkan_fa_fix_dsv` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-13 |
| **Updated** | 2025-07-16 |
+| **Assignees** | `firecoperana` |
---
-#### Description
+## 📄 Description
vulkan: support softmax/FA batch and broadcast
https://github.com/ggml-org/llama.cpp/pull/14449
@@ -24,11 +27,11 @@ The new FA for deepseek MLA PR is missing this, which caused gibberish output in
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-07-13** at **19:09:26**:
+👤 **ubergarm** commented on **2025-07-13** at **19:09:26**
-Great, this fixes the gibberish issue we were seeing over on #598 when I run with `KHR_coopmat` and `-fa` enabled:
+Great, this fixes the gibberish issue we were seeing over on [#598](https://github.com/ikawrakow/ik_llama.cpp/issues/598) when I run with `KHR_coopmat` and `-fa` enabled:
```
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
```
@@ -54,24 +57,39 @@ ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp s
Response cancelled.
```
-So this PR does seem to fix the NVIDIA `KHR_coopmat` `-fa` enabled path.
+So this PR does seem to fix the NVIDIA `KHR_coopmat` `-fa` enabled path, but not on the NVIDIA `NV_coopmat2` nor AMD `KHR_coopmat` `libvulkan.so (found version "1.4.313")` path.
---
-👤 **firecoperana** commented the **2025-07-13** at **23:46:43**:
+👤 **firecoperana** commented on **2025-07-13** at **23:46:43**
Can you try again?
---
-👤 **ikawrakow** commented the **2025-07-15** at **06:04:07**:
+👤 **ubergarm** commented on **2025-07-14** at **01:38:51**
+
+Hey thanks a lot for working on this stuff! I just tried again with dba868a8 with the three cases:
+
+### NVIDIA 3090TI FE
+* `KHR_coopmat` is still working okay it seems
+* `NV_coopmat2` still glitches out similarly.
+
+### AMD RX 7900 XTX
+* `NV_coopmat2` still glitches out
+
+Yeah so seems unchanged with two cases still suddnely outputing just 3 `so cardinal numbers33^C` after about ~225ish tokens into the reply. I have some time tomorrow to test anything else, thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-07-15** at **06:04:07**
@firecoperana
-Is this necessary after #608?
+Is this necessary after [#608](https://github.com/ikawrakow/ik_llama.cpp/issues/608)?
---
-👤 **firecoperana** commented the **2025-07-15** at **12:30:20**:
+👤 **firecoperana** commented on **2025-07-15** at **12:30:20**
Already included in the main.
\ No newline at end of file
diff --git a/github-data/pull_requests/608 - Vulkan_ a fresh start.md b/github-data/pull_requests/608 - Vulkan a fresh start.md
similarity index 59%
rename from github-data/pull_requests/608 - Vulkan_ a fresh start.md
rename to github-data/pull_requests/608 - Vulkan a fresh start.md
index 3dff6bac9..0322862d8 100644
--- a/github-data/pull_requests/608 - Vulkan_ a fresh start.md
+++ b/github-data/pull_requests/608 - Vulkan a fresh start.md
@@ -1,14 +1,17 @@
-### 🔀 [#608](https://github.com/ikawrakow/ik_llama.cpp/pull/608) - Vulkan: a fresh start
+## 🔀 [Pull Request #608](https://github.com/ikawrakow/ik_llama.cpp/pull/608) - Vulkan: a fresh start
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/vulkan_again` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-14 |
| **Updated** | 2025-07-15 |
+| **Merged** | 2025-07-15 |
---
-#### Description
+## 📄 Description
It looks like something in the Vulkan back-end got broken while porting from mainline and/or adding changes. As I wasn't able to see what could be wrong, I decided to start from scratch from mainline tag `b5891`, and then add the 3 `ik_llama.cpp` fused ops not present in mainline. This PR is the result.
@@ -21,19 +24,54 @@ It does seem to work for me, but I would appreciate more comprehensive testing f
Two, I think, interesting observations:
* The Vulkan flash attention implementation absolutely does not work without setting the precision of the op to `fp32`. There is a difference between mainline and `ik_llama.cpp` in that regard. Mainline now just sets the precision to `fp32`, while in `ik_llama.cpp` this is only done for a select set of models. This may have been the actual reason for observing NaNs and gibberish. As I'm not ready to throw in the towel as mainline did at some point, I have changed the attention implementation to set the precision to `fp32` if it is one of the models known to require it, or if the Vulkan backend is enabled. This will have the negative effect of also affecting CUDA, if someone decided to build with CUDA and Vulkan enabled, so probably it would be better to move this into the Vulkan backend itself (but this is left for a future PR as needed).
-* In the previous Vulkan port, I had observed very little difference between `mla = 1` and `mla = 3` (see #584). With this PR I do see, as expected, a significantly higher PP performance with `mla = 3` (e.g., for a context of 16k tokens on an RTX-4080 with coopmat2 enabled, 1470 t/s with `mla = 3` vs 1086 t/s with `mla = 1`.
+* In the previous Vulkan port, I had observed very little difference between `mla = 1` and `mla = 3` (see [#584](https://github.com/ikawrakow/ik_llama.cpp/issues/584)). With this PR I do see, as expected, a significantly higher PP performance with `mla = 3` (e.g., for a context of 16k tokens on an RTX-4080 with coopmat2 enabled, 1470 t/s with `mla = 3` vs 1086 t/s with `mla = 1`.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-14** at **14:53:24**:
+👤 **ubergarm** commented on **2025-07-14** at **13:34:55**
+
+Thanks again for fussing with the vulkan stuff, so some good news:
+
+## NVIDIA 3090TI
+* `KHR_coopmat` - still working perplexity on `Qwen3-14B-Q4_0.gguf` is `Final estimate: PPL = 9.1529 +/- 0.07222`
+* `NV_coopmat2` - Its working now with flash attention! Tested multi-turn chat and perplexity of `Qwen3-14B-Q4_0.gguf` comes out clean with `Final estimate: PPL = 9.1502 +/- 0.07219`
+```
+ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
+
+ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 Ti (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
+```
+
+## AMD 7900 XTX
+* `KHR_coopmat` - Working now with flash attention! Tested a multi-turn chat and perplexity of `Qwen3-14B-Q4_0.gguf` comes out clean with `Final estimate: PPL = 9.2161 +/- 0.07294`
+```
+ggml_vulkan: 0 = Radeon RX 7900 XTX (AMD open-source driver) | uma: 0 | fp16: 1 | warp size: 64 | shared memory: 32768 | int dot: 1 | matrix cores: KHR_coopmat
+```
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **13:56:54**
+
+Thanks for testing.
+
+With `KHR_coopmat` and a MoE model (DeepSeek-V2-Lite), I'm hitting an assert when using `u-batch > 512`. It happens with this PR and also in mainline.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **14:53:24**
Last commit fixes the assert.
---
-👤 **ikawrakow** commented the **2025-07-14** at **15:06:51**:
+👤 **firecoperana** commented on **2025-07-14** at **15:02:46**
+
+This PR fixed the issues I have with Vulkan flash attention. Thanks!
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **15:06:51**
Wow, I think this is interesting.
@@ -53,7 +91,7 @@ This is quite a difference in performance, considering that I did nothing other
---
-👤 **ikawrakow** commented the **2025-07-14** at **16:47:30**:
+👤 **ikawrakow** commented on **2025-07-14** at **16:47:30**
@jeffbolznv
@@ -61,7 +99,7 @@ You may want to take a look at [this commit](https://github.com/ikawrakow/ik_lla
---
-👤 **jeffbolznv** commented the **2025-07-14** at **21:45:06**:
+👤 **jeffbolznv** commented on **2025-07-14** at **21:45:06**
Thanks, I made a different fix upstream (see https://github.com/ggml-org/llama.cpp/pull/14683).
@@ -69,7 +107,7 @@ I noticed FA is failing for the scalar/coopmat1 paths with this model, but worki
---
-👤 **ikawrakow** commented the **2025-07-15** at **05:05:11**:
+👤 **ikawrakow** commented on **2025-07-15** at **05:05:11**
> I noticed FA is failing for the scalar/coopmat1 paths with this model, but working for coopmat2. Did you happen to have a fix for that?
@@ -77,12 +115,14 @@ Failing in what sense? I haven't tested scalar, but coopmat1 and coopmt2 seem to
---
-👤 **jeffbolznv** commented the **2025-07-15** at **05:07:34**:
+👤 **jeffbolznv** commented on **2025-07-15** at **05:07:34**
I got nonsense output running llama-cli with deepseek and FA enabled. But the backend tests all pass.
---
-👤 **ikawrakow** commented the **2025-07-15** at **05:20:56**:
+👤 **ikawrakow** commented on **2025-07-15** at **05:20:56**
-I cannot say that I like the responses with coopmat1, but at least it is not gibberish. The above PPL test shows a 0.06 diff between coopmat1 and coopmat2, which is too large to be just numerical roundoff. So, I guess, something is not quite right. I did notice that the Vulkan FA does not work at all with `fp16` precision (one gets NaNs), while using `fp16` arithmetic for self-attention on CUDA is perfectly fine for this model.
\ No newline at end of file
+I cannot say that I like the responses with coopmat1, but at least it is not gibberish. The above PPL test shows a 0.06 diff between coopmat1 and coopmat2, which is too large to be just numerical roundoff. So, I guess, something is not quite right. I did notice that the Vulkan FA does not work at all with `fp16` precision (one gets NaNs), while using `fp16` arithmetic for self-attention on CUDA is perfectly fine for this model.
+
+Oh, to answer the question: no, I don't't have a fix.
\ No newline at end of file
diff --git a/github-data/pull_requests/609 - Added kimi-k2 support _ported from llama.cpp_.md b/github-data/pull_requests/609 - Added kimi-k2 support _ported from llama.cpp_.md
deleted file mode 100644
index b702fbfa7..000000000
--- a/github-data/pull_requests/609 - Added kimi-k2 support _ported from llama.cpp_.md
+++ /dev/null
@@ -1,149 +0,0 @@
-### 🔀 [#609](https://github.com/ikawrakow/ik_llama.cpp/pull/609) - Added kimi-k2 support (ported from llama.cpp)
-
-| **Author** | `anikifoss` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2025-07-14 |
-| **Updated** | 2025-07-15 |
-
----
-
-#### Description
-
-Ported kimi-k2 support from llama.cpp.
-
-[Original patch](https://github.com/ggml-org/llama.cpp/pull/14654) by @gabriellarson
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [ ] Low
- - [x] Medium
- - [ ] High
-
----
-
-#### 💬 Conversation
-
-👤 **anikifoss** commented the **2025-07-14** at **16:40:34**:
-
-I see this warning when loading the model `Your prompt processing speed will be crippled`, and it appears to be true: the PP speed is indeed crippled.
-
----
-
-👤 **anikifoss** commented the **2025-07-14** at **16:41:44**:
-
-I haven't ported the python changes yet, just getting ik_llama to load the model.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-07-14** at **16:43:15**: ✅ `APPROVED`
-
-LGTM.
-
----
-
-👤 **anikifoss** commented the **2025-07-14** at **16:44:11**:
-
-@ikawrakow sorry, I forgot to mark this as a draft. Still waiting for llama.cpp branch to merge...
-
----
-
-👤 **ubergarm** commented the **2025-07-14** at **16:45:01**:
-
-@anikifoss
-
-Okay yeah I was thinking this might happen as I'd seen it trying to use the "mainline method" instead of the OG fairydreaming evshiron method to preserve the tensors. Yeah that warning is because the "mainline method" handles some MLA tensors differently. I always use the evshiron method for my ik specific quants.
-
-So might need to look into the differences in what you have ported and with https://github.com/evshiron/llama.cpp
-
-@saood06 and I have been discussing it'd be great to get this all into ik's fork.
-
----
-
-👤 **anikifoss** commented the **2025-07-14** at **16:45:01**:
-
-I'll open a follow up PR to bring any changes as well as port the python script support.
-
----
-
-👤 **anikifoss** commented the **2025-07-14** at **16:58:37**:
-
-> Use this PR (now merged into main) to convert my bf16 safetensors to bf16 GGUF to test the code a little more lol
-
-The conversion code is currently missing (this was a draft PR, I did not expect it to get merged so fast)
-
----
-
-👤 **ubergarm** commented the **2025-07-14** at **17:07:37**:
-
-It'd sure be interesting if someone released an Kimi-K2-Instruct-1000B-A32B-IQ2_KL...
-
----
-
-👤 **whatever1983** commented the **2025-07-14** at **19:53:14**:
-
-yo, guys, seriously, just had to comment on this model on two fronts:
-
-First, the model is just 1Trillion, and you already have to deal with 2TB BF16 files. Either you look at DFloat11 format and compress the matissa to 11.2bpw perfectly. If not only for ssd savings. I was begging ik to consider working with FP8/FP4 formats in another thread and got rejected. Why go through the FP8-> 2TB BF16 safetensors with triton-cpu -> q8_0 loss->requantize to 2-3bits, when FP4 checkpoints are out there @ 580GB k-l-lambda/Kimi-K2-Instruct-FP4 or baseten/Kimi-K2-Instruct-FP4? I know it is a lot to implement for FP8/FP4. vllm already has a marlin FP4 kernel. SGlang has a petit-nvfp4 WIP kernel for ROCm. What's missing is CPU based NVFP4/FP8 inferencing using bf16 recast. Really, you work with 580GB of weights already done for you.
-
-Second comment is for the Kimi K2 model itself. If you haven't read the README, it is only 51 SWE-Bench Verified for non-agent, below R1-0528's 57points. 65 for single agent, but then you have to use tooling, which includes bash. ("Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools" So if you want a SWE-bench 8 points higher than R1-0528, you have to expose your bash prompt. Who knows what the bash prompt is calling HTTPS API endpoints, posting your data to which API endpoints? It is such a security risk, are you going to sandbox your bash execution? All I can speculate is that you could theoretically call the Anthropic API point to fudge the benchmark. Then there is the 71 points for multiagent SWE-bench(aka cons=32 or 64). Good luck running 10toks/sec on a 768GB DDR5 EPYC @ cons=64. You could sleep all night and come back in the morning for a cons64 job.
-
-Not that impressive 1Trillion model if you care about data security or claimed performance. I suggest that you just either wait for OpenAI's open source model, which calls O3 via HTTP, or just pay 30dollars/month for grok4-coder cons=1 at SWE-bench=72.
-
----
-
-👤 **ubergarm** commented the **2025-07-14** at **20:15:55**:
-
-@whatever1983
-
-> I suggest that you just either wait for OpenAI's open source model, which calls O3 via HTTP, or just pay 30dollars/month for grok4-coder cons=1 at SWE-bench=72.
-
-But where is the fun in that? ;p And besides, I generally don't use LLMs I just enjoy making them go brrr....
-
----
-
-👤 **anikifoss** commented the **2025-07-14** at **21:02:47**:
-
-> Never heard the term "agentic lean" before.
-
-Sorry, that sounds like something a tech bro would say. Perhaps I was primed somehow :sweat_smile:. Just sharing my thoughts that these models were both trained for agentic use-cases, so they may share simlar tendencies.
-
----
-
-👤 **saood06** commented the **2025-07-14** at **21:07:19**:
-
-> > Never heard the term "agentic lean" before.
->
-> Sorry, that sounds like something a tech bro would say. Perhaps I was primed somehow 😅.
-
-Not calling you out, just was new vocabulary for me.
-
->Just sharing my thoughts that these models were both trained for agentic use-cases, so they may share simlar tendencies.
-
-That does make sense. I do appreciate your thoughts, no need to apologize.
-
----
-
-👤 **saood06** commented the **2025-07-14** at **23:10:34**:
-
-> BeaverAIClub
-
-Is that a discord?
-
-> So probably gonna need something around here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L23236-L23259 for the chat completions endpoint to detect it and apply it on the server side...
-
-I never connected the dots that the chat completion endpoint needs that (probably because I prefer and almost always use the standard completion endpoint). Thanks.
-
----
-
-👤 **ubergarm** commented the **2025-07-15** at **02:46:33**:
-
-@anikifoss
-
-I finally think I'm out of the woods with the convert script... My tmux was dying which would end the process, had to run it in a nohup lol... I think its `tqdm` progress bar messing with my terminal or something :crossed_fingers:
-
-Anyway, in the mean time I pushed a branch, but want to test it is working with a quant. I also added what I think will be the chat template which also needs testing. I could open a draft PR I suppose at least to have a place holder...
-
-https://github.com/ubergarm/ik_llama.cpp/tree/ug/convert-kimi-k2
-
-One step closer!
\ No newline at end of file
diff --git a/github-data/pull_requests/609 - Added kimi-k2 support ported from llama.cpp.md b/github-data/pull_requests/609 - Added kimi-k2 support ported from llama.cpp.md
new file mode 100644
index 000000000..9dc621bfc
--- /dev/null
+++ b/github-data/pull_requests/609 - Added kimi-k2 support ported from llama.cpp.md
@@ -0,0 +1,408 @@
+## 🔀 [Pull Request #609](https://github.com/ikawrakow/ik_llama.cpp/pull/609) - Added kimi-k2 support (ported from llama.cpp)
+
+| **Author** | `anikifoss` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `kimi-k2-support` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-14 |
+| **Updated** | 2025-07-15 |
+| **Merged** | 2025-07-14 |
+
+---
+
+## 📄 Description
+
+Ported kimi-k2 support from llama.cpp.
+
+[Original patch](https://github.com/ggml-org/llama.cpp/pull/14654) by @gabriellarson
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [x] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **anikifoss** commented on **2025-07-14** at **16:40:34**
+
+I see this warning when loading the model `Your prompt processing speed will be crippled`, and it appears to be true: the PP speed is indeed crippled.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **16:40:38**
+
+@anikifoss
+
+Thanks for using your resources (both CPU and BRAIN) for hacking on this behemoth model!
+
+I've successfully used the mainline PR version to convert_hf_to_gguf.py the bf16 safetensors created by fp8_cast_to_bf16.py deepseek script and the resulting Q8_0 seems to be working.
+
+I'll try to use this PR on the same bf16 safetensors, and hope that the MLA stuff works out and that I don't get that `missing wkv_b tensor(s) hanging MLA from to 1` warning. Let me know if you have any luck getting `-mla 3` going on ik's fork! Hope to try it myslf today.
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **16:41:44**
+
+I haven't ported the python changes yet, just getting ik_llama to load the model.
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-14** at **16:43:15**
+
+LGTM.
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **16:44:11**
+
+@ikawrakow sorry, I forgot to mark this as a draft. Still waiting for llama.cpp branch to merge...
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **16:45:01**
+
+I'll open a follow up PR to bring any changes as well as port the python script support.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **16:45:01**
+
+@anikifoss
+
+Okay yeah I was thinking this might happen as I'd seen it trying to use the "mainline method" instead of the OG fairydreaming evshiron method to preserve the tensors. Yeah that warning is because the "mainline method" handles some MLA tensors differently. I always use the evshiron method for my ik specific quants.
+
+So might need to look into the differences in what you have ported and with https://github.com/evshiron/llama.cpp
+
+@saood06 and I have been discussing it'd be great to get this all into ik's fork.
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **16:47:14**
+
+@ubergarm I used unsloth's BF16 safetensors and then converted that to GGUF using llama.cpp, so I skipped the step that gives you the `missing wkv_b tensor(s) hanging MLA from to 1` warning.
+
+I quantized using unpatched ik_llama, and it seems to be working.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **16:57:03**
+
+> I quantized using unpatched ik_llama, and it seems to be working.
+
+Okay, then I think my path forward looks something like:
+
+1. Use this PR (now merged into main) to convert my bf16 safetensors to bf16 GGUF to test the code a little more lol
+2. use ik_llama.cpp to quantize a Q8_0
+3. confirm this Q8_0 is happy and no complaints about `missing wkv_b tensor(s)`
+4. use ik_llama.cpp to generate an imatrix.dat
+5. test out some mixes and relase some ik quants!
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **16:58:37**
+
+> Use this PR (now merged into main) to convert my bf16 safetensors to bf16 GGUF to test the code a little more lol
+
+The conversion code is currently missing (this was a draft PR, I did not expect it to get merged so fast)
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **17:01:21**
+
+Ahh okie, things are indeed moving fast. I'm reading up on some more clues from ik [here](https://github.com/ikawrakow/ik_llama.cpp/issues/601#issuecomment-3070185792) so it might be okay.
+
+I'll just use my existing bf16 GGUF then and try it out on ik_llama.cpp and confirm the default behavior is `-mla 1` for imatrix.
+
+Exciting monday lol :sweat_smile:
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **17:06:37**
+
+> @ikawrakow sorry, I forgot to mark this as a draft. Still waiting for llama.cpp branch to merge...
+
+It's OK. You can make a separate PR for the Python stuff. In the meantime if someone is really desperate to try the model with `ik_llama.cpp`, they can do it with a GGUF that has been created with mainline.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **17:07:37**
+
+It'd sure be interesting if someone released an Kimi-K2-Instruct-1000B-A32B-IQ2_KL...
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **17:11:54**
+
+> It'd sure be interesting if someone released an Kimi-K2-Instruct-1000B-A32B-IQ2_KL...
+
+That is YOUR job :sweat_smile: ... I'm sticking to q4+ quants with no imatrix. But not many have enough RAM to run those. My system is using 690G with the DQ4_K quant.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **19:44:28**
+
+So yeah I tested this PR too using a "mainline style" Q8_0 i cooked and it is running at least single inference:
+
+```
+>>> User:
+
+Count from 1 to 10 in French.
+
+>>> Assistant:
+
+1. un
+2. deux
+3. trois
+4. quatre
+5. cinq
+6. six
+7. sept
+8. huit
+9. neuf
+10. dix
+```
+
+Despite quantizing my bf16 GGUF with ik_llama.cpp it still throws that warning, so there are some important details happening differntly in the convert_hf_to_gguf.py between [ik_llama.cpp's version](https://github.com/ikawrakow/ik_llama.cpp/blob/main/convert_hf_to_gguf.py#L3462-L3484) and [mainline's verison](https://github.com/gabriellarson/llama.cpp/blob/kimi-k2/convert_hf_to_gguf.py#L5705-L5725)
+
+So I'm fussing to see if I can merge in just the changes needed from gabriellarson/llama.cpp/tree/kimi-k2 without messing up the MLA tensors so they stay the OG way... Then I will have a bf16 GGUF with the OG style MLA tensors and can go forward like normal haha...
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **19:50:30**
+
+@ubergarm I see the following message when running with ik_llama, is this the same issues you are looking at?
+```
+============ llm_prepare_mla: need to compute 61 wkv_b tensors
+Computed blk.0.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.1.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.2.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.3.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.4.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.5.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.6.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.7.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.8.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.9.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.10.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.11.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.12.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.13.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.14.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.15.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.16.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.17.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.18.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.19.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.20.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.21.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.22.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.23.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.24.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.25.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.26.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.27.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.28.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.29.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.30.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.31.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.32.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.33.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.34.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.35.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.36.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.37.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.38.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.39.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.40.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.41.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.42.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.43.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.44.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.45.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.46.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.47.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.48.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.49.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.50.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.51.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.52.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.53.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.54.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.55.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.56.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.57.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.58.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.59.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+Computed blk.60.attn_kv_b.weight as 512 x 16384 and stored in buffer CUDA0
+```
+
+---
+
+👤 **whatever1983** commented on **2025-07-14** at **19:53:14**
+
+yo, guys, seriously, just had to comment on this model on two fronts:
+
+First, the model is just 1Trillion, and you already have to deal with 2TB BF16 files. Either you look at DFloat11 format and compress the matissa to 11.2bpw perfectly. If not only for ssd savings. I was begging ik to consider working with FP8/FP4 formats in another thread and got rejected. Why go through the FP8-> 2TB BF16 safetensors with triton-cpu -> q8_0 loss->requantize to 2-3bits, when FP4 checkpoints are out there @ 580GB k-l-lambda/Kimi-K2-Instruct-FP4 or baseten/Kimi-K2-Instruct-FP4? I know it is a lot to implement for FP8/FP4. vllm already has a marlin FP4 kernel. SGlang has a petit-nvfp4 WIP kernel for ROCm. What's missing is CPU based NVFP4/FP8 inferencing using bf16 recast. Really, you work with 580GB of weights already done for you.
+
+Second comment is for the Kimi K2 model itself. If you haven't read the README, it is only 51 SWE-Bench Verified for non-agent, below R1-0528's 57points. 65 for single agent, but then you have to use tooling, which includes bash. ("Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools" So if you want a SWE-bench 8 points higher than R1-0528, you have to expose your bash prompt. Who knows what the bash prompt is calling HTTPS API endpoints, posting your data to which API endpoints? It is such a security risk, are you going to sandbox your bash execution? All I can speculate is that you could theoretically call the Anthropic API point to fudge the benchmark. Then there is the 71 points for multiagent SWE-bench(aka cons=32 or 64). Good luck running 10toks/sec on a 768GB DDR5 EPYC @ cons=64. You could sleep all night and come back in the morning for a cons64 job.
+
+Not that impressive 1Trillion model if you care about data security or claimed performance. I suggest that you just either wait for OpenAI's open source model, which calls O3 via HTTP, or just pay 30dollars/month for grok4-coder cons=1 at SWE-bench=72.
+
+---
+
+👤 **saood06** commented on **2025-07-14** at **19:55:08**
+
+> So I'm fussing to see if I can merge in just the changes needed from gabriellarson/llama.cpp/tree/kimi-k2 without messing up the MLA tensors so they stay the OG way... Then I will have a bf16 GGUF with the OG style MLA tensors and can go forward like normal haha...
+
+Like I said on HF, if you take the ~2 TB BF16 safetensor you made, then you can just use the `ik_llama.cpp` convert script (with the kimi changes) and it should give you a GGUF with the MLA tensors you want.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **20:05:16**
+
+@anikifoss
+
+I think I got it going now: https://github.com/ikawrakow/ik_llama.cpp/issues/601#issuecomment-3070800462
+
+You'll have to download the ~1TB FP8 yourself and fp8_cast_bf16 them like I show in that hf repo discussion. And if my current test works, I'll open a PR with with the updated ik_llama.cpp convert_hf_to_gguf.py including the Kimi-K2 fixes. (or i could upload the 2TB bf16 with the correct MLA tensors, but would have to check if that is okay with the uplink ata first... haha... :sweat_smile: )
+
+If you start with unsloth's bf16 they already have the mainline MLA stuff done to them.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **20:15:55**
+
+@whatever1983
+
+> I suggest that you just either wait for OpenAI's open source model, which calls O3 via HTTP, or just pay 30dollars/month for grok4-coder cons=1 at SWE-bench=72.
+
+But where is the fun in that? ;p And besides, I generally don't use LLMs I just enjoy making them go brrr....
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **20:18:24**
+
+Do we feed the trolls? :thinking:
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **20:24:52**
+
+Kimi-K2 has amazing VRAM savings, I can load the full 131k context!
+
+I am **over the moon** with this model :new_moon_with_face:
+
+---
+
+👤 **saood06** commented on **2025-07-14** at **20:30:20**
+
+> I am **over the moon** with this model 🌚
+
+I haven't tried the model at all, but I have heard mixed feedback about it.
+
+If you don't mind, how prone to refusals is it? That's the one area I'm most curious about (and will probably affect when/whether or not I end up trying the model locally).
+
+---
+
+👤 **whatever1983** commented on **2025-07-14** at **20:36:42**
+
+@anikifoss:
+
+Why do you call me a troll? That's just not nice. I am realistic. What's the point of running DQ4KM at 690GB or IQ2K IQ3K levels further dropping SWE-bench, if you use it for real work? It took me about a year messing with GGUF to realize that the GGUF format, even with IK's superb IQK quants is such a toy for client side home production, and I am forced to move to original FP4 safetensors format instead or just pay for the top tier models. GGUF got started too early. There's a BF16-> IQ6K compression saving, even at FP8->IQ6K. The compression just disappears when GB200 trains FP4 models natively, no one is dumb enough to run FP4 trained/compressed model at IQ6K.
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **20:36:46**
+
+> If you don't mind, how prone to refusals is it?
+
+Thanks I'll keep an eye on it. But so far it's been amazing at answering my usual benchmark questions. I'll try my goto roo-code project to see how well it does.
+
+In terms of refusal, I noticed devstall-small-24b was refusing some of my suggestions. I suspect it's related to agentic lean, when models are taught to avoid uncertain actions to prevent getting into the weeds. Since Kimi-K2 is mainly developed for agentic use, it may have similar tendencies.
+
+---
+
+👤 **saood06** commented on **2025-07-14** at **20:55:05**
+
+> Thanks I'll keep an eye on it. But so far it's been amazing at answering my usual benchmark questions. I'll try my goto roo-code project to see how well it does.
+>
+> In terms of refusal, I noticed devstall-small-24b was refusing some of my suggestions. I suspect it's related to agentic lean, when models are taught to avoid uncertain actions to prevent getting into the weeds.
+
+Never heard the term "agentic lean" before.
+
+If you are just using it for coding tasks, then I'm not sure you will hit the refusals I care about. It's not even the refusals I care about as bypassing them is rather trivial, but their existence and prevalence tend to correlate with training decisions which impact downstream quality which is what I care about. (Never refusing like abliterated models leads to worse quality from what I've seen, just like a model that refuses too often).
+
+---
+
+👤 **anikifoss** commented on **2025-07-14** at **21:02:47**
+
+> Never heard the term "agentic lean" before.
+
+Sorry, that sounds like something a tech bro would say. Perhaps I was primed somehow :sweat_smile:. Just sharing my thoughts that these models were both trained for agentic use-cases, so they may share simlar tendencies.
+
+---
+
+👤 **saood06** commented on **2025-07-14** at **21:07:19**
+
+> > Never heard the term "agentic lean" before.
+>
+> Sorry, that sounds like something a tech bro would say. Perhaps I was primed somehow 😅.
+
+Not calling you out, just was new vocabulary for me.
+
+>Just sharing my thoughts that these models were both trained for agentic use-cases, so they may share simlar tendencies.
+
+That does make sense. I do appreciate your thoughts, no need to apologize.
+
+---
+
+👤 **ubergarm** commented on **2025-07-14** at **22:48:17**
+
+@anikifoss
+
+sorry i'm taking so long, still testing my convert_hf_to_gguf.py is working, its taking a while i had to restart for hardware stuff, hah... it is just the mainline changes for kimidev applied to the existing ik_llama.cpp fork's convert_hf_to_gguf.py - no need for the evshiron fork technically (though it is convenient to save a step and disk space, but outside this scope for me).
+
+the mainline PR is still having some discussion, and from i heard in BeaverAIClub the chat template looks like this (with no newlines) (credit tofumagnate for this info) from converting the official template: https://huggingface.co/moonshotai/Kimi-K2-Base/blob/main/tokenizer_config.json#L154
+
+```
+<|im_system|>system<|im_middle|>example system prompt<|im_end|><|im_user|>user<|im_middle|>example user turn 1<|im_end|><|im_assistant|>assistant<|im_middle|>example assistant turn 1<|im_end|><|im_user|>user<|im_middle|>example user turn 2<|im_end|><|im_assistant|>assistant<|im_middle|>
+```
+
+So probably gonna need something around here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L23236-L23259 for the chat completions endpoint to detect it and apply it on the server side...
+
+*UPDATE*
+The convert is getting close, over 80% I kept having tmux explode on me and then ran the process in `nohup` and its going like a champ (knock on wood). Random aside my rsync --progress had been doing the same thing with tmux panes suddenly closing and nothing in dmesg and no ram errors etc. Anyway, i gotta be careful of how I pipe tqdm progress bar style output with my terminal i guess maybe hopefully lol...
+
+Anyway, if thing thing finishes finally I can get a Q8_0 that *should* not have the warning on this fork! What a day lol
+
+---
+
+👤 **saood06** commented on **2025-07-14** at **23:10:34**
+
+> BeaverAIClub
+
+Is that a discord?
+
+> So probably gonna need something around here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp#L23236-L23259 for the chat completions endpoint to detect it and apply it on the server side...
+
+I never connected the dots that the chat completion endpoint needs that (probably because I prefer and almost always use the standard completion endpoint). Thanks.
+
+---
+
+👤 **ubergarm** commented on **2025-07-15** at **02:46:33**
+
+@anikifoss
+
+I finally think I'm out of the woods with the convert script... My tmux was dying which would end the process, had to run it in a nohup lol... I think its `tqdm` progress bar messing with my terminal or something :crossed_fingers:
+
+Anyway, in the mean time I pushed a branch, but want to test it is working with a quant. I also added what I think will be the chat template which also needs testing. I could open a draft PR I suppose at least to have a place holder...
+
+https://github.com/ubergarm/ik_llama.cpp/tree/ug/convert-kimi-k2
+
+One step closer!
+
+*UPDATE*: Went ahead and opened a draft PR https://github.com/ikawrakow/ik_llama.cpp/pull/612
\ No newline at end of file
diff --git a/github-data/pull_requests/61 - Adding ability to have meta data per tensor row.md b/github-data/pull_requests/61 - Adding ability to have meta data per tensor row.md
index 28a80d29d..1522bb236 100644
--- a/github-data/pull_requests/61 - Adding ability to have meta data per tensor row.md
+++ b/github-data/pull_requests/61 - Adding ability to have meta data per tensor row.md
@@ -1,14 +1,18 @@
-### 🔀 [#61](https://github.com/ikawrakow/ik_llama.cpp/pull/61) - Adding ability to have meta data per tensor row
+## 🔀 [Pull Request #61](https://github.com/ikawrakow/ik_llama.cpp/pull/61) - Adding ability to have meta data per tensor row
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/per_row_scale` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-25 |
| **Updated** | 2024-09-27 |
+| **Merged** | 2024-09-27 |
+| **Labels** | `Breaking change` |
---
-#### Description
+## 📄 Description
`ggml` is very opinionated on the topic of tensor data layout - things must be organized in blocks of a known size, the number of elements in a block must be fixed, etc. There are many places where it is assumed that a contiguous tensor row with `ne` elements occupies `ne * ts / bs` bytes, where `ts` is the "type size" and `bs` is the "block size". This is not very useful when one wants to have some meta data per tensor or per row (e.g., tensor or row scale, quant values in a K-means clustering based quantization, etc.).
diff --git a/github-data/pull_requests/610 - q8_k_r8 experimental AVX512 version.md b/github-data/pull_requests/610 - q8_k_r8 experimental AVX512 version.md
new file mode 100644
index 000000000..f67bcf479
--- /dev/null
+++ b/github-data/pull_requests/610 - q8_k_r8 experimental AVX512 version.md
@@ -0,0 +1,196 @@
+## 🔀 [Pull Request #610](https://github.com/ikawrakow/ik_llama.cpp/pull/610) - q8_k_r8: experimental AVX512 version
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Source Branch** | `ik/q8_k_r8_avx512` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-14 |
+| **Updated** | 2025-07-27 |
+
+---
+
+## 📄 Description
+
+@ubergarm This is specifically for your 9950X CPU.
+
+On my 7950X this is ~10% slower than what we have on the main branch. The 7950X supports `AVX512`, but 512-bit instructions get executed as two 256-bit instructions. Hence, I'm expecting (hoping?) this `Q8_K_R8` GEMM version to be significantly faster on a CPU with "real" 512-bit instructions such as the 9950X.
+
+Please benchmark it so I can decide if it is worth adding this to the main branch.
+
+---
+
+## 💬 Conversation
+
+👤 **ubergarm** commented on **2025-07-14** at **17:27:49**
+
+Wow! :rocket: this little amd 9950x can really rip with its "real" 512-bit instruction!!!
+
+The chart is getting too busy, but left everything to show how crazy it is to see faster PP on my 16x gaming rig that the 24x core thread ripper pro! 😮🎉🥂
+
+*EDIT* the title is a bit misleading, that commit was used for the earlier tests. The actual commit used is shown in the legend in tiny tiny hard to read font. thanks.
+
+
+
+
+
+👈 Details
+
+The other data and info is over on [#602](https://github.com/ikawrakow/ik_llama.cpp/issues/602)
+
+# Q8_K_R8 9950X 16x PR610 ik/q8_k_r8_avx512@c462c5bd
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 512 | 128 | 0 | 1.593 | 321.47 | 21.403 | 5.98 |
+| 512 | 128 | 512 | 1.658 | 308.80 | 21.594 | 5.93 |
+| 512 | 128 | 1024 | 1.719 | 297.82 | 21.855 | 5.86 |
+| 512 | 128 | 1536 | 1.797 | 284.99 | 22.093 | 5.79 |
+| 512 | 128 | 2048 | 1.866 | 274.35 | 22.337 | 5.73 |
+| 512 | 128 | 2560 | 1.948 | 262.82 | 22.605 | 5.66 |
+| 512 | 128 | 3072 | 2.008 | 254.93 | 22.899 | 5.59 |
+| 512 | 128 | 3584 | 2.084 | 245.66 | 23.271 | 5.50 |
+| 512 | 128 | 4096 | 2.152 | 237.93 | 23.333 | 5.49 |
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-14** at **17:42:43**
+
+OK, then, I'll create a way to select one of the two kernels at build time.
+
+Yes, the 9950X is really nice. I was tempted to upgrade when it came out, but at the end didn't because AMD didn't do anything for memory bandwidth.
+
+---
+
+👤 **Ph0rk0z** commented on **2025-07-18** at **13:11:56**
+
+Wish there was a way to use AVX-512 without the ML extensions. Or would it not provide any benefit over AVX2?
+
+---
+
+👤 **sousekd** commented on **2025-07-22** at **20:58:13**
+
+Do I need specific model quants to test it? I tried using **anikifoss/Kimi-K2-Instruct-DQ4_K** and **bartowski/Qwen3-235-A22B-Q8_0** with `-rtr`, but I didn't notice any difference compared to the main branch on my EPYC 9355. It might be due to how I compiled it on Windows, though.
+
+---
+
+👤 **ubergarm** commented on **2025-07-23** at **01:03:05**
+
+@sousekd
+
+> Do I need specific model quants to test it?
+
+If I understand correctly this only effects quants that use q8_k_r8 path so I don't think your Q8_0 would be effected nor your q4_K/q6_K quants which use different paths [as i tried to find a way to describe here in an older buried comment](https://github.com/ikawrakow/ik_llama.cpp/pull/495#issuecomment-2985633815).
+
+I think this would be a list of the current quants that if are in your mix you might see a boost in PP using this PR on a Zen5 CPU:
+
+
+
+👈 supported quants
+
+```bash
+$ grep Q8_K_R8 ggml/src/iqk/iqk_mul_mat.cpp | grep type
+ case GGML_TYPE_IQ2_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_S : return nrc_y >= 16 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ1_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ1_M : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_Q2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_Q3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_KL : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ5_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ5_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ6_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_Q2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_Q3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ1_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ1_M : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_XXS: return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_S : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_XS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_KL : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ5_KS : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ2_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ3_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ4_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ5_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+ case GGML_TYPE_IQ6_K : return nrc_y >= 32 ? GGML_TYPE_Q8_K_R8 : type;
+```
+
+
+
+I'm not sure how `-rtr` would effect it or not, I'd suggest leave it off if you are testing and just attempt to boost `-ub 4096 -b 4096` for max PP as is my practice.
+
+psure your CPU should support it though as it is Zen5, and i have no idea about windows compiling. on linux i run `lscpu | grep avx_vnni` to check for the flag in question.
+
+You could possibly give this quant a try as it is mostly quants from this list: https://huggingface.co/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF#iq5_k-161722-gib-5909-bpw
+
+I measured slightly better PPL than the larger DQ4_K, but I used an imatrix so go with whatever you prefer.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **07:31:38**
+
+Yes, pick one of the quantization types in the list provided by @ubergarm to see if it makes a difference on your Zen5 CPU.
+
+> I'm not sure how -rtr would effect it or not
+
+Do not use `-rtr`. With `-rtr` it will repack the quants to the corresponding row-interleaved `*_R4` or `*_R8` variant while loading the model. The row-interleaved quants do not get repacked to `Q8_K_R8` for large matrix multiplications, so the PR will have no effect on performance in that case.
+
+---
+
+👤 **sousekd** commented on **2025-07-27** at **18:34:22**
+
+So I was finally able to measure minor but stable PP t/s improvements with this PR on EPYC 9355 x Windows, when running **CPU only** compiled without CUDA :). Tested on @ubergarm's [Qwen3-235B-A22B-Thinking-2507-IQ5_K](https://huggingface.co/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF) with `--no-mmap -fa -fmoe -ctk q8_0 -ctv q8_0 -c 32768 -b 4096 -ub 4096 --parallel 1 --threads 32 --threads-batch 28`. I repeated the test few times to validate the results:
+
+### Main:
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 27.540 | 148.73 | 54.104 | 18.93 |
+| 4096 | 1024 | 4096 | 34.614 | 118.33 | 59.500 | 17.21 |
+| 4096 | 1024 | 8192 | 41.932 | 97.68 | 65.398 | 15.66 |
+| 4096 | 1024 | 12288 | 49.304 | 83.08 | 70.115 | 14.60 |
+| 4096 | 1024 | 16384 | 56.678 | 72.27 | 75.751 | 13.52 |
+
+### This PR:
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 26.115 | 156.85 | 54.520 | 18.78 |
+| 4096 | 1024 | 4096 | 33.274 | 123.10 | 60.143 | 17.03 |
+| 4096 | 1024 | 8192 | 40.598 | 100.89 | 65.399 | 15.66 |
+| 4096 | 1024 | 12288 | 48.046 | 85.25 | 70.323 | 14.56 |
+| 4096 | 1024 | 16384 | 55.303 | 74.07 | 75.557 | 13.55 |
+
+With a GPU in the mix (RTX 5090), any improvements are within a margin of error. Same model, but with `--no-mmap -fa -fmoe -c 32768 -b 16384 -ub 16384 -ngl 999 -ot "blk\.([3-9]|[1-9][0-9])\.ffn_.*_exps=CPU" --parallel 1 --threads 32 --threads-batch 28`:
+
+### Main:
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 16384 | 4096 | 0 | 18.737 | 874.41 | 215.650 | 18.99 |
+| 16384 | 4096 | 16384 | 22.726 | 720.92 | 226.351 | 18.10 |
+
+### This PR:
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 16384 | 4096 | 0 | 18.584 | 881.64 | 216.695 | 18.90 |
+| 16384 | 4096 | 16384 | 22.715 | 721.29 | 224.334 | 18.26 |
\ No newline at end of file
diff --git a/github-data/pull_requests/610 - q8_k_r8_ experimental AVX512 version.md b/github-data/pull_requests/610 - q8_k_r8_ experimental AVX512 version.md
deleted file mode 100644
index 0d010260d..000000000
--- a/github-data/pull_requests/610 - q8_k_r8_ experimental AVX512 version.md
+++ /dev/null
@@ -1,17 +0,0 @@
-### 🔀 [#610](https://github.com/ikawrakow/ik_llama.cpp/pull/610) - q8_k_r8: experimental AVX512 version
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ✅ **Open** |
-| **Created** | 2025-07-14 |
-| **Updated** | 2025-07-18 |
-
----
-
-#### Description
-
-@ubergarm This is specifically for your 9950X CPU.
-
-On my 7950X this is ~10% slower than what we have on the main branch. The 7950X supports `AVX512`, but 512-bit instructions get executed as two 256-bit instructions. Hence, I'm expecting (hoping?) this `Q8_K_R8` GEMM version to be significantly faster on a CPU with "real" 512-bit instructions such as the 9950X.
-
-Please benchmark it so I can decide if it is worth adding this to the main branch.
\ No newline at end of file
diff --git a/github-data/pull_requests/611 - Bump GGML_MAX_CONTEXTS to allow loading more shards.md b/github-data/pull_requests/611 - Bump GGML_MAX_CONTEXTS to allow loading more shards.md
index 75b5d7e58..2a5d6c49a 100644
--- a/github-data/pull_requests/611 - Bump GGML_MAX_CONTEXTS to allow loading more shards.md
+++ b/github-data/pull_requests/611 - Bump GGML_MAX_CONTEXTS to allow loading more shards.md
@@ -1,14 +1,17 @@
-### 🔀 [#611](https://github.com/ikawrakow/ik_llama.cpp/pull/611) - Bump GGML_MAX_CONTEXTS to allow loading more shards
+## 🔀 [Pull Request #611](https://github.com/ikawrakow/ik_llama.cpp/pull/611) - Bump GGML_MAX_CONTEXTS to allow loading more shards
| **Author** | `Thireus` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `patch-1` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-15 |
| **Updated** | 2025-07-16 |
+| **Merged** | 2025-07-16 |
---
-#### Description
+## 📄 Description
This var prevents more than 64 shards from being loaded - Specifically relevant for large models such as DeepSeek R1.
@@ -22,49 +25,94 @@ I have tested it extensively for a few weeks - see https://github.com/Thireus/ik
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2025-07-15** at **01:19:45**:
+👤 **saood06** commented on **2025-07-15** at **01:19:45**
Would it make sense to also include this https://github.com/Thireus/ik_llama.cpp/commit/65dd65c10d2dc24cdddbd6255c3841c6a6c1038c as well for Windows users?
---
-👤 **ikawrakow** submitted a review the **2025-07-15** at **05:08:20**: 💬 `COMMENTED`
+👤 **ikawrakow** started a conversation on `ggml/include/ggml.h` on **2025-07-15** at **05:08:20**
----
-
-👤 **saood06** submitted a review the **2025-07-15** at **05:12:32**: 💬 `COMMENTED`
-
----
+Is 2048 really needed? The quoted whisper.cpp thread talks about 256 contexts, not 2048.
-👤 **saood06** commented during a code review the **2025-07-15** at **05:12:32** on `ggml/include/ggml.h`:
-
-It is if you want to use his tool suite, which makes use of GGUF split to this degree: https://huggingface.co/Thireus/DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_SPLIT/blob/main/DeepSeek-TNG-R1T2-Chimera-THIREUS-BF16-00001-of-01148.gguf
-
-1148 files for R1, so 2048 feels justified.
-
----
+> 👤 **saood06** replied on **2025-07-15** at **05:12:32**
+>
+> It is if you want to use his tool suite, which makes use of GGUF split to this degree: https://huggingface.co/Thireus/DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_SPLIT/blob/main/DeepSeek-TNG-R1T2-Chimera-THIREUS-BF16-00001-of-01148.gguf
+>
+> 1148 files for R1, so 2048 feels justified.
-👤 **ikawrakow** submitted a review the **2025-07-15** at **05:59:36**: 💬 `COMMENTED`
+> 👤 **ikawrakow** replied on **2025-07-15** at **05:59:36**
+>
+> But apart from the tool suite, when are we going to need more than 64, or perhaps 256, shards?
+>
+> Sure, the `ggml_context` struct is not that large (88 bytes, so we will waste a mere 170 kB).
+>
+> But then again, are you actually having 1148 contexts **at the same time** in your tool suite?
----
+> 👤 **Thireus** replied on **2025-07-15** at **06:08:05**
+>
+> Sharding with max 1 tensor per shard, which allows each tensor to be swapped individually at will. So one can create quants of individual tensors, and moving from one mixture of quants to another for a specific model simply means swapping some shards for others. No quantisation necessary, only download.
+>
+> It works quite well and saves a tone of time, as one quand quickly swap tensor quants without going through quantising the whole model again. But it would be somewhat also effective if we could also quantise individual tensors (if someone wants to create such alternative tool, or enhance llama-quantization to allow this), which would give an alternative when shards aren't available.
+>
+> https://huggingface.co/collections/Thireus/deepseek-r1-0528-thireus-special-split-68725429aceffbd1094bdd29
+>
+> Of course if someone really wants to have less shards after downloading the mixture of shards, they can merge them, but that defeats the purpose of allowing for quick swaps between mixes by only downloading and replacing the necessary tensors.
+>
+> I wrote a downloader that managed this all which is quant_downloader.sh at the root of gguf.thireus.com if you'd like to try it out. It can be used for any existing recipes, including @ubergarm ones (since he's one of the few who shares his recipes openly), providing all the quants of the tensors that comprise the recipe are available somewhere to download.
+> The vision is that quantising models becomes more efficient, with one person pre-quantising tensors individual into shards and sharing recipes only, instead of sharing whole merged models (which can always be provided as an option for the users who really hate optimisation).
+>
+> Didn't want to advertise my tool specifically, as I believe there are or can be other use cases and other tools that would benefit from an increased context size, as the upvotes from the llama.cpp seem to suggest.
-👤 **ikawrakow** commented during a code review the **2025-07-15** at **05:59:36** on `ggml/include/ggml.h`:
+> 👤 **saood06** replied on **2025-07-15** at **06:26:49**
+>
+> >Of course if someone really wants to have less shards after downloading the mixture of shards, they can merge them, but that defeats the purpose of allowing for quick swaps between mixes by only downloading and replacing the necessary tensors.
+>
+> I was just typing up a less eloquent version of this.
+>
+> I like your tool, I am planning to adapt some of the recipes you found to kimi-k2 to fit on my 384 GB server.
-But apart from the tool suite, when are we going to need more than 64, or perhaps 256, shards?
-
-Sure, the `ggml_context` struct is not that large (88 bytes, so we will waste a mere 170 kB).
-
-But then again, are you actually having 1148 contexts **at the same time** in your tool suite?
+> 👤 **ubergarm** replied on **2025-07-16** at **13:36:21**
+>
+> @saood06 I have now [uploaded a few Kimi-K2s](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF). A couple of which might suit your needs.
+>
+> So if I understand @Thireus approach better it is to essentially pull apart individual tensors quantized to different levels and mix-match them back together using a bunch of "shards"?
+>
+> If so that kinda makes sense, given most of my quants are using the same attn/shexp/ffn dense layers and only changing the exps more or less.
+>
+> Feel free to rip tensors out of my GGUFs and frankenstein back together another mix! Interesting...
+
+> 👤 **Thireus** replied on **2025-07-16** at **14:35:16**
+>
+> @ubergarm Yes. Each tensor of the model is quantised to be exactly 1 shard, see [this collection](https://huggingface.co/collections/Thireus/deepseek-r1-0528-thireus-special-split-68725429aceffbd1094bdd29) and for example this [tensors.map](https://huggingface.co/Thireus/DeepSeek-R1-0528-THIREUS-BF16-SPECIAL_SPLIT/blob/main/tensors.map) for BF16. All possible combinations of recipe can then be produced (providing these shards are available).
+>
+> Users can use [quant_recipe_pipeline.ipynb](https://colab.research.google.com/github/Thireus/GGUF-Tool-Suite/blob/main/) to compute the recipe suitable to their VRAM and RAM requirements. Or use existing recipes such as yours.
+>
+> Once the recipe is generated, they can download the corresponding shards (tensors) using [quant_downloader.sh](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/)
+>
+> So, there is no need to quantize models anymore, only download the mixture of shards as defined in the recipe. Users also don't need to merge the shards, thanks to this pull request llama can load all the individual frankensteined 1148 shards.
+
+> 👤 **ubergarm** replied on **2025-07-16** at **14:42:59**
+>
+> frankenshards i love it! it is still a bit beyond my full conception with the working parts. it'd be cool to see a 5 minute demo video if such a thing is possible. i'll have to look closer when I get some more time. thanks for thinking so far out there and pushing the innovation!
----
+> 👤 **ikawrakow** replied on **2025-07-16** at **14:52:59**
+>
+> > So, there is no need to quantize models anymore, only download the mixture of shards as defined in the recipe. Users also don't need to merge the shards, thanks to this pull request llama can load all the individual frankensteined 1148 shards.
+>
+> What if one wants a different imatrix? Or if there is an improvement in the quantization function?
-👤 **Thireus** submitted a review the **2025-07-15** at **06:08:05**: 💬 `COMMENTED`
+> 👤 **Thireus** replied on **2025-07-16** at **15:30:40**
+>
+> > What if one wants a different imatrix? Or if there is an improvement in the quantization function?
+>
+> They'll create their own shards with [DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh). And adjust [download.conf](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/download.conf) to point to their repos.
---
-👤 **ikawrakow** commented the **2025-07-15** at **06:26:41**:
+👤 **ikawrakow** commented on **2025-07-15** at **06:26:41**
How about this:
```c++
@@ -78,21 +126,7 @@ I see that `GGML_MAX_CONTEXTS` is not used anywhere else apart from `ggml.c`, so
---
-👤 **saood06** submitted a review the **2025-07-15** at **06:26:49**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-07-15** at **06:26:49** on `ggml/include/ggml.h`:
-
->Of course if someone really wants to have less shards after downloading the mixture of shards, they can merge them, but that defeats the purpose of allowing for quick swaps between mixes by only downloading and replacing the necessary tensors.
-
-I was just typing up a less eloquent version of this.
-
-I like your tool, I am looking to adapt some of the recipes you found to kimi-k2 to fit on my 384 GB server.
-
----
-
-👤 **Thireus** commented the **2025-07-15** at **06:35:54**:
+👤 **Thireus** commented on **2025-07-15** at **06:35:54**
> How about this:
>
@@ -118,72 +152,54 @@ Thank you.
---
-👤 **saood06** commented the **2025-07-15** at **06:58:21**:
+👤 **saood06** commented on **2025-07-15** at **06:37:41**
-@ikawrakow
-
->Which windows commit
-
-[Thireus@65dd65c](https://github.com/Thireus/ik_llama.cpp/commit/65dd65c10d2dc24cdddbd6255c3841c6a6c1038c)
-
->and when is there dynamic GGML_MAX_CONTEXTS?
+> along with a `cmake` variable that can be used to set `GGML_MAX_CONTEXTS`? You can then build the tool suite with whatever number of contexts you like (the way things are going, soon even 2048 may not be enough).
-And dynamic in the sense that built below 512 then nothing needs to be set, if built above 8192, set to only 8192 (as 8192 is the Windows limitation and 512 the default).
+For a dynamic `GGML_MAX_CONTEXTS` can the windows commit I describe can be set according to this limit (capped at 8192), and included?
---
-👤 **saood06** commented the **2025-07-16** at **00:31:03**:
+👤 **ikawrakow** commented on **2025-07-15** at **06:44:17**
-> [Thireus@65dd65c](https://github.com/Thireus/ik_llama.cpp/commit/65dd65c10d2dc24cdddbd6255c3841c6a6c1038c) would be a separate pull request as this is a different limitation (OS limitation for number of opened files), that code is required for Windows while other platforms (linux, macos) can use ulimit to lift the limitation.
+> For a dynamic GGML_MAX_CONTEXTS can the windows commit I describe can be set according to this limit (capped at 8192), and included?
-Sounds good to me.
+Don't understand this comment. Which windows commit and when is there dynamic `GGML_MAX_CONTEXTS`?
---
-👤 **ikawrakow** submitted a review the **2025-07-16** at **12:11:08**: ✅ `APPROVED`
+👤 **saood06** commented on **2025-07-15** at **06:58:21**
----
-
-👤 **ubergarm** commented during a code review the **2025-07-16** at **13:36:21** on `ggml/include/ggml.h`:
-
-@saood06 I have now [uploaded a few Kimi-K2s](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF). A couple of which might suit your needs.
+@ikawrakow
-So if I understand @Thireus approach better it is to essentially pull apart individual tensors quantized to different levels and mix-match them back together using a bunch of "shards"?
+>Which windows commit
+
+[Thireus@65dd65c](https://github.com/Thireus/ik_llama.cpp/commit/65dd65c10d2dc24cdddbd6255c3841c6a6c1038c)
-If so that kinda makes sense, given most of my quants are using the same attn/shexp/ffn dense layers and only changing the exps more or less.
+>and when is there dynamic GGML_MAX_CONTEXTS?
-Feel free to rip tensors out of my GGUFs and frankenstein back together another mix! Interesting...
+And dynamic in the sense that if `GGML_MAX_CONTEXTS` is below 512 then nothing needs to be set (as 512 is the default), if built above 8192, set to only 8192 (as 8192 is the Windows hard upper limit [and even this is not guaranteed see [this](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=msvc-160) for more info]).
---
-👤 **ubergarm** submitted a review the **2025-07-16** at **13:36:22**: 💬 `COMMENTED`
+👤 **Thireus** commented on **2025-07-15** at **08:30:07**
----
-
-👤 **Thireus** submitted a review the **2025-07-16** at **14:35:16**: 💬 `COMMENTED`
+https://github.com/Thireus/ik_llama.cpp/commit/65dd65c10d2dc24cdddbd6255c3841c6a6c1038c would be a separate pull request as this is a different limitation (OS limitation for number of opened files), that code is required for Windows while other platforms (linux, macos) can use ulimit to lift the limitation.
---
-👤 **ubergarm** submitted a review the **2025-07-16** at **14:42:59**: 💬 `COMMENTED`
-
----
+👤 **saood06** commented on **2025-07-16** at **00:31:03**
-👤 **ubergarm** commented during a code review the **2025-07-16** at **14:42:59** on `ggml/include/ggml.h`:
-
-frankenshards i love it! it is still a bit beyond my full conception with the working parts. it'd be cool to see a 5 minute demo video if such a thing is possible. i'll have to look closer when I get some more time. thanks for thinking so far out there and pushing the innovation!
-
----
-
-👤 **ikawrakow** submitted a review the **2025-07-16** at **14:52:59**: 💬 `COMMENTED`
+> [Thireus@65dd65c](https://github.com/Thireus/ik_llama.cpp/commit/65dd65c10d2dc24cdddbd6255c3841c6a6c1038c) would be a separate pull request as this is a different limitation (OS limitation for number of opened files), that code is required for Windows while other platforms (linux, macos) can use ulimit to lift the limitation.
+
+Thanks.
---
-👤 **Thireus** submitted a review the **2025-07-16** at **15:30:40**: 💬 `COMMENTED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-16** at **12:11:08**
---
-👤 **Thireus** commented during a code review the **2025-07-16** at **15:30:40** on `ggml/include/ggml.h`:
+👤 **Thireus** commented on **2025-07-16** at **23:47:10**
-> What if one wants a different imatrix? Or if there is an improvement in the quantization function?
-
-They'll create their own shards with [DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/DeepSeek-R1-0528-THIREUS-ANY-SPECIAL.sh). And adjust [download.conf](https://github.com/Thireus/GGUF-Tool-Suite/blob/main/models/DeepSeek-R1-0528/download.conf) to point to their repos.
\ No newline at end of file
+@saood06 - https://github.com/ikawrakow/ik_llama.cpp/pull/620
\ No newline at end of file
diff --git a/github-data/pull_requests/612 - kimi-k2 convert script and chat template.md b/github-data/pull_requests/612 - kimi-k2 convert script and chat template.md
index 0beea254a..e5a139b2b 100644
--- a/github-data/pull_requests/612 - kimi-k2 convert script and chat template.md
+++ b/github-data/pull_requests/612 - kimi-k2 convert script and chat template.md
@@ -1,14 +1,17 @@
-### 🔀 [#612](https://github.com/ikawrakow/ik_llama.cpp/pull/612) - kimi-k2 convert script and chat template
+## 🔀 [Pull Request #612](https://github.com/ikawrakow/ik_llama.cpp/pull/612) - kimi-k2 convert script and chat template
| **Author** | `ubergarm` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ug/convert-kimi-k2` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-15 |
| **Updated** | 2025-07-17 |
+| **Merged** | 2025-07-15 |
---
-#### Description
+## 📄 Description
1. Add convert script changes from @gabriellarson on mainline PR https://github.com/ggml-org/llama.cpp/pull/14654
2. Add kimi-k2 chat template to support chat endpoint (not sure if this is needed or if the gguf supplies the chat template via jinja or whatnot somehow lol)
@@ -26,23 +29,41 @@ blk.0.attn_kv_b.weight - [ 512, 16384, 1, 1], type = bf16, converting
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-07-15** at **04:17:47**:
+👤 **ubergarm** commented on **2025-07-15** at **04:17:47**
Okay just got the Q8_0 started up and seems coherent in short inferences. Also with this PR it does detect the chat template as such now:
```
INFO [ main] model loaded | tid="123282723551424" timestamp=1752553001
INFO [ main] chat template | tid="123282723551424" timestamp=1752553001 chat_example="<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|><|im_assistant|>assistant<|im_middle|>Hello<|im_end|><|im_user|>user<|im_middle|>Hi there<|im_end|><|im_assistant|>assistant<|im_middle|>How are you?<|im_end|>" built_in=true
-```
+```
+
+Gonna let this imatrix run and get some sleep. I added specifically `-mla 1` based on [this discussion](https://github.com/ikawrakow/ik_llama.cpp/issues/601#issuecomment-3070185792). Historically I leave off `-fa` as well during imatrix but not sure best practice or if it matters much.
+```
+model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
+numactl --interleave=all \
+./build/bin/llama-imatrix \
+ -m "$model" \
+ -f ubergarm-imatrix-calibration-corpus-v02.txt \
+ -o /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat \
+ -mla 1 \
+ --verbosity 1 \
+ --ctx-size 512 \
+ --layer-similarity \
+ --numa distribute \
+ --threads 384
+```
+
+Thanks!
---
-👤 **ikawrakow** submitted a review the **2025-07-15** at **06:01:35**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-15** at **06:01:35**
---
-👤 **ubergarm** commented the **2025-07-15** at **16:13:31**:
+👤 **ubergarm** commented on **2025-07-15** at **16:13:31**
Thanks!
@@ -57,6 +78,16 @@ llm_load_print_meta: model params = 1.027 T
llm_load_print_meta: model size = 345.687 GiB (2.892 BPW)
llm_load_print_meta: repeating layers = 344.166 GiB (2.885 BPW, 1024.571 B parameters)
llm_load_print_meta: general.name = Kimi K2 Instruct Bf16 Safetensors
+
+llama_model_loader: - type f32: 365 tensors 11:59:08 [72/1848]
+llama_model_loader: - type q5_0: 61 tensors
+llama_model_loader: - type q8_0: 61 tensors
+llama_model_loader: - type iq4_k: 1 tensors
+llama_model_loader: - type iq6_k: 1 tensors
+llama_model_loader: - type iq4_ks: 122 tensors
+llama_model_loader: - type iq5_ks: 366 tensors
+llama_model_loader: - type iq3_ks: 60 tensors
+llama_model_loader: - type iq2_kl: 120 tensors
```
@@ -173,19 +204,30 @@ But then other times it does respond okay, well formatted, coherent...
So hoping maybe just the chat template is off and will hack on it some more before marking ready.
+*EDIT*
+Maybe I just need to `add_ass` to get it to reply:
+
+```bash
+$ python chat_template_tester.py moonshotai/Kimi-K2-Instruct
+>> chat template <<
+<|im_system|>system<|im_middle|>example system prompt<|im_end|><|im_user|>user<|im_middle|>example user turn 1<|im_end|><|im_assistant|>assistant<|im_middle|>example assistant turn 1<|im_end|><|im_user|>user<|im_middle|>example user turn 2<|im_end|><|im_assistant|>assistant<|im_middle|>
+>> end of chat template <<
+```
+
+
@anikifoss
No pressure, but happy to hear if you manage to use this convert script on the original fp8 safetensors to get your good MLA bf16 GGUFs (with the attn_kv_b tensor).
---
-👤 **anikifoss** commented the **2025-07-15** at **17:01:52**:
+👤 **anikifoss** commented on **2025-07-15** at **17:01:52**
@ubergarm I can test the `convert_hf_to_gguf.py` from this PR to convert unsloth's BF16 `safetensors` to GGUF.
---
-👤 **ubergarm** commented the **2025-07-15** at **17:07:54**:
+👤 **ubergarm** commented on **2025-07-15** at **17:07:54**
> @ubergarm I can test the `convert_hf_to_gguf.py` from this PR to convert unsloth's BF16 `safetensors` to GGUF.
@@ -195,7 +237,7 @@ So far so good, the updated chat template `add_ass` fixed the generation issue.
---
-👤 **ikawrakow** commented the **2025-07-15** at **17:13:34**:
+👤 **ikawrakow** commented on **2025-07-15** at **17:13:34**
> So as soon as my perplexity comes back clean I'll start uploading and be ready to merge this.
@@ -203,13 +245,41 @@ How quickly, or rather how slowly, does it go?
---
-👤 **ikawrakow** commented the **2025-07-15** at **17:19:00**:
+👤 **ikawrakow** commented on **2025-07-15** at **17:19:00**
Btw., I have decided to add a sub-2 bpw quant, `IQ1_KT`, at 1.75 bpw (so same as `IQ1_M`). It is Trellis, but my guess is that with Kimi-2 even more people will reach to the lowest possible bpw models. Desperate times call for desperate action! It is shaping up to be nearly on par with `IQ2_XXS` (2.0625 bpw), and certainly much better than `IQ1_M`. CUDA is done with very decent performance. I'll do the CPU tomorrow.
---
-👤 **ubergarm** commented the **2025-07-15** at **17:46:20**:
+👤 **ubergarm** commented on **2025-07-15** at **17:27:41**
+
+> How quickly, or rather how slowly, does it go?
+
+I hope to get some sweep-benches in eventually, anecdotally on short prompts with llama-server seeing around 130\~150 tok/sec PP and 10\~12 tok/sec TG running CPU-only on a single socket of a `AMD EPYC 9965 192-Core Processor` with 768GB DDR5@6400MT/s clocked around 260GiB/sec RAM bandwidth *per socket*
+
+Running like so on a single socket. I haven't found the sweet spot for threads given this rig is new to me.
+
+```bash
+model=/mnt/raid/hf/Kimi-K2-Instruct-GGUF/IQ2_KL/Kimi-K2-Instruct-IQ2_KL-00001-of-00008.gguf
+numactl -N 0 -m 0 \
+./build/bin/llama-server \
+ --model "$model"\
+ --alias ubergarm/Kimi-K2-Instruct \
+ --ctx-size 32768 \
+ -ctk q8_0 \
+ -fa -fmoe \
+ -mla 3 \
+ --parallel 1 \
+ --threads 64 \
+ --threads-batch 192 \
+ --numa numactl \
+ --host 127.0.0.1 \
+ --port 8080
+```
+
+---
+
+👤 **ubergarm** commented on **2025-07-15** at **17:46:20**
Okay perplexity ran clean on CPU only implementation:
@@ -233,7 +303,7 @@ Happy to merge this now and model will land in hugging face in 10 minutes.
---
-👤 **anikifoss** commented the **2025-07-15** at **19:10:17**:
+👤 **anikifoss** commented on **2025-07-15** at **19:10:17**
> Oh I didn't realize they uploaded the bf16 safetensors that must be just the output of fp8_cast_bf16.py yes that should work as that step does not strip the attn_kv_b so should work out! Thanks for testing, I know this thing is a monster. Working with this 1TB+ model feels like driving a barge lol...
@@ -241,7 +311,114 @@ Happy to merge this now and model will land in hugging face in 10 minutes.
---
-👤 **saood06** commented the **2025-07-16** at **00:29:19**:
+👤 **ubergarm** commented on **2025-07-15** at **19:22:13**
+
+@anikifoss
+
+Thanks for giving it a try, at least it sounds like this `convert_hf_to_gguf.py` "worked" on the unsloth BF16 safetensors? Hrmm... I think there is a safetensor viewer script let me see... In the mean time I'll check with my remote rig guy and ask him if its cool to just upload the bf16 GGUFs to make it easier for ya! Will update soon.
+
+
+
+👈 gguf dump of my bf16 GGUFs
+
+```bash
+$ python ./gguf-py/scripts/gguf_dump.py /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x15B-Instruct-safetensors-BF16-00001-of-00045.gguf
+
+INFO:gguf-dump:* Loading: /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-384x15B-Instruct-safetensors-BF16-00001-of-00045.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 48 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 36
+ 3: UINT64 | 1 | GGUF.kv_count = 45
+ 4: STRING | 1 | general.architecture = 'deepseek2'
+ 5: STRING | 1 | general.type = 'model'
+ 6: STRING | 1 | general.name = 'Kimi K2 Instruct Bf16 Safetensors'
+ 7: STRING | 1 | general.finetune = 'Instruct-safetensors'
+ 8: STRING | 1 | general.basename = 'Kimi-K2'
+ 9: STRING | 1 | general.size_label = '384x15B'
+.
+.
+.
+
+* Dumping 36 tensor(s)
+ 1: 1174405120 | 7168, 163840, 1, 1 | BF16 | token_embd.weight
+ 2: 7168 | 7168, 1, 1, 1 | F32 | blk.0.attn_norm.weight
+ 3: 132120576 | 18432, 7168, 1, 1 | BF16 | blk.0.ffn_down.weight
+ 4: 132120576 | 7168, 18432, 1, 1 | BF16 | blk.0.ffn_gate.weight
+ 5: 132120576 | 7168, 18432, 1, 1 | BF16 | blk.0.ffn_up.weight
+ 6: 7168 | 7168, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
+ 7: 512 | 512, 1, 1, 1 | F32 | blk.0.attn_kv_a_norm.weight
+ 8: 4128768 | 7168, 576, 1, 1 | BF16 | blk.0.attn_kv_a_mqa.weight
+ 9: 8388608 | 512, 16384, 1, 1 | BF16 | blk.0.attn_kv_b.weight
+ 10: 4194304 | 128, 32768, 1, 1 | BF16 | blk.0.attn_k_b.weight
+ 11: 4194304 | 512, 8192, 1, 1 | BF16 | blk.0.attn_v_b.weight
+ 12: 58720256 | 8192, 7168, 1, 1 | BF16 | blk.0.attn_output.weight
+ 13: 1536 | 1536, 1, 1, 1 | F32 | blk.0.attn_q_a_norm.weight
+ 14: 11010048 | 7168, 1536, 1, 1 | BF16 | blk.0.attn_q_a.weight
+ 15: 18874368 | 1536, 12288, 1, 1 | BF16 | blk.0.attn_q_b.weight
+ 16: 7168 | 7168, 1, 1, 1 | F32 | blk.9.attn_norm.weight
+ 17: 5637144576 | 2048, 7168, 384, 1 | BF16 | blk.9.ffn_down_exps.weight
+ 18: 5637144576 | 7168, 2048, 384, 1 | BF16 | blk.9.ffn_gate_exps.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | BF16 | blk.9.ffn_up_exps.weight
+ 20: 384 | 384, 1, 1, 1 | F32 | blk.9.exp_probs_b.bias
+ 21: 2752512 | 7168, 384, 1, 1 | F32 | blk.9.ffn_gate_inp.weight
+ 22: 14680064 | 2048, 7168, 1, 1 | BF16 | blk.9.ffn_down_shexp.weight
+ 23: 14680064 | 7168, 2048, 1, 1 | BF16 | blk.9.ffn_gate_shexp.weight
+ 24: 14680064 | 7168, 2048, 1, 1 | BF16 | blk.9.ffn_up_shexp.weight
+ 25: 7168 | 7168, 1, 1, 1 | F32 | blk.9.ffn_norm.weight
+ 26: 512 | 512, 1, 1, 1 | F32 | blk.9.attn_kv_a_norm.weight
+ 27: 4128768 | 7168, 576, 1, 1 | BF16 | blk.9.attn_kv_a_mqa.weight
+ 28: 8388608 | 512, 16384, 1, 1 | BF16 | blk.9.attn_kv_b.weight
+ 29: 4194304 | 128, 32768, 1, 1 | BF16 | blk.9.attn_k_b.weight
+ 30: 4194304 | 512, 8192, 1, 1 | BF16 | blk.9.attn_v_b.weight
+ 31: 58720256 | 8192, 7168, 1, 1 | BF16 | blk.9.attn_output.weight
+ 32: 1536 | 1536, 1, 1, 1 | F32 | blk.9.attn_q_a_norm.weight
+ 33: 11010048 | 7168, 1536, 1, 1 | BF16 | blk.9.attn_q_a.weight
+ 34: 18874368 | 1536, 12288, 1, 1 | BF16 | blk.9.attn_q_b.weight
+.
+.
+.
+```
+
+
+
+TODO: find a safetensor viewer...
+
+---
+
+👤 **anikifoss** commented on **2025-07-15** at **19:25:08**
+
+> Thanks for giving it a try, at least it sounds like this convert_hf_to_gguf.py "worked" on the unsloth BF16 safetensors?
+
+@ubergarm I haven't run it on your branch. What I'm saying is [this quant](https://huggingface.co/anikifoss/Kimi-K2-Instruct-DQ4_K), created from unsloth's BF16 safetensors and converted to GGUF using llama.cpp does not have `attn_kv_b`. So, most likely, unsloth's BF16 safetensors does not have `attn_kv_b`.
+
+---
+
+👤 **ubergarm** commented on **2025-07-15** at **19:33:06**
+
+@anikifoss
+
+> converted to GGUF using llama.cpp
+
+:point_up: that is the step which I believe munges up and omits the `attn_kv_b`.
+
+If you use the freshly merged ik_llama.cpp/convert_hf_to_gguf.py on those bf16 safetensors, I believe you will get the attn_kv_b tensors in your bf16 GGUF.
+
+afaik going from fp8 safetensors upcasting via fp8_cast_bf16.py bf16 safetensors does *not* mess with the actual tensors.
+
+> So, most likely, unsloth's BF16 safetensors does not have attn_kv_b.
+
+Unless they did something strange, I believe they should be okay to use with this new convert script.
+
+Probably easy enough to test if you have the disk space as nothing more required to download.
+
+*EDIT*
+
+I believe this is the code that is munging it up in [mainline convert_hf_to_gguf.py](https://github.com/ggml-org/llama.cpp/blob/master/convert_hf_to_gguf.py#L5832-L5852).
+
+---
+
+👤 **saood06** commented on **2025-07-16** at **00:29:19**
> TODO: find a safetensor viewer...
@@ -249,13 +426,13 @@ HF has one built in just like for GGUF.
---
-👤 **ubergarm** commented the **2025-07-16** at **02:53:08**:
+👤 **ubergarm** commented on **2025-07-16** at **02:53:08**
@ikawrakow
> How quickly, or rather how slowly, does it go?
-I finally got to some sweep benches feeling out this big dual socket AMD EPYC 9965 192-Core rig in NPS1 with ~768GB RAM per socket. mlc clocks it at around 256GiB/s RAM bandwidth per socket. The "smaller" Kimi-K2-Instruct quants will fit on a single socket. Given I believe this is Zen5 I tried out #610 and did see around 8% boost in PP with that AVX512 kernel. Also increasing `-ub 4096 -b 4096` and omitting `-rtr` a valid option even on this MoE.
+I finally got to some sweep benches feeling out this big dual socket AMD EPYC 9965 192-Core rig in NPS1 with ~768GB RAM per socket. mlc clocks it at around 256GiB/s RAM bandwidth per socket. The "smaller" Kimi-K2-Instruct quants will fit on a single socket. Given I believe this is Zen5 I tried out [#610](https://github.com/ikawrakow/ik_llama.cpp/issues/610) and did see around 8% boost in PP with that AVX512 kernel. Also increasing `-ub 4096 -b 4096` and omitting `-rtr` a valid option even on this MoE.
@@ -345,11 +522,22 @@ numactl -N 0 -m 0 \
| 4096 | 1024 | 4096 | 17.343 | 236.18 | 83.245 | 12.30 |
| 4096 | 1024 | 8192 | 21.132 | 193.83 | 86.125 | 11.89 |
-
+# IQ4_KS PR610 ik/q8_k_r8_avx512 --no-mmap -ub 4096 -b 4096
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 16.169 | 253.33 | 89.668 | 11.42 |
+| 4096 | 1024 | 4096 | 18.017 | 227.34 | 96.703 | 10.59 |
+| 4096 | 1024 | 8192 | 20.752 | 197.38 | 99.845 | 10.26 |
+
+
+
+I compared the larger IQ4_KS 550.428 GiB (4.604 BPW) and its remarkably similar performance.
+
+
---
-👤 **ubergarm** commented the **2025-07-16** at **03:39:04**:
+👤 **ubergarm** commented on **2025-07-16** at **03:39:04**
> Btw., I have decided to add a sub-2 bpw quant, IQ1_KT, at 1.75 bpw (so same as IQ1_M). It is Trellis, but my guess is that with Kimi-2 even more people will reach to the lowest possible bpw models. Desperate times call for desperate action! It is shaping up to be nearly on par with IQ2_XXS (2.0625 bpw), and certainly much better than IQ1_M. CUDA is done with very decent performance. I'll do the CPU tomorrow.
@@ -361,13 +549,31 @@ Curious to see how the IQ1_KT comes along as competition for the IQ1_S and IQ1_M
---
-👤 **ubergarm** commented the **2025-07-16** at **12:48:15**:
+👤 **anikifoss** commented on **2025-07-16** at **04:25:58**
+
+@ubergarm looks like you're missing an indent:
+```
+python convert_hf_to_gguf.py \
+ --outtype bf16 \
+ --split-max-size 50G \
+ /mnt/data/Models/unsloth/Kimi-K2-Instruct-BF16
+ File "convert_hf_to_gguf.py", line 3439
+ self._set_vocab_gpt2()
+ ^
+IndentationError: expected an indented block after 'else' statement on line 3438
+```
+
+I fixed it locally, so it can run overnight.
+
+---
+
+👤 **ubergarm** commented on **2025-07-16** at **12:48:15**
Thanks @anikifoss I opened a PR here https://github.com/ikawrakow/ik_llama.cpp/pull/617 with the fixup, let us know how it looks in the morning!
---
-👤 **anikifoss** commented the **2025-07-16** at **23:58:27**:
+👤 **anikifoss** commented on **2025-07-16** at **23:58:27**
Done:
```
@@ -378,6 +584,12 @@ HDDs are not fast :roll_eyes:
---
-👤 **anikifoss** commented the **2025-07-17** at **17:32:59**:
+👤 **anikifoss** commented on **2025-07-16** at **23:59:05**
+
+I'll quantize overnight and will let you know how it works tomorrow.
+
+---
+
+👤 **anikifoss** commented on **2025-07-17** at **17:32:59**
-@ubergarm quantized to Q4 for down_exp nd Q3 for the other exps. It runs, was able to produce spinning hexagon with 3 tries (Q4/Q3 mix is just under 512GB, but noticably worse than Q6/Q4).
\ No newline at end of file
+@ubergarm quantized converted GGUF to Q4_K for down_exps and Q3_K for the other exps. It runs, was able to produce spinning hexagon with 3 tries (Q4/Q3 mix is just under 512GB, but noticably worse than Q6/Q4).
\ No newline at end of file
diff --git a/github-data/pull_requests/616 - Adding IQ1_KT - 1.75 bpw SOTA quants.md b/github-data/pull_requests/616 - Adding IQ1_KT - 1.75 bpw SOTA quants.md
index 132b6fb09..6214b5069 100644
--- a/github-data/pull_requests/616 - Adding IQ1_KT - 1.75 bpw SOTA quants.md
+++ b/github-data/pull_requests/616 - Adding IQ1_KT - 1.75 bpw SOTA quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#616](https://github.com/ikawrakow/ik_llama.cpp/pull/616) - Adding IQ1_KT - 1.75 bpw SOTA quants
+## 🔀 [Pull Request #616](https://github.com/ikawrakow/ik_llama.cpp/pull/616) - Adding IQ1_KT - 1.75 bpw SOTA quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ✅ **Open** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_kt` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-16 |
-| **Updated** | 2025-07-19 |
+| **Updated** | 2025-07-22 |
+| **Merged** | 2025-07-20 |
---
-#### Description
+## 📄 Description
With Kimi-2 at 1 trillion parameters being the new rage of the day, my guess is that even more local inference enthusiasts will reach to very low bit-per-weight (bpw) quantized models. The state of affairs in mainline `llama.cpp` for very low bpw quants is not good:
* Nothings has been done to improve quantization quality since I contributed [IQ1_S](https://github.com/ggml-org/llama.cpp/pull/5999) and [IQ1_M](https://github.com/ggml-org/llama.cpp/pull/6302) to mainline.
@@ -29,15 +32,15 @@ Similar to the other `*_KT` quants
As trellis quants performance is very low on Metal (at least for my 30-core M2-Max GPU), I didn't not even bother to add a Metal implementation.
-To illustrate the quantization quality compared to other quantization types, the next graph shows `PPL(Q)/PPL(f16)-1` for LlaMA-3.1-8B-Instruct, which is notoriously hard to quantize. I have excluded the `IQ1_M` and `IQ1_S` data points as this would have extended the y-axis too much to be useful. We can see that `IQ1_KT` at 1.92 bpw provides nearly the same quality as `IQ2_XXS` at 2.13 bpw, so almost a 10% reduction in model size for comparable quantization quality. I have made the `IQ2_KL` data point magenta because it was also added very recently in PR #602.
+To illustrate the quantization quality compared to other quantization types, the next graph shows `PPL(Q)/PPL(f16)-1` for LlaMA-3.1-8B-Instruct, which is notoriously hard to quantize. I have excluded the `IQ1_M` and `IQ1_S` data points as this would have extended the y-axis too much to be useful. We can see that `IQ1_KT` at 1.92 bpw provides nearly the same quality as `IQ2_XXS` at 2.13 bpw, so almost a 10% reduction in model size for comparable quantization quality. I have made the `IQ2_KL` data point magenta because it was also added very recently in PR [#602](https://github.com/ikawrakow/ik_llama.cpp/issues/602).
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ubergarm** commented the **2025-07-16** at **15:50:24**:
+👤 **ubergarm** commented on **2025-07-16** at **15:50:24**
> With Kimi-2 at 1 trillion parameters being the new rage of the day, my guess is that even more local inference enthusiasts will reach to very low bit-per-weight (bpw) quantized models.
@@ -102,21 +105,149 @@ numactl -N 1 -m 1 \
---
-👤 **ikawrakow** commented the **2025-07-16** at **19:26:04**:
+👤 **ikawrakow** commented on **2025-07-16** at **17:10:01**
+
+Thanks for cooking!
+
+---
+
+👤 **Nexesenex** commented on **2025-07-16** at **17:54:54**
+
+@ikawrakow :
+You might have forgotten the CUDA MMQ kernel file, now that you separate it from mmq.cuh. ^^
+
+---
+
+👤 **ubergarm** commented on **2025-07-16** at **17:59:44**
+
+I'm on CPU only with this thing or now, so its doing perplexity now!
+
+```
+llama_model_quantize_internal: model size = 1959011.30 MB
+llama_model_quantize_internal: quant size = 219322.63 MB
+
+main: quantize time = 4557251.29 ms
+main: total time = 4557251.29 ms
+
+llm_load_print_meta: model ftype = IQ1_KT - 1.75 bpw
+llm_load_print_meta: model params = 1.027 T
+llm_load_print_meta: model size = 214.182 GiB (1.792 BPW)
+llm_load_print_meta: repeating layers = 212.916 GiB (1.785 BPW, 1024.571 B parameters)
+llm_load_print_meta: general.name = Kimi K2 Instruct Bf16 Safetensors
+
+llama_model_loader: - type f32: 365 tensors
+llama_model_loader: - type q8_0: 61 tensors
+llama_model_loader: - type iq4_nl: 61 tensors
+llama_model_loader: - type iq5_ks: 1 tensors
+llama_model_loader: - type iq3_kt: 122 tensors
+llama_model_loader: - type iq4_kt: 367 tensors
+llama_model_loader: - type iq1_kt: 180 tensors
+
+perplexity: tokenization took 633.399 ms
+perplexity: calculating perplexity over 568 chunks, n_ctx=512, batch_size=2048, n_seq=4
+perplexity: 15.64 seconds per pass - ETA 37.02 minutes
+[1]3.3661,[2]4.4515,[3]3.6908,[4]3.6615,[5]3.3665,[6]3.2151,[7]3.2705,[8]3.2106,
+
+llama_print_timings: load time = 52736.67 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 1940171.96 ms / 290816 tokens ( 6.67 ms per token, 149.89 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 1958744.98 ms / 290817 tokens
+
+Final estimate: PPL = 4.3623 +/- 0.02432
+```
+
+sweep benches later
+
+---
+
+👤 **ikawrakow** commented on **2025-07-16** at **19:26:04**
@Nexesenex Thanks! Added the forgotten file.
---
-👤 **Nexesenex** commented the **2025-07-16** at **21:36:24**:
+👤 **ubergarm** commented on **2025-07-16** at **20:42:55**
+
+## -t 128 -tb 192 (of 192 cores)
+main: n_kv_max = 12288, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 192
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 17.377 | 235.72 | 75.213 | 13.61 |
+| 4096 | 1024 | 4096 | 20.470 | 200.10 | 80.030 | 12.80 |
+| 4096 | 1024 | 8192 | 21.570 | 189.89 | 82.918 | 12.35 |
+
+## -t 192 -tb 192 (of 192 cores)
+
+main: n_kv_max = 12288, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = -1, n_threads = 192, n_threads_batch = 192
+
+| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
+|-------|--------|--------|----------|----------|----------|----------|
+| 4096 | 1024 | 0 | 16.724 | 244.92 | 74.048 | 13.83 |
+| 4096 | 1024 | 4096 | 19.215 | 213.17 | 80.421 | 12.73 |
+| 4096 | 1024 | 8192 | 19.674 | 208.19 | 86.859 | 11.79 |
+
+Sorry no graphs as I'm on laptop in a library. Huh I'm surprised adding more to `--threads` improved PP speeds?
+
+> -t, --threads N number of threads to use during generation (default: 384)
+> -tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
+
+I'll fiddle with it more later.
+
+I might call this one the `-smol-IQ1_KT` and try a more "normal" mix with IQ2_KT as ffn_down_exps to see how it fares PPL and speed-wise.
+
+I'll not release any IQ1_KT until some further testing with CUDA and you are happy with everything. Fun!
+
+*EDIT*
+Kind of interesting to run https://github.com/amd/esmi_ib_library while its benchmarking. I'm running these benchmarks on socket/numa 0 which is down a bank of ram hah. I gotta make a script to measure DDR bandwidth vs time to get a more clear view of what is happening. I haven't seen it go much over ~200GB/s but don't know how it is measuring exactly either. Socket 1 is busy cooking the slightly larger IQ1_KT.
+
+```
+============================= E-SMI ===================================
+
+--------------------------------------
+| CPU Family | 0x1a (26 ) |
+| CPU Model | 0x11 (17 ) |
+| NR_CPUS | 768 |
+| NR_SOCKETS | 2 |
+| THREADS PER CORE | 2 (SMT ON) |
+--------------------------------------
+
+------------------------------------------------------------------------
+| Sensor Name | Socket 0 | Socket 1 |
+------------------------------------------------------------------------
+| Energy (K Joules) | 34137.254 | 40312.224 |
+| Power (Watts) | 415.192 | 474.688 |
+| PowerLimit (Watts) | 500.000 | 500.000 |
+| PowerLimitMax (Watts) | 500.000 | 500.000 |
+| C0 Residency (%) | 67 | 87 |
+| DDR Bandwidth | | |
+| DDR Max BW (GB/s) | 528 | 576 |
+| DDR Utilized BW (GB/s) | 207 | 56 |
+| DDR Utilized Percent(%) | 39 | 10 |
+| Current Active Freq limit | | |
+| Freq limit (MHz) | 3700 | 2818 |
+| Freq limit source | Refer below[*0] | Refer below[*1] |
+| Socket frequency range | | |
+| Fmax (MHz) | 3700 | 3700 |
+| Fmin (MHz) | 600 | 600 |
+------------------------------------------------------------------------
+```
+
+---
+
+👤 **Nexesenex** commented on **2025-07-16** at **21:36:24**
@ikawrakow : Thanks!
-constants.py could be updated as well, I guess.
+constants.py could be updated as well, I guess.
+
+And of course, thanks for this amazing development!
+Having a viable sub-2bpw quant is quite a hit, for the attn_q of mono-GPU aimed quants of 70b/72b models, or even the ffns of the 253b nemotrons models (the highest models I could potentially run in full Cuda-MMQ offload on my 64GB VRAM with a decent quality thanks to the help of iq1_kt and iq2_kt).
---
-👤 **ubergarm** commented the **2025-07-17** at **00:39:25**:
+👤 **ubergarm** commented on **2025-07-17** at **00:39:25**
Cooked a slightly larger version just for comparison. Same recipe as above except larger iq2_kt for ffn_down_exps so more like my "normal" recipes
@@ -146,7 +277,1483 @@ Final estimate: PPL = 4.1310 +/- 0.02266
---
-👤 **magikRUKKOLA** commented the **2025-07-19** at **01:30:36**:
+👤 **ubergarm** commented on **2025-07-17** at **01:11:45**
+
+Okay last data point I made a "pure" Qwen3-14B-IQ1_KT same as this set: https://github.com/ikawrakow/ik_llama.cpp/pull/602#issuecomment-3065995863
+
+CUDA backend full offload.
+
+Final estimate: PPL = 13.4941 +/- 0.10484
+
+```
+llm_load_print_meta: model ftype = IQ1_KT - 1.75 bpw
+llm_load_print_meta: model params = 14.768 B
+llm_load_print_meta: model size = 3.703 GiB (2.154 BPW)
+llm_load_print_meta: repeating layers = 2.701 GiB (1.756 BPW, 13.212 B parameters
+
+llama_model_loader: - type f32: 161 tensors
+llama_model_loader: - type q4_K: 1 tensors
+llama_model_loader: - type q6_K: 1 tensors
+llama_model_loader: - type iq1_kt: 280 tensors
+```
+
+I tried a few short chats and it is actually coherent and was able to complete some requests. It did get stuck in a loop trying think of a good joke, but wow- amazing a sub 2bpw "pure" quantized dense model works at all!
+
+---
+
+👤 **ikawrakow** commented on **2025-07-17** at **18:06:12**
+
+> Final estimate: PPL = 4.3623 +/- 0.02432
+
+IIRC, the `Q8_0` PPL is `2.95`? So this would be ~48% higher for the pure version, 40% for the 229 GiB `IQ1_KT/IQ2_KT` mix. I haven't seen a lot of Kimi-2 PPL values, so don't have a feel if this is good or not so good. I see Unsloth's `IQ1_M` is 304 GB (283 GiB), so there is a lot of room to use more bits for some of the tensors.
+
+One thing you could try is to simply take Unsloth's `IQ1_M` recipe, replace `IQ1_M` with `IQ1_KT`, and see how PPL compares. One could go a step further and replace
+* `Q4_K` with `IQ4_K`
+* `Q5_K` with `IQ5_K`
+* `IQ4_XS` with `IQ4_KS`
+* `IQ2_XXS` with `IQ2_KT`
+* `IQ3_XXS` with `IQ3_KT`
+
+The last two are +0.0625, so if one wanted to arrive at the exact same size, one needs to reduce the number of tensors using `IQ3_XXS` so that the +0.0625 bpw is recovered. Our perhaps replace some of the `IQ3_XXS` with `IQ2_KL`.
+
+Does anyone know what is `TQ1_0`? When `TQ1_0` was added to `llama.cpp` it was BitNet specific, so totally not useful for a not-BitNet model. I also don't see any `TQ1_0` tensors. Why did they decide to call it `TQ1_0`?
+
+---
+
+👤 **ubergarm** commented on **2025-07-17** at **20:06:09**
+
+> I haven't seen a lot of Kimi-2 PPL values, so don't have a feel if this is good or not so good.
+
+Yeah I'm not 100% sure what command @magikRUKKOLA is using for imatrix, so hopefully his numbers are comparable to mine. He's been updating this graph here it seems: https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13776504
+
+this is my test command for anyone curious
+
+
+
+ubergarm latest perplexity methodology
+
+Modify for CPU only or offload more CUDA layers etc. I've been using CPU only for my Kimi-K2-Instruct quants. The seed is not important. My older DeepSeek-V3 numbers were with `-ctk q8_0` but now I use full fp16 for reporting the PPLs.
+
+```bash
+$ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
+$ gunzip wiki.test.raw.gz
+$ du -h wiki.test.raw
+1.3M wiki.test.raw
+$ sha1sum wiki.test.raw
+6f1fe2054a940eebfc76b284b09680763b37f5ea wiki.test.raw
+
+./build/bin/llama-perplexity \
+ --model "$model" \
+ -f wiki.test.raw \
+ -ctk fp16 \
+ -fa -fmoe \
+ -mla 3 -amb 512 \
+ --ctx-size 512 \
+ --ubatch-size 512 \
+ -ngl 99 \
+ -ot exps=CPU \
+ --threads 24
+```
+
+Wait to the end for the Final PPL= value.
+
+
+
+> One thing you could try
+
+I'll try to make up a mix similar to your description test it out!
+
+> Does anyone know what is TQ1_0? When TQ1_0 was added to llama.cpp it was BitNet specific, so totally not useful for a not-BitNet model. I also don't see any TQ1_0 tensors. Why did they decide to call it TQ1_0?
+
+tl;dr; Apparently ollama and huggingface don't properly show quants with "unusual" file names. (which is why many of my quants don't show up in the side-bar tensor viewer on hf). Unsloth wanted to release quants in a similar BPW that just happened to be similar to Compilade's ternary only bitnet quantization type. So despite the unsloth TQ1_0 consisting of mostly IQ1_S and IQ3_S and no TQ1_0 tensors at all, they started using that name and seem to continue to be doing so.
+
+I've called them out multiple times on reddit and github on the improper use of TQ1_0 beginning with R1-0528. Here is the most recent discussion where they continued to do it for Kimi-K2-Instruct:
+
+* https://github.com/ggml-org/llama.cpp/pull/14654#issuecomment-3070114997
+* https://github.com/ggml-org/llama.cpp/pull/14654#issuecomment-3070439724
+* https://github.com/ggml-org/llama.cpp/pull/14654#issuecomment-3071240401
+
+And earlier on R1-0528:
+* https://www.reddit.com/r/LocalLLaMA/comments/1l19yud/comment/mvjyw04/
+
+Anyway, I'll go see what else I can cook up and try to compare some PPLs!
+
+---
+
+👤 **ubergarm** commented on **2025-07-18** at **06:18:49**
+
+I've been grinding through perplexity on some quants, have at least one more to add (UD-IQ2_XXS) and would like to add @anikifoss 's larger model(s) as they roll out if he would like to calculate them, no pressure though!
+
+*EDIT* Updated image and data to fixup badname TQ1_0 bpw and add some more data points. The v0.2 recipes are full q8_0 `attn/shexp/blk.0.ffn.*` versions. Some comments in the json data as well.
+
+
+
+👈 raw data in json format
+
+```json
+[
+ {
+ "name": "q8_0",
+ "ppl": "2.9507 +/- 0.01468",
+ "size": 1016.623,
+ "bpw": 8.504,
+ "legend": "pure"
+ },
+ {
+ "name": "IQ4_KS",
+ "ppl": "3.0438 +/- 0.01536",
+ "size": 550.428,
+ "bpw": 4.604,
+ "legend": "ubergarm",
+ "comment": "v0.1 recipe"
+ },
+ {
+ "name": "v0.2-IQ4_KS",
+ "ppl": "2.9584 +/- 0.01473",
+ "size": 554.421,
+ "bpw": 4.638,
+ "legend": "ubergarm",
+ "comment": "v0.2 recipe - full q8_0 attn/shexp/blk.0.ffn"
+ },
+ {
+ "name": "IQ3_KS",
+ "ppl": "3.1395 +/- 0.01604",
+ "size": 427.205,
+ "bpw": 3.573,
+ "legend": "ubergarm",
+ "comment": "v0.1 recipe"
+ },
+ {
+ "name": "v0.2-IQ3_KS",
+ "ppl": "3.0226 +/- 0.01518",
+ "size": 430.908,
+ "bpw": 3.604,
+ "legend": "ubergarm",
+ "comment": "v0.2 recipe"
+ },
+ {
+ "name": "PR624-IQ3_KS",
+ "ppl": "3.1936 +/- 0.01638",
+ "size": 427.205,
+ "bpw": 3.573,
+ "legend": "ubergarm"
+ },
+ {
+ "name": "IQ2_KL",
+ "ppl": "3.2741 +/- 0.01689",
+ "size": 345.687,
+ "bpw": 2.892,
+ "legend": "ubergarm",
+ "comment": "v0.1 recipe"
+ },
+ {
+ "name": "PR624-IQ2_KL",
+ "ppl": "3.3055 +/- 0.01709",
+ "size": 345.687,
+ "bpw": 2.892,
+ "legend": "ubergarm"
+ },
+ {
+ "name": "chonk-IQ2_KL",
+ "ppl": "3.2095 +/- 0.01641",
+ "size": 365.507,
+ "bpw": 3.057,
+ "legend": "ubergarm",
+ "comment": "blk.(1|2|3|4|5|6|59|60).ffn_down_exps.weight=iq4_ks and blk.(1|2|3|4|5|6|59|60).ffn_(gate|up)_exps.weight=iq4_kss"
+ },
+ {
+ "name": "PR624-chonk-IQ2_KL",
+ "ppl": "3.2389 +/- 0.01661",
+ "size": 365.507,
+ "bpw": 3.057,
+ "legend": "ubergarm",
+ "comment": "blk.(1|2|3|4|5|6|59|60).ffn_down_exps.weight=iq4_ks and blk.(1|2|3|4|5|6|59|60).ffn_(gate|up)_exps.weight=iq4_kss"
+ },
+ {
+ "name": "v0.2-IQ2_KL",
+ "ppl": "3.1813 +/- 0.01619",
+ "size": 349.389,
+ "bpw": 2.923,
+ "legend": "ubergarm",
+ "comment": "v0.2 recipe - full q8_0 attn/shexp/blk.0.ffn"
+ },
+ {
+ "name": "IQ2_KS",
+ "ppl": "3.7922 +/- 0.02045",
+ "size": 286.624,
+ "bpw": 2.398,
+ "legend": "ubergarm",
+ "comment": "v0.1 recipe"
+ },
+ {
+ "name": "PR624-IQ2_KS",
+ "ppl": "3.7846 +/- 0.02040",
+ "size": 286.624,
+ "bpw": 2.398,
+ "legend": "ubergarm"
+ },
+ {
+ "name": "PR624-chonk-IQ2_KS",
+ "ppl": "3.7313 +/- 0.01999",
+ "size": 313.923,
+ "bpw": 2.626,
+ "legend": "ubergarm",
+ "comment": "blk.(1|2|3|4|5|6|59|60).ffn_down_exps.weight=iq4_ks and blk.(1|2|3|4|5|6|59|60).ffn_(gate|up)_exps.weight=iq4_kss"
+ },
+ {
+ "name": "PR624-v0.2-IQ2_KS",
+ "ppl": "3.6827 +/- 0.01957",
+ "size": 290.327,
+ "bpw": 2.429,
+ "legend": "ubergarm",
+ "comment": "v0.2 recipe - full q8_0 attn/shexp/blk.0.ffn"
+ },
+ {
+ "name": "v0.2-IQ1_KT",
+ "ppl": "3.9734 +/- 0.02152",
+ "size": 234.141,
+ "bpw": 1.959,
+ "legend": "ubergarm",
+ "comment": "v0.2 recipe - full q8_0 attn/shexp/blk.0.ffn"
+ },
+ {
+ "name": "IQ1_KT",
+ "ppl": "4.1310 +/- 0.02266",
+ "size": 228.948,
+ "bpw": 1.915,
+ "legend": "ubergarm"
+ },
+ {
+ "name": "smol-IQ1_KT",
+ "ppl": "4.3623 +/- 0.02432",
+ "size": 214.182,
+ "bpw": 1.792,
+ "legend": "ubergarm"
+ },
+ {
+ "name": "v0.2-smol-IQ1_KT",
+ "ppl": "4.2187 +/- 0.02325",
+ "size": 219.375,
+ "bpw": 1.835,
+ "legend": "ubergarm",
+ "comment": "v0.2 recipe - full q8_0 attn/shexp/blk.0.ffn"
+ },
+ {
+ "name": "DQ4_K",
+ "ppl": "2.9691 +/- 0.01480",
+ "size": 624.828,
+ "bpw": 5.229,
+ "legend": "anikifoss",
+ "url": "https://huggingface.co/anikifoss/Kimi-K2-Instruct-DQ4_K"
+ },
+ {
+ "name": "UD-IQ1_S",
+ "ppl": "4.3331 +/- 0.02390",
+ "size": 261.979,
+ "bpw": 2.192,
+ "legend": "unsloth",
+ "comment": "ran this without -fmoe fwiw before PR630"
+ },
+ {
+ "name": "badname-UD-TQ1_0",
+ "ppl": "5.0150 +/- 0.02885",
+ "size": 227.854,
+ "bpw": 1.907,
+ "legend": "unsloth",
+ "comment": "this is not a TQ1_0 but an incorrect name"
+ },
+ {
+ "name": "UD-IQ2_XXS",
+ "ppl": "3.5258 +/- 0.01842",
+ "size": 305.660,
+ "bpw": 2.558,
+ "legend": "unsloth"
+ },
+ {
+ "name": "UD-IQ3_XXS",
+ "ppl": "3.1535 +/- 0.01601",
+ "size": 388.003,
+ "bpw": 3.247,
+ "legend": "unsloth"
+ },
+ {
+ "name": "UD-Q4_K_XL",
+ "ppl": "3.0612 +/- 0.01550",
+ "size": 547.437,
+ "bpw": 4.581,
+ "legend": "unsloth"
+ }
+]
+```
+
+
+
+
+
+
+
+---
+
+👤 **magikRUKKOLA** commented on **2025-07-18** at **17:15:11**
+
+@ubergarm
+
+> "name": "UD-IQ2_XXS",
+> "ppl": "3.5258 +/- 0.01842",
+> "size": 305.660,
+> "bpw": 2.558,
+> "legend": "unsloth",
+> "comment": "compare to magikRUKKOLA's measurement PPL = 3.1382"
+> },
+
+~~But i never claimed that PPL for UD-IQ2_XXS is 3.1382. That was related to the UD-IQ3_XXS.~~
+Ah. That was probably a copy/paste error.
+Ok, I will retest the UD-IQ3_XXS.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-18** at **17:20:53**
+
+If `UD-IQ2_XXS` at 2.5 bpw had a PPL of 3.1382, we could all collectively forget about new quantization types and quant cooking, and just use that.
+
+---
+
+👤 **anikifoss** commented on **2025-07-18** at **17:21:12**
+
+@ubergarm I'm distracted benchmarking MI50s for the next couple of days. I'll get perplexity calculations sometime next week or so.
+
+---
+
+👤 **ubergarm** commented on **2025-07-18** at **17:39:58**
+
+@magikRUKKOLA
+
+Yeah, thanks, i was up a bit too late last night. Cleaning things up as much as I can while going along.
+
+But yes the UD-IQ2_XXS and UD-IQ3_XXS seem pretty decent so far in my testing. I went ahead and ran the numbers on both myself and it came out just slightly above your reported value (for UD_IQ3_XXS).
+
+I'm testing adding a little bit more weight to the early ffn_*_exps layers which is helping, but distracted by also testing the PR for tweaks that are affecting IQ3_KS. So gonna try to sort that out first before going too wild on new recipes hah...
+
+If you want to see the full UD-IQ3_XXS here is the gguf dump showing them alternating tensors sizes for the same tensors up and down across throughout the layers. In looking at a few of their recipes the pattern seems different for different size models, so not sure exactly what they are using to decide on this, but I haven't read their blogs in a while.
+
+
+
+UD-IQ3_XXS gguf-dump
+
+```bash
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00001-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 64 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 134
+ 3: UINT64 | 1 | GGUF.kv_count = 61
+ 4: STRING | 1 | general.architecture = 'deepseek2'
+ 5: STRING | 1 | general.type = 'model'
+ 6: STRING | 1 | general.name = 'Kimi-K2-Instruct'
+ 7: STRING | 1 | general.finetune = 'Instruct'
+ 8: STRING | 1 | general.basename = 'Kimi-K2-Instruct'
+ 9: STRING | 1 | general.quantized_by = 'Unsloth'
+ 10: STRING | 1 | general.size_label = '384x14B'
+ 11: STRING | 1 | general.license = 'other'
+ 12: STRING | 1 | general.license.name = 'modified-mit'
+ 13: STRING | 1 | general.repo_url = 'https://huggingface.co/unsloth'
+ 14: UINT32 | 1 | general.base_model.count = 1
+ 15: STRING | 1 | general.base_model.0.name = 'Kimi K2 Instruct'
+ 16: STRING | 1 | general.base_model.0.organization = 'Moonshotai'
+ 17: STRING | 1 | general.base_model.0.repo_url = 'https://huggingface.co/moonshotai/Kimi-K2-Instruct'
+ 18: [STRING] | 1 | general.tags
+ 19: UINT32 | 1 | deepseek2.block_count = 61
+ 20: UINT32 | 1 | deepseek2.context_length = 131072
+ 21: UINT32 | 1 | deepseek2.embedding_length = 7168
+ 22: UINT32 | 1 | deepseek2.feed_forward_length = 18432
+ 23: UINT32 | 1 | deepseek2.attention.head_count = 64
+ 24: UINT32 | 1 | deepseek2.attention.head_count_kv = 1
+ 25: FLOAT32 | 1 | deepseek2.rope.freq_base = 50000.0
+ 26: FLOAT32 | 1 | deepseek2.attention.layer_norm_rms_epsilon = 9.999999974752427e-07
+ 27: UINT32 | 1 | deepseek2.expert_used_count = 8
+ 28: UINT32 | 1 | deepseek2.leading_dense_block_count = 1
+ 29: UINT32 | 1 | deepseek2.vocab_size = 163840
+ 30: UINT32 | 1 | deepseek2.attention.q_lora_rank = 1536
+ 31: UINT32 | 1 | deepseek2.attention.kv_lora_rank = 512
+ 32: UINT32 | 1 | deepseek2.attention.key_length = 576
+ 33: UINT32 | 1 | deepseek2.attention.value_length = 512
+ 34: UINT32 | 1 | deepseek2.attention.key_length_mla = 192
+ 35: UINT32 | 1 | deepseek2.attention.value_length_mla = 128
+ 36: UINT32 | 1 | deepseek2.expert_feed_forward_length = 2048
+ 37: UINT32 | 1 | deepseek2.expert_count = 384
+ 38: UINT32 | 1 | deepseek2.expert_shared_count = 1
+ 39: FLOAT32 | 1 | deepseek2.expert_weights_scale = 2.8269999027252197
+ 40: BOOL | 1 | deepseek2.expert_weights_norm = True
+ 41: UINT32 | 1 | deepseek2.expert_gating_func = 2
+ 42: UINT32 | 1 | deepseek2.rope.dimension_count = 64
+ 43: STRING | 1 | deepseek2.rope.scaling.type = 'yarn'
+ 44: FLOAT32 | 1 | deepseek2.rope.scaling.factor = 32.0
+ 45: UINT32 | 1 | deepseek2.rope.scaling.original_context_length = 4096
+ 46: FLOAT32 | 1 | deepseek2.rope.scaling.yarn_log_multiplier = 0.10000000149011612
+ 47: STRING | 1 | tokenizer.ggml.model = 'gpt2'
+ 48: STRING | 1 | tokenizer.ggml.pre = 'kimi-k2'
+ 49: [STRING] | 163840 | tokenizer.ggml.tokens
+ 50: [INT32] | 163840 | tokenizer.ggml.token_type
+ 51: [STRING] | 163328 | tokenizer.ggml.merges
+ 52: UINT32 | 1 | tokenizer.ggml.bos_token_id = 163584
+ 53: UINT32 | 1 | tokenizer.ggml.eos_token_id = 163585
+ 54: UINT32 | 1 | tokenizer.ggml.padding_token_id = 163839
+ 55: STRING | 1 | tokenizer.chat_template = '{%- if tools -%}\n <|im_system|>tool_declare<|im_middle|>{{ '
+ 56: UINT32 | 1 | general.quantization_version = 2
+ 57: UINT32 | 1 | general.file_type = 23
+ 58: STRING | 1 | quantize.imatrix.file = 'Kimi-K2-Instruct-GGUF/imatrix_unsloth.dat'
+ 59: STRING | 1 | quantize.imatrix.dataset = 'unsloth_calibration_Kimi-K2-Instruct.txt'
+ 60: UINT32 | 1 | quantize.imatrix.entries_count = 667
+ 61: UINT32 | 1 | quantize.imatrix.chunks_count = 714
+ 62: UINT16 | 1 | split.no = 0
+ 63: INT32 | 1 | split.tensors.count = 1096
+ 64: UINT16 | 1 | split.count = 9
+* Dumping 134 tensor(s)
+ 1: 1174405120 | 7168, 163840, 1, 1 | Q6_K | output.weight
+ 2: 7168 | 7168, 1, 1, 1 | F32 | output_norm.weight
+ 3: 1174405120 | 7168, 163840, 1, 1 | Q4_K | token_embd.weight
+ 4: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.0.attn_k_b.weight
+ 5: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.0.attn_kv_a_mqa.weight
+ 6: 512 | 512, 1, 1, 1 | F32 | blk.0.attn_kv_a_norm.weight
+ 7: 7168 | 7168, 1, 1, 1 | F32 | blk.0.attn_norm.weight
+ 8: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.0.attn_output.weight
+ 9: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.0.attn_q_a.weight
+ 10: 1536 | 1536, 1, 1, 1 | F32 | blk.0.attn_q_a_norm.weight
+ 11: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.0.attn_q_b.weight
+ 12: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.0.attn_v_b.weight
+ 13: 132120576 | 18432, 7168, 1, 1 | IQ4_XS | blk.0.ffn_down.weight
+ 14: 132120576 | 7168, 18432, 1, 1 | IQ4_XS | blk.0.ffn_gate.weight
+ 15: 7168 | 7168, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
+ 16: 132120576 | 7168, 18432, 1, 1 | IQ4_XS | blk.0.ffn_up.weight
+ 17: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.1.attn_k_b.weight
+ 18: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.1.attn_kv_a_mqa.weight
+ 19: 512 | 512, 1, 1, 1 | F32 | blk.1.attn_kv_a_norm.weight
+ 20: 7168 | 7168, 1, 1, 1 | F32 | blk.1.attn_norm.weight
+ 21: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.1.attn_output.weight
+ 22: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.1.attn_q_a.weight
+ 23: 1536 | 1536, 1, 1, 1 | F32 | blk.1.attn_q_a_norm.weight
+ 24: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.1.attn_q_b.weight
+ 25: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.1.attn_v_b.weight
+ 26: 384 | 384, 1, 1, 1 | F32 | blk.1.exp_probs_b.bias
+ 27: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.1.ffn_down_exps.weight
+ 28: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.1.ffn_down_shexp.weight
+ 29: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.1.ffn_gate_exps.weight
+ 30: 2752512 | 7168, 384, 1, 1 | F32 | blk.1.ffn_gate_inp.weight
+ 31: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.1.ffn_gate_shexp.weight
+ 32: 7168 | 7168, 1, 1, 1 | F32 | blk.1.ffn_norm.weight
+ 33: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.1.ffn_up_exps.weight
+ 34: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.1.ffn_up_shexp.weight
+ 35: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.2.attn_k_b.weight
+ 36: 4128768 | 7168, 576, 1, 1 | Q5_K | blk.2.attn_kv_a_mqa.weight
+ 37: 512 | 512, 1, 1, 1 | F32 | blk.2.attn_kv_a_norm.weight
+ 38: 7168 | 7168, 1, 1, 1 | F32 | blk.2.attn_norm.weight
+ 39: 58720256 | 8192, 7168, 1, 1 | Q5_K | blk.2.attn_output.weight
+ 40: 11010048 | 7168, 1536, 1, 1 | Q5_K | blk.2.attn_q_a.weight
+ 41: 1536 | 1536, 1, 1, 1 | F32 | blk.2.attn_q_a_norm.weight
+ 42: 18874368 | 1536, 12288, 1, 1 | Q5_K | blk.2.attn_q_b.weight
+ 43: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.2.attn_v_b.weight
+ 44: 384 | 384, 1, 1, 1 | F32 | blk.2.exp_probs_b.bias
+ 45: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.2.ffn_down_exps.weight
+ 46: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.2.ffn_down_shexp.weight
+ 47: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.2.ffn_gate_exps.weight
+ 48: 2752512 | 7168, 384, 1, 1 | F32 | blk.2.ffn_gate_inp.weight
+ 49: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.2.ffn_gate_shexp.weight
+ 50: 7168 | 7168, 1, 1, 1 | F32 | blk.2.ffn_norm.weight
+ 51: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.2.ffn_up_exps.weight
+ 52: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.2.ffn_up_shexp.weight
+ 53: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.3.attn_k_b.weight
+ 54: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.3.attn_kv_a_mqa.weight
+ 55: 512 | 512, 1, 1, 1 | F32 | blk.3.attn_kv_a_norm.weight
+ 56: 7168 | 7168, 1, 1, 1 | F32 | blk.3.attn_norm.weight
+ 57: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.3.attn_output.weight
+ 58: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.3.attn_q_a.weight
+ 59: 1536 | 1536, 1, 1, 1 | F32 | blk.3.attn_q_a_norm.weight
+ 60: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.3.attn_q_b.weight
+ 61: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.3.attn_v_b.weight
+ 62: 384 | 384, 1, 1, 1 | F32 | blk.3.exp_probs_b.bias
+ 63: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.3.ffn_down_exps.weight
+ 64: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.3.ffn_down_shexp.weight
+ 65: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.3.ffn_gate_exps.weight
+ 66: 2752512 | 7168, 384, 1, 1 | F32 | blk.3.ffn_gate_inp.weight
+ 67: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.3.ffn_gate_shexp.weight
+ 68: 7168 | 7168, 1, 1, 1 | F32 | blk.3.ffn_norm.weight
+ 69: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.3.ffn_up_exps.weight
+ 70: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.3.ffn_up_shexp.weight
+ 71: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.4.attn_k_b.weight
+ 72: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.4.attn_kv_a_mqa.weight
+ 73: 512 | 512, 1, 1, 1 | F32 | blk.4.attn_kv_a_norm.weight
+ 74: 7168 | 7168, 1, 1, 1 | F32 | blk.4.attn_norm.weight
+ 75: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.4.attn_output.weight
+ 76: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.4.attn_q_a.weight
+ 77: 1536 | 1536, 1, 1, 1 | F32 | blk.4.attn_q_a_norm.weight
+ 78: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.4.attn_q_b.weight
+ 79: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.4.attn_v_b.weight
+ 80: 384 | 384, 1, 1, 1 | F32 | blk.4.exp_probs_b.bias
+ 81: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.4.ffn_down_exps.weight
+ 82: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.4.ffn_down_shexp.weight
+ 83: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.4.ffn_gate_exps.weight
+ 84: 2752512 | 7168, 384, 1, 1 | F32 | blk.4.ffn_gate_inp.weight
+ 85: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.4.ffn_gate_shexp.weight
+ 86: 7168 | 7168, 1, 1, 1 | F32 | blk.4.ffn_norm.weight
+ 87: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.4.ffn_up_exps.weight
+ 88: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.4.ffn_up_shexp.weight
+ 89: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.5.attn_k_b.weight
+ 90: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.5.attn_kv_a_mqa.weight
+ 91: 512 | 512, 1, 1, 1 | F32 | blk.5.attn_kv_a_norm.weight
+ 92: 7168 | 7168, 1, 1, 1 | F32 | blk.5.attn_norm.weight
+ 93: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.5.attn_output.weight
+ 94: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.5.attn_q_a.weight
+ 95: 1536 | 1536, 1, 1, 1 | F32 | blk.5.attn_q_a_norm.weight
+ 96: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.5.attn_q_b.weight
+ 97: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.5.attn_v_b.weight
+ 98: 384 | 384, 1, 1, 1 | F32 | blk.5.exp_probs_b.bias
+ 99: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.5.ffn_down_exps.weight
+ 100: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.5.ffn_down_shexp.weight
+ 101: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.5.ffn_gate_exps.weight
+ 102: 2752512 | 7168, 384, 1, 1 | F32 | blk.5.ffn_gate_inp.weight
+ 103: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.5.ffn_gate_shexp.weight
+ 104: 7168 | 7168, 1, 1, 1 | F32 | blk.5.ffn_norm.weight
+ 105: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.5.ffn_up_exps.weight
+ 106: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.5.ffn_up_shexp.weight
+ 107: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.6.attn_k_b.weight
+ 108: 4128768 | 7168, 576, 1, 1 | Q5_K | blk.6.attn_kv_a_mqa.weight
+ 109: 512 | 512, 1, 1, 1 | F32 | blk.6.attn_kv_a_norm.weight
+ 110: 7168 | 7168, 1, 1, 1 | F32 | blk.6.attn_norm.weight
+ 111: 58720256 | 8192, 7168, 1, 1 | Q5_K | blk.6.attn_output.weight
+ 112: 11010048 | 7168, 1536, 1, 1 | Q5_K | blk.6.attn_q_a.weight
+ 113: 1536 | 1536, 1, 1, 1 | F32 | blk.6.attn_q_a_norm.weight
+ 114: 18874368 | 1536, 12288, 1, 1 | Q5_K | blk.6.attn_q_b.weight
+ 115: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.6.attn_v_b.weight
+ 116: 384 | 384, 1, 1, 1 | F32 | blk.6.exp_probs_b.bias
+ 117: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.6.ffn_down_exps.weight
+ 118: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.6.ffn_down_shexp.weight
+ 119: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.6.ffn_gate_exps.weight
+ 120: 2752512 | 7168, 384, 1, 1 | F32 | blk.6.ffn_gate_inp.weight
+ 121: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.6.ffn_gate_shexp.weight
+ 122: 7168 | 7168, 1, 1, 1 | F32 | blk.6.ffn_norm.weight
+ 123: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.6.ffn_up_exps.weight
+ 124: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.6.ffn_up_shexp.weight
+ 125: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.7.attn_k_b.weight
+ 126: 4128768 | 7168, 576, 1, 1 | Q5_K | blk.7.attn_kv_a_mqa.weight
+ 127: 512 | 512, 1, 1, 1 | F32 | blk.7.attn_kv_a_norm.weight
+ 128: 7168 | 7168, 1, 1, 1 | F32 | blk.7.attn_norm.weight
+ 129: 58720256 | 8192, 7168, 1, 1 | Q5_K | blk.7.attn_output.weight
+ 130: 11010048 | 7168, 1536, 1, 1 | Q5_K | blk.7.attn_q_a.weight
+ 131: 1536 | 1536, 1, 1, 1 | F32 | blk.7.attn_q_a_norm.weight
+ 132: 18874368 | 1536, 12288, 1, 1 | Q5_K | blk.7.attn_q_b.weight
+ 133: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.7.attn_v_b.weight
+ 134: 384 | 384, 1, 1, 1 | F32 | blk.7.exp_probs_b.bias
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00002-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 128
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 1
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 128 tensor(s)
+ 1: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.7.ffn_down_exps.weight
+ 2: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.7.ffn_down_shexp.weight
+ 3: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.7.ffn_gate_exps.weight
+ 4: 2752512 | 7168, 384, 1, 1 | F32 | blk.7.ffn_gate_inp.weight
+ 5: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.7.ffn_gate_shexp.weight
+ 6: 7168 | 7168, 1, 1, 1 | F32 | blk.7.ffn_norm.weight
+ 7: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.7.ffn_up_exps.weight
+ 8: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.7.ffn_up_shexp.weight
+ 9: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.8.attn_k_b.weight
+ 10: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.8.attn_kv_a_mqa.weight
+ 11: 512 | 512, 1, 1, 1 | F32 | blk.8.attn_kv_a_norm.weight
+ 12: 7168 | 7168, 1, 1, 1 | F32 | blk.8.attn_norm.weight
+ 13: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.8.attn_output.weight
+ 14: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.8.attn_q_a.weight
+ 15: 1536 | 1536, 1, 1, 1 | F32 | blk.8.attn_q_a_norm.weight
+ 16: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.8.attn_q_b.weight
+ 17: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.8.attn_v_b.weight
+ 18: 384 | 384, 1, 1, 1 | F32 | blk.8.exp_probs_b.bias
+ 19: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.8.ffn_down_exps.weight
+ 20: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.8.ffn_down_shexp.weight
+ 21: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.8.ffn_gate_exps.weight
+ 22: 2752512 | 7168, 384, 1, 1 | F32 | blk.8.ffn_gate_inp.weight
+ 23: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.8.ffn_gate_shexp.weight
+ 24: 7168 | 7168, 1, 1, 1 | F32 | blk.8.ffn_norm.weight
+ 25: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.8.ffn_up_exps.weight
+ 26: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.8.ffn_up_shexp.weight
+ 27: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.9.attn_k_b.weight
+ 28: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.9.attn_kv_a_mqa.weight
+ 29: 512 | 512, 1, 1, 1 | F32 | blk.9.attn_kv_a_norm.weight
+ 30: 7168 | 7168, 1, 1, 1 | F32 | blk.9.attn_norm.weight
+ 31: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.9.attn_output.weight
+ 32: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.9.attn_q_a.weight
+ 33: 1536 | 1536, 1, 1, 1 | F32 | blk.9.attn_q_a_norm.weight
+ 34: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.9.attn_q_b.weight
+ 35: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.9.attn_v_b.weight
+ 36: 384 | 384, 1, 1, 1 | F32 | blk.9.exp_probs_b.bias
+ 37: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.9.ffn_down_exps.weight
+ 38: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.9.ffn_down_shexp.weight
+ 39: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.9.ffn_gate_exps.weight
+ 40: 2752512 | 7168, 384, 1, 1 | F32 | blk.9.ffn_gate_inp.weight
+ 41: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.9.ffn_gate_shexp.weight
+ 42: 7168 | 7168, 1, 1, 1 | F32 | blk.9.ffn_norm.weight
+ 43: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.9.ffn_up_exps.weight
+ 44: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.9.ffn_up_shexp.weight
+ 45: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.10.attn_k_b.weight
+ 46: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.10.attn_kv_a_mqa.weight
+ 47: 512 | 512, 1, 1, 1 | F32 | blk.10.attn_kv_a_norm.weight
+ 48: 7168 | 7168, 1, 1, 1 | F32 | blk.10.attn_norm.weight
+ 49: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.10.attn_output.weight
+ 50: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.10.attn_q_a.weight
+ 51: 1536 | 1536, 1, 1, 1 | F32 | blk.10.attn_q_a_norm.weight
+ 52: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.10.attn_q_b.weight
+ 53: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.10.attn_v_b.weight
+ 54: 384 | 384, 1, 1, 1 | F32 | blk.10.exp_probs_b.bias
+ 55: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.10.ffn_down_exps.weight
+ 56: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.10.ffn_down_shexp.weight
+ 57: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.10.ffn_gate_exps.weight
+ 58: 2752512 | 7168, 384, 1, 1 | F32 | blk.10.ffn_gate_inp.weight
+ 59: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.10.ffn_gate_shexp.weight
+ 60: 7168 | 7168, 1, 1, 1 | F32 | blk.10.ffn_norm.weight
+ 61: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.10.ffn_up_exps.weight
+ 62: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.10.ffn_up_shexp.weight
+ 63: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.11.attn_k_b.weight
+ 64: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.11.attn_kv_a_mqa.weight
+ 65: 512 | 512, 1, 1, 1 | F32 | blk.11.attn_kv_a_norm.weight
+ 66: 7168 | 7168, 1, 1, 1 | F32 | blk.11.attn_norm.weight
+ 67: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.11.attn_output.weight
+ 68: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.11.attn_q_a.weight
+ 69: 1536 | 1536, 1, 1, 1 | F32 | blk.11.attn_q_a_norm.weight
+ 70: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.11.attn_q_b.weight
+ 71: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.11.attn_v_b.weight
+ 72: 384 | 384, 1, 1, 1 | F32 | blk.11.exp_probs_b.bias
+ 73: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.11.ffn_down_exps.weight
+ 74: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.11.ffn_down_shexp.weight
+ 75: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.11.ffn_gate_exps.weight
+ 76: 2752512 | 7168, 384, 1, 1 | F32 | blk.11.ffn_gate_inp.weight
+ 77: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.11.ffn_gate_shexp.weight
+ 78: 7168 | 7168, 1, 1, 1 | F32 | blk.11.ffn_norm.weight
+ 79: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.11.ffn_up_exps.weight
+ 80: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.11.ffn_up_shexp.weight
+ 81: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.12.attn_k_b.weight
+ 82: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.12.attn_kv_a_mqa.weight
+ 83: 512 | 512, 1, 1, 1 | F32 | blk.12.attn_kv_a_norm.weight
+ 84: 7168 | 7168, 1, 1, 1 | F32 | blk.12.attn_norm.weight
+ 85: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.12.attn_output.weight
+ 86: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.12.attn_q_a.weight
+ 87: 1536 | 1536, 1, 1, 1 | F32 | blk.12.attn_q_a_norm.weight
+ 88: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.12.attn_q_b.weight
+ 89: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.12.attn_v_b.weight
+ 90: 384 | 384, 1, 1, 1 | F32 | blk.12.exp_probs_b.bias
+ 91: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.12.ffn_down_exps.weight
+ 92: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.12.ffn_down_shexp.weight
+ 93: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.12.ffn_gate_exps.weight
+ 94: 2752512 | 7168, 384, 1, 1 | F32 | blk.12.ffn_gate_inp.weight
+ 95: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.12.ffn_gate_shexp.weight
+ 96: 7168 | 7168, 1, 1, 1 | F32 | blk.12.ffn_norm.weight
+ 97: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.12.ffn_up_exps.weight
+ 98: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.12.ffn_up_shexp.weight
+ 99: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.13.attn_k_b.weight
+ 100: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.13.attn_kv_a_mqa.weight
+ 101: 512 | 512, 1, 1, 1 | F32 | blk.13.attn_kv_a_norm.weight
+ 102: 7168 | 7168, 1, 1, 1 | F32 | blk.13.attn_norm.weight
+ 103: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.13.attn_output.weight
+ 104: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.13.attn_q_a.weight
+ 105: 1536 | 1536, 1, 1, 1 | F32 | blk.13.attn_q_a_norm.weight
+ 106: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.13.attn_q_b.weight
+ 107: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.13.attn_v_b.weight
+ 108: 384 | 384, 1, 1, 1 | F32 | blk.13.exp_probs_b.bias
+ 109: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.13.ffn_down_exps.weight
+ 110: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.13.ffn_down_shexp.weight
+ 111: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.13.ffn_gate_exps.weight
+ 112: 2752512 | 7168, 384, 1, 1 | F32 | blk.13.ffn_gate_inp.weight
+ 113: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.13.ffn_gate_shexp.weight
+ 114: 7168 | 7168, 1, 1, 1 | F32 | blk.13.ffn_norm.weight
+ 115: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.13.ffn_up_exps.weight
+ 116: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.13.ffn_up_shexp.weight
+ 117: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.14.attn_k_b.weight
+ 118: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.14.attn_kv_a_mqa.weight
+ 119: 512 | 512, 1, 1, 1 | F32 | blk.14.attn_kv_a_norm.weight
+ 120: 7168 | 7168, 1, 1, 1 | F32 | blk.14.attn_norm.weight
+ 121: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.14.attn_output.weight
+ 122: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.14.attn_q_a.weight
+ 123: 1536 | 1536, 1, 1, 1 | F32 | blk.14.attn_q_a_norm.weight
+ 124: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.14.attn_q_b.weight
+ 125: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.14.attn_v_b.weight
+ 126: 384 | 384, 1, 1, 1 | F32 | blk.14.exp_probs_b.bias
+ 127: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.14.ffn_down_exps.weight
+ 128: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.14.ffn_down_shexp.weight
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00003-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 126
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 2
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 126 tensor(s)
+ 1: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.14.ffn_gate_exps.weight
+ 2: 2752512 | 7168, 384, 1, 1 | F32 | blk.14.ffn_gate_inp.weight
+ 3: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.14.ffn_gate_shexp.weight
+ 4: 7168 | 7168, 1, 1, 1 | F32 | blk.14.ffn_norm.weight
+ 5: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.14.ffn_up_exps.weight
+ 6: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.14.ffn_up_shexp.weight
+ 7: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.15.attn_k_b.weight
+ 8: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.15.attn_kv_a_mqa.weight
+ 9: 512 | 512, 1, 1, 1 | F32 | blk.15.attn_kv_a_norm.weight
+ 10: 7168 | 7168, 1, 1, 1 | F32 | blk.15.attn_norm.weight
+ 11: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.15.attn_output.weight
+ 12: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.15.attn_q_a.weight
+ 13: 1536 | 1536, 1, 1, 1 | F32 | blk.15.attn_q_a_norm.weight
+ 14: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.15.attn_q_b.weight
+ 15: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.15.attn_v_b.weight
+ 16: 384 | 384, 1, 1, 1 | F32 | blk.15.exp_probs_b.bias
+ 17: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.15.ffn_down_exps.weight
+ 18: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.15.ffn_down_shexp.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.15.ffn_gate_exps.weight
+ 20: 2752512 | 7168, 384, 1, 1 | F32 | blk.15.ffn_gate_inp.weight
+ 21: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.15.ffn_gate_shexp.weight
+ 22: 7168 | 7168, 1, 1, 1 | F32 | blk.15.ffn_norm.weight
+ 23: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.15.ffn_up_exps.weight
+ 24: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.15.ffn_up_shexp.weight
+ 25: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.16.attn_k_b.weight
+ 26: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.16.attn_kv_a_mqa.weight
+ 27: 512 | 512, 1, 1, 1 | F32 | blk.16.attn_kv_a_norm.weight
+ 28: 7168 | 7168, 1, 1, 1 | F32 | blk.16.attn_norm.weight
+ 29: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.16.attn_output.weight
+ 30: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.16.attn_q_a.weight
+ 31: 1536 | 1536, 1, 1, 1 | F32 | blk.16.attn_q_a_norm.weight
+ 32: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.16.attn_q_b.weight
+ 33: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.16.attn_v_b.weight
+ 34: 384 | 384, 1, 1, 1 | F32 | blk.16.exp_probs_b.bias
+ 35: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.16.ffn_down_exps.weight
+ 36: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.16.ffn_down_shexp.weight
+ 37: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.16.ffn_gate_exps.weight
+ 38: 2752512 | 7168, 384, 1, 1 | F32 | blk.16.ffn_gate_inp.weight
+ 39: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.16.ffn_gate_shexp.weight
+ 40: 7168 | 7168, 1, 1, 1 | F32 | blk.16.ffn_norm.weight
+ 41: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.16.ffn_up_exps.weight
+ 42: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.16.ffn_up_shexp.weight
+ 43: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.17.attn_k_b.weight
+ 44: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.17.attn_kv_a_mqa.weight
+ 45: 512 | 512, 1, 1, 1 | F32 | blk.17.attn_kv_a_norm.weight
+ 46: 7168 | 7168, 1, 1, 1 | F32 | blk.17.attn_norm.weight
+ 47: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.17.attn_output.weight
+ 48: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.17.attn_q_a.weight
+ 49: 1536 | 1536, 1, 1, 1 | F32 | blk.17.attn_q_a_norm.weight
+ 50: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.17.attn_q_b.weight
+ 51: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.17.attn_v_b.weight
+ 52: 384 | 384, 1, 1, 1 | F32 | blk.17.exp_probs_b.bias
+ 53: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.17.ffn_down_exps.weight
+ 54: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.17.ffn_down_shexp.weight
+ 55: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.17.ffn_gate_exps.weight
+ 56: 2752512 | 7168, 384, 1, 1 | F32 | blk.17.ffn_gate_inp.weight
+ 57: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.17.ffn_gate_shexp.weight
+ 58: 7168 | 7168, 1, 1, 1 | F32 | blk.17.ffn_norm.weight
+ 59: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.17.ffn_up_exps.weight
+ 60: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.17.ffn_up_shexp.weight
+ 61: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.18.attn_k_b.weight
+ 62: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.18.attn_kv_a_mqa.weight
+ 63: 512 | 512, 1, 1, 1 | F32 | blk.18.attn_kv_a_norm.weight
+ 64: 7168 | 7168, 1, 1, 1 | F32 | blk.18.attn_norm.weight
+ 65: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.18.attn_output.weight
+ 66: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.18.attn_q_a.weight
+ 67: 1536 | 1536, 1, 1, 1 | F32 | blk.18.attn_q_a_norm.weight
+ 68: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.18.attn_q_b.weight
+ 69: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.18.attn_v_b.weight
+ 70: 384 | 384, 1, 1, 1 | F32 | blk.18.exp_probs_b.bias
+ 71: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.18.ffn_down_exps.weight
+ 72: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.18.ffn_down_shexp.weight
+ 73: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.18.ffn_gate_exps.weight
+ 74: 2752512 | 7168, 384, 1, 1 | F32 | blk.18.ffn_gate_inp.weight
+ 75: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.18.ffn_gate_shexp.weight
+ 76: 7168 | 7168, 1, 1, 1 | F32 | blk.18.ffn_norm.weight
+ 77: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.18.ffn_up_exps.weight
+ 78: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.18.ffn_up_shexp.weight
+ 79: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.19.attn_k_b.weight
+ 80: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.19.attn_kv_a_mqa.weight
+ 81: 512 | 512, 1, 1, 1 | F32 | blk.19.attn_kv_a_norm.weight
+ 82: 7168 | 7168, 1, 1, 1 | F32 | blk.19.attn_norm.weight
+ 83: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.19.attn_output.weight
+ 84: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.19.attn_q_a.weight
+ 85: 1536 | 1536, 1, 1, 1 | F32 | blk.19.attn_q_a_norm.weight
+ 86: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.19.attn_q_b.weight
+ 87: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.19.attn_v_b.weight
+ 88: 384 | 384, 1, 1, 1 | F32 | blk.19.exp_probs_b.bias
+ 89: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.19.ffn_down_exps.weight
+ 90: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.19.ffn_down_shexp.weight
+ 91: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.19.ffn_gate_exps.weight
+ 92: 2752512 | 7168, 384, 1, 1 | F32 | blk.19.ffn_gate_inp.weight
+ 93: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.19.ffn_gate_shexp.weight
+ 94: 7168 | 7168, 1, 1, 1 | F32 | blk.19.ffn_norm.weight
+ 95: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.19.ffn_up_exps.weight
+ 96: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.19.ffn_up_shexp.weight
+ 97: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.20.attn_k_b.weight
+ 98: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.20.attn_kv_a_mqa.weight
+ 99: 512 | 512, 1, 1, 1 | F32 | blk.20.attn_kv_a_norm.weight
+ 100: 7168 | 7168, 1, 1, 1 | F32 | blk.20.attn_norm.weight
+ 101: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.20.attn_output.weight
+ 102: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.20.attn_q_a.weight
+ 103: 1536 | 1536, 1, 1, 1 | F32 | blk.20.attn_q_a_norm.weight
+ 104: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.20.attn_q_b.weight
+ 105: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.20.attn_v_b.weight
+ 106: 384 | 384, 1, 1, 1 | F32 | blk.20.exp_probs_b.bias
+ 107: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.20.ffn_down_exps.weight
+ 108: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.20.ffn_down_shexp.weight
+ 109: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.20.ffn_gate_exps.weight
+ 110: 2752512 | 7168, 384, 1, 1 | F32 | blk.20.ffn_gate_inp.weight
+ 111: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.20.ffn_gate_shexp.weight
+ 112: 7168 | 7168, 1, 1, 1 | F32 | blk.20.ffn_norm.weight
+ 113: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.20.ffn_up_exps.weight
+ 114: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.20.ffn_up_shexp.weight
+ 115: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.21.attn_k_b.weight
+ 116: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.21.attn_kv_a_mqa.weight
+ 117: 512 | 512, 1, 1, 1 | F32 | blk.21.attn_kv_a_norm.weight
+ 118: 7168 | 7168, 1, 1, 1 | F32 | blk.21.attn_norm.weight
+ 119: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.21.attn_output.weight
+ 120: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.21.attn_q_a.weight
+ 121: 1536 | 1536, 1, 1, 1 | F32 | blk.21.attn_q_a_norm.weight
+ 122: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.21.attn_q_b.weight
+ 123: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.21.attn_v_b.weight
+ 124: 384 | 384, 1, 1, 1 | F32 | blk.21.exp_probs_b.bias
+ 125: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.21.ffn_down_exps.weight
+ 126: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.21.ffn_down_shexp.weight
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00004-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 130
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 3
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 130 tensor(s)
+ 1: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.21.ffn_gate_exps.weight
+ 2: 2752512 | 7168, 384, 1, 1 | F32 | blk.21.ffn_gate_inp.weight
+ 3: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.21.ffn_gate_shexp.weight
+ 4: 7168 | 7168, 1, 1, 1 | F32 | blk.21.ffn_norm.weight
+ 5: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.21.ffn_up_exps.weight
+ 6: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.21.ffn_up_shexp.weight
+ 7: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.22.attn_k_b.weight
+ 8: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.22.attn_kv_a_mqa.weight
+ 9: 512 | 512, 1, 1, 1 | F32 | blk.22.attn_kv_a_norm.weight
+ 10: 7168 | 7168, 1, 1, 1 | F32 | blk.22.attn_norm.weight
+ 11: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.22.attn_output.weight
+ 12: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.22.attn_q_a.weight
+ 13: 1536 | 1536, 1, 1, 1 | F32 | blk.22.attn_q_a_norm.weight
+ 14: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.22.attn_q_b.weight
+ 15: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.22.attn_v_b.weight
+ 16: 384 | 384, 1, 1, 1 | F32 | blk.22.exp_probs_b.bias
+ 17: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.22.ffn_down_exps.weight
+ 18: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.22.ffn_down_shexp.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.22.ffn_gate_exps.weight
+ 20: 2752512 | 7168, 384, 1, 1 | F32 | blk.22.ffn_gate_inp.weight
+ 21: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.22.ffn_gate_shexp.weight
+ 22: 7168 | 7168, 1, 1, 1 | F32 | blk.22.ffn_norm.weight
+ 23: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.22.ffn_up_exps.weight
+ 24: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.22.ffn_up_shexp.weight
+ 25: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.23.attn_k_b.weight
+ 26: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.23.attn_kv_a_mqa.weight
+ 27: 512 | 512, 1, 1, 1 | F32 | blk.23.attn_kv_a_norm.weight
+ 28: 7168 | 7168, 1, 1, 1 | F32 | blk.23.attn_norm.weight
+ 29: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.23.attn_output.weight
+ 30: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.23.attn_q_a.weight
+ 31: 1536 | 1536, 1, 1, 1 | F32 | blk.23.attn_q_a_norm.weight
+ 32: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.23.attn_q_b.weight
+ 33: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.23.attn_v_b.weight
+ 34: 384 | 384, 1, 1, 1 | F32 | blk.23.exp_probs_b.bias
+ 35: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.23.ffn_down_exps.weight
+ 36: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.23.ffn_down_shexp.weight
+ 37: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.23.ffn_gate_exps.weight
+ 38: 2752512 | 7168, 384, 1, 1 | F32 | blk.23.ffn_gate_inp.weight
+ 39: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.23.ffn_gate_shexp.weight
+ 40: 7168 | 7168, 1, 1, 1 | F32 | blk.23.ffn_norm.weight
+ 41: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.23.ffn_up_exps.weight
+ 42: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.23.ffn_up_shexp.weight
+ 43: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.24.attn_k_b.weight
+ 44: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.24.attn_kv_a_mqa.weight
+ 45: 512 | 512, 1, 1, 1 | F32 | blk.24.attn_kv_a_norm.weight
+ 46: 7168 | 7168, 1, 1, 1 | F32 | blk.24.attn_norm.weight
+ 47: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.24.attn_output.weight
+ 48: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.24.attn_q_a.weight
+ 49: 1536 | 1536, 1, 1, 1 | F32 | blk.24.attn_q_a_norm.weight
+ 50: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.24.attn_q_b.weight
+ 51: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.24.attn_v_b.weight
+ 52: 384 | 384, 1, 1, 1 | F32 | blk.24.exp_probs_b.bias
+ 53: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.24.ffn_down_exps.weight
+ 54: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.24.ffn_down_shexp.weight
+ 55: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.24.ffn_gate_exps.weight
+ 56: 2752512 | 7168, 384, 1, 1 | F32 | blk.24.ffn_gate_inp.weight
+ 57: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.24.ffn_gate_shexp.weight
+ 58: 7168 | 7168, 1, 1, 1 | F32 | blk.24.ffn_norm.weight
+ 59: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.24.ffn_up_exps.weight
+ 60: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.24.ffn_up_shexp.weight
+ 61: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.25.attn_k_b.weight
+ 62: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.25.attn_kv_a_mqa.weight
+ 63: 512 | 512, 1, 1, 1 | F32 | blk.25.attn_kv_a_norm.weight
+ 64: 7168 | 7168, 1, 1, 1 | F32 | blk.25.attn_norm.weight
+ 65: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.25.attn_output.weight
+ 66: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.25.attn_q_a.weight
+ 67: 1536 | 1536, 1, 1, 1 | F32 | blk.25.attn_q_a_norm.weight
+ 68: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.25.attn_q_b.weight
+ 69: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.25.attn_v_b.weight
+ 70: 384 | 384, 1, 1, 1 | F32 | blk.25.exp_probs_b.bias
+ 71: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.25.ffn_down_exps.weight
+ 72: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.25.ffn_down_shexp.weight
+ 73: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.25.ffn_gate_exps.weight
+ 74: 2752512 | 7168, 384, 1, 1 | F32 | blk.25.ffn_gate_inp.weight
+ 75: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.25.ffn_gate_shexp.weight
+ 76: 7168 | 7168, 1, 1, 1 | F32 | blk.25.ffn_norm.weight
+ 77: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.25.ffn_up_exps.weight
+ 78: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.25.ffn_up_shexp.weight
+ 79: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.26.attn_k_b.weight
+ 80: 4128768 | 7168, 576, 1, 1 | Q6_K | blk.26.attn_kv_a_mqa.weight
+ 81: 512 | 512, 1, 1, 1 | F32 | blk.26.attn_kv_a_norm.weight
+ 82: 7168 | 7168, 1, 1, 1 | F32 | blk.26.attn_norm.weight
+ 83: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.26.attn_output.weight
+ 84: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.26.attn_q_a.weight
+ 85: 1536 | 1536, 1, 1, 1 | F32 | blk.26.attn_q_a_norm.weight
+ 86: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.26.attn_q_b.weight
+ 87: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.26.attn_v_b.weight
+ 88: 384 | 384, 1, 1, 1 | F32 | blk.26.exp_probs_b.bias
+ 89: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.26.ffn_down_exps.weight
+ 90: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.26.ffn_down_shexp.weight
+ 91: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.26.ffn_gate_exps.weight
+ 92: 2752512 | 7168, 384, 1, 1 | F32 | blk.26.ffn_gate_inp.weight
+ 93: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.26.ffn_gate_shexp.weight
+ 94: 7168 | 7168, 1, 1, 1 | F32 | blk.26.ffn_norm.weight
+ 95: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.26.ffn_up_exps.weight
+ 96: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.26.ffn_up_shexp.weight
+ 97: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.27.attn_k_b.weight
+ 98: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.27.attn_kv_a_mqa.weight
+ 99: 512 | 512, 1, 1, 1 | F32 | blk.27.attn_kv_a_norm.weight
+ 100: 7168 | 7168, 1, 1, 1 | F32 | blk.27.attn_norm.weight
+ 101: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.27.attn_output.weight
+ 102: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.27.attn_q_a.weight
+ 103: 1536 | 1536, 1, 1, 1 | F32 | blk.27.attn_q_a_norm.weight
+ 104: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.27.attn_q_b.weight
+ 105: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.27.attn_v_b.weight
+ 106: 384 | 384, 1, 1, 1 | F32 | blk.27.exp_probs_b.bias
+ 107: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.27.ffn_down_exps.weight
+ 108: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.27.ffn_down_shexp.weight
+ 109: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.27.ffn_gate_exps.weight
+ 110: 2752512 | 7168, 384, 1, 1 | F32 | blk.27.ffn_gate_inp.weight
+ 111: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.27.ffn_gate_shexp.weight
+ 112: 7168 | 7168, 1, 1, 1 | F32 | blk.27.ffn_norm.weight
+ 113: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.27.ffn_up_exps.weight
+ 114: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.27.ffn_up_shexp.weight
+ 115: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.28.attn_k_b.weight
+ 116: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.28.attn_kv_a_mqa.weight
+ 117: 512 | 512, 1, 1, 1 | F32 | blk.28.attn_kv_a_norm.weight
+ 118: 7168 | 7168, 1, 1, 1 | F32 | blk.28.attn_norm.weight
+ 119: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.28.attn_output.weight
+ 120: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.28.attn_q_a.weight
+ 121: 1536 | 1536, 1, 1, 1 | F32 | blk.28.attn_q_a_norm.weight
+ 122: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.28.attn_q_b.weight
+ 123: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.28.attn_v_b.weight
+ 124: 384 | 384, 1, 1, 1 | F32 | blk.28.exp_probs_b.bias
+ 125: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.28.ffn_down_exps.weight
+ 126: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.28.ffn_down_shexp.weight
+ 127: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.28.ffn_gate_exps.weight
+ 128: 2752512 | 7168, 384, 1, 1 | F32 | blk.28.ffn_gate_inp.weight
+ 129: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.28.ffn_gate_shexp.weight
+ 130: 7168 | 7168, 1, 1, 1 | F32 | blk.28.ffn_norm.weight
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00005-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 138
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 4
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 138 tensor(s)
+ 1: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.28.ffn_up_exps.weight
+ 2: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.28.ffn_up_shexp.weight
+ 3: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.29.attn_k_b.weight
+ 4: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.29.attn_kv_a_mqa.weight
+ 5: 512 | 512, 1, 1, 1 | F32 | blk.29.attn_kv_a_norm.weight
+ 6: 7168 | 7168, 1, 1, 1 | F32 | blk.29.attn_norm.weight
+ 7: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.29.attn_output.weight
+ 8: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.29.attn_q_a.weight
+ 9: 1536 | 1536, 1, 1, 1 | F32 | blk.29.attn_q_a_norm.weight
+ 10: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.29.attn_q_b.weight
+ 11: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.29.attn_v_b.weight
+ 12: 384 | 384, 1, 1, 1 | F32 | blk.29.exp_probs_b.bias
+ 13: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.29.ffn_down_exps.weight
+ 14: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.29.ffn_down_shexp.weight
+ 15: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.29.ffn_gate_exps.weight
+ 16: 2752512 | 7168, 384, 1, 1 | F32 | blk.29.ffn_gate_inp.weight
+ 17: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.29.ffn_gate_shexp.weight
+ 18: 7168 | 7168, 1, 1, 1 | F32 | blk.29.ffn_norm.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.29.ffn_up_exps.weight
+ 20: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.29.ffn_up_shexp.weight
+ 21: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.30.attn_k_b.weight
+ 22: 4128768 | 7168, 576, 1, 1 | Q6_K | blk.30.attn_kv_a_mqa.weight
+ 23: 512 | 512, 1, 1, 1 | F32 | blk.30.attn_kv_a_norm.weight
+ 24: 7168 | 7168, 1, 1, 1 | F32 | blk.30.attn_norm.weight
+ 25: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.30.attn_output.weight
+ 26: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.30.attn_q_a.weight
+ 27: 1536 | 1536, 1, 1, 1 | F32 | blk.30.attn_q_a_norm.weight
+ 28: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.30.attn_q_b.weight
+ 29: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.30.attn_v_b.weight
+ 30: 384 | 384, 1, 1, 1 | F32 | blk.30.exp_probs_b.bias
+ 31: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.30.ffn_down_exps.weight
+ 32: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.30.ffn_down_shexp.weight
+ 33: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.30.ffn_gate_exps.weight
+ 34: 2752512 | 7168, 384, 1, 1 | F32 | blk.30.ffn_gate_inp.weight
+ 35: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.30.ffn_gate_shexp.weight
+ 36: 7168 | 7168, 1, 1, 1 | F32 | blk.30.ffn_norm.weight
+ 37: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.30.ffn_up_exps.weight
+ 38: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.30.ffn_up_shexp.weight
+ 39: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.31.attn_k_b.weight
+ 40: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.31.attn_kv_a_mqa.weight
+ 41: 512 | 512, 1, 1, 1 | F32 | blk.31.attn_kv_a_norm.weight
+ 42: 7168 | 7168, 1, 1, 1 | F32 | blk.31.attn_norm.weight
+ 43: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.31.attn_output.weight
+ 44: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.31.attn_q_a.weight
+ 45: 1536 | 1536, 1, 1, 1 | F32 | blk.31.attn_q_a_norm.weight
+ 46: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.31.attn_q_b.weight
+ 47: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.31.attn_v_b.weight
+ 48: 384 | 384, 1, 1, 1 | F32 | blk.31.exp_probs_b.bias
+ 49: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.31.ffn_down_exps.weight
+ 50: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.31.ffn_down_shexp.weight
+ 51: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.31.ffn_gate_exps.weight
+ 52: 2752512 | 7168, 384, 1, 1 | F32 | blk.31.ffn_gate_inp.weight
+ 53: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.31.ffn_gate_shexp.weight
+ 54: 7168 | 7168, 1, 1, 1 | F32 | blk.31.ffn_norm.weight
+ 55: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.31.ffn_up_exps.weight
+ 56: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.31.ffn_up_shexp.weight
+ 57: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.32.attn_k_b.weight
+ 58: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.32.attn_kv_a_mqa.weight
+ 59: 512 | 512, 1, 1, 1 | F32 | blk.32.attn_kv_a_norm.weight
+ 60: 7168 | 7168, 1, 1, 1 | F32 | blk.32.attn_norm.weight
+ 61: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.32.attn_output.weight
+ 62: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.32.attn_q_a.weight
+ 63: 1536 | 1536, 1, 1, 1 | F32 | blk.32.attn_q_a_norm.weight
+ 64: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.32.attn_q_b.weight
+ 65: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.32.attn_v_b.weight
+ 66: 384 | 384, 1, 1, 1 | F32 | blk.32.exp_probs_b.bias
+ 67: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.32.ffn_down_exps.weight
+ 68: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.32.ffn_down_shexp.weight
+ 69: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.32.ffn_gate_exps.weight
+ 70: 2752512 | 7168, 384, 1, 1 | F32 | blk.32.ffn_gate_inp.weight
+ 71: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.32.ffn_gate_shexp.weight
+ 72: 7168 | 7168, 1, 1, 1 | F32 | blk.32.ffn_norm.weight
+ 73: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.32.ffn_up_exps.weight
+ 74: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.32.ffn_up_shexp.weight
+ 75: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.33.attn_k_b.weight
+ 76: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.33.attn_kv_a_mqa.weight
+ 77: 512 | 512, 1, 1, 1 | F32 | blk.33.attn_kv_a_norm.weight
+ 78: 7168 | 7168, 1, 1, 1 | F32 | blk.33.attn_norm.weight
+ 79: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.33.attn_output.weight
+ 80: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.33.attn_q_a.weight
+ 81: 1536 | 1536, 1, 1, 1 | F32 | blk.33.attn_q_a_norm.weight
+ 82: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.33.attn_q_b.weight
+ 83: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.33.attn_v_b.weight
+ 84: 384 | 384, 1, 1, 1 | F32 | blk.33.exp_probs_b.bias
+ 85: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.33.ffn_down_exps.weight
+ 86: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.33.ffn_down_shexp.weight
+ 87: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.33.ffn_gate_exps.weight
+ 88: 2752512 | 7168, 384, 1, 1 | F32 | blk.33.ffn_gate_inp.weight
+ 89: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.33.ffn_gate_shexp.weight
+ 90: 7168 | 7168, 1, 1, 1 | F32 | blk.33.ffn_norm.weight
+ 91: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.33.ffn_up_exps.weight
+ 92: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.33.ffn_up_shexp.weight
+ 93: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.34.attn_k_b.weight
+ 94: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.34.attn_kv_a_mqa.weight
+ 95: 512 | 512, 1, 1, 1 | F32 | blk.34.attn_kv_a_norm.weight
+ 96: 7168 | 7168, 1, 1, 1 | F32 | blk.34.attn_norm.weight
+ 97: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.34.attn_output.weight
+ 98: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.34.attn_q_a.weight
+ 99: 1536 | 1536, 1, 1, 1 | F32 | blk.34.attn_q_a_norm.weight
+ 100: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.34.attn_q_b.weight
+ 101: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.34.attn_v_b.weight
+ 102: 384 | 384, 1, 1, 1 | F32 | blk.34.exp_probs_b.bias
+ 103: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.34.ffn_down_exps.weight
+ 104: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.34.ffn_down_shexp.weight
+ 105: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.34.ffn_gate_exps.weight
+ 106: 2752512 | 7168, 384, 1, 1 | F32 | blk.34.ffn_gate_inp.weight
+ 107: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.34.ffn_gate_shexp.weight
+ 108: 7168 | 7168, 1, 1, 1 | F32 | blk.34.ffn_norm.weight
+ 109: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.34.ffn_up_exps.weight
+ 110: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.34.ffn_up_shexp.weight
+ 111: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.35.attn_k_b.weight
+ 112: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.35.attn_kv_a_mqa.weight
+ 113: 512 | 512, 1, 1, 1 | F32 | blk.35.attn_kv_a_norm.weight
+ 114: 7168 | 7168, 1, 1, 1 | F32 | blk.35.attn_norm.weight
+ 115: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.35.attn_output.weight
+ 116: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.35.attn_q_a.weight
+ 117: 1536 | 1536, 1, 1, 1 | F32 | blk.35.attn_q_a_norm.weight
+ 118: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.35.attn_q_b.weight
+ 119: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.35.attn_v_b.weight
+ 120: 384 | 384, 1, 1, 1 | F32 | blk.35.exp_probs_b.bias
+ 121: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.35.ffn_down_exps.weight
+ 122: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.35.ffn_down_shexp.weight
+ 123: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.35.ffn_gate_exps.weight
+ 124: 2752512 | 7168, 384, 1, 1 | F32 | blk.35.ffn_gate_inp.weight
+ 125: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.35.ffn_gate_shexp.weight
+ 126: 7168 | 7168, 1, 1, 1 | F32 | blk.35.ffn_norm.weight
+ 127: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.35.ffn_up_exps.weight
+ 128: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.35.ffn_up_shexp.weight
+ 129: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.36.attn_k_b.weight
+ 130: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.36.attn_kv_a_mqa.weight
+ 131: 512 | 512, 1, 1, 1 | F32 | blk.36.attn_kv_a_norm.weight
+ 132: 7168 | 7168, 1, 1, 1 | F32 | blk.36.attn_norm.weight
+ 133: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.36.attn_output.weight
+ 134: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.36.attn_q_a.weight
+ 135: 1536 | 1536, 1, 1, 1 | F32 | blk.36.attn_q_a_norm.weight
+ 136: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.36.attn_q_b.weight
+ 137: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.36.attn_v_b.weight
+ 138: 384 | 384, 1, 1, 1 | F32 | blk.36.exp_probs_b.bias
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00007-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 130
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 6
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 130 tensor(s)
+ 1: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.43.ffn_gate_exps.weight
+ 2: 2752512 | 7168, 384, 1, 1 | F32 | blk.43.ffn_gate_inp.weight
+ 3: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.43.ffn_gate_shexp.weight
+ 4: 7168 | 7168, 1, 1, 1 | F32 | blk.43.ffn_norm.weight
+ 5: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.43.ffn_up_exps.weight
+ 6: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.43.ffn_up_shexp.weight
+ 7: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.44.attn_k_b.weight
+ 8: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.44.attn_kv_a_mqa.weight
+ 9: 512 | 512, 1, 1, 1 | F32 | blk.44.attn_kv_a_norm.weight
+ 10: 7168 | 7168, 1, 1, 1 | F32 | blk.44.attn_norm.weight
+ 11: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.44.attn_output.weight
+ 12: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.44.attn_q_a.weight
+ 13: 1536 | 1536, 1, 1, 1 | F32 | blk.44.attn_q_a_norm.weight
+ 14: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.44.attn_q_b.weight
+ 15: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.44.attn_v_b.weight
+ 16: 384 | 384, 1, 1, 1 | F32 | blk.44.exp_probs_b.bias
+ 17: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.44.ffn_down_exps.weight
+ 18: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.44.ffn_down_shexp.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.44.ffn_gate_exps.weight
+ 20: 2752512 | 7168, 384, 1, 1 | F32 | blk.44.ffn_gate_inp.weight
+ 21: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.44.ffn_gate_shexp.weight
+ 22: 7168 | 7168, 1, 1, 1 | F32 | blk.44.ffn_norm.weight
+ 23: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.44.ffn_up_exps.weight
+ 24: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.44.ffn_up_shexp.weight
+ 25: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.45.attn_k_b.weight
+ 26: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.45.attn_kv_a_mqa.weight
+ 27: 512 | 512, 1, 1, 1 | F32 | blk.45.attn_kv_a_norm.weight
+ 28: 7168 | 7168, 1, 1, 1 | F32 | blk.45.attn_norm.weight
+ 29: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.45.attn_output.weight
+ 30: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.45.attn_q_a.weight
+ 31: 1536 | 1536, 1, 1, 1 | F32 | blk.45.attn_q_a_norm.weight
+ 32: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.45.attn_q_b.weight
+ 33: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.45.attn_v_b.weight
+ 34: 384 | 384, 1, 1, 1 | F32 | blk.45.exp_probs_b.bias
+ 35: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.45.ffn_down_exps.weight
+ 36: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.45.ffn_down_shexp.weight
+ 37: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.45.ffn_gate_exps.weight
+ 38: 2752512 | 7168, 384, 1, 1 | F32 | blk.45.ffn_gate_inp.weight
+ 39: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.45.ffn_gate_shexp.weight
+ 40: 7168 | 7168, 1, 1, 1 | F32 | blk.45.ffn_norm.weight
+ 41: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.45.ffn_up_exps.weight
+ 42: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.45.ffn_up_shexp.weight
+ 43: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.46.attn_k_b.weight
+ 44: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.46.attn_kv_a_mqa.weight
+ 45: 512 | 512, 1, 1, 1 | F32 | blk.46.attn_kv_a_norm.weight
+ 46: 7168 | 7168, 1, 1, 1 | F32 | blk.46.attn_norm.weight
+ 47: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.46.attn_output.weight
+ 48: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.46.attn_q_a.weight
+ 49: 1536 | 1536, 1, 1, 1 | F32 | blk.46.attn_q_a_norm.weight
+ 50: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.46.attn_q_b.weight
+ 51: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.46.attn_v_b.weight
+ 52: 384 | 384, 1, 1, 1 | F32 | blk.46.exp_probs_b.bias
+ 53: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.46.ffn_down_exps.weight
+ 54: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.46.ffn_down_shexp.weight
+ 55: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.46.ffn_gate_exps.weight
+ 56: 2752512 | 7168, 384, 1, 1 | F32 | blk.46.ffn_gate_inp.weight
+ 57: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.46.ffn_gate_shexp.weight
+ 58: 7168 | 7168, 1, 1, 1 | F32 | blk.46.ffn_norm.weight
+ 59: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.46.ffn_up_exps.weight
+ 60: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.46.ffn_up_shexp.weight
+ 61: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.47.attn_k_b.weight
+ 62: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.47.attn_kv_a_mqa.weight
+ 63: 512 | 512, 1, 1, 1 | F32 | blk.47.attn_kv_a_norm.weight
+ 64: 7168 | 7168, 1, 1, 1 | F32 | blk.47.attn_norm.weight
+ 65: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.47.attn_output.weight
+ 66: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.47.attn_q_a.weight
+ 67: 1536 | 1536, 1, 1, 1 | F32 | blk.47.attn_q_a_norm.weight
+ 68: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.47.attn_q_b.weight
+ 69: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.47.attn_v_b.weight
+ 70: 384 | 384, 1, 1, 1 | F32 | blk.47.exp_probs_b.bias
+ 71: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.47.ffn_down_exps.weight
+ 72: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.47.ffn_down_shexp.weight
+ 73: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.47.ffn_gate_exps.weight
+ 74: 2752512 | 7168, 384, 1, 1 | F32 | blk.47.ffn_gate_inp.weight
+ 75: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.47.ffn_gate_shexp.weight
+ 76: 7168 | 7168, 1, 1, 1 | F32 | blk.47.ffn_norm.weight
+ 77: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.47.ffn_up_exps.weight
+ 78: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.47.ffn_up_shexp.weight
+ 79: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.48.attn_k_b.weight
+ 80: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.48.attn_kv_a_mqa.weight
+ 81: 512 | 512, 1, 1, 1 | F32 | blk.48.attn_kv_a_norm.weight
+ 82: 7168 | 7168, 1, 1, 1 | F32 | blk.48.attn_norm.weight
+ 83: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.48.attn_output.weight
+ 84: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.48.attn_q_a.weight
+ 85: 1536 | 1536, 1, 1, 1 | F32 | blk.48.attn_q_a_norm.weight
+ 86: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.48.attn_q_b.weight
+ 87: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.48.attn_v_b.weight
+ 88: 384 | 384, 1, 1, 1 | F32 | blk.48.exp_probs_b.bias
+ 89: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.48.ffn_down_exps.weight
+ 90: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.48.ffn_down_shexp.weight
+ 91: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.48.ffn_gate_exps.weight
+ 92: 2752512 | 7168, 384, 1, 1 | F32 | blk.48.ffn_gate_inp.weight
+ 93: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.48.ffn_gate_shexp.weight
+ 94: 7168 | 7168, 1, 1, 1 | F32 | blk.48.ffn_norm.weight
+ 95: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.48.ffn_up_exps.weight
+ 96: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.48.ffn_up_shexp.weight
+ 97: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.49.attn_k_b.weight
+ 98: 4128768 | 7168, 576, 1, 1 | Q6_K | blk.49.attn_kv_a_mqa.weight
+ 99: 512 | 512, 1, 1, 1 | F32 | blk.49.attn_kv_a_norm.weight
+ 100: 7168 | 7168, 1, 1, 1 | F32 | blk.49.attn_norm.weight
+ 101: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.49.attn_output.weight
+ 102: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.49.attn_q_a.weight
+ 103: 1536 | 1536, 1, 1, 1 | F32 | blk.49.attn_q_a_norm.weight
+ 104: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.49.attn_q_b.weight
+ 105: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.49.attn_v_b.weight
+ 106: 384 | 384, 1, 1, 1 | F32 | blk.49.exp_probs_b.bias
+ 107: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.49.ffn_down_exps.weight
+ 108: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.49.ffn_down_shexp.weight
+ 109: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.49.ffn_gate_exps.weight
+ 110: 2752512 | 7168, 384, 1, 1 | F32 | blk.49.ffn_gate_inp.weight
+ 111: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.49.ffn_gate_shexp.weight
+ 112: 7168 | 7168, 1, 1, 1 | F32 | blk.49.ffn_norm.weight
+ 113: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.49.ffn_up_exps.weight
+ 114: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.49.ffn_up_shexp.weight
+ 115: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.50.attn_k_b.weight
+ 116: 4128768 | 7168, 576, 1, 1 | Q6_K | blk.50.attn_kv_a_mqa.weight
+ 117: 512 | 512, 1, 1, 1 | F32 | blk.50.attn_kv_a_norm.weight
+ 118: 7168 | 7168, 1, 1, 1 | F32 | blk.50.attn_norm.weight
+ 119: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.50.attn_output.weight
+ 120: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.50.attn_q_a.weight
+ 121: 1536 | 1536, 1, 1, 1 | F32 | blk.50.attn_q_a_norm.weight
+ 122: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.50.attn_q_b.weight
+ 123: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.50.attn_v_b.weight
+ 124: 384 | 384, 1, 1, 1 | F32 | blk.50.exp_probs_b.bias
+ 125: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.50.ffn_down_exps.weight
+ 126: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.50.ffn_down_shexp.weight
+ 127: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.50.ffn_gate_exps.weight
+ 128: 2752512 | 7168, 384, 1, 1 | F32 | blk.50.ffn_gate_inp.weight
+ 129: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.50.ffn_gate_shexp.weight
+ 130: 7168 | 7168, 1, 1, 1 | F32 | blk.50.ffn_norm.weight
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00008-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 122
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 7
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 122 tensor(s)
+ 1: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.50.ffn_up_exps.weight
+ 2: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.50.ffn_up_shexp.weight
+ 3: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.51.attn_k_b.weight
+ 4: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.51.attn_kv_a_mqa.weight
+ 5: 512 | 512, 1, 1, 1 | F32 | blk.51.attn_kv_a_norm.weight
+ 6: 7168 | 7168, 1, 1, 1 | F32 | blk.51.attn_norm.weight
+ 7: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.51.attn_output.weight
+ 8: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.51.attn_q_a.weight
+ 9: 1536 | 1536, 1, 1, 1 | F32 | blk.51.attn_q_a_norm.weight
+ 10: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.51.attn_q_b.weight
+ 11: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.51.attn_v_b.weight
+ 12: 384 | 384, 1, 1, 1 | F32 | blk.51.exp_probs_b.bias
+ 13: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.51.ffn_down_exps.weight
+ 14: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.51.ffn_down_shexp.weight
+ 15: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.51.ffn_gate_exps.weight
+ 16: 2752512 | 7168, 384, 1, 1 | F32 | blk.51.ffn_gate_inp.weight
+ 17: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.51.ffn_gate_shexp.weight
+ 18: 7168 | 7168, 1, 1, 1 | F32 | blk.51.ffn_norm.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.51.ffn_up_exps.weight
+ 20: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.51.ffn_up_shexp.weight
+ 21: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.52.attn_k_b.weight
+ 22: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.52.attn_kv_a_mqa.weight
+ 23: 512 | 512, 1, 1, 1 | F32 | blk.52.attn_kv_a_norm.weight
+ 24: 7168 | 7168, 1, 1, 1 | F32 | blk.52.attn_norm.weight
+ 25: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.52.attn_output.weight
+ 26: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.52.attn_q_a.weight
+ 27: 1536 | 1536, 1, 1, 1 | F32 | blk.52.attn_q_a_norm.weight
+ 28: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.52.attn_q_b.weight
+ 29: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.52.attn_v_b.weight
+ 30: 384 | 384, 1, 1, 1 | F32 | blk.52.exp_probs_b.bias
+ 31: 5637144576 | 2048, 7168, 384, 1 | IQ3_XXS | blk.52.ffn_down_exps.weight
+ 32: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.52.ffn_down_shexp.weight
+ 33: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.52.ffn_gate_exps.weight
+ 34: 2752512 | 7168, 384, 1, 1 | F32 | blk.52.ffn_gate_inp.weight
+ 35: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.52.ffn_gate_shexp.weight
+ 36: 7168 | 7168, 1, 1, 1 | F32 | blk.52.ffn_norm.weight
+ 37: 5637144576 | 7168, 2048, 384, 1 | IQ3_XXS | blk.52.ffn_up_exps.weight
+ 38: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.52.ffn_up_shexp.weight
+ 39: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.53.attn_k_b.weight
+ 40: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.53.attn_kv_a_mqa.weight
+ 41: 512 | 512, 1, 1, 1 | F32 | blk.53.attn_kv_a_norm.weight
+ 42: 7168 | 7168, 1, 1, 1 | F32 | blk.53.attn_norm.weight
+ 43: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.53.attn_output.weight
+ 44: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.53.attn_q_a.weight
+ 45: 1536 | 1536, 1, 1, 1 | F32 | blk.53.attn_q_a_norm.weight
+ 46: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.53.attn_q_b.weight
+ 47: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.53.attn_v_b.weight
+ 48: 384 | 384, 1, 1, 1 | F32 | blk.53.exp_probs_b.bias
+ 49: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.53.ffn_down_exps.weight
+ 50: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.53.ffn_down_shexp.weight
+ 51: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.53.ffn_gate_exps.weight
+ 52: 2752512 | 7168, 384, 1, 1 | F32 | blk.53.ffn_gate_inp.weight
+ 53: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.53.ffn_gate_shexp.weight
+ 54: 7168 | 7168, 1, 1, 1 | F32 | blk.53.ffn_norm.weight
+ 55: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.53.ffn_up_exps.weight
+ 56: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.53.ffn_up_shexp.weight
+ 57: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.54.attn_k_b.weight
+ 58: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.54.attn_kv_a_mqa.weight
+ 59: 512 | 512, 1, 1, 1 | F32 | blk.54.attn_kv_a_norm.weight
+ 60: 7168 | 7168, 1, 1, 1 | F32 | blk.54.attn_norm.weight
+ 61: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.54.attn_output.weight
+ 62: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.54.attn_q_a.weight
+ 63: 1536 | 1536, 1, 1, 1 | F32 | blk.54.attn_q_a_norm.weight
+ 64: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.54.attn_q_b.weight
+ 65: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.54.attn_v_b.weight
+ 66: 384 | 384, 1, 1, 1 | F32 | blk.54.exp_probs_b.bias
+ 67: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.54.ffn_down_exps.weight
+ 68: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.54.ffn_down_shexp.weight
+ 69: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.54.ffn_gate_exps.weight
+ 70: 2752512 | 7168, 384, 1, 1 | F32 | blk.54.ffn_gate_inp.weight
+ 71: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.54.ffn_gate_shexp.weight
+ 72: 7168 | 7168, 1, 1, 1 | F32 | blk.54.ffn_norm.weight
+ 73: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.54.ffn_up_exps.weight
+ 74: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.54.ffn_up_shexp.weight
+ 75: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.55.attn_k_b.weight
+ 76: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.55.attn_kv_a_mqa.weight
+ 77: 512 | 512, 1, 1, 1 | F32 | blk.55.attn_kv_a_norm.weight
+ 78: 7168 | 7168, 1, 1, 1 | F32 | blk.55.attn_norm.weight
+ 79: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.55.attn_output.weight
+ 80: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.55.attn_q_a.weight
+ 81: 1536 | 1536, 1, 1, 1 | F32 | blk.55.attn_q_a_norm.weight
+ 82: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.55.attn_q_b.weight
+ 83: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.55.attn_v_b.weight
+ 84: 384 | 384, 1, 1, 1 | F32 | blk.55.exp_probs_b.bias
+ 85: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.55.ffn_down_exps.weight
+ 86: 14680064 | 2048, 7168, 1, 1 | Q5_K | blk.55.ffn_down_shexp.weight
+ 87: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.55.ffn_gate_exps.weight
+ 88: 2752512 | 7168, 384, 1, 1 | F32 | blk.55.ffn_gate_inp.weight
+ 89: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.55.ffn_gate_shexp.weight
+ 90: 7168 | 7168, 1, 1, 1 | F32 | blk.55.ffn_norm.weight
+ 91: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.55.ffn_up_exps.weight
+ 92: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.55.ffn_up_shexp.weight
+ 93: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.56.attn_k_b.weight
+ 94: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.56.attn_kv_a_mqa.weight
+ 95: 512 | 512, 1, 1, 1 | F32 | blk.56.attn_kv_a_norm.weight
+ 96: 7168 | 7168, 1, 1, 1 | F32 | blk.56.attn_norm.weight
+ 97: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.56.attn_output.weight
+ 98: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.56.attn_q_a.weight
+ 99: 1536 | 1536, 1, 1, 1 | F32 | blk.56.attn_q_a_norm.weight
+ 100: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.56.attn_q_b.weight
+ 101: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.56.attn_v_b.weight
+ 102: 384 | 384, 1, 1, 1 | F32 | blk.56.exp_probs_b.bias
+ 103: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.56.ffn_down_exps.weight
+ 104: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.56.ffn_down_shexp.weight
+ 105: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.56.ffn_gate_exps.weight
+ 106: 2752512 | 7168, 384, 1, 1 | F32 | blk.56.ffn_gate_inp.weight
+ 107: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.56.ffn_gate_shexp.weight
+ 108: 7168 | 7168, 1, 1, 1 | F32 | blk.56.ffn_norm.weight
+ 109: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.56.ffn_up_exps.weight
+ 110: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.56.ffn_up_shexp.weight
+ 111: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.57.attn_k_b.weight
+ 112: 4128768 | 7168, 576, 1, 1 | Q6_K | blk.57.attn_kv_a_mqa.weight
+ 113: 512 | 512, 1, 1, 1 | F32 | blk.57.attn_kv_a_norm.weight
+ 114: 7168 | 7168, 1, 1, 1 | F32 | blk.57.attn_norm.weight
+ 115: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.57.attn_output.weight
+ 116: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.57.attn_q_a.weight
+ 117: 1536 | 1536, 1, 1, 1 | F32 | blk.57.attn_q_a_norm.weight
+ 118: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.57.attn_q_b.weight
+ 119: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.57.attn_v_b.weight
+ 120: 384 | 384, 1, 1, 1 | F32 | blk.57.exp_probs_b.bias
+ 121: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.57.ffn_down_exps.weight
+ 122: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.57.ffn_down_shexp.weight
+INFO:gguf-dump:* Loading: /mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS/Kimi-K2-Instruct-UD-IQ3_XXS-00009-of-00009.gguf
+* File is LITTLE endian, script is running on a LITTLE endian host.
+* Dumping 6 key/value pair(s)
+ 1: UINT32 | 1 | GGUF.version = 3
+ 2: UINT64 | 1 | GGUF.tensor_count = 60
+ 3: UINT64 | 1 | GGUF.kv_count = 3
+ 4: UINT16 | 1 | split.no = 8
+ 5: INT32 | 1 | split.tensors.count = 1096
+ 6: UINT16 | 1 | split.count = 9
+* Dumping 60 tensor(s)
+ 1: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.57.ffn_gate_exps.weight
+ 2: 2752512 | 7168, 384, 1, 1 | F32 | blk.57.ffn_gate_inp.weight
+ 3: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.57.ffn_gate_shexp.weight
+ 4: 7168 | 7168, 1, 1, 1 | F32 | blk.57.ffn_norm.weight
+ 5: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.57.ffn_up_exps.weight
+ 6: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.57.ffn_up_shexp.weight
+ 7: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.58.attn_k_b.weight
+ 8: 4128768 | 7168, 576, 1, 1 | IQ4_XS | blk.58.attn_kv_a_mqa.weight
+ 9: 512 | 512, 1, 1, 1 | F32 | blk.58.attn_kv_a_norm.weight
+ 10: 7168 | 7168, 1, 1, 1 | F32 | blk.58.attn_norm.weight
+ 11: 58720256 | 8192, 7168, 1, 1 | IQ4_XS | blk.58.attn_output.weight
+ 12: 11010048 | 7168, 1536, 1, 1 | Q4_K | blk.58.attn_q_a.weight
+ 13: 1536 | 1536, 1, 1, 1 | F32 | blk.58.attn_q_a_norm.weight
+ 14: 18874368 | 1536, 12288, 1, 1 | IQ4_XS | blk.58.attn_q_b.weight
+ 15: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.58.attn_v_b.weight
+ 16: 384 | 384, 1, 1, 1 | F32 | blk.58.exp_probs_b.bias
+ 17: 5637144576 | 2048, 7168, 384, 1 | IQ3_S | blk.58.ffn_down_exps.weight
+ 18: 14680064 | 2048, 7168, 1, 1 | IQ4_XS | blk.58.ffn_down_shexp.weight
+ 19: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.58.ffn_gate_exps.weight
+ 20: 2752512 | 7168, 384, 1, 1 | F32 | blk.58.ffn_gate_inp.weight
+ 21: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.58.ffn_gate_shexp.weight
+ 22: 7168 | 7168, 1, 1, 1 | F32 | blk.58.ffn_norm.weight
+ 23: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.58.ffn_up_exps.weight
+ 24: 14680064 | 7168, 2048, 1, 1 | IQ4_XS | blk.58.ffn_up_shexp.weight
+ 25: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.59.attn_k_b.weight
+ 26: 4128768 | 7168, 576, 1, 1 | Q6_K | blk.59.attn_kv_a_mqa.weight
+ 27: 512 | 512, 1, 1, 1 | F32 | blk.59.attn_kv_a_norm.weight
+ 28: 7168 | 7168, 1, 1, 1 | F32 | blk.59.attn_norm.weight
+ 29: 58720256 | 8192, 7168, 1, 1 | Q5_K | blk.59.attn_output.weight
+ 30: 11010048 | 7168, 1536, 1, 1 | Q5_K | blk.59.attn_q_a.weight
+ 31: 1536 | 1536, 1, 1, 1 | F32 | blk.59.attn_q_a_norm.weight
+ 32: 18874368 | 1536, 12288, 1, 1 | Q5_K | blk.59.attn_q_b.weight
+ 33: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.59.attn_v_b.weight
+ 34: 384 | 384, 1, 1, 1 | F32 | blk.59.exp_probs_b.bias
+ 35: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.59.ffn_down_exps.weight
+ 36: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.59.ffn_down_shexp.weight
+ 37: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.59.ffn_gate_exps.weight
+ 38: 2752512 | 7168, 384, 1, 1 | F32 | blk.59.ffn_gate_inp.weight
+ 39: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.59.ffn_gate_shexp.weight
+ 40: 7168 | 7168, 1, 1, 1 | F32 | blk.59.ffn_norm.weight
+ 41: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.59.ffn_up_exps.weight
+ 42: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.59.ffn_up_shexp.weight
+ 43: 4194304 | 128, 512, 64, 1 | Q8_0 | blk.60.attn_k_b.weight
+ 44: 4128768 | 7168, 576, 1, 1 | Q5_K | blk.60.attn_kv_a_mqa.weight
+ 45: 512 | 512, 1, 1, 1 | F32 | blk.60.attn_kv_a_norm.weight
+ 46: 7168 | 7168, 1, 1, 1 | F32 | blk.60.attn_norm.weight
+ 47: 58720256 | 8192, 7168, 1, 1 | Q5_K | blk.60.attn_output.weight
+ 48: 11010048 | 7168, 1536, 1, 1 | Q5_K | blk.60.attn_q_a.weight
+ 49: 1536 | 1536, 1, 1, 1 | F32 | blk.60.attn_q_a_norm.weight
+ 50: 18874368 | 1536, 12288, 1, 1 | Q5_K | blk.60.attn_q_b.weight
+ 51: 4194304 | 512, 128, 64, 1 | Q8_0 | blk.60.attn_v_b.weight
+ 52: 384 | 384, 1, 1, 1 | F32 | blk.60.exp_probs_b.bias
+ 53: 5637144576 | 2048, 7168, 384, 1 | IQ4_XS | blk.60.ffn_down_exps.weight
+ 54: 14680064 | 2048, 7168, 1, 1 | Q6_K | blk.60.ffn_down_shexp.weight
+ 55: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.60.ffn_gate_exps.weight
+ 56: 2752512 | 7168, 384, 1, 1 | F32 | blk.60.ffn_gate_inp.weight
+ 57: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.60.ffn_gate_shexp.weight
+ 58: 7168 | 7168, 1, 1, 1 | F32 | blk.60.ffn_norm.weight
+ 59: 5637144576 | 7168, 2048, 384, 1 | IQ3_S | blk.60.ffn_up_exps.weight
+ 60: 14680064 | 7168, 2048, 1, 1 | Q4_K | blk.60.ffn_up_shexp.weight
+```
+
+
+
+---
+
+👤 **magikRUKKOLA** commented on **2025-07-19** at **01:30:36**
@ubergarm
@@ -321,7 +1928,7 @@ llama_print_timings: total time = 6468857.58 ms / 290817 tokens
---
-👤 **ThomasBaruzier** commented the **2025-07-19** at **15:59:44**:
+👤 **ThomasBaruzier** commented on **2025-07-19** at **15:59:44**
Thanks Iwan and Ubergram for the amazing work! You two motivated me to try Kimi on my "mere" 128GB + 3x3090 rig.
@@ -433,12 +2040,12 @@ Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq1_s_r4
Adding custom rule token_embd\.weight -> iq4_kt
Adding custom rule output\.weight -> iq5_ks
load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
-load_imatrix: loaded 729 importance matrix entries from /home/tyra/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.imatrix computed on 826 chunks
+load_imatrix: loaded 729 importance matrix entries from /home/user/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.imatrix computed on 826 chunks
prepare_imatrix: have 729 importance matrix entries
main: build = 3818 (77eaa532)
main: built with cc (GCC) 15.1.1 20250425 for x86_64-pc-linux-gnu
-main: quantizing '/home/tyra/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.gguf' to '/home/tyra/nvme/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ1_S.gguf' as IQ1_KT using 32 threads
-llama_model_loader: loaded meta data with 50 key-value pairs and 1157 tensors from /home/tyra/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.gguf (version GGUF V3 (latest))
+main: quantizing '/home/user/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.gguf' to '/home/user/nvme/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-IQ1_S.gguf' as IQ1_KT using 32 threads
+llama_model_loader: loaded meta data with 50 key-value pairs and 1157 tensors from /home/user/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
@@ -639,7 +2246,23 @@ Thanks!
---
-👤 **ubergarm** commented the **2025-07-19** at **16:38:41**:
+👤 **ikawrakow** commented on **2025-07-19** at **16:14:53**
+
+> However, could token_embd.weight be missing because I used Q8_0 as input?
+
+One never calculates imatrix data for token embeddings. I should go and add a check to not print this warning to avoid people worrying about this.
+
+The warnings about missing cluster points are harmless.
+
+The warning about missing imatrix data for `attn_k_b` is not good.
+
+> How much accuracy do we lose by requantizing from Q8_0 instead of BF16?
+
+Not much, if any.
+
+---
+
+👤 **ubergarm** commented on **2025-07-19** at **16:38:41**
> The warning about missing imatrix data for attn_k_b is not good.
@@ -656,17 +2279,24 @@ Looking back at my deepseek quantization logs it only has:
====== llama_model_quantize_internal: did not find weights for blk.47.attn_k_b.weight
```
-The main difference is that for kimi-k2 imatrix i used `-mla 1` whereas with the older deepseek imatrix i did not specify `-mla` at all?
+(fwiw attn_k_b is not divisible by 256 so i've had to use like q5_0 or iq4_nl, might be related to the imatrix stuff, not sure)
-Also, yesterday I discovered that Kimi-K2-Instruct seems very sensitive to attn/shexp/blk.0.ffn.* or possibly just attn. I'm thinking it is because Kimi-K2 uses half the attn heads and 33% of the ffn dense layers as DeepSeek. So going back and requantizing my recipes with full q8_0 attn/shexp/blk.0.ffn.* is improving PP a lot for a little BPW.
+The main difference is that for kimi-k2 imatrix i used `-mla 1` whereas with the older deepseek imatrix i did not specify `-mla` at all.
+
+Also, yesterday I discovered that Kimi-K2-Instruct seems very sensitive to attn/shexp/blk.0.ffn.* or possibly just attn. I'm thinking it is because Kimi-K2 uses half the attn heads and 33% of the ffn dense layers as DeepSeek. So going back and requantizing my recipes with full q8_0 attn/shexp/blk.0.ffn.* is improving Perplexity a lot for a little BPW.
So now I'm not sure if this is because of those architecture changes in Kimi-K2, or perhaps just my imatrix was not being properly applied to the MLA tensors? hrmm...
-I'm updating the chart and data with what I have so far up above: https://github.com/ikawrakow/ik_llama.cpp/pull/616#issuecomment-3087170346
+I'm updating the chart and data with what I have so far up above: https://github.com/ikawrakow/ik_llama.cpp/pull/616#issuecomment-3087170346
+
+> Note that the Q8_0 input was made from convert_hf_to_gguf.py:
+python convert_hf_to_gguf.py --outfile /home/user/storage/gguf/Kimi-K2-Instruct/Kimi-K2-Instruct-Q8_0.gguf /home/user/storage/llm/Kimi-K2-Instruct-BF16/ --outtype q8_0 --model-name Kimi-K2-Instruct --split-max-size 9999G
+
+Oh interesting, I used `--outtype bf16` for my convert and then quantize from the full bf16. Its 2TB file size though so chews up disk space!
---
-👤 **magikRUKKOLA** commented the **2025-07-19** at **22:39:56**:
+👤 **magikRUKKOLA** commented on **2025-07-19** at **22:39:56**
@ubergarm
@@ -951,11 +2581,77 @@ Here is my dump:
* Dumping 6 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
+```
+
+[EDIT]:
+hashes:
+```
+83ede91803add96bc6dedb654fb63d8623cbe49116ffa6157f7187038d0b844e Kimi-K2-Instruct-UD-IQ3_XXS-00001-of-00009.gguf
+0a13cd9912d8b2fd04d39db1e49c6b34e36b228b95511b5623211003903e1159 Kimi-K2-Instruct-UD-IQ3_XXS-00002-of-00009.gguf
+7e756f2fb141dc6b9dc76905485b82997b03537594eeed6f00b000cb9ca8118e Kimi-K2-Instruct-UD-IQ3_XXS-00003-of-00009.gguf
+e22a7a9eabe7b65da81c61427ab0e878accc15abc1fc6996b01c904dc57f32b6 Kimi-K2-Instruct-UD-IQ3_XXS-00004-of-00009.gguf
+67563c3134d19652726822daf641f5b83fd617e7ab1538965d51755323f11dcb Kimi-K2-Instruct-UD-IQ3_XXS-00005-of-00009.gguf
+952230068e615e8b5d04c67489b729196ddb323179b66e9838c3f1b9fe726636 Kimi-K2-Instruct-UD-IQ3_XXS-00006-of-00009.gguf
+1628351d2b1659e4c7a52270955b9ce2c93d7781726318e53f5fe4152ddcb687 Kimi-K2-Instruct-UD-IQ3_XXS-00007-of-00009.gguf
+6149283b6e36c5bd87db0fabe091ebec433e1f8126bfef847d3af39d652abf3a Kimi-K2-Instruct-UD-IQ3_XXS-00008-of-00009.gguf
+72aca833efedb0b00424eba6655a8c7319d27b1e9b87a4178b714c1b7770b53c Kimi-K2-Instruct-UD-IQ3_XXS-00009-of-00009.gguf
+```
+
+```bash
+/opt/unsloth/Kimi-K2-Instruct-GGUF/UD-IQ3_XXS# ls -lah *gguf
+-rw-r--r-- 1 root root 46G Jul 16 11:17 Kimi-K2-Instruct-UD-IQ3_XXS-00001-of-00009.gguf
+-rw-r--r-- 1 root root 47G Jul 16 12:35 Kimi-K2-Instruct-UD-IQ3_XXS-00002-of-00009.gguf
+-rw-r--r-- 1 root root 45G Jul 16 13:50 Kimi-K2-Instruct-UD-IQ3_XXS-00003-of-00009.gguf
+-rw-r--r-- 1 root root 46G Jul 17 02:54 Kimi-K2-Instruct-UD-IQ3_XXS-00004-of-00009.gguf
+-rw-r--r-- 1 root root 45G Jul 16 16:36 Kimi-K2-Instruct-UD-IQ3_XXS-00005-of-00009.gguf
+-rw-r--r-- 1 root root 45G Jul 16 18:00 Kimi-K2-Instruct-UD-IQ3_XXS-00006-of-00009.gguf
+-rw-r--r-- 1 root root 45G Jul 16 19:24 Kimi-K2-Instruct-UD-IQ3_XXS-00007-of-00009.gguf
+-rw-r--r-- 1 root root 46G Jul 17 04:22 Kimi-K2-Instruct-UD-IQ3_XXS-00008-of-00009.gguf
+-rw-r--r-- 1 root root 27G Jul 17 02:34 Kimi-K2-Instruct-UD-IQ3_XXS-00009-of-00009.gguf
+```
+
+@ubergarm
+
+it seems like we are dealing with ~~two different revisions fo the same quant.~~ different tokenization configs embedded into GGUF (!?)
+Also, your info regarding Kimi-K2-Instruct-UD-IQ3_XXS-00006-of-00009.gguf is missing.
+
+[EDIT]:
+
+```
+* Extra key/value pair
+ Old: GGUF.kv_count = 61
+ New: GGUF.kv_count = 62
+ The new dump contains one additional metadata field.
+* Extra metadata field
+ New key #51:
+ tokenizer.ggml.add_bos_token = False
+ This boolean flag tells downstream loaders not to automatically prepend the
+ BOS token ( ) when the tokenizer is applied.
+ Absence of this key (old dump) means “let the program decide” or default to
+ True ; with the key present and set to False the tokenizer will not add a
+ leading BOS token.
+ * Base-model reference updated
+ Old:
+ general.base_model.0.name = "Kimi K2 Instruct"
+ general.base_model.0.repo_url = "https://huggingface.co/moonshotai/Kimi-K2-
+ Instruct"
+ New:
+ general.base_model.0.name = "Kimi K2 Instruct BF16"
+ general.base_model.0.repo_url = "https://huggingface.co/moonshotai/Kimi-K2-
+ Instruct-BF16"
+ The quantized model now declares the BF16 checkpoint as its base instead of
+ the original one.
+ * EOS token id changed
+ Old: tokenizer.ggml.eos_token_id = 163585
+ New: tokenizer.ggml.eos_token_id = 163586
+ The end-of-sentence token id was shifted by one. This is usually caused by
+ re-generating the tokenizer vocabulary (or adding a new special token) after
+ the base model was switched to the BF16 release.
```
---
-👤 **ikawrakow** commented the **2025-07-20** at **08:30:26**:
+👤 **ikawrakow** commented on **2025-07-20** at **08:30:26**
> Hrrm, I too see this for my Kimi-K2-Instruct quantize logs:
>
@@ -967,13 +2663,196 @@ Here is my dump:
---
-👤 **ubergarm** commented the **2025-07-20** at **15:18:58**:
+👤 **ThomasBaruzier** commented on **2025-07-20** at **10:15:59**
+
+> The warning about missing imatrix data for attn_k_b is not good.
+
+> The main difference is that for kimi-k2 imatrix i used -mla 1 whereas with the older deepseek imatrix i did not specify -mla at all.
+
+Is it related to how we produce the imatrix and what value of MLA we use while doing so? Edit: nvm I didn't read the answer above
+
+> So going back and requantizing my recipes with full q8_0 attn/shexp/blk.0.ffn.*
+
+I will try as well. Using your edited recipe, everything minus ffn gate up down is very small:
+
+ | Name | Count | Size (MB) | Total (MB) |
+|:------------------------------|--------:|------------:|-------------:|
+| blk.{i}.ffn_down_exps.weight | 59 | 1186 | 70004 |
+| blk.{i}.ffn_gate_exps.weight | 59 | 1010 | 59560 |
+| blk.{i}.ffn_up_exps.weight | 59 | 1010 | 59560 |
+| blk.{i}.attn_output.weight | 59 | 28 | 1654 |
+| output.weight | 1 | 736 | 736 |
+| blk.{i}.ffn_gate_inp.weight | 59 | 10.5 | 620 |
+| blk.{i}.ffn_norm.weight | 1 | 561 | 561 |
+| blk.{i}.attn_q_b.weight | 59 | 9.05 | 534 |
+| blk.{i}.attn_kv_b.weight | 59 | 8.5 | 502 |
+| blk.{i}.ffn_down_shexp.weight | 59 | 7.03 | 415 |
+| blk.{i}.ffn_gate_shexp.weight | 59 | 5.48 | 323 |
+| blk.{i}.ffn_up_shexp.weight | 59 | 5.48 | 323 |
+| blk.{i}.attn_q_a.weight | 59 | 5.26 | 310 |
+| blk.{i}.attn_k_b.weight | 59 | 2.25 | 133 |
+| blk.{i}.attn_v_b.weight | 59 | 2.03 | 120 |
+| blk.{i}.attn_kv_a_mqa.weight | 59 | 1.97 | 116 |
+| blk.{i}.ffn_down.weight | 1 | 63 | 63 |
+| blk.{i}.ffn_gate.weight | 1 | 49.3 | 49.3 |
+| blk.{i}.ffn_up.weight | 1 | 49.3 | 49.3 |
+| blk.{i}.attn_norm.weight | 60 | 0.0273 | 1.64 |
+| blk.{i}.ffn_norm.weight | 59 | 0.0273 | 1.61 |
+| blk.{i}.attn_q_a_norm.weight | 59 | 0.0059 | 0.346 |
+| blk.{i}.attn_kv_a_norm.weight | 59 | 0.002 | 0.115 |
+| blk.{i}.exp_probs_b.bias | 59 | 0.0015 | 0.0864 |
+| output_norm.weight | 1 | 0.0273 | 0.0273 |
+| token_embd.weight | 1 | 0.002 | 0.002 |
+
+For Q8_0:
+
+| Name | Count | Size (MB) | Total (MB) |
+|:------------------------------|--------:|------------:|-------------:|
+| blk.{i}.ffn_down_exps.weight | 59 | 5712 | 337008 |
+| blk.{i}.ffn_gate_exps.weight | 59 | 5712 | 337008 |
+| blk.{i}.ffn_up_exps.weight | 59 | 5712 | 337008 |
+| blk.{i}.attn_output.weight | 59 | 59.5 | 3510 |
+| token_embd.weight | 1 | 1190 | 1190 |
+| output.weight | 1 | 1190 | 1190 |
+| blk.{i}.attn_q_b.weight | 59 | 19.1 | 1128 |
+| blk.{i}.ffn_down_shexp.weight | 58 | 14.9 | 863 |
+| blk.{i}.ffn_gate_shexp.weight | 58 | 14.9 | 863 |
+| blk.{i}.ffn_up_shexp.weight | 58 | 14.9 | 863 |
+| blk.{i}.attn_q_a.weight | 59 | 11.2 | 658 |
+| blk.{i}.ffn_gate_inp.weight | 59 | 10.5 | 620 |
+| blk.{i}.attn_kv_b.weight | 59 | 8.5 | 502 |
+| blk.{i}.attn_k_b.weight | 59 | 4.25 | 251 |
+| blk.{i}.attn_v_b.weight | 59 | 4.25 | 251 |
+| blk.{i}.attn_kv_a_mqa.weight | 59 | 4.18 | 247 |
+| blk.{i}.ffn_down.weight | 1 | 134 | 134 |
+| blk.{i}.ffn_gate.weight | 1 | 134 | 134 |
+| blk.{i}.ffn_up.weight | 1 | 134 | 134 |
+| blk.{i}.attn_norm.weight | 60 | 0.0273 | 1.64 |
+| blk.{i}.ffn_norm.weight | 59 | 0.0273 | 1.61 |
+| blk.{i}.attn_q_a_norm.weight | 59 | 0.0059 | 0.346 |
+| blk.{i}.attn_kv_a_norm.weight | 59 | 0.002 | 0.115 |
+| blk.{i}.exp_probs_b.bias | 59 | 0.0015 | 0.0864 |
+| output_norm.weight | 1 | 0.0273 | 0.0273 |
+
+
+Script
+
+`./llama-gguf Kimi-K2-Instruct-IQ1_S.gguf r n > kimi-tensors.txt`
+
+`python tensors.py kimi-tensors.txt --md -m`
+
+`tensors.py`:
+```py
+import re
+import sys
+import argparse
+import pandas as pd
+
+def format_size(size, unit):
+ if not isinstance(size, (int, float)) or size < 0:
+ return "Unknown"
+ if unit == 'K': size /= 1024
+ elif unit == 'M': size /= 1024 * 1024
+ elif unit == 'G': size /= 1024 * 1024 * 1024
+ if size == 0: return 0
+ elif size < 0.1: return round(size, 4)
+ elif size < 1: return round(size, 3)
+ elif size < 10: return round(size, 2)
+ elif size < 100: return round(size, 1)
+ return round(size)
+
+def create_tensor_report(file_contents, unit, md_output=False):
+ tensor_regex = re.compile(r"tensor\[\d+\]: name = ([\w\.]+), offset = (\d+)")
+ raw_tensors = [{"name": m.group(1), "offset": int(m.group(2))}
+ for line in file_contents.splitlines()
+ if (m := tensor_regex.search(line))]
+
+ if not raw_tensors:
+ print("Error: No tensors found.", file=sys.stderr)
+ return
+
+ raw_tensors.sort(key=lambda x: x["offset"])
+ tensors = [{"name": t["name"], "size": raw_tensors[i+1]["offset"] - t["offset"]}
+ for i, t in enumerate(raw_tensors[:-1])]
+ tensors.append({"name": raw_tensors[-1]["name"], "size": -1})
+
+ pattern_regex = re.compile(r"blk\.(\d+)\.")
+ aggregated = {}
+ for t in tensors[:-1]:
+ p = pattern_regex.sub(r"blk.{i}.", t["name"])
+ k = (p, t["size"])
+ aggregated[k] = aggregated.get(k, 0) + 1
+
+ last_p = pattern_regex.sub(r"blk.{i}.", tensors[-1]["name"])
+ matched = False
+ for p, s in aggregated:
+ if p == last_p:
+ aggregated[(p, s)] += 1
+ matched = True
+ break
+ if not matched:
+ aggregated[(last_p, -1)] = aggregated.get((last_p, -1), 0) + 1
+
+ output = []
+ for (p, s), c in aggregated.items():
+ ts = c * s if s != -1 else -1
+ fs = format_size(s, unit)
+ ft = format_size(ts, unit)
+ output.append([p, c, fs if isinstance(fs, str) else fs,
+ ft if isinstance(ft, str) else ft])
+
+ output.sort(key=lambda x: x[3] if isinstance(x[3], (int, float)) else -1, reverse=True)
+ units = {'B': 'Bytes', 'K': 'KB', 'M': 'MB', 'G': 'GB'}
+ headers = ["Name", "Count", f"Size ({units.get(unit, 'Bytes')})",
+ f"Total ({units.get(unit, 'Bytes')})"]
+
+ if md_output:
+ df = pd.DataFrame(output, columns=headers)
+ print(df.to_markdown(index=False))
+ else:
+ print(*headers, sep=',')
+ for row in output:
+ print(*row, sep=',')
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser(description="Analyze GGUF tensor metadata.")
+ parser.add_argument("filepath", help="Path to tensor metadata file.")
+ group = parser.add_mutually_exclusive_group()
+ group.add_argument("-k", "--kb", action="store_true")
+ group.add_argument("-m", "--mb", action="store_true")
+ group.add_argument("-g", "--gb", action="store_true")
+ parser.add_argument("--md", action="store_true", help="Output as markdown table")
+
+ args = parser.parse_args()
+ unit = 'B'
+ if args.kb: unit = 'K'
+ elif args.mb: unit = 'M'
+ elif args.gb: unit = 'G'
+
+ try:
+ with open(args.filepath) as f:
+ create_tensor_report(f.read(), unit, args.md)
+ except FileNotFoundError:
+ print(f"Error: File not found at '{args.filepath}'", file=sys.stderr)
+ sys.exit(1)
+ except Exception as e:
+ print(f"Error: {e}", file=sys.stderr)
+ sys.exit(1)
+```
+
+
+
+Also, is there a way to get the tensor types from llama-gguf? Or should I use something like gguf-py?
+
+---
+
+👤 **ubergarm** commented on **2025-07-20** at **15:18:58**
@ThomasBaruzier
> everything minus ffn gate up down is very small
-Yes, I like to imagine a person with the `attn/shexp/first N ffn dense layers` as the head, and all the routed exps as the body. DeepSeek has a very small "head" and a very large "body". Kimi-K2 has an even smaller tiny "head" and an even larger "body" haha...
+Yes, I like to imagine a person with the `attn/shexp/first N ffn dense layers` as their "head", and all the routed exps as their "body". DeepSeek has a very small "head" and a very large "body". Kimi-K2 has an even smaller tiny "head" and an even larger "body" haha...
So perhaps one must be more careful when squishing that tiny "brain" lol... All metaphorical of course...
@@ -992,12 +2871,12 @@ uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install numpy==1.26.2 sentencepiece pyyaml
-./gguf-py/scripts/gguf_dump.py /models/mymodel.gguf
+python ./gguf-py/scripts/gguf_dump.py /models/mymodel.gguf
```
---
-👤 **ubergarm** commented the **2025-07-20** at **16:08:14**:
+👤 **ubergarm** commented on **2025-07-20** at **16:08:14**
@ikawrakow
@@ -2828,7 +4707,15 @@ collect_imatrix[2]: blk.4.ffn_gate_exps.weight, MUL_MAT_ID, 71
---
-👤 **ThomasBaruzier** commented the **2025-07-20** at **16:59:15**:
+👤 **ikawrakow** commented on **2025-07-20** at **16:18:21**
+
+So, `attn_k_b` ends up having the name `attn_k_b.weight (reshaped)`. Somehow I thought I had taken care of that, but apparently not. We can still deal with it when using the imatrix. But `attn_v_b` is missing, which is very strange.
+
+I'll look into it.
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-07-20** at **16:59:15**
> Yes, I like to imagine a person with the attn/shexp/first N ffn dense layers as their "head", and all the routed exps as their "body". DeepSeek has a very small "head" and a very large "body". Kimi-K2 has an even smaller tiny "head" and an even larger "body" haha...
@@ -2840,7 +4727,7 @@ Thanks, I will check it out
---
-👤 **ikawrakow** commented the **2025-07-20** at **17:23:02**:
+👤 **ikawrakow** commented on **2025-07-20** at **17:23:02**
> I guess the best bet could be Q8_0 or Q6_K
@@ -2848,7 +4735,7 @@ Thanks, I will check it out
---
-👤 **ubergarm** commented the **2025-07-20** at **17:34:37**:
+👤 **ubergarm** commented on **2025-07-20** at **17:34:37**
> I wonder how fast that would go.
@@ -2856,4 +4743,36 @@ I have some preliminary llama-sweep-bench with my original recipe Kimi-K2 quants
I plan to get at least one a/b test sweep-bench of my kimi-k2 v0.1 original recipe vs the v0.2 full q8_0 `attn/shexp/blk.0.ffn.*` on this same rig today and might release the updated quants if the speed hit is not too bad given the improvement Perplexity.
-Of course I'll probably want to try a v0.3 recipe eventually after sorting out the MLA imatrix business :sweat_smile: ... Fortunately hf doesn't charge for the public storage :moneybag: :headstone: :hugs: ...
\ No newline at end of file
+Of course I'll probably want to try a v0.3 recipe eventually after sorting out the MLA imatrix business :sweat_smile: ... Fortunately hf doesn't charge for the public storage :moneybag: :headstone: :hugs: ...
+
+---
+
+👤 **ikawrakow** commented on **2025-07-20** at **17:49:06**
+
+Btw, what is causing this sudden surge in stars?
+
+---
+
+👤 **saood06** commented on **2025-07-22** at **14:50:08**
+
+> Btw, what is causing this sudden surge in stars?
+
+If I had to guess it was a mixture of Kimi, the people learning about this from the posts about Vulkan voting, organic growth.
+
+(Also I'm glad the repo is back was going to reply to this before the incident).
+
+---
+
+👤 **ubergarm** commented on **2025-07-22** at **16:27:18**
+
+Despite the outage, managed to release the worlds smallest Kimi-K2-Instruct as well as the best perplexity quants here: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF#quant-collection
+
+Great job on the IQ1_KT, really impressive way to shrink down these behemoths models to run them in RAM on local rigs.
+
+Welcome back!
+
+---
+
+👤 **ikawrakow** commented on **2025-07-22** at **17:30:41**
+
+@ubergarm Looking at the graph in the linked HF repository, my guess is that one can cook models with size between `IQ1_KT` and `IQ2_KL` that push the Pareto frontier further away from Unsloth's quants using `IQ2_KT`. It looks like Unsloth's `IQ2_XXS` mix is performing surprisingly well in that size range.
\ No newline at end of file
diff --git a/github-data/pull_requests/617 - Fixup kimi-k2 convert indentation.md b/github-data/pull_requests/617 - Fixup kimi-k2 convert indentation.md
index 00f57762c..287124664 100644
--- a/github-data/pull_requests/617 - Fixup kimi-k2 convert indentation.md
+++ b/github-data/pull_requests/617 - Fixup kimi-k2 convert indentation.md
@@ -1,14 +1,17 @@
-### 🐛 [#617](https://github.com/ikawrakow/ik_llama.cpp/pull/617) - Fixup kimi-k2 convert indentation
+## 🔀 [Pull Request #617](https://github.com/ikawrakow/ik_llama.cpp/pull/617) - Fixup kimi-k2 convert indentation
| **Author** | `ubergarm` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ug/kimi-k2-convert-fixup` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-16 |
| **Updated** | 2025-07-16 |
+| **Merged** | 2025-07-16 |
---
-#### Description
+## 📄 Description
Fixup a copy-paste python indent bug on the convert_hf_to_gguf.py script for kimi-k2-instruct. Thanks @anikifoss for testing and if you have success let me know here to confirm this patch is good.
@@ -16,13 +19,21 @@ https://github.com/ikawrakow/ik_llama.cpp/pull/612#issuecomment-3076684820
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-07-16** at **13:24:15**: ✅ `APPROVED`
+👤 **anikifoss** commented on **2025-07-16** at **13:10:21**
+
+Still running, 8 hours later at 50%. There is `attn_kv_b` in the output GGUF.
+
+Why do you need `attn_kv_b` anyway?
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-16** at **13:24:15**
---
-👤 **ubergarm** commented the **2025-07-16** at **13:30:08**:
+👤 **ubergarm** commented on **2025-07-16** at **13:30:08**
> Still running, 8 hours later at 50%. There is `attn_kv_b` in the output GGUF.
>
@@ -38,8 +49,58 @@ Based on that discussion I've changed my recipes a bit for Kimi and future deeps
---
-👤 **ikawrakow** commented the **2025-07-16** at **14:43:46**:
+👤 **anikifoss** commented on **2025-07-16** at **13:38:55**
+
+Thanks, you already pointed to that PR. Looks like it's for imatrix. There is so much activity I'm having hard time keeping up :sweat_smile:
+
+---
+
+👤 **ubergarm** commented on **2025-07-16** at **14:25:05**
+
+@anikifoss
+
+Ooops, I'm so scattered sometimes! I've been trying to understand more clearly myself as well!
+
+## tl;dr;
+
+given you use q8_0 for all attn in your quants, it probably doesn't matter to you much. haha... also i think you are the source of `-ot attn_kv_b=CPU` maybe? I thought I tried that once and it wasn't running for me, but other people using it. Maybe it depends on which `-mla ` you're using? I only use 3 now since it got CUDA support.
+
+## ramblings
+
+The reason I pointed at that comment again is this specific bit, regarding "Why do you need attn_kv_b anyway":
+
+> This gives you imatrix data for the wk_b and wv_b tensors, which is good. It is good because these two get used for TG, so you want them quantized with fewer bits if possible. If wkv_b is added to the GGUF, it should be quantized with Q8_0. If it is not added, ik_llama.cpp will (nearly) losslessly create wkv_b tensors as Q8_0 from wk_b and wv_b while loading the model. wkv_b being Q8_0 is fine because tit only gets used for PP, so the more bits don't matter for performance. -ik
+
+Also from https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13733184
+
+> The attn_k_b and attn_v_b tensors get used for TG. The attn_kv_b tensors that ik_llama.cpp creates on-the-fly are used for PP (when MLA = 2, 3). To avoid potential accuracy loss due to re-quantization, the attn_kv_b tensors get created as Q8_0. -ik
+
+Also this discussion on MLA and comments: https://github.com/ikawrakow/ik_llama.cpp/discussions/354#discussioncomment-13054586
+
+There is a little bit about it too in one of the original mainline MLA PRs by fairydreaming which is was not merged, but possibly a bit more similar to how it is done here psure: https://github.com/ggml-org/llama.cpp/pull/11446
+
+So all that to say my limited understanding of having the both the `attn_kv_b` allow this fork to use "the best of both worlds" for `-mla 3` which uses:
+* q8_0 attn_kv_b tensor for PP (its fine to be big given PP is CPU bound)
+* quantized attn_k_b attn_v_b tensors (preferably with correct imatrix for lower bpws) for TG (memory bound so smaller size is faster)
+
+But yeah as you use q8_0 for all of it, probably not a bit deal on your quants and also why mainline uses q8_0 for all that as compilades new imatrix/gguf stuff that properly handles those tensors is not yet merged.
+
+My latest recipes I've been leaving attn_kv_b at q8_0 now and only quantizing attn_k_b and attn_v_b. Unfortunately though, attn_k_b is not divisible by 256 so I'm stuck with q5_0 or iq4_nl.
+
+I hope this is somewhat accurate :sweat_smile:
+
+---
+
+👤 **ikawrakow** commented on **2025-07-16** at **14:43:46**
> I hope this is somewhat accurate
-It is. Basically, you don't need to have the `attn_kv_b` tensors to create imatrix data and a good quantized model for `ik_llama.cpp`. The only potential benefit from having `attn_kv_b` in the GGUF is that then these tensors becomes part of the contiguously allocated (or mmap'ed) tensor data storage, while if they are not present in the GGUF, memory is allocated separately for them (but still on the same device that stores the corresponding `attn_k` and `attn_v` tensors). Considering how sensitive the big NUMA systems are to the way the tensors are stored in RAM, this may have some performance implications. But nobody has studied this effect in detail yet, so we don't really know.
\ No newline at end of file
+It is. Basically, you don't need to have the `attn_kv_b` tensors to create imatrix data and a good quantized model for `ik_llama.cpp`. The only potential benefit from having `attn_kv_b` in the GGUF is that then these tensors becomes part of the contiguously allocated (or mmap'ed) tensor data storage, while if they are not present in the GGUF, memory is allocated separately for them (but still on the same device that stores the corresponding `attn_k` and `attn_v` tensors). Considering how sensitive the big NUMA systems are to the way the tensors are stored in RAM, this may have some performance implications. But nobody has studied this effect in detail yet, so we don't really know.
+
+---
+
+👤 **saood06** commented on **2025-07-16** at **20:13:58**
+
+>Considering how sensitive the big NUMA systems are to the way the tensors are stored in RAM, this may have some performance implications. But nobody has studied this effect in detail yet, so we don't really know.
+
+I did some direct comparisons a long while back, and there was a measurable (but small) impact on my system (and this was with q8_0 attn tensors which matches the size they are created at if not present). So I can say that when it comes to my system it matters enough to be measured.
\ No newline at end of file
diff --git a/github-data/pull_requests/618 - Webui New Features for Conversations Settings and Chat Messages.md b/github-data/pull_requests/618 - Webui New Features for Conversations Settings and Chat Messages.md
new file mode 100644
index 000000000..847967c6c
--- /dev/null
+++ b/github-data/pull_requests/618 - Webui New Features for Conversations Settings and Chat Messages.md
@@ -0,0 +1,57 @@
+## 🔀 [Pull Request #618](https://github.com/ikawrakow/ik_llama.cpp/pull/618) - Webui: New Features for Conversations, Settings, and Chat Messages
+
+| **Author** | `firecoperana` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `fcp/webui_update_new` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-16 |
+| **Updated** | 2025-07-20 |
+| **Merged** | 2025-07-20 |
+| **Assignees** | `firecoperana` |
+
+---
+
+## 📄 Description
+
+1. Add Rename/Upload conversation function in header and sidebar
+2. Add a preset feature to the settings [#14649](https://github.com/ikawrakow/ik_llama.cpp/issues/14649) https://github.com/ggml-org/llama.cpp/pull/14649
+3. Add editing assistant messages [#13522](https://github.com/ikawrakow/ik_llama.cpp/issues/13522) (modify some behavior) https://github.com/ggml-org/llama.cpp/pull/13522
+4. DB import and export [#14347](https://github.com/ikawrakow/ik_llama.cpp/issues/14347) https://github.com/ggml-org/llama.cpp/pull/14347
+5. Bug fixes
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [ ] Medium
+ - [x] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** commented on **2025-07-16** at **16:09:57**
+
+I basically never use the WebUI, so I'm a totally unexperienced user not knowing what to look for.
+
+Can we get a few people testing and providing feedback? Thanks!
+
+---
+
+👤 **mcm007** commented on **2025-07-16** at **20:53:26**
+
+All features are useful, thanks, and are working OK on my tests.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-17** at **05:42:09**
+
+@mcm007 Thank you for testing.
+
+Please let's get at least one more user to test.
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-20** at **10:33:43**
+
+Merging. In case there are issue, we will learn about them after the fact.
\ No newline at end of file
diff --git a/github-data/pull_requests/618 - Webui_ New Features for Conversations_ Settings_ and Chat Messages.md b/github-data/pull_requests/618 - Webui_ New Features for Conversations_ Settings_ and Chat Messages.md
deleted file mode 100644
index 6befe5d2f..000000000
--- a/github-data/pull_requests/618 - Webui_ New Features for Conversations_ Settings_ and Chat Messages.md
+++ /dev/null
@@ -1,39 +0,0 @@
-### ✨ [#618](https://github.com/ikawrakow/ik_llama.cpp/pull/618) - Webui: New Features for Conversations, Settings, and Chat Messages
-
-| **Author** | `firecoperana` |
-| :--- | :--- |
-| **State** | ✅ **Open** |
-| **Created** | 2025-07-16 |
-| **Updated** | 2025-07-17 |
-
----
-
-#### Description
-
-1. Add Rename/Upload conversation function in header and sidebar
-2. Add a preset feature to the settings #14649 https://github.com/ggml-org/llama.cpp/pull/14649
-3. Add editing assistant messages #13522 (modify some behavior) https://github.com/ggml-org/llama.cpp/pull/13522
-4. DB import and export #14347 https://github.com/ggml-org/llama.cpp/pull/14347
-5. Bug fixes
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [ ] Low
- - [ ] Medium
- - [x] High
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** commented the **2025-07-17** at **05:42:09**:
-
-@mcm007 Thank you for testing.
-
-Please let's get at least one more user to test.
-
----
-
-👤 **ikawrakow** submitted a review the **2025-07-20** at **10:33:43**: ✅ `APPROVED`
-
-Merging. In case there are issue, we will learn about them after the fact.
\ No newline at end of file
diff --git a/github-data/pull_requests/62 - Use fp32 for K_Q in Metal FA implementation.md b/github-data/pull_requests/62 - Use fp32 for KQ in Metal FA implementation.md
similarity index 54%
rename from github-data/pull_requests/62 - Use fp32 for K_Q in Metal FA implementation.md
rename to github-data/pull_requests/62 - Use fp32 for KQ in Metal FA implementation.md
index ca6ec1e4e..9c45cffdd 100644
--- a/github-data/pull_requests/62 - Use fp32 for K_Q in Metal FA implementation.md
+++ b/github-data/pull_requests/62 - Use fp32 for KQ in Metal FA implementation.md
@@ -1,14 +1,17 @@
-### 🔀 [#62](https://github.com/ikawrakow/ik_llama.cpp/pull/62) - Use fp32 for K*Q in Metal FA implementation
+## 🔀 [Pull Request #62](https://github.com/ikawrakow/ik_llama.cpp/pull/62) - Use fp32 for K*Q in Metal FA implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_metal_fa` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-25 |
| **Updated** | 2024-09-25 |
+| **Merged** | 2024-09-25 |
---
-#### Description
+## 📄 Description
Else some models (e.g., Qwen2-7B-Instruct) produce garbage. Borrowed from PR-9595 in mainline `llama.cpp`.
diff --git a/github-data/pull_requests/620 - Bump Windows max open files from 512 to 2048.md b/github-data/pull_requests/620 - Bump Windows max open files from 512 to 2048.md
index 18299b0f7..3bac1c897 100644
--- a/github-data/pull_requests/620 - Bump Windows max open files from 512 to 2048.md
+++ b/github-data/pull_requests/620 - Bump Windows max open files from 512 to 2048.md
@@ -1,14 +1,17 @@
-### 🔀 [#620](https://github.com/ikawrakow/ik_llama.cpp/pull/620) - Bump Windows max open files from 512 to 2048
+## 🔀 [Pull Request #620](https://github.com/ikawrakow/ik_llama.cpp/pull/620) - Bump Windows max open files from 512 to 2048
| **Author** | `Thireus` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `patch-2` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-16 |
| **Updated** | 2025-07-17 |
+| **Merged** | 2025-07-17 |
---
-#### Description
+## 📄 Description
Allows up to 2048 shards to be loaded on Windows builds, from the current default of 512. This change is specific to Windows, it instructs the Windows OS that the binary requires 2048 of max opened files. This is the equivalent to Linux's `ulimit -n`.
@@ -20,54 +23,50 @@ Allows up to 2048 shards to be loaded on Windows builds, from the current defaul
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-07-17** at **05:39:22**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-07-17** at **05:39:22** on `src/llama.cpp`:
+👤 **ikawrakow** started a conversation on `src/llama.cpp` on **2025-07-17** at **05:39:22**
Don't you want to make this dependent on the value of `GGML_MAX_CONTEXTS` instead of it being simply set to 2048?
-I don't know much about Windows, but if I understand correctly the description of the `_setmaxstdio` function, it changes the max. number of files that can be open at the same time at the stream I/O level. The default for this is 512. The Microsoft engineers must have had a reason to keep it at 512 instead of just setting it to the 8192 limit of the low I/O level. If they did have a reason, then my thinking is that ot would be wise to not increase the stream I/O limit unless necessary. It only becomes necessary if we want to use more than 512 shards, which is only possible if we have changed the value of `GGML_MAX_CONTEXTS`.
-
----
-
-👤 **saood06** submitted a review the **2025-07-17** at **06:03:53**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** submitted a review the **2025-07-17** at **06:35:12**: 💬 `COMMENTED`
-
----
-
-👤 **ikawrakow** commented during a code review the **2025-07-17** at **06:35:12** on `src/llama.cpp`:
-
-If we are sure that limitations in `CreateProcess` implementation is the only reason, then it wouldn't be an issue as `llama.cpp` is not actually spawning new processes. A file handle leak each time one starts a `llama.cpp` process is not too bad either: one simply needs to reboot their Windows box from time to time just like in the old days. Just joking. If there is indeed a file handle leak, then it is even more important to make the increase conditional upon `GGML_MAX_CONTEXTS > 512`.
-
----
-
-👤 **Thireus** submitted a review the **2025-07-17** at **06:38:36**: 💬 `COMMENTED`
-
----
-
-👤 **Thireus** commented during a code review the **2025-07-17** at **06:38:36** on `src/llama.cpp`:
-
-Change made. Please let me know if this is now acceptable.
-
----
-
-👤 **saood06** submitted a review the **2025-07-17** at **06:44:27**: 💬 `COMMENTED`
-
----
-
-👤 **saood06** commented during a code review the **2025-07-17** at **06:44:27** on `src/llama.cpp`:
-
-> If we are sure that limitations in `CreateProcess` implementation is the only reason, then it wouldn't be an issue as `llama.cpp` is not actually spawning new processes. A file handle leak each time one starts a `llama.cpp` process is not too bad either: one simply needs to reboot their Windows box from time to time just like in the old days. Just joking. If there is indeed a file handle leak, then it is even more important to make the increase conditional upon `GGML_MAX_CONTEXTS > 512`.
-
-I wouldn't take the "leak" part seriously as it is from "10 Dec 2006", just included that because it mentioned the handles. Win32 should only be needed if models large enough and people have 2048 limits (instead of 8192).
+I don't know much about Windows, but if I understand correctly the description of the `_setmaxstdio` function, it changes the max. number of files that can be open at the same time at the stream I/O level. The default for this is 512. The Microsoft engineers must have had a reason to keep it at 512 instead of just setting it to the 8192 limit of the low I/O level. If they did have a reason, then my thinking is that it would be wise to not increase the stream I/O limit unless necessary. It only becomes necessary if we want to use more than 512 shards, which is only possible if we have changed the value of `GGML_MAX_CONTEXTS`.
+
+> 👤 **saood06** replied on **2025-07-17** at **06:03:52**
+>
+> I agree, and this is what I was saying here: https://github.com/ikawrakow/ik_llama.cpp/pull/611#issuecomment-3072281429
+>
+> >The default for this is 512. The Microsoft engineers must have had a reason to keep it at 512 instead of just setting it to the 8192 limit of the low I/O level.
+>
+> Since this came up, I've looked into it, best reason I found was this (from a time when the true maximum was 2048):
+>
+> >I believe the limit has to do with the ability to inherit the open files from a CreateProcess call. The CreateProcess has only 2048 slots for passing handles (both on 32-bit and 64-bit). You can debug a program and step into the system, exec, or spawn CRT functions to see the limit of the 2048 slots.
+> >
+> >If you use the Win32 file API (CreateFile, WriteFile, ReadFile, CloseHandle, etc.), then you don't have a limit on open files (well, you do but I believe it is based on your resources like memory).
+>
+> Source: https://stackoverflow.com/questions/1803552/setmaxstdio-max-open-files-is-2048-only
+>
+> alongside this corroborating piece from https://bugs.mysql.com/bug.php?id=24509 (they also mention Win32 on that page):
+>
+> >It's a hard windows limit due to the fact of using posix-like
+> >functions in some places. I will open 2nd bug report about a
+> >handle leak when that 2048 limit is hit.
+>
+> If 2048/8192+ is wanted Win32 API might be needed (not sure how big a change that would be).
+
+> 👤 **ikawrakow** replied on **2025-07-17** at **06:35:12**
+>
+> If we are sure that limitations in `CreateProcess` implementation is the only reason, then it wouldn't be an issue as `llama.cpp` is not actually spawning new processes. A file handle leak each time one starts a `llama.cpp` process is not too bad either: one simply needs to reboot their Windows box from time to time just like in the old days. Just joking. If there is indeed a file handle leak, then it is even more important to make the increase conditional upon `GGML_MAX_CONTEXTS > 512`.
+
+> 👤 **Thireus** replied on **2025-07-17** at **06:38:36**
+>
+> Change made. Please let me know if this is now acceptable.
+
+> 👤 **saood06** replied on **2025-07-17** at **06:44:27**
+>
+> > If there is indeed a file handle leak, then it is even more important to make the increase conditional upon `GGML_MAX_CONTEXTS > 512`.
+>
+> I wouldn't take the "leak" part seriously as it is from "10 Dec 2006", just included that because it mentioned the handles. Win32 should only be needed if models large enough (much more than deepseek) and people have 2048 limits (instead of 8192).
---
-👤 **ikawrakow** submitted a review the **2025-07-17** at **06:50:15**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-17** at **06:50:15**
\ No newline at end of file
diff --git a/github-data/pull_requests/622 - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt.md b/github-data/pull_requests/622 - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt.md
index c9af2df67..e2b818b7e 100644
--- a/github-data/pull_requests/622 - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt.md
+++ b/github-data/pull_requests/622 - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt.md
@@ -1,14 +1,17 @@
-### 🔀 [#622](https://github.com/ikawrakow/ik_llama.cpp/pull/622) - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt
+## 🔀 [Pull Request #622](https://github.com/ikawrakow/ik_llama.cpp/pull/622) - Add GGML_MAX_CONTEXTS definition in CMakeLists.txt
| **Author** | `Thireus` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `patch-3` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-17 |
| **Updated** | 2025-07-17 |
+| **Merged** | 2025-07-17 |
---
-#### Description
+## 📄 Description
If this entry is missing, GGML_MAX_CONTEXTS is ignored.
This is part of this request: https://github.com/ikawrakow/ik_llama.cpp/pull/611
@@ -21,6 +24,12 @@ This is part of this request: https://github.com/ikawrakow/ik_llama.cpp/pull/611
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2025-07-17** at **05:26:53**: ✅ `APPROVED`
\ No newline at end of file
+👤 **Thireus** commented on **2025-07-17** at **01:33:30**
+
+Tested and ready to merge.
+
+---
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-17** at **05:26:53**
\ No newline at end of file
diff --git a/github-data/pull_requests/624 - Quantization tweaks.md b/github-data/pull_requests/624 - Quantization tweaks.md
index bc34890ba..5e3a88df9 100644
--- a/github-data/pull_requests/624 - Quantization tweaks.md
+++ b/github-data/pull_requests/624 - Quantization tweaks.md
@@ -1,14 +1,16 @@
-### 🔀 [#624](https://github.com/ikawrakow/ik_llama.cpp/pull/624) - Quantization tweaks
+## 🔀 [Pull Request #624](https://github.com/ikawrakow/ik_llama.cpp/pull/624) - Quantization tweaks
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ✅ **Open** |
+| **Source Branch** | `ik/quantization_tweaks` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-17 |
| **Updated** | 2025-07-19 |
---
-#### Description
+## 📄 Description
Minor tweaks in the quantization methods for `Q2_K, Q3_K, Q4_K, Q5_K, IQ2_KS, IQ3_KS, IQ3_K`.
@@ -16,9 +18,16 @@ Also changed the automatic recipes to use `IQ2_KL` instead of `Q2_K`.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2025-07-17** at **16:32:39**:
+👤 **Nexesenex** commented on **2025-07-17** at **16:18:26**
+
+Hey IK.
+You devised small gains on perplexity for all those ggml_types, I presume, besides the works on the ftypes/quant strategies?
+
+---
+
+👤 **ikawrakow** commented on **2025-07-17** at **16:32:39**
> You devised small gains on perplexity for all those ggml_types, I presume, besides the works on the ftypes/quant strategies?
@@ -32,7 +41,27 @@ Just kidding. I felt lazy to do the usual evaluation with multiple models, so th
---
-👤 **ikawrakow** commented the **2025-07-18** at **05:05:47**:
+👤 **ubergarm** commented on **2025-07-18** at **03:11:51**
+
+I had just finished cooking a Kimi-K2-Instruct-IQ3_KS when I noticed this PR!
+
+So I had to cook again to compare. I also went back and re-cooked my IQ2_KS recipe. Not perfect recipes to test this thoroughly given mixed types, but a couple data points at least coming together now.
+
+The IQ2_KS looks slightly better, but the IQ3_KS seemed worse for this PR. Haven't tried others or any other tests.
+
+Full recipe is [available on the hf model card secret recipe details](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF#-iq3_ks-427205-gib-3573-bpw)
+* The IQ3_KS uses IQ3_KS for ffn_(gate|up)_exps and IQ4_KS for ffn_down_exps
+* The IQ2_KS uses IQ2_KS for ffn_(gate|up)_exps and IQ2_KL for ffn_down_exps
+
+Also getting around to checking some perplexities of various UD quants which are also on the chart for any that I've measured myself.
+
+*EDIT* to avoid duplicating this graph everywhere [I'll keep it and data here for now.](https://github.com/ikawrakow/ik_llama.cpp/pull/616#issuecomment-3087170346).
+
+I'll update this chart once I run one more and drop it with the data over on the IQ1_KT PR were we are discussing that more specifically.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-18** at **05:05:47**
@ubergarm
@@ -44,7 +73,7 @@ I see `UD-IQ1_S` labeled as "nofmoe". Does this mean that `-fmoe` is not working
---
-👤 **ikawrakow** commented the **2025-07-18** at **06:58:19**:
+👤 **ikawrakow** commented on **2025-07-18** at **06:58:19**
> The IQ2_KS looks slightly better, but the IQ3_KS seemed worse for this PR. Haven't tried others or any other tests.
@@ -74,7 +103,873 @@ ___
---
-👤 **ubergarm** commented the **2025-07-19** at **15:08:07**:
+👤 **ubergarm** commented on **2025-07-18** at **14:09:23**
+
+> Thank you for this plot. So, the pure IQ1_KT model is basically on par with Unsloth's IQ1_S, while being 22% smaller!
+
+Yes the KT quants are looking strong in the low bpw ranges here!
+
+> Isn't the bpw for "badname-UD-TQ1_0" wrong? This model shows as just 245 GB on HF (or is HF also wrong about model sizes now?).
+
+Here is what it prints out when I start it up:
+```bash
+model=/mnt/data/models/unsloth/Kimi-K2-Instruct-GGUF/UD-TQ1_0/Kimi-K2-Instruct-UD-TQ1_0-00001-of-00005.gguf
+
+llm_load_print_meta: model ftype = IQ1_S - 1.5625 bpw
+llm_load_print_meta: model params = 1.026 T
+llm_load_print_meta: model size = 227.854 GiB (1.907 BPW)
+llm_load_print_meta: repeating layers = 226.342 GiB (1.899 BPW, 1024.059 B parameters)
+llm_load_print_meta: general.name = Kimi-K2-Instruct
+```
+
+So yes, I see I made a copy paste error and kept the size/bpw from the UD-IQ1_S, I've updated my data file now!
+
+> I see UD-IQ1_S labeled as "nofmoe". Does this mean that -fmoe is not working? I saw elsewhere a report about models failing with -fmoe, but no-one would bother to post the model quant composition so I can try to understand what is wrong. If UD-IQ1_S is failing with -fmoe, can you open an issue for that? Thanks.
+
+Correct, that UD-IQ1_S is the only one failing with `-fmoe`. You can look inside it here and change the filename in the URL to see the other GGUF splits contents: https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF?show_file_info=UD-IQ1_S%2FKimi-K2-Instruct-UD-IQ1_S-00001-of-00006.gguf
+
+I'm just catching up on the action and will open an issue and link to everything (if there is not already an issue open now).
+
+> How was the imatrix for Kimi-2 generated?
+
+Yeah I was surprised at the increase in PPL on my PR624-IQ3_KS as well and want to double check for any operator (me) error. Here is the imatrix command I ran and full logs. I've used the resulting imatrix.dat for all of my quants:
+
+
+
+👈 kimi-k2-instruct imatrix command and logs
+
+On earlier deepseek models I left out `-mla 1` but added it for this given recent discussions on attn_kv_b and such.
+
+```bash
+model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf
+
+numactl --interleave=all \
+./build/bin/llama-imatrix \
+ -m "$model" \
+ -f ubergarm-imatrix-calibration-corpus-v02.txt \
+ -o /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat \
+ -mla 1 \
+ --verbosity 1 \
+ --ctx-size 512 \
+ --layer-similarity \
+ --numa distribute \
+ --threads 384
+
+llama_model_loader: loaded meta data with 42 key-value pairs and 1157 tensors from /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-Q8_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Kimi K2 Instruct Bf16 Safetensors
+llama_model_loader: - kv 3: general.finetune str = Instruct-safetensors
+llama_model_loader: - kv 4: general.basename str = Kimi-K2
+llama_model_loader: - kv 5: general.size_label str = 384x15B
+llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 7: deepseek2.context_length u32 = 131072
+llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 64
+llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 64
+llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 50000.000000
+llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 15: general.file_type u32 = 7
+llama_model_loader: - kv 16: deepseek2.leading_dense_block_count u32 = 1
+llama_model_loader: - kv 17: deepseek2.vocab_size u32 = 163840
+llama_model_loader: - kv 18: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 19: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 20: deepseek2.attention.key_length u32 = 192
+llama_model_loader: - kv 21: deepseek2.attention.value_length u32 = 128
+llama_model_loader: - kv 22: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 23: deepseek2.expert_count u32 = 384
+llama_model_loader: - kv 24: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 25: deepseek2.expert_weights_scale f32 = 2.827000
+llama_model_loader: - kv 26: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 27: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 28: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 29: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 30: deepseek2.rope.scaling.factor f32 = 32.000000
+llama_model_loader: - kv 31: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 32: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 34: tokenizer.ggml.pre str = kimi-k2
+llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,163840] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,163840] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,163328] = ["Ġ Ġ", "ĠĠ ĠĠ", "Ġ t", "i n",...
+llama_model_loader: - kv 38: tokenizer.ggml.bos_token_id u32 = 163584
+llama_model_loader: - kv 39: tokenizer.ggml.eos_token_id u32 = 163585
+llama_model_loader: - kv 40: tokenizer.chat_template str = {% if tools -%}\n {{ '<|im_system|>...
+llama_model_loader: - kv 41: general.quantization_version u32 = 2
+llama_model_loader: - type f32: 365 tensors
+llama_model_loader: - type q8_0: 792 tensors
+llm_load_vocab: special tokens cache size = 256
+llm_load_vocab: token to piece cache size = 1.0607 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 163840
+llm_load_print_meta: n_merges = 163328
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 131072
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 64
+llm_load_print_meta: n_head_kv = 64
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 12288
+llm_load_print_meta: n_embd_v_gqa = 8192
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 384
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 50000.0
+llm_load_print_meta: freq_scale_train = 0.03125
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = Q8_0
+llm_load_print_meta: model params = 1.027 T
+llm_load_print_meta: model size = 1016.623 GiB (8.504 BPW)
+llm_load_print_meta: repeating layers = 1014.299 GiB (8.504 BPW, 1024.571 B parameters)
+llm_load_print_meta: general.name = Kimi K2 Instruct Bf16 Safetensors
+llm_load_print_meta: BOS token = 163584 '[BOS]'
+llm_load_print_meta: EOS token = 163585 '[EOS]'
+llm_load_print_meta: LF token = 128 'Ä'
+llm_load_print_meta: EOT token = 163586 '<|im_end|>'
+llm_load_print_meta: max token length = 512
+llm_load_print_meta: n_layer_dense_lead = 1
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.8
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.47 MiB
+llm_load_tensors: CPU buffer size = 1041021.91 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 512
+llama_new_context_with_model: n_batch = 512
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 0
+llama_new_context_with_model: mla_attn = 1
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 50000.0
+llama_new_context_with_model: freq_scale = 0.03125
+llama_kv_cache_init: CPU KV buffer size = 64.81 MiB
+llama_new_context_with_model: KV self size = 64.81 MiB, c^KV (f16): 34.31 MiB, kv^T (f16): 30.50 MiB
+llama_new_context_with_model: CPU output buffer size = 0.63 MiB
+llama_new_context_with_model: CPU compute buffer size = 334.00 MiB
+llama_new_context_with_model: graph nodes = 3827
+llama_new_context_with_model: graph splits = 1
+
+system_info: n_threads = 384 / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
+compute_imatrix: tokenizing the input ..
+compute_imatrix: tokenization took 835.506 ms
+compute_imatrix: computing over 826 chunks with batch_size 512
+compute_imatrix: 43.88 seconds per pass - ETA 10 hours 4.05 minutes
+[1]75.3007,[2]13.9305,[3]6.7296,[4]4.1851,[5]3.2372,[6]2.6987,[7]2.3609,[8]2.1425,[9]2.0965,
+save_imatrix: entry ' blk.59.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.59.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.59.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.58.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.56.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.56.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.54.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.54.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.52.ffn_down_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.52.ffn_up_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.52.ffn_gate_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.51.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.51.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.57.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.49.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.54.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.48.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.47.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.47.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.46.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.46.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.46.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.45.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.22.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.33.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.19.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.57.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.45.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.21.ffn_gate_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.19.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.18.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.17.ffn_gate_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.20.ffn_up_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.48.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.8.ffn_down_exps.weight' has partial data (98.18%) 7 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.14.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.1.ffn_gate_exps.weight' has partial data (98.44%) 6 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.1.ffn_down_exps.weight' has partial data (98.44%) 6 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.47.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.11.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.11.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.21.ffn_down_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.34.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.4.ffn_down_exps.weight' has partial data (97.40%) 10 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.51.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.20.ffn_gate_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.36.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.16.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.13.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.5.ffn_up_exps.weight' has partial data (97.66%) 9 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.38.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.3.ffn_gate_exps.weight' has partial data (98.18%) 7 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.4.ffn_gate_exps.weight' has partial data (97.40%) 10 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.44.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.58.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.48.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.6.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.34.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.5.ffn_down_exps.weight' has partial data (97.66%) 9 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.30.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.7.ffn_up_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.44.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.17.ffn_down_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.25.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.25.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.58.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.57.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.13.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.39.ffn_gate_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.1.ffn_up_exps.weight' has partial data (98.44%) 6 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.18.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.2.ffn_gate_exps.weight' has partial data (97.92%) 8 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.25.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.41.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.3.ffn_down_exps.weight' has partial data (98.18%) 7 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.45.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.40.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.24.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.14.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.31.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.21.ffn_up_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.36.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.41.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.37.ffn_up_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.37.ffn_down_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.11.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.6.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.34.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.6.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.2.ffn_down_exps.weight' has partial data (97.92%) 8 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.17.ffn_up_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.49.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.33.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.3.ffn_up_exps.weight' has partial data (98.18%) 7 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.7.ffn_down_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.44.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.49.ffn_up_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.22.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.22.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.14.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.56.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.23.ffn_gate_exps.weight' has partial data (98.44%) 6 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.23.ffn_up_exps.weight' has partial data (98.44%) 6 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.23.ffn_down_exps.weight' has partial data (98.44%) 6 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.16.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.24.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.24.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.18.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.26.ffn_gate_exps.weight' has partial data (97.66%) 9 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.26.ffn_up_exps.weight' has partial data (97.66%) 9 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.26.ffn_down_exps.weight' has partial data (97.66%) 9 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.27.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.4.ffn_up_exps.weight' has partial data (97.40%) 10 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.27.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.27.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.38.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.31.ffn_up_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.28.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.28.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.28.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.42.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.29.ffn_gate_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.29.ffn_up_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.29.ffn_down_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.30.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.30.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.8.ffn_up_exps.weight' has partial data (98.18%) 7 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.43.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.36.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.40.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.13.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.31.ffn_down_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.16.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.39.ffn_up_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.33.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.5.ffn_gate_exps.weight' has partial data (97.66%) 9 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.20.ffn_down_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.35.ffn_up_exps.weight' has partial data (97.92%) 8 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.35.ffn_down_exps.weight' has partial data (97.92%) 8 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.40.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.37.ffn_gate_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.8.ffn_gate_exps.weight' has partial data (98.18%) 7 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.39.ffn_down_exps.weight' has partial data (98.70%) 5 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.38.ffn_gate_exps.weight' has partial data (99.22%) 3 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.2.ffn_up_exps.weight' has partial data (97.92%) 8 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.7.ffn_gate_exps.weight' has partial data (98.96%) 4 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.41.ffn_down_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.42.ffn_gate_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.35.ffn_gate_exps.weight' has partial data (97.92%) 8 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.42.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.43.ffn_up_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.19.ffn_gate_exps.weight' has partial data (99.74%) 1 out of 384 experts are missing data Storing **but be aware**
+save_imatrix: entry ' blk.43.ffn_down_exps.weight' has partial data (99.48%) 2 out of 384 experts are missing data Storing **but be aware**
+
+save_imatrix: stored collected data after 10 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[10]2.0447,[11]2.0553,[12]2.2739,[13]2.3537,[14]2.3295,[15]2.2035,[16]2.1080,[17]2.0208,[18]1.9580,[19]1.8930,
+save_imatrix: stored collected data after 20 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[20]1.8448,[21]1.7927,[22]1.7578,[23]1.7213,[24]1.6852,[25]1.6508,[26]1.7266,[27]1.8283,[28]1.8931,[29]1.8844,
+save_imatrix: stored collected data after 30 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[30]1.8766,[31]1.8525,[32]1.8491,[33]1.8515,[34]1.8373,[35]1.8234,[36]1.8112,[37]1.8104,[38]1.8069,[39]1.7878,
+save_imatrix: stored collected data after 40 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[40]1.7629,[41]1.7450,[42]1.7292,[43]1.7117,[44]1.6987,[45]1.6951,[46]1.6825,[47]1.6741,[48]1.6705,[49]1.6613,
+save_imatrix: stored collected data after 50 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[50]1.6534,[51]1.6677,[52]1.6804,[53]1.6799,[54]1.6973,[55]1.7078,[56]1.7172,[57]1.7084,[58]1.7473,[59]1.7778,
+save_imatrix: stored collected data after 60 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[60]1.8056,[61]1.8425,[62]1.8813,[63]1.9221,[64]1.9550,[65]2.0082,[66]2.0360,[67]2.0632,[68]2.1073,[69]2.1413,
+save_imatrix: stored collected data after 70 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[70]2.1653,[71]2.1940,[72]2.2106,[73]2.2301,[74]2.2592,[75]2.2820,[76]2.2968,[77]2.3122,[78]2.3190,[79]2.3225,
+save_imatrix: stored collected data after 80 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[80]2.3332,[81]2.3528,[82]2.3927,[83]2.4148,[84]2.4200,[85]2.4355,[86]2.4338,[87]2.4763,[88]2.5016,[89]2.5260,
+save_imatrix: stored collected data after 90 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[90]2.5510,[91]2.5572,[92]2.5870,[93]2.5924,[94]2.5996,[95]2.6035,[96]2.6109,[97]2.6077,[98]2.6330,[99]2.6132,
+save_imatrix: stored collected data after 100 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[100]2.6467,[101]2.6665,[102]2.6550,[103]2.6847,[104]2.7280,[105]2.7568,[106]2.7900,[107]2.8202,[108]2.8503,[109]2.8766,
+save_imatrix: stored collected data after 110 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[110]2.8644,[111]2.8817,[112]2.8940,[113]2.9070,[114]2.9062,[115]2.9370,[116]2.9746,[117]2.9949,[118]2.9870,[119]2.9623,
+save_imatrix: stored collected data after 120 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[120]2.9471,[121]2.9625,[122]2.9627,[123]2.9414,[124]2.9315,[125]2.9299,[126]2.9324,[127]2.9374,[128]2.9410,[129]2.9433,
+save_imatrix: stored collected data after 130 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[130]2.9611,[131]2.9937,[132]3.0287,[133]3.0204,[134]2.9960,[135]2.9719,[136]2.9483,[137]2.9260,[138]2.9292,[139]2.9501,
+save_imatrix: stored collected data after 140 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[140]2.9749,[141]3.0066,[142]3.0007,[143]3.0139,[144]3.0319,[145]3.0514,[146]3.0670,[147]3.0876,[148]3.1107,[149]3.1310,
+save_imatrix: stored collected data after 150 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[150]3.1519,[151]3.1509,[152]3.1542,[153]3.1568,[154]3.1834,[155]3.1951,[156]3.2028,[157]3.2163,[158]3.2280,[159]3.2295,
+save_imatrix: stored collected data after 160 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[160]3.2334,[161]3.2440,[162]3.2529,[163]3.2575,[164]3.2713,[165]3.2735,[166]3.2772,[167]3.2836,[168]3.2885,[169]3.2943,
+save_imatrix: stored collected data after 170 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[170]3.2882,[171]3.3057,[172]3.3126,[173]3.3172,[174]3.3278,[175]3.3394,[176]3.3374,[177]3.3441,[178]3.3507,[179]3.3664,
+save_imatrix: stored collected data after 180 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[180]3.3795,[181]3.3895,[182]3.3839,[183]3.3796,[184]3.3757,[185]3.3701,[186]3.3651,[187]3.3589,[188]3.3538,[189]3.3612,
+save_imatrix: stored collected data after 190 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[190]3.3734,[191]3.4025,[192]3.4251,[193]3.4465,[194]3.4784,[195]3.5022,[196]3.5219,[197]3.5393,[198]3.5498,[199]3.5526,
+save_imatrix: stored collected data after 200 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[200]3.5398,[201]3.5190,[202]3.4978,[203]3.5173,[204]3.5273,[205]3.5347,[206]3.5496,[207]3.5697,[208]3.5833,[209]3.5974,
+save_imatrix: stored collected data after 210 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[210]3.6182,[211]3.6258,[212]3.6256,[213]3.6040,[214]3.5825,[215]3.5618,[216]3.5414,[217]3.5210,[218]3.5009,[219]3.4832,
+save_imatrix: stored collected data after 220 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[220]3.4740,[221]3.4707,[222]3.4537,[223]3.4414,[224]3.4436,[225]3.4451,[226]3.4657,[227]3.4843,[228]3.4953,[229]3.5154,
+save_imatrix: stored collected data after 230 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[230]3.5062,[231]3.5273,[232]3.5476,[233]3.5552,[234]3.5718,[235]3.5840,[236]3.6057,[237]3.6260,[238]3.6246,[239]3.6336,
+save_imatrix: stored collected data after 240 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[240]3.6456,[241]3.6565,[242]3.6775,[243]3.6929,[244]3.7058,[245]3.7153,[246]3.7066,[247]3.7349,[248]3.7442,[249]3.7650,
+save_imatrix: stored collected data after 250 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[250]3.7733,[251]3.7784,[252]3.7874,[253]3.7947,[254]3.8057,[255]3.8116,[256]3.8196,[257]3.8304,[258]3.8384,[259]3.8467,
+save_imatrix: stored collected data after 260 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[260]3.8583,[261]3.8709,[262]3.8819,[263]3.8946,[264]3.8779,[265]3.8819,[266]3.8895,[267]3.8978,[268]3.9040,[269]3.9176,
+save_imatrix: stored collected data after 270 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[270]3.9365,[271]3.9377,[272]3.9451,[273]3.9542,[274]3.9638,[275]3.9775,[276]3.9886,[277]3.9999,[278]4.0093,[279]4.0130,
+save_imatrix: stored collected data after 280 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[280]4.0236,[281]4.0310,[282]4.0362,[283]4.0499,[284]4.0520,[285]4.0557,[286]4.0540,[287]4.0495,[288]4.0615,[289]4.0583,
+save_imatrix: stored collected data after 290 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[290]4.0618,[291]4.0807,[292]4.0948,[293]4.1044,[294]4.1210,[295]4.1255,[296]4.1441,[297]4.1552,[298]4.1710,[299]4.1837,
+save_imatrix: stored collected data after 300 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[300]4.1961,[301]4.2035,[302]4.2221,[303]4.2279,[304]4.2312,[305]4.2356,[306]4.2510,[307]4.2603,[308]4.2672,[309]4.2743,
+save_imatrix: stored collected data after 310 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[310]4.2823,[311]4.2950,[312]4.3023,[313]4.3084,[314]4.3195,[315]4.3304,[316]4.3446,[317]4.3474,[318]4.3325,[319]4.3156,
+save_imatrix: stored collected data after 320 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[320]4.3037,[321]4.2875,[322]4.2860,[323]4.2880,[324]4.2691,[325]4.2831,[326]4.2950,[327]4.2992,[328]4.3054,[329]4.3034,
+save_imatrix: stored collected data after 330 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[330]4.3036,[331]4.3185,[332]4.3162,[333]4.3262,[334]4.3406,[335]4.3495,[336]4.3521,[337]4.3418,[338]4.3541,[339]4.3684,
+save_imatrix: stored collected data after 340 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[340]4.3836,[341]4.3974,[342]4.4151,[343]4.4409,[344]4.4443,[345]4.4444,[346]4.4450,[347]4.4501,[348]4.4658,[349]4.4725,
+save_imatrix: stored collected data after 350 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[350]4.4720,[351]4.4696,[352]4.4761,[353]4.4682,[354]4.4626,[355]4.4568,[356]4.4533,[357]4.4560,[358]4.4654,[359]4.4638,
+save_imatrix: stored collected data after 360 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[360]4.4644,[361]4.4488,[362]4.4307,[363]4.4157,[364]4.4027,[365]4.3857,[366]4.3739,[367]4.3623,[368]4.3509,[369]4.3422,
+save_imatrix: stored collected data after 370 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[370]4.3307,[371]4.3225,[372]4.3189,[373]4.3071,[374]4.2967,[375]4.2871,[376]4.2740,[377]4.2640,[378]4.2608,[379]4.2489,
+save_imatrix: stored collected data after 380 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[380]4.2445,[381]4.2489,[382]4.2369,[383]4.2320,[384]4.2229,[385]4.2074,[386]4.1938,[387]4.1897,[388]4.1818,[389]4.1666,
+save_imatrix: stored collected data after 390 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[390]4.1514,[391]4.1363,[392]4.1340,[393]4.1288,[394]4.1247,[395]4.1171,[396]4.1098,[397]4.1043,[398]4.0902,[399]4.0791,
+save_imatrix: stored collected data after 400 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[400]4.0756,[401]4.0638,[402]4.0550,[403]4.0447,[404]4.0398,[405]4.0295,[406]4.0172,[407]4.0072,[408]3.9978,[409]3.9892,
+save_imatrix: stored collected data after 410 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[410]3.9827,[411]3.9762,[412]3.9700,[413]3.9650,[414]3.9583,[415]3.9507,[416]3.9439,[417]3.9313,[418]3.9186,[419]3.9059,
+save_imatrix: stored collected data after 420 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[420]3.8960,[421]3.8844,[422]3.8736,[423]3.8632,[424]3.8510,[425]3.8421,[426]3.8321,[427]3.8206,[428]3.8094,[429]3.8012,
+save_imatrix: stored collected data after 430 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[430]3.7922,[431]3.7851,[432]3.7848,[433]3.7833,[434]3.7796,[435]3.7709,[436]3.7662,[437]3.7549,[438]3.7445,[439]3.7341,
+save_imatrix: stored collected data after 440 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[440]3.7248,[441]3.7155,[442]3.7139,[443]3.7058,[444]3.7040,[445]3.6987,[446]3.6918,[447]3.6922,[448]3.6879,[449]3.6824,
+save_imatrix: stored collected data after 450 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[450]3.6737,[451]3.6644,[452]3.6633,[453]3.6562,[454]3.6460,[455]3.6361,[456]3.6273,[457]3.6184,[458]3.6090,[459]3.6005,
+save_imatrix: stored collected data after 460 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[460]3.5919,[461]3.5893,[462]3.5814,[463]3.5770,[464]3.5740,[465]3.5712,[466]3.5668,[467]3.5624,[468]3.5589,[469]3.5547,
+save_imatrix: stored collected data after 470 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[470]3.5509,[471]3.5473,[472]3.5433,[473]3.5394,[474]3.5355,[475]3.5315,[476]3.5277,[477]3.5225,[478]3.5150,[479]3.5061,
+save_imatrix: stored collected data after 480 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[480]3.5025,[481]3.4995,[482]3.4998,[483]3.4914,[484]3.4839,[485]3.4784,[486]3.4715,[487]3.4640,[488]3.4572,[489]3.4516,
+save_imatrix: stored collected data after 490 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[490]3.4454,[491]3.4428,[492]3.4370,[493]3.4317,[494]3.4278,[495]3.4239,[496]3.4174,[497]3.4155,[498]3.4167,[499]3.4200,
+save_imatrix: stored collected data after 500 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[500]3.4197,[501]3.4218,[502]3.4232,[503]3.4190,[504]3.4123,[505]3.4200,[506]3.4290,[507]3.4385,[508]3.4462,[509]3.4529,
+save_imatrix: stored collected data after 510 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[510]3.4615,[511]3.4646,[512]3.4667,[513]3.4676,[514]3.4706,[515]3.4731,[516]3.4767,[517]3.4750,[518]3.4901,[519]3.5010,
+save_imatrix: stored collected data after 520 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[520]3.5139,[521]3.5209,[522]3.5251,[523]3.5289,[524]3.5327,[525]3.5361,[526]3.5361,[527]3.5358,[528]3.5403,[529]3.5435,
+save_imatrix: stored collected data after 530 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[530]3.5473,[531]3.5494,[532]3.5515,[533]3.5546,[534]3.5587,[535]3.5615,[536]3.5674,[537]3.5684,[538]3.5739,[539]3.5771,
+save_imatrix: stored collected data after 540 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[540]3.5802,[541]3.5840,[542]3.5871,[543]3.5894,[544]3.5886,[545]3.5892,[546]3.5910,[547]3.5931,[548]3.5945,[549]3.5971,
+save_imatrix: stored collected data after 550 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[550]3.6003,[551]3.6015,[552]3.6033,[553]3.6079,[554]3.6123,[555]3.6171,[556]3.6223,[557]3.6286,[558]3.6324,[559]3.6357,
+save_imatrix: stored collected data after 560 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[560]3.6387,[561]3.6410,[562]3.6424,[563]3.6404,[564]3.6422,[565]3.6436,[566]3.6446,[567]3.6411,[568]3.6420,[569]3.6433,
+save_imatrix: stored collected data after 570 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[570]3.6410,[571]3.6428,[572]3.6429,[573]3.6457,[574]3.6454,[575]3.6411,[576]3.6340,[577]3.6323,[578]3.6305,[579]3.6284,
+save_imatrix: stored collected data after 580 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[580]3.6253,[581]3.6211,[582]3.6139,[583]3.6082,[584]3.6009,[585]3.5942,[586]3.5871,[587]3.5798,[588]3.5804,[589]3.5793,
+save_imatrix: stored collected data after 590 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[590]3.5794,[591]3.5802,[592]3.5782,[593]3.5778,[594]3.5768,[595]3.5747,[596]3.5718,[597]3.5710,[598]3.5696,[599]3.5690,
+save_imatrix: stored collected data after 600 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[600]3.5648,[601]3.5615,[602]3.5597,[603]3.5551,[604]3.5510,[605]3.5483,[606]3.5472,[607]3.5461,[608]3.5469,[609]3.5459,
+save_imatrix: stored collected data after 610 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[610]3.5453,[611]3.5397,[612]3.5378,[613]3.5400,[614]3.5415,[615]3.5407,[616]3.5394,[617]3.5361,[618]3.5349,[619]3.5337,
+save_imatrix: stored collected data after 620 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[620]3.5347,[621]3.5329,[622]3.5301,[623]3.5310,[624]3.5312,[625]3.5254,[626]3.5200,[627]3.5136,[628]3.5078,[629]3.5014,
+save_imatrix: stored collected data after 630 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[630]3.4957,[631]3.4895,[632]3.4836,[633]3.4770,[634]3.4703,[635]3.4635,[636]3.4626,[637]3.4563,[638]3.4515,[639]3.4457,
+save_imatrix: stored collected data after 640 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[640]3.4398,[641]3.4353,[642]3.4294,[643]3.4256,[644]3.4250,[645]3.4190,[646]3.4152,[647]3.4143,[648]3.4103,[649]3.4039,
+save_imatrix: stored collected data after 650 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[650]3.3983,[651]3.3924,[652]3.3864,[653]3.3805,[654]3.3747,[655]3.3688,[656]3.3630,[657]3.3573,[658]3.3514,[659]3.3458,
+save_imatrix: stored collected data after 660 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[660]3.3409,[661]3.3353,[662]3.3296,[663]3.3238,[664]3.3182,[665]3.3125,[666]3.3068,[667]3.3011,[668]3.2956,[669]3.2900,
+save_imatrix: stored collected data after 670 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[670]3.2843,[671]3.2865,[672]3.2869,[673]3.2903,[674]3.2889,[675]3.2839,[676]3.2800,[677]3.2753,[678]3.2702,[679]3.2666,
+save_imatrix: stored collected data after 680 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[680]3.2610,[681]3.2564,[682]3.2516,[683]3.2466,[684]3.2415,[685]3.2367,[686]3.2326,[687]3.2281,[688]3.2234,[689]3.2184,
+save_imatrix: stored collected data after 690 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[690]3.2148,[691]3.2097,[692]3.2057,[693]3.2010,[694]3.1958,[695]3.1906,[696]3.1882,[697]3.1832,[698]3.1796,[699]3.1764,
+save_imatrix: stored collected data after 700 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[700]3.1721,[701]3.1699,[702]3.1651,[703]3.1604,[704]3.1560,[705]3.1522,[706]3.1483,[707]3.1459,[708]3.1452,[709]3.1441,
+save_imatrix: stored collected data after 710 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[710]3.1427,[711]3.1413,[712]3.1399,[713]3.1385,[714]3.1382,[715]3.1372,[716]3.1365,[717]3.1354,[718]3.1341,[719]3.1326,
+save_imatrix: stored collected data after 720 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[720]3.1318,[721]3.1303,[722]3.1288,[723]3.1280,[724]3.1278,[725]3.1290,[726]3.1302,[727]3.1323,[728]3.1336,[729]3.1355,
+save_imatrix: stored collected data after 730 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[730]3.1371,[731]3.1400,[732]3.1418,[733]3.1421,[734]3.1435,[735]3.1451,[736]3.1465,[737]3.1480,[738]3.1508,[739]3.1527,
+save_imatrix: stored collected data after 740 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[740]3.1546,[741]3.1554,[742]3.1561,[743]3.1575,[744]3.1610,[745]3.1629,[746]3.1640,[747]3.1647,[748]3.1657,[749]3.1682,
+save_imatrix: stored collected data after 750 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[750]3.1696,[751]3.1716,[752]3.1728,[753]3.1751,[754]3.1767,[755]3.1779,[756]3.1786,[757]3.1800,[758]3.1820,[759]3.1838,
+save_imatrix: stored collected data after 760 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[760]3.1855,[761]3.1872,[762]3.1894,[763]3.1908,[764]3.1931,[765]3.1945,[766]3.1958,[767]3.1970,[768]3.1996,[769]3.2020,
+save_imatrix: stored collected data after 770 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[770]3.2032,[771]3.2049,[772]3.2065,[773]3.2076,[774]3.2098,[775]3.2108,[776]3.2133,[777]3.2140,[778]3.2160,[779]3.2173,
+save_imatrix: stored collected data after 780 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[780]3.2185,[781]3.2198,[782]3.2218,[783]3.2232,[784]3.2250,[785]3.2258,[786]3.2273,[787]3.2296,[788]3.2318,[789]3.2344,
+save_imatrix: stored collected data after 790 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[790]3.2348,[791]3.2370,[792]3.2377,[793]3.2396,[794]3.2418,[795]3.2442,[796]3.2452,[797]3.2464,[798]3.2484,[799]3.2498,
+save_imatrix: stored collected data after 800 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[800]3.2509,[801]3.2531,[802]3.2543,[803]3.2562,[804]3.2570,[805]3.2588,[806]3.2606,[807]3.2614,[808]3.2615,[809]3.2624,
+save_imatrix: stored collected data after 810 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[810]3.2642,[811]3.2663,[812]3.2670,[813]3.2674,[814]3.2696,[815]3.2714,[816]3.2731,[817]3.2749,[818]3.2766,[819]3.2782,
+save_imatrix: stored collected data after 820 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+[820]3.2799,[821]3.2816,[822]3.2831,[823]3.2844,[824]3.2857,[825]3.2868,[826]3.2880,
+save_imatrix: stored collected data after 826 chunks in /mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/imatrix-Kimi-K2-Instruct-Q8_0.dat
+
+Final estimate: PPL = 3.2880 +/- 0.01495
+
+llama_print_timings: load time = 44750.06 ms
+llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: prompt eval time = 11519326.21 ms / 422912 tokens ( 27.24 ms per token, 36.71 tokens per second)
+llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
+llama_print_timings: total time = 11547591.52 ms / 422913 tokens
+
+======================== sorted layer importances
+ 0: Layer 0, = 0.54895
+ 1: Layer 3, = 0.78673
+ 2: Layer 60, = 0.801527
+ 3: Layer 1, = 0.813399
+ 4: Layer 2, = 0.818612
+ 5: Layer 6, = 0.861989
+ 6: Layer 5, = 0.867296
+ 7: Layer 4, = 0.882697
+ 8: Layer 13, = 0.903652
+ 9: Layer 14, = 0.904641
+ 10: Layer 12, = 0.908486
+ 11: Layer 10, = 0.910599
+ 12: Layer 15, = 0.912134
+ 13: Layer 8, = 0.920342
+ 14: Layer 59, = 0.920597
+ 15: Layer 16, = 0.920976
+ 16: Layer 18, = 0.921181
+ 17: Layer 17, = 0.921184
+ 18: Layer 7, = 0.921782
+ 19: Layer 11, = 0.925572
+ 20: Layer 9, = 0.928968
+ 21: Layer 23, = 0.930959
+ 22: Layer 24, = 0.931823
+ 23: Layer 20, = 0.932656
+ 24: Layer 21, = 0.933109
+ 25: Layer 22, = 0.933117
+ 26: Layer 19, = 0.9381
+ 27: Layer 27, = 0.938829
+ 28: Layer 25, = 0.93897
+ 29: Layer 29, = 0.942702
+ 30: Layer 26, = 0.944389
+ 31: Layer 30, = 0.944426
+ 32: Layer 28, = 0.944919
+ 33: Layer 58, = 0.95032
+ 34: Layer 32, = 0.950972
+ 35: Layer 31, = 0.951155
+ 36: Layer 34, = 0.955866
+ 37: Layer 57, = 0.956497
+ 38: Layer 56, = 0.956722
+ 39: Layer 35, = 0.957403
+ 40: Layer 55, = 0.959099
+ 41: Layer 33, = 0.960765
+ 42: Layer 54, = 0.964028
+ 43: Layer 36, = 0.964487
+ 44: Layer 43, = 0.965472
+ 45: Layer 37, = 0.965515
+ 46: Layer 53, = 0.967063
+ 47: Layer 42, = 0.967151
+ 48: Layer 38, = 0.967854
+ 49: Layer 52, = 0.969066
+ 50: Layer 39, = 0.969307
+ 51: Layer 40, = 0.970395
+ 52: Layer 51, = 0.971089
+ 53: Layer 50, = 0.971141
+ 54: Layer 49, = 0.972038
+ 55: Layer 41, = 0.972043
+ 56: Layer 46, = 0.972849
+ 57: Layer 45, = 0.973011
+ 58: Layer 44, = 0.973383
+ 59: Layer 47, = 0.974174
+ 60: Layer 48, = 0.975424
+
+======================== sorted attention importances
+ 0: Layer 0, = 0.528929
+ 1: Layer 3, = 0.609517
+ 2: Layer 2, = 0.737416
+ 3: Layer 1, = 0.765176
+ 4: Layer 6, = 0.822542
+ 5: Layer 4, = 0.852033
+ 6: Layer 8, = 0.869524
+ 7: Layer 5, = 0.870499
+ 8: Layer 10, = 0.872662
+ 9: Layer 9, = 0.879495
+ 10: Layer 7, = 0.883822
+ 11: Layer 14, = 0.898271
+ 12: Layer 12, = 0.899972
+ 13: Layer 13, = 0.912961
+ 14: Layer 15, = 0.918265
+ 15: Layer 11, = 0.926531
+ 16: Layer 18, = 0.934695
+ 17: Layer 16, = 0.937328
+ 18: Layer 17, = 0.941984
+ 19: Layer 23, = 0.944046
+ 20: Layer 20, = 0.945272
+ 21: Layer 28, = 0.946108
+ 22: Layer 31, = 0.946182
+ 23: Layer 43, = 0.947348
+ 24: Layer 25, = 0.948715
+ 25: Layer 32, = 0.950976
+ 26: Layer 22, = 0.953634
+ 27: Layer 21, = 0.953882
+ 28: Layer 30, = 0.953936
+ 29: Layer 24, = 0.954201
+ 30: Layer 29, = 0.95446
+ 31: Layer 19, = 0.955263
+ 32: Layer 38, = 0.956067
+ 33: Layer 27, = 0.95718
+ 34: Layer 34, = 0.957277
+ 35: Layer 26, = 0.958979
+ 36: Layer 35, = 0.961119
+ 37: Layer 39, = 0.961912
+ 38: Layer 36, = 0.962639
+ 39: Layer 33, = 0.962902
+ 40: Layer 45, = 0.963958
+ 41: Layer 54, = 0.964774
+ 42: Layer 49, = 0.966144
+ 43: Layer 37, = 0.967254
+ 44: Layer 55, = 0.967591
+ 45: Layer 42, = 0.967956
+ 46: Layer 57, = 0.968065
+ 47: Layer 59, = 0.968123
+ 48: Layer 56, = 0.968696
+ 49: Layer 60, = 0.969505
+ 50: Layer 40, = 0.969653
+ 51: Layer 58, = 0.969745
+ 52: Layer 52, = 0.970129
+ 53: Layer 41, = 0.970522
+ 54: Layer 50, = 0.972281
+ 55: Layer 47, = 0.972728
+ 56: Layer 44, = 0.974193
+ 57: Layer 48, = 0.974345
+ 58: Layer 46, = 0.978292
+ 59: Layer 51, = 0.979166
+ 60: Layer 53, = 0.979395
+
+======================== sorted ffn importances
+ 0: Layer 2, = 0.600363
+ 1: Layer 0, = 0.78679
+ 2: Layer 1, = 0.78881
+ 3: Layer 60, = 0.804641
+ 4: Layer 3, = 0.814239
+ 5: Layer 8, = 0.846997
+ 6: Layer 5, = 0.848068
+ 7: Layer 7, = 0.850158
+ 8: Layer 6, = 0.857595
+ 9: Layer 9, = 0.862339
+ 10: Layer 4, = 0.873048
+ 11: Layer 13, = 0.882637
+ 12: Layer 11, = 0.887424
+ 13: Layer 10, = 0.899452
+ 14: Layer 12, = 0.902722
+ 15: Layer 14, = 0.910508
+ 16: Layer 15, = 0.920924
+ 17: Layer 17, = 0.922436
+ 18: Layer 16, = 0.924198
+ 19: Layer 27, = 0.927228
+ 20: Layer 22, = 0.927292
+ 21: Layer 19, = 0.930707
+ 22: Layer 18, = 0.931487
+ 23: Layer 24, = 0.932161
+ 24: Layer 42, = 0.932389
+ 25: Layer 59, = 0.932592
+ 26: Layer 30, = 0.932863
+ 27: Layer 20, = 0.936043
+ 28: Layer 31, = 0.938531
+ 29: Layer 21, = 0.938706
+ 30: Layer 25, = 0.941162
+ 31: Layer 37, = 0.941747
+ 32: Layer 26, = 0.941901
+ 33: Layer 23, = 0.942403
+ 34: Layer 29, = 0.942694
+ 35: Layer 28, = 0.944772
+ 36: Layer 32, = 0.945089
+ 37: Layer 38, = 0.94832
+ 38: Layer 35, = 0.94834
+ 39: Layer 33, = 0.949204
+ 40: Layer 44, = 0.950097
+ 41: Layer 53, = 0.951411
+ 42: Layer 58, = 0.951411
+ 43: Layer 34, = 0.951841
+ 44: Layer 36, = 0.952884
+ 45: Layer 48, = 0.953085
+ 46: Layer 55, = 0.953457
+ 47: Layer 41, = 0.954042
+ 48: Layer 51, = 0.954259
+ 49: Layer 56, = 0.954951
+ 50: Layer 39, = 0.955321
+ 51: Layer 57, = 0.955975
+ 52: Layer 40, = 0.956567
+ 53: Layer 54, = 0.956844
+ 54: Layer 49, = 0.956899
+ 55: Layer 46, = 0.957188
+ 56: Layer 43, = 0.958947
+ 57: Layer 47, = 0.959267
+ 58: Layer 52, = 0.963017
+ 59: Layer 50, = 0.963043
+ 60: Layer 45, = 0.964393
+```
+
+
+
+---
+
+👤 **ubergarm** commented on **2025-07-18** at **15:14:24**
+
+The quickest way for me to test some more IQ3_KS tensors with this PR is to re-do my Kimi-K2-Instruct IQ2_KL which uses:
+
+* llama_model_loader: - type iq3_ks: 60 tensors ffn_down_exps
+* llama_model_loader: - type iq2_kl: 120 tensors ffn_(gate|up)_exps
+
+Those ffn_down_exps are the only tensors in the recipe affected by this PR so can compare before/after. I'll update this comment with results soon:
+
+*WIP* *TODO* add results for https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF#-iq2_kl-345687-gib-2892-bpw
+
+Well using this PR624 for iq3_ks ffn_down_exps increased PPl slightly for PR624-IQ2_KL as shown on graph and data over at: https://github.com/ikawrakow/ik_llama.cpp/pull/616#issuecomment-3087170346
+
+* main-IQ2_KL 3.2741 +/- 0.01689
+* PR-IQ2_KL 3.3055 +/- 0.01709
+
+So far the two tests that show increasing/worse perplexity were specifically related to using IQ3_KS with routed experts tensors `ffn_*_exps` on kimi-k2... This model is a pain to work with given its size hah...
+
+I'll try out some more small tests with my set of Qwen3-14B as that should be faster.
+
+---
+
+👤 **ubergarm** commented on **2025-07-18** at **18:37:43**
+
+Well I took some of my old Qwen3-14B quants and added to the set to compare all the types involved here between main and this PR. This data might just muddy the waters even more hah..
+
+I tested with my usual `perplexity: calculating perplexity over 584 chunks, n_ctx=512, batch_size=2048, n_seq=4` on my same `wiki.test.raw`. This imatrix corpus used for these imatrix quants is the same corpus I used on kimi-k2-instrict fwiw.
+
+
+
+👈 JSON data
+
+```json
+[
+ {
+ "name": "bf16",
+ "ppl": "9.0133 +/- 0.07115",
+ "size": 27.509,
+ "bpw": 16.000,
+ "legend": "main"
+ },
+ {
+ "name": "q4_K",
+ "ppl": "9.1487 +/- 0.07232",
+ "size": 7.925,
+ "bpw": 4.609,
+ "legend": "main"
+ },
+ {
+ "name": "q2_K",
+ "ppl": "10.6691 +/- 0.08376",
+ "size": 5.041,
+ "bpw": 2.932,
+ "legend": "main"
+ },
+ {
+ "name": "q3_K",
+ "ppl": "9.4405 +/- 0.07422",
+ "size": 6.291,
+ "bpw": 3.659,
+ "legend": "main"
+ },
+ {
+ "name": "q5_K",
+ "ppl": "9.0413 +/- 0.07128",
+ "size": 9.463,
+ "bpw": 5.504,
+ "legend": "main"
+ },
+ {
+ "name": "iq3_ks",
+ "ppl": "9.6945 +/- 0.07826",
+ "size": 5.910,
+ "bpw": 3.438,
+ "legend": "main"
+ },
+ {
+ "name": "iq3_k",
+ "ppl": "9.3296 +/- 0.07371",
+ "size": 6.291,
+ "bpw": 3.659,
+ "legend": "main"
+ },
+ {
+ "name": "iq2_ks",
+ "ppl": "11.8117 +/- 0.09367",
+ "size": 4.372,
+ "bpw": 2.543,
+ "legend": "main"
+ },
+ {
+ "name": "PR624-q2_K",
+ "ppl": "10.7015 +/- 0.08453",
+ "size": 5.041,
+ "bpw": 2.932,
+ "legend": "PR624"
+ },
+ {
+ "name": "PR624-q3_K",
+ "ppl": "9.3747 +/- 0.07318",
+ "size": 6.291,
+ "bpw": 3.659,
+ "legend": "PR624"
+ },
+ {
+ "name": "PR624-q4_K",
+ "ppl": "9.1210 +/- 0.07194",
+ "size": 7.925,
+ "bpw": 4.609,
+ "legend": "PR624"
+ },
+ {
+ "name": "PR624-q5_K",
+ "ppl": "9.0391 +/- 0.07129",
+ "size": 9.463,
+ "bpw": 5.504,
+ "legend": "PR624"
+ },
+ {
+ "name": "PR624-iq2_ks",
+ "ppl": "11.8160 +/- 0.09371",
+ "size": 4.372,
+ "bpw": 2.543,
+ "legend": "PR624"
+ },
+ {
+ "name": "PR624-iq3_ks",
+ "ppl": "9.5529 +/- 0.07619",
+ "size": 5.910,
+ "bpw": 3.438,
+ "legend": "PR624"
+ },
+ {
+ "name": "PR624-iq3_k",
+ "ppl": "9.3818 +/- 0.07445",
+ "size": 6.291,
+ "bpw": 3.659,
+ "legend": "PR624"
+ }
+]
+```
+
+
+
+
+
+* "better" on this PR624: iq3_ks q3_K q4_K q5_K
+* "better" on main: iq2_ks q2_K iq3_k
+
+These results are annoying because I'm seeing worse iq3_ks PPL on Kimi-K2-Instruct hah. I'm not sure how to read the tea leaves here, and I didn't check `n_ctx 2048` and I know some of these "pure" mixes are really over-quantized for a dense model.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-19** at **06:49:33**
+
+Can I see your imatrix calibration data?
+
+---
+
+👤 **ubergarm** commented on **2025-07-19** at **15:08:07**
@ikawrakow
diff --git a/github-data/pull_requests/628 - Function calling support for Kimi-K2.md b/github-data/pull_requests/628 - Function calling support for Kimi-K2.md
new file mode 100644
index 000000000..bcaa5e007
--- /dev/null
+++ b/github-data/pull_requests/628 - Function calling support for Kimi-K2.md
@@ -0,0 +1,329 @@
+## 🔀 [Pull Request #628](https://github.com/ikawrakow/ik_llama.cpp/pull/628) - Function calling support for Kimi-K2
+
+| **Author** | `iSevenDays` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `function_calling` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-18 |
+| **Updated** | 2025-07-26 |
+| **Merged** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [ ] Medium
+ - [x] High
+---
+The implementation adds support for tool calls.
+
+The reason why I think the feature is important is that it allows users of ik_llama.cpp to use this backend with apps like Claude Code that requires tool calls.
+
+By using simple proxy like this one https://github.com/1rgs/claude-code-proxy (I just found it in github), I could connect Claude Code to ik_llama.cpp using [Kimi-K2 Q2](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/tree/main/IQ2_KL) LLM provided by ubergarm.
+In claude-code-proxy you just have to change .env `OPENAI_API_BASE="http://192.168.0.24:8080/v1"`
+
+
+
+I had to port llama.cpp function tool calls support. The most difficult part was to port streaming and json healing.
+
+
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-18** at **09:56:32**
+
+Thank you for this! People have been asking for function calling support, but that is not something I'm very familiar with.
+
+LGTM, but I would appreciate at least one other person testing.
+
+I see your location is Leipzig. Have fond memories of this place, having spent 11 years there studying physics, doing a PhD, and staying for my first postdoc position.
+
+---
+
+👤 **iSevenDays** commented on **2025-07-18** at **10:43:28**
+
+> LGTM, but I would appreciate at least one other person testing.
+
+Thanks! I've done the basic tests, but the model loads too slow from my hdd, so I will test different use cases over the weekend.
+I could make it work for the first request, but it seems that multiple requests don't work currently or Kimi-K2 requires a different prompting. I'll debug this more over the weekend and update the PR.
+
+> I see your location is Leipzig. Have fond memories of this place, having spent 11 years there studying physics, doing a PhD, and staying for my first postdoc position.
+
+I live in a beautiful city, thanks! I've been living here for 3 years and have absolutely no regrets!
+
+---
+
+👤 **ubergarm** commented on **2025-07-18** at **16:38:14**
+
+> I could make it work for the first request, but it seems that multiple requests don't work currently or Kimi-K2 requires a different prompting. I'll debug this more over the weekend and update the PR.>
+
+Oh hej this is exciting! I believe we have a PR open for this https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2889059989 where some folks were trying to use a reverse proxy / wrapper to handle it similar to claude-code-proxy perhaps.
+
+I don't use tool calling myself, but did notice when adding Kimi-K2-Instruct PR that I left out one section for the chat endpoint for the `"role": "tool"`: https://github.com/ggml-org/llama.cpp/pull/14654#issuecomment-3074893927
+
+So if it expects llama-server to handle the template internally that `"role": "tool"` might not be applied. But if you're using the text completions endpoint and doing your own template it might not matter.
+
+---
+
+👤 **sousekd** commented on **2025-07-18** at **23:10:28**
+
+@iSevenDays This seems relevant:
+
+> We've just fixed 2 bugs in Kimi-K2-Instruct huggingface repo. Please update the following files to apply the fix:
+>
+>- tokenizer_config.json: update chat-template so that it works for multi-turn tool calls.
+>- tokenization_kimi.py: update encode method to enable encoding special tokens.
+
+https://x.com/Kimi_Moonshot/status/1945050874067476962
+
+---
+
+👤 **mtcl** commented on **2025-07-19** at **16:30:45**
+
+This is very exciting! I would much rather use a native function calling!
+
+---
+
+👤 **iSevenDays** commented on **2025-07-19** at **17:10:18**
+
+I took a look at how llama.cpp implements tool calling support and the task is much more complicated than I thought. Especially, the streaming part.
+I'll keep you updated.
+
+---
+
+👤 **mtcl** commented on **2025-07-19** at **17:42:16**
+
+> I took a look at how llama.cpp implements tool calling support and the task is much more complicated than I thought. Especially, the streaming part.
+> I'll keep you updated.
+
+That would be really amazing! ik_llama + tool calling will be a dream come true for me!
+
+---
+
+👤 **iSevenDays** commented on **2025-07-22** at **16:16:11**
+
+I had to port llama.cpp function tool calls support.
+
+Here is branch of Claude Proxy that you can use with ik_llama.cpp and Claude code.
+
+Steps to test this PR
+1. Clone https://github.com/iSevenDays/claude-code-proxy
+2. Run the proxy
+```
+uv run uvicorn server:app --host 0.0.0.0 --port 8082
+```
+3. Open .env inside claude proxy
+```
+OPENAI_API_BASE="http://192.168.0.24:8080/v1"
+PREFERRED_PROVIDER="openai"
+BIG_MODEL="Kimi-K2"
+SMALL_MODEL="Kimi-K2"
+```
+4. The model name is important, so set it to kimi-k2 to enable tool parsing from ik_llama.cpp
+5. Test with Claude Code
+```
+ANTHROPIC_BASE_URL=http://localhost:8082 claude "list files"
+```
+
+I'm doing more tests in the meantime.
+
+---
+
+👤 **iSevenDays** commented on **2025-07-23** at **09:00:50**
+
+I added Qwen3 tool calling support.
+From my tests, Kimi-K2 uses tools better and Qwen3 fails to use tools for Claude Code.
+
+---
+
+👤 **iSevenDays** commented on **2025-07-23** at **09:06:45**
+
+@ikawrakow I have backported tool calling support. I'm not sure if I can make PR smaller, because the feature in llama.cpp is quite complicated.
+I'd be glad if somebody can also do real world tests.
+
+I suggest using Kimi-K2 model with Claude Code using these steps https://github.com/ikawrakow/ik_llama.cpp/pull/628#issuecomment-3103627677
+
+It seems to work fine, at least it can call tools when I explicitly ask for it.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **09:13:50**
+
+I think there was a lot of interest for this, so hopefully we will have a few people testing the PR. Hopefully today, so I can merge before going on vacation tomorrow.
+
+---
+
+👤 **iSevenDays** commented on **2025-07-23** at **09:17:20**
+
+@ikawrakow I'll be happy to work on your requests for this PR to get it merged.
+I followed the strategy of porting llama.cpp as close as possible.
+
+---
+
+👤 **xldistance** commented on **2025-07-23** at **09:27:45**
+
+Looking forward to qwen3's tool call
+
+---
+
+👤 **iSevenDays** commented on **2025-07-23** at **10:10:58**
+
+I have added DeepSeek-R1 tool calling support.
+The following LLM works just fine. It takes often 2 iterations to do the tool call, but Claude Code handles that automatically.
+
+```
+numactl --interleave=all ./build/bin/llama-server \
+ --alias DeepSeek-R1T2 \
+ --model /root/models/DeepSeek-TNG-R1T2-Chimera-GGUF/IQ3_KS/IQ3_KS/DeepSeek-TNG-R1T2-Chimera-IQ3_KS-00001-of-00007.gguf \
+ -rtr \
+ --ctx-size 102400 \
+ -ctk q8_0 \
+ -mla 3 -fa \
+ -amb 512 \
+ -fmoe \
+ --temp 0.6 \
+ --top_p 0.95 \
+ --n-gpu-layers 63 \
+ --override-tensor "blk\.([0-5])\.ffn_.*=CUDA0,exps=CPU" \
+ --parallel 1 \
+ --threads 16 \
+ --host 0.0.0.0 \
+ --port 8080 \
+ --min_p 0.01 \
+ --numa distribute \
+ --threads-batch 32 \
+ --no-mmap \
+ -b 8192 -ub 8192
+```
+
+---
+
+👤 **xldistance** commented on **2025-07-23** at **10:43:12**
+
+@iSevenDays json-partial.h
+json-partial.cpp
+regex-partial.h
+regex-partial.cpp Missing documents
+
+---
+
+👤 **iSevenDays** commented on **2025-07-23** at **11:12:28**
+
+@xldistance thanks for the feedback, the files are there and can be compiled successfully.
+
+For those who are testing with Claude Code, here are my suggestions:
+Kimi-K2 works the best, and is very fast, uses tools.
+DeepSeek-TNG-R1T2-Chimera works, but too often it times out on my Dell R740 48GB 4090D.
+Qwen3-235B-A22B-Instruct-2507-GGUF (pure-IQ4_KS from ubergarm) doesn't want to use tools
+
+---
+
+👤 **xldistance** commented on **2025-07-23** at **11:14:21**
+
+@iSevenDays I use qwen3-coder-480b on top of ccr code
+
+---
+
+👤 **iSevenDays** commented on **2025-07-23** at **11:18:50**
+
+@xldistance just make sure to set the correct name of LLM in env and in llama-server.
+I enabled name matching e.g. the following triggers additional tool calling in system prompt to let the model know how to use tools properly. I ported the behavior from llama.cpp. Llama.cpp uses more complex system btw.
+The following names would work:
+Qwen3-235b
+DeepSeek-R1
+Kimi-K2
+Kimi_K2
+
+I'll check qwen3-coder-480b that was recently uploaded https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/tree/main/IQ2_KS
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **16:11:36**
+
+Well, I'll just merge it then.
+
+---
+
+👤 **iSevenDays** commented on **2025-07-24** at **12:14:41**
+
+@xldistance I found one issue with function tool calls when using LLM with Claude Code.
+Please check this PR https://github.com/ikawrakow/ik_llama.cpp/pull/643 to have the latest updates. Now I can use Qwen3 with Claude Code Proxy as well.
+
+---
+
+👤 **randoentity** commented on **2025-07-24** at **14:54:27**
+
+FWIW: tested and working with local qwen. Haven't run into the issue above yet. I'm not using the proxy/router from above though. Is there any way to make this work with jinja templates and not having the model name hardcoded?
+
+---
+
+👤 **mtcl** commented on **2025-07-24** at **16:09:14**
+
+> FWIW: tested and working with local qwen. Haven't run into the issue above yet. I'm not using the proxy/router from above though. Is there any way to make this work with jinja templates and not having the model name hardcoded?
+
+What's the exact command that you used to start the server? Can you please share?
+
+---
+
+👤 **randoentity** commented on **2025-07-24** at **21:15:54**
+
+@mtcl There's nothing special to it, look at isevendays' example above, just use `--alias Qwen3-235b` instead (but just qwen should be sufficient). Also check out the documentation added in this PR as it has an example of what the request should look like. Note that the model name is significant.
+
+---
+
+👤 **city96** commented on **2025-07-26** at **12:42:17**
+
+I did an update today and noticed token streaming wasn't working on latest master. I've tracked it down to this PR, with the commit right before it working.
+
+When token streaming is disabled, the reply is generated as usual and appears once generation finishes. When I enable token streaming, the generation still finishes in the background, but I never get any output. I was testing with an old version of sillytavern, but it seems reproducible in [mikupad](https://github.com/lmg-anon/mikupad) which is probably easier to reproduce.
+
+I get the same issue on Kimi, Deepseek V3, and even just random models like gemma:
+
+```
+CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-server -m /mnt/models/llm/gemma-3-27b-it-q6_k.gguf -c 16384 -ngl 99
+```
+
+---
+
+👤 **iSevenDays** commented on **2025-07-26** at **12:45:24**
+
+@city96 could you please check this PR https://github.com/ikawrakow/ik_llama.cpp/pull/652 and could you please provide a minimum reproducible example? At best, using some small LLM. Then I could check and verify it quickly.
+
+I'm currently testing the PR above and I use both streaming and non-streaming mode with Kimi-K2 model and I didn't notice any issues, but I would gladly help you resolve the issue if there was a regression.
+
+---
+
+👤 **city96** commented on **2025-07-26** at **13:14:50**
+
+I tested your linked PR, but still saw the same problem. I think I found the issue, though. It's this change that this PR makes:
+
+
+
+On latest master that line is here. Changing it back fixes streaming.
+
+https://github.com/ikawrakow/ik_llama.cpp/blob/4e9c78c039601c99541726d95216e3aa7bfda742/examples/server/server.cpp#L1621
+
+Not sure what the logic is in mainline llama.cpp for streaming, but I am using text completion instead of the chat completion endpoint. I assume this is likely why it wasn't caught, since most people probably use the openai compatible one.
+
+For a reproducible example, you can start the ik_llama.cpp server example using any model (I used gemma 27B for testing, but any model should work). Connect to it via mikupad and enter a simple query, enable token streaming, then hit "predict" at the bottom. I can try and make a pure python example as well if it helps.
+
+
+
+---
+
+👤 **iSevenDays** commented on **2025-07-26** at **14:45:12**
+
+@city96 could you please test the change in this PR https://github.com/ikawrakow/ik_llama.cpp/pull/654 ?
+I think you have correctly identified the issue, but I'll be able to test that change only later today.
+
+---
+
+👤 **city96** commented on **2025-07-26** at **18:31:30**
+
+@iSevenDays I can confirm that the newest PR does indeed fix token streaming on the text completion endpoint for me, thank you.
\ No newline at end of file
diff --git a/github-data/pull_requests/628 - _Draft_ Function calling support for Kimi-K2.md b/github-data/pull_requests/628 - _Draft_ Function calling support for Kimi-K2.md
deleted file mode 100644
index 6abf1e346..000000000
--- a/github-data/pull_requests/628 - _Draft_ Function calling support for Kimi-K2.md
+++ /dev/null
@@ -1,91 +0,0 @@
-### 🔀 [#628](https://github.com/ikawrakow/ik_llama.cpp/pull/628) - [Draft] Function calling support for Kimi-K2
-
-| **Author** | `iSevenDays` |
-| :--- | :--- |
-| **State** | ✅ **Open** |
-| **Created** | 2025-07-18 |
-| **Updated** | 2025-07-19 |
-
----
-
-#### Description
-
-- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
-- Self-reported review complexity:
- - [ ] Low
- - [x] Medium
- - [ ] High
----
-The implementation adds support for tool calls.
-
-The reason why I think the feature is important is that it allows users of ik_llama.cpp to use this backend with apps like Claude Code that requires tool calls.
-
-By using simple proxy like this one https://github.com/1rgs/claude-code-proxy (I just found it in github), I could connect Claude Code to ik_llama.cpp using [Kimi-K2 Q2](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/tree/main/IQ2_KL) LLM provided by ubergarm.
-In claude-code-proxy you just have to change .env `OPENAI_API_BASE="http://192.168.0.24:8080/v1"`
-
-
-
-Kimi-k2 uses multiple formats, when not instructed to use specific tool call format.
-The list of formats that I observed is in examples/server/function_calls.md file.
-
-
-
----
-
-#### 💬 Conversation
-
-👤 **ikawrakow** submitted a review the **2025-07-18** at **09:56:32**: ✅ `APPROVED`
-
-Thank you for this! People have been asking for function calling support, but that is not something I'm very familiar with.
-
-LGTM, but I would appreciate at least one other person testing.
-
-I see your location is Leipzig. Have fond memories of this place, having spent 11 years there studying physics, doing a PhD, and staying for my first postdoc position.
-
----
-
-👤 **iSevenDays** commented the **2025-07-18** at **10:43:28**:
-
-> LGTM, but I would appreciate at least one other person testing.
-
-Thanks! I've done the basic tests, but the model loads too slow from my hdd, so I will test different use cases over the weekend.
-I could make it work for the first request, but it seems that multiple requests don't work currently or Kimi-K2 requires a different prompting. I'll debug this more over the weekend and update the PR.
-
-> I see your location is Leipzig. Have fond memories of this place, having spent 11 years there studying physics, doing a PhD, and staying for my first postdoc position.
-
-I live in a beautiful city, thanks! I've been living here for 3 years and have absolutely no regrets!
-
----
-
-👤 **sousekd** commented the **2025-07-18** at **23:10:28**:
-
-@iSevenDays This seems relevant:
-
-> We've just fixed 2 bugs in Kimi-K2-Instruct huggingface repo. Please update the following files to apply the fix:
-
-- tokenizer_config.json: update chat-template so that it works for multi-turn tool calls.
-- tokenization_kimi.py: update encode method to enable encoding special tokens.
-
-https://x.com/Kimi_Moonshot/status/1945050874067476962
-
----
-
-👤 **mtcl** commented the **2025-07-19** at **16:30:45**:
-
-This is very exciting! I would much rather use a native function calling!
-
----
-
-👤 **iSevenDays** commented the **2025-07-19** at **17:10:18**:
-
-I took a look at how llama.cpp implements tool calling support and the task is much more complicated that I thought. Especially, the streaming part.
-I'll keep you updated.
-
----
-
-👤 **mtcl** commented the **2025-07-19** at **17:42:16**:
-
-> I took a look at how llama.cpp implements tool calling support and the task is much more complicated than I thought. Especially, the streaming part.
-> I'll keep you updated.
-
-That would be really amazing! ik_llama + tool calling will be a dream come true for me!
\ No newline at end of file
diff --git a/github-data/pull_requests/630 - GEMM for IQ1_M.md b/github-data/pull_requests/630 - GEMM for IQ1_M.md
index 65c874053..269856caf 100644
--- a/github-data/pull_requests/630 - GEMM for IQ1_M.md
+++ b/github-data/pull_requests/630 - GEMM for IQ1_M.md
@@ -1,16 +1,19 @@
-### 🔀 [#630](https://github.com/ikawrakow/ik_llama.cpp/pull/630) - GEMM for IQ1_M
+## 🔀 [Pull Request #630](https://github.com/ikawrakow/ik_llama.cpp/pull/630) - GEMM for IQ1_M
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1m_gemm` |
+| **Target Branch** | `main` |
| **Created** | 2025-07-18 |
| **Updated** | 2025-07-18 |
+| **Merged** | 2025-07-18 |
---
-#### Description
+## 📄 Description
-Closes #626
+Closes [#626](https://github.com/ikawrakow/ik_llama.cpp/issues/626)
Hopefully the collective knowledge on Reddit and elsewhere that one cannot use `-fmoe` because of the missing `IQ1_M` GEMM has not already been perpetuated for all eternity...
diff --git a/github-data/pull_requests/631 - IQ1_M GEMM for ARM_NEON.md b/github-data/pull_requests/631 - IQ1_M GEMM for ARM_NEON.md
new file mode 100644
index 000000000..2862fdea7
--- /dev/null
+++ b/github-data/pull_requests/631 - IQ1_M GEMM for ARM_NEON.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #631](https://github.com/ikawrakow/ik_llama.cpp/pull/631) - IQ1_M GEMM for ARM_NEON
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq1_m_neon` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-20 |
+| **Updated** | 2025-07-20 |
+| **Merged** | 2025-07-20 |
+
+---
+
+## 📄 Description
+
+Did not make it into [#630](https://github.com/ikawrakow/ik_llama.cpp/issues/630), so adding t here.
\ No newline at end of file
diff --git a/github-data/pull_requests/637 - Add GitHub data backup.md b/github-data/pull_requests/637 - Add GitHub data backup.md
new file mode 100644
index 000000000..e5730f951
--- /dev/null
+++ b/github-data/pull_requests/637 - Add GitHub data backup.md
@@ -0,0 +1,58 @@
+## 🔀 [Pull Request #637](https://github.com/ikawrakow/ik_llama.cpp/pull/637) - Add GitHub data backup
+
+| **Author** | `ThomasBaruzier` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `tb/github-data` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-22 |
+| **Updated** | 2025-07-22 |
+| **Merged** | 2025-07-22 |
+
+---
+
+## 📄 Description
+
+Hello,
+
+The last two days have been pretty stressful, but I’m glad to see the repo back up!
+
+To prepare for any future unexpected outages, I’m sharing what I’ve been working on while the repo was down. For now, here’s a complete archive of all discussions, issues, and pull requests from before the takedown. I’ll also push the scraping and formatting code soon.
+
+This backup will also allow people to use the data directly for RAG use, in case the takedown was caused by scraping for this purpose (seems unlikely, but we don't know).
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-22** at **16:18:31**
+
+Wow, thank you for this!
+
+---
+
+👤 **ikawrakow** commented on **2025-07-22** at **16:30:05**
+
+So, now we have a copy of the discussions, issues, and PRs
+* On [Codeberg](https://codeberg.org/ikawrakow/illama.git)
+* On [GitLab](https://gitlab.com/ikawrakow-group/ik_llama.cpp.git)
+
+It would be great to also get your scraping tool so we can update and backup regularly.
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-07-22** at **16:42:24**
+
+Nice!
+
+I’ll clean up and refactor my code to make it resumable (currently, everything is scraped from scratch, which isn’t ideal). If successful, we could even set it up as a GitHub Actions workflow, triggered on every commit push to the repo.
+
+That said, frequent runs might clutter the commit history, so we can revisit the approach later.
+
+Expect a new PR for this in the next few days!
\ No newline at end of file
diff --git a/github-data/pull_requests/639 - Fix pauses after a comma.md b/github-data/pull_requests/639 - Fix pauses after a comma.md
new file mode 100644
index 000000000..adc323937
--- /dev/null
+++ b/github-data/pull_requests/639 - Fix pauses after a comma.md
@@ -0,0 +1,44 @@
+## 🔀 [Pull Request #639](https://github.com/ikawrakow/ik_llama.cpp/pull/639) - Fix pauses after a comma
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_comma_pauses` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-23 |
+| **Updated** | 2025-07-23 |
+| **Merged** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+Closes [#464](https://github.com/ikawrakow/ik_llama.cpp/issues/464).
+
+It seems there are models out there where the BOS token id is the same as the token ID of a comma (11 in the case of Qwen3 MoE models). This results in interpreting a comma during token generation as warm up run, which then results in using all experts, which makes the run time for the next token much longer, which then looks like a pause in the generation. The logic to use all experts during warm up was added in [#198](https://github.com/ikawrakow/ik_llama.cpp/issues/198) to improve the user experience with very large MoE models.
+
+This PR fixes the issue by checking how many tokens have been evaluated in the given context and only creating a warm up graph if this is zero (in addition to the other conditions to detect a warm up run).
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** commented on **2025-07-23** at **09:37:55**
+
+I was just compiling something similar to this (checking from the llama_kv_cache object) on top of adding support for the flag. Your solution is much cleaner.
+
+---
+
+👤 **saood06** approved this pull request ✅ on **2025-07-23** at **09:39:08**
+
+---
+
+👤 **ubergarm** commented on **2025-07-23** at **16:15:15**
+
+@ikawrakow
+
+Yes this seems to fix the issue. I notice that now with this compiled in the first chat is *much* faster and subsequent chats no longer seem to pause after `,`.
+
+I'm spreading the word to update and recompile https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/discussions/1#6880eb57b50b0bb883e58f44 and no more need for that `--override-kv tokenizer.ggml.bos_token_id=int:151643` business
+
+Thanks!
\ No newline at end of file
diff --git a/github-data/pull_requests/64 - Better sub-3-bit quantization mixes with a qkv tensor.md b/github-data/pull_requests/64 - Better sub-3-bit quantization mixes with a qkv tensor.md
index e7cc30cfb..d5e6385ec 100644
--- a/github-data/pull_requests/64 - Better sub-3-bit quantization mixes with a qkv tensor.md
+++ b/github-data/pull_requests/64 - Better sub-3-bit quantization mixes with a qkv tensor.md
@@ -1,14 +1,17 @@
-### 🔀 [#64](https://github.com/ikawrakow/ik_llama.cpp/pull/64) - Better sub-3-bit quantization mixes with a qkv tensor
+## 🔀 [Pull Request #64](https://github.com/ikawrakow/ik_llama.cpp/pull/64) - Better sub-3-bit quantization mixes with a qkv tensor
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/phi3.5_tweaks` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-28 |
| **Updated** | 2024-09-28 |
+| **Merged** | 2024-09-28 |
---
-#### Description
+## 📄 Description
Phi3.5-mini uses a combined `QKV` tensor. As a result, the quantization mix strategies used for sub-3-bit quants fail. This PR fixes it, and here is what we get as quantization error using wiki text perplexity
diff --git a/github-data/pull_requests/640 - Add GitHub data filename sanitization.md b/github-data/pull_requests/640 - Add GitHub data filename sanitization.md
new file mode 100644
index 000000000..dd1f678df
--- /dev/null
+++ b/github-data/pull_requests/640 - Add GitHub data filename sanitization.md
@@ -0,0 +1,28 @@
+## 🔀 [Pull Request #640](https://github.com/ikawrakow/ik_llama.cpp/pull/640) - Add GitHub data: filename sanitization
+
+| **Author** | `ThomasBaruzier` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `tb/github-data-filenames` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-23 |
+| **Updated** | 2025-07-23 |
+| **Merged** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+Should fix https://github.com/ikawrakow/ik_llama.cpp/issues/638
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-23** at **11:31:47**
\ No newline at end of file
diff --git a/github-data/pull_requests/642 - IQ4_KSS improvements.md b/github-data/pull_requests/642 - IQ4_KSS improvements.md
new file mode 100644
index 000000000..1508c824a
--- /dev/null
+++ b/github-data/pull_requests/642 - IQ4_KSS improvements.md
@@ -0,0 +1,666 @@
+## 🔀 [Pull Request #642](https://github.com/ikawrakow/ik_llama.cpp/pull/642) - IQ4_KSS improvements
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_kss_improvements` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-23 |
+| **Updated** | 2025-07-27 |
+| **Merged** | 2025-07-23 |
+
+---
+
+## 📄 Description
+
+Not much is known about `IQ4_KSS`, and nobody seems to be using it. So, I decided to give it some attention.
+
+Quick reminder (for more, see [#89](https://github.com/ikawrakow/ik_llama.cpp/issues/89))
+* `IQ4_KSS` uses exactly 4.0 bpw just like `IQ4_KT`
+* Performance on CUDA is very similar to `IQ4_KT` (after this PR)
+* PP CPU performance is similar to `IQ4_KT` (after this PR)
+* TG CPU performance is quite a bit better than `IQ4_KT`
+* PPL is only slightly worse than `IQ4_KT`
+
+This PR
+* Adds CUDA quantized matrix multiplication kernel
+* Adds repacking to `Q8_K_R8` for fast CPU GEMM
+* Adds a small improvement in quantization accuracy
+
+---
+
+## 💬 Conversation
+
+👤 **ubergarm** commented on **2025-07-23** at **16:05:19**
+
+I had just made an unreleased Qwen3-235B-A22B-Instruct-2507-IQ4_KSS feeling around for the sweet spot near 4BPW for mostly CPU inference. It seemed pretty good for the size, but I was fiddling around juicing up some attn tensors and first few layers as well so too many variables.
+
+If i get some time later this week, I might revisit that and do a proper a/b comparison of PPL for this PR.
+
+Swamped by all the releases and slowly digging out what a wild ride this week lol...
+
+Here is a lot of my raw data from testing with that model:
+
+
+
+👈 Details
+
+```bash
+#!/usr/bin/env bash
+
+# Repeating Layers [0-93]
+
+custom="
+# Attention
+blk\..*\.attn_q.*=iq6_k
+blk\..*\.attn_k.*=q8_0
+blk\..*\.attn_v.*=q8_0
+blk\..*\.attn_output.*=iq6_k
+
+# Routed Experts
+blk\.(0|1|2|3)\.ffn_down_exps\.weight=iq5_ks
+blk\.(0|1|2|3)\.ffn_(gate|up)_exps\.weight=iq4_ks
+blk\..*\.ffn_down_exps\.weight=iq4_ks
+blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
+
+# Token Embedding
+token_embd\.weight=iq4_k
+output\.weight=iq6_k
+"
+
+custom=$(
+ echo "$custom" | grep -v '^#' | \
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+
+numactl -N 0 -m 0 \
+./build/bin/llama-quantize \
+ --custom-q "$custom" \
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
+ /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
+ /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-smort-IQ4_KSS.gguf \
+ IQ4_KSS \
+ 192
+```
+
+note data might have some copy paste errors in the comments, its been a busy week lol
+```json
+[
+ {
+ "name": "BF16",
+ "ppl": "4.3079 +/- 0.02544",
+ "size": 437.989,
+ "bpw": 16.003,
+ "legend": "pure",
+ "comment": ""
+ },
+ {
+ "name": "Q8_0",
+ "ppl": "4.3139 +/- 0.02550",
+ "size": 232.769,
+ "bpw": 8.505,
+ "legend": "pure"
+ },
+ {
+ "name": "pure-IQ4_KS",
+ "ppl": "4.4156 +/- 0.02624",
+ "size": 116.994,
+ "bpw": 4.275,
+ "legend": "pure",
+ "comment": "iq4_k token_embd, iq6_k output, ubergarm-imatrix-calibration-corpus-v02.txt"
+ },
+ {
+ "name": "IQ2_KL",
+ "ppl": "4.7912 +/- 0.02910",
+ "size": 81.866,
+ "bpw": 2.991,
+ "legend": "ubergarm",
+ "comment": "juiced q8_0 k|v, iq6_k q|o, iq3_ks down, iq2_kl gate|up"
+ },
+ {
+ "name": "IQ3_KS",
+ "ppl": "4.5275 +/- 0.02703",
+ "size": 97.968,
+ "bpw": 3.580,
+ "legend": "ubergarm",
+ "comment": "iq4_kt attn_.*, iq4_ks down, iq3_ks gate|up"
+ },
+ {
+ "name": "mix-IQ3_KS",
+ "ppl": "4.5078 +/- 0.02700",
+ "size": 98.979,
+ "bpw": 3.617,
+ "legend": "ubergarm",
+ "comment": "iq5_ks attn_.*, iq4_ks down, iq3_ks gate|up"
+ },
+ {
+ "name": "smort-IQ3_KS",
+ "ppl": "4.4915 +/- 0.02685",
+ "size": 101.308,
+ "bpw": 3.702,
+ "legend": "ubergarm",
+ "comment": "juiced q8_0 k|v, iq6_k q|o, iq4_ks down, iq3_ks gate|up"
+ },
+ {
+ "name": "IQ3_K",
+ "ppl": "4.4561 +/- 0.02657",
+ "size": 106.644,
+ "bpw": 3.897,
+ "legend": "ubergarm",
+ "comment": "juiced q8_0 k|v, iq6_k q|o, iq4_k down, iq3_k gate|up"
+ },
+ {
+ "name": "smort-IQ4_KSS",
+ "ppl": "4.4017 +/- 0.02614",
+ "size": 115.085,
+ "bpw": 4.205,
+ "legend": "ubergarm",
+ "comment": "juiced q8_0 k|v, iq6_k q|o, juiced first 4 routed exps layers, iq4_ks down, iq4_kss gate|up"
+ },
+ {
+ "name": "IQ4_KS",
+ "ppl": "4.3923 +/- 0.02618",
+ "size": 126.587,
+ "bpw": 4.625,
+ "legend": "ubergarm",
+ "comment": "iq5_ks attn_.*"
+ },
+ {
+ "name": "IQ5_K",
+ "ppl": "4.3351 +/- 0.02566",
+ "size": 161.722,
+ "bpw": 5.909,
+ "legend": "ubergarm",
+ "comment": "juiced q8_0 k|v, iq6_k q|o, iq6_k down, iq5_k gate|up"
+ }
+]
+```
+
+
+
+
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **16:11:03**
+
+@ubergarm Btw, I'm not finding where you mentioned to be seeing pauses after a comma to ping you there in case you missed PR [#639](https://github.com/ikawrakow/ik_llama.cpp/issues/639) that fixes the issue.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **17:48:16**
+
+So, I'll disappear tomorrow for 2 weeks. Do I merge this before I go?
+
+---
+
+👤 **ubergarm** commented on **2025-07-23** at **18:43:07**
+
+YOLO! (you only live once 🤣)
+
+i have not tested yet, but it seems at quick glance the code changes don't
+effect non-IQ4_KSS quants. as there aren't any of those quants released of
+which i know — yeah merge it and we can sort it out later lol!
+
+unrelated, i have not opened an issue, but was having a segfault in
+llama-quantize with IQ3_KT trellis quant so have not released. Recipe here:
+
+https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq2_kt-todo
+
+Finally, unrelated, when trying to run this IQ2_KL (it quantizes fine) but
+crashes with asserts towards the end of starting up :
+https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq2_kl-169597-gib-3034-bpw
+
+compiled CPU only, on that big dual socket epyc
+
+Sorry, not at home today now for proper logs
+
+finally finally, feel free to ignore all this and have a great couple
+weeks!!! 😋 catch you later!
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **18:50:45**
+
+When you get a chance, post the assert that the `IQ2_KL` model hits. The `IQ3_KT` segfault will be much more difficult to fix without a run in the debugger.
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-07-23** at **18:52:06**
+
+> So, I'll disappear tomorrow for 2 weeks
+
+Noooooo
+
+Not urgent, but did you have the chance to look into the issue where imatrix data for `attn_k_b` was missing when quantizing kimi?
+
+---
+
+👤 **ikawrakow** commented on **2025-07-23** at **19:10:20**
+
+> Not urgent, but did you have the chance to look into the issue where imatrix data for attn_k_b was missing when quantizing kimi?
+
+Ha, I looked into it, then searched for the thread where we were talking about it, didn't find it, and then forgot.
+
+I'm actually not sure what happens in the Kimi runs. imatrix works fine when I test with a smaller model with the same attention architecture (DeepSeek-Lite). I tested with a GGUF created specifically for `llama.cpp` MLA (so `attn_k_b` and `attn_v_b` present, but not `attn_kv_b`), with a GGUF that precedes `ik_llama.cpp` MLA (so only `attn_kv_b` present), and with a version created from the safetensors with the `ik_llama.cpp` `convert_hf_to_gguf.py` script (so, all 3 present in the GGUF). In all 3 cases it worked fine with `-mla 1`. I didn't see tensor names with `(view of ...)` appended to the `attn_k_b` name, and `attn_v_b` calls were always triggered as expected. The only thing I was not sure if I was exercising was the split of the attention calculation using `-amb` (DeepSeek-Lite has 8 times fewer attention heads than the giant MLA models, so not easy to trigger the split). So, perhaps running the imatrix calculation without `-amb` would resolve it? The imatrix runs don't need such a big context, the `-mla 3` option that requires large work buffer without `-amb` is not being used, so it should be OK to run without `-amb`.
+
+So, in short, just try running without `-amb`. First with `--verbosity 2` to see if the imatrix data collection function gets called with `attn_k_b` and `attn_v_b`. If yes, rerun the imatrix calculation that way. If it still doesn't work, it will have to wait until I come back.
+
+---
+
+👤 **ubergarm** commented on **2025-07-23** at **20:41:55**
+
+Hope you get some sleep before your travels! Besides we can just use Qwen3-Coder now to fix everything right? :rofl:
+
+I'll open proper issues for these if I can't figure it out. Zero rush or priority here as I've not released these two models giving me troubles.
+
+Just got a laptop with some WiFi and can give a quick log:
+
+> When you get a chance, post the assert that the IQ2_KL model hits.
+
+*EDIT* Here is the Issue: [#649](https://github.com/ikawrakow/ik_llama.cpp/issues/649)
+
+
+
+IQ2_KL assert run and log
+
+```bash
+model=/mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf
+
+numactl -N 1 -m 1 \
+./build/bin/llama-server \
+ --model "$model"\
+ --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct \
+ --ctx-size 196608 \
+ -ctk q8_0 -ctv q8_0 \
+ -fa -fmoe \
+ -ub 4096 -b 4096 \
+ --parallel 3 \
+ --threads 128 \
+ --threads-batch 192 \
+ --numa numactl \
+ --host 127.0.0.1 \
+ --port 8080 \
+ --no-mmap
+
+INFO [ main] build info | tid="127586578487488" timestamp=1753302334 build=3821 commit="1b052109"
+INFO [ main] system info | tid="127586578487488" timestamp=1753302334 n_threads=128 n_threads_batch=192 total_threads=768 system_info="AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+llama_model_loader: additional 3 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 41 key-value pairs and 747 tensors from /mnt/raid/hf/Qwen3-Coder-480B-A35B-Instruct-GGUF/IQ2_KL/Qwen3-480B-A35B-Instruct-IQ2_KL-00001-of-00004.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 Coder 480B A35B Instruct
+llama_model_loader: - kv 3: general.finetune str = Instruct
+llama_model_loader: - kv 4: general.basename str = Qwen3-Coder
+llama_model_loader: - kv 5: general.size_label str = 480B-A35B
+llama_model_loader: - kv 6: general.license str = apache-2.0
+llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
+llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 9: qwen3moe.block_count u32 = 62
+llama_model_loader: - kv 10: qwen3moe.context_length u32 = 262144
+llama_model_loader: - kv 11: qwen3moe.embedding_length u32 = 6144
+llama_model_loader: - kv 12: qwen3moe.feed_forward_length u32 = 8192
+llama_model_loader: - kv 13: qwen3moe.attention.head_count u32 = 96
+llama_model_loader: - kv 14: qwen3moe.attention.head_count_kv u32 = 8
+llama_model_loader: - kv 15: qwen3moe.rope.freq_base f32 = 10000000.000000
+llama_model_loader: - kv 16: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 17: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 18: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 19: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 20: general.file_type u32 = 155
+llama_model_loader: - kv 21: qwen3moe.expert_count u32 = 160
+llama_model_loader: - kv 22: qwen3moe.expert_feed_forward_length u32 = 2560
+llama_model_loader: - kv 23: qwen3moe.expert_shared_feed_forward_length u32 = 0
+llama_model_loader: - kv 24: general.quantization_version u32 = 2
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 33: tokenizer.chat_template str = {% macro render_item_list(item_list, ...
+llama_model_loader: - kv 34: quantize.imatrix.file str = /mnt/raid/models/ubergarm/Qwen3-Coder...
+llama_model_loader: - kv 35: quantize.imatrix.dataset str = ubergarm-imatrix-calibration-corpus-v...
+llama_model_loader: - kv 36: quantize.imatrix.entries_count i32 = 497
+llama_model_loader: - kv 37: quantize.imatrix.chunks_count i32 = 840
+llama_model_loader: - kv 38: split.no u16 = 0
+llama_model_loader: - kv 39: split.count u16 = 4
+llama_model_loader: - kv 40: split.tensors.count i32 = 747
+llama_model_loader: - type f32: 311 tensors
+llama_model_loader: - type q8_0: 124 tensors
+llama_model_loader: - type iq3_k: 62 tensors
+llama_model_loader: - type iq4_k: 1 tensors
+llama_model_loader: - type iq6_k: 125 tensors
+llama_model_loader: - type iq2_kl: 124 tensors
+llm_load_vocab: special tokens cache size = 26
+llm_load_vocab: token to piece cache size = 0.9311 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen3moe
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 151936
+llm_load_print_meta: n_merges = 151387
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 262144
+llm_load_print_meta: n_embd = 6144
+llm_load_print_meta: n_layer = 62
+llm_load_print_meta: n_head = 96
+llm_load_print_meta: n_head_kv = 8
+llm_load_print_meta: n_rot = 128
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 128
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 12
+llm_load_print_meta: n_embd_k_gqa = 1024
+llm_load_print_meta: n_embd_v_gqa = 1024
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 8192
+llm_load_print_meta: n_expert = 160
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = linear
+llm_load_print_meta: freq_base_train = 10000000.0
+llm_load_print_meta: freq_scale_train = 1
+llm_load_print_meta: n_ctx_orig_yarn = 262144
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = ?B
+llm_load_print_meta: model ftype = IQ2_KL - 2.6875 bpw
+llm_load_print_meta: model params = 480.155 B
+llm_load_print_meta: model size = 169.597 GiB (3.034 BPW)
+llm_load_print_meta: repeating layers = 168.388 GiB (3.024 BPW, 478.288 B parameters)
+llm_load_print_meta: general.name = Qwen3 Coder 480B A35B Instruct
+llm_load_print_meta: BOS token = 11 ','
+llm_load_print_meta: EOS token = 151645 '<|im_end|>'
+llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
+llm_load_print_meta: LF token = 148848 'ÄĬ'
+llm_load_print_meta: EOT token = 151645 '<|im_end|>'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_ff_exp = 2560
+llm_load_tensors: ggml ctx size = 0.33 MiB
+llm_load_tensors: CPU buffer size = 173666.87 MiB
+....................................................................................................
+llama_new_context_with_model: n_ctx = 196608
+llama_new_context_with_model: n_batch = 4096
+llama_new_context_with_model: n_ubatch = 4096
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 10000000.0
+llama_new_context_with_model: freq_scale = 1
+llama_kv_cache_init: CPU KV buffer size = 25296.00 MiB
+llama_new_context_with_model: KV self size = 25296.00 MiB, K (q8_0): 12648.00 MiB, V (q8_0): 12648.00 MiB
+llama_new_context_with_model: CPU output buffer size = 2.32 MiB
+llama_new_context_with_model: CPU compute buffer size = 5184.05 MiB
+llama_new_context_with_model: graph nodes = 2424
+llama_new_context_with_model: graph splits = 1
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: /home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+GGML_ASSERT(fms.S[j] > 0) failed
+GGML_ASSERT(fms.S[j] > 0) failed
+
+GGML_ASSERT(fms.S[j] > 0) failed
+/home/w/projects/ik_llama.cpp/ggml/src/./iqk/fa/iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+warning: process 4140403 is a zombie - the process has already terminated
+ptrace: Inappropriate ioctl for device.
+No stack.
+The program is not being run.
+./myscripts/api-server-Qwen3-Coder-480B-A35B-Instruct.sh: line 34: 4140403 Aborted (core dumped) numactl -N 1 -m 1 ./build/bin/llama-server --model "$model" --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct --ctx-size 196608 -ctk q8_0 -ctv q8_0 -fa -fmoe -ub 4096 -b 4096 --parallel 3 --threads 128 --threads-batch 192 --numa numactl --host 127.0.0.1 --port 8080 --no-mmap
+```
+
+
+
+> The IQ3_KT segfault will be much more difficult to fix without a run in the debugger.
+
+*EDIT* here is that issue with debug logs: [#650](https://github.com/ikawrakow/ik_llama.cpp/issues/650)
+
+Yeah, I'll give full logs on its own issue later, it could just be this hardware possibly as it throws an error in `dmesg` as well. Here is the quick look
+
+
+
+segfault quantizing iq3_kt
+
+```bash
+
+$ sudo dmest -T --follow
+
+[Wed Jul 23 16:36:14 2025] llama-quantize[4140724]: segfault at 7dd4d780a9d0 ip 00007eb9b81c634f sp 00007fff3c7bfd40 error 4 in libggml.so[9c634f,7eb9b7815000+9be000] likely on CPU 195 (core 3, socket 1)
+[Wed Jul 23 16:36:14 2025] Code: ca 0f 87 80 fe ff ff c5 e8 57 d2 c5 f8 28 c2 e9 7f fe ff ff 8b bd 20 ff ff ff 8b b5 24 ff ff ff 8d 14 fd 00 00 00 00 48 63 d2 fa 10 04 90 48 8d 14 95 04 00 00 00 c5 fa 11 03 c5 fa 10 04 10
+
+$ #!/usr/bin/env bash
+
+# Repeating Layers [0-61]
+
+custom="
+# Attention
+blk\..*\.attn_q.*=iq4_kt
+blk\..*\.attn_k.*=iq4_kt
+blk\..*\.attn_v.*=iq4_kt
+blk\..*\.attn_output.*=iq4_kt
+
+# Routed Experts
+blk\..*\.ffn_down_exps\.weight=iq3_kt
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
+
+# Non-Repeating Layers
+token_embd\.weight=iq4_kt
+output\.weight=iq6_k
+"
+
+custom=$(
+ echo "$custom" | grep -v '^#' | \
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+
+numactl -N 1 -m 1 \
+./build/bin/llama-quantize \
+ --custom-q "$custom" \
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf \
+ IQ2_KT \
+ 192
+
+
+main: build = 3823 (fd711836)
+main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
+main: quantizing '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf' to '/mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf' as IQ2_KT using 192 threads
+llama_model_loader: additional 20 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 37 key-value pairs and 747 tensors from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen3moe
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Qwen3 Coder 480B A35B Instruct
+llama_model_loader: - kv 3: general.finetune str = Instruct
+llama_model_loader: - kv 4: general.basename str = Qwen3-Coder
+llama_model_loader: - kv 5: general.size_label str = 480B-A35B
+llama_model_loader: - kv 6: general.license str = apache-2.0
+llama_model_loader: - kv 7: general.license.link str = https://huggingface.co/Qwen/Qwen3-Cod...
+llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
+llama_model_loader: - kv 9: qwen3moe.block_count u32 = 62
+llama_model_loader: - kv 10: qwen3moe.context_length u32 = 262144
+llama_model_loader: - kv 11: qwen3moe.embedding_length u32 = 6144
+llama_model_loader: - kv 12: qwen3moe.feed_forward_length u32 = 8192
+llama_model_loader: - kv 13: qwen3moe.attention.head_count u32 = 96
+llama_model_loader: - kv 14: qwen3moe.attention.head_count_kv u32 = 8
+llama_model_loader: - kv 15: qwen3moe.rope.freq_base f32 = 10000000.000000
+llama_model_loader: - kv 16: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 17: qwen3moe.expert_used_count u32 = 8
+llama_model_loader: - kv 18: qwen3moe.attention.key_length u32 = 128
+llama_model_loader: - kv 19: qwen3moe.attention.value_length u32 = 128
+llama_model_loader: - kv 20: general.file_type u32 = 32
+llama_model_loader: - kv 21: qwen3moe.expert_count u32 = 160
+llama_model_loader: - kv 22: qwen3moe.expert_feed_forward_length u32 = 2560
+llama_model_loader: - kv 23: qwen3moe.expert_shared_feed_forward_length u32 = 0
+llama_model_loader: - kv 24: general.quantization_version u32 = 2
+llama_model_loader: - kv 25: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 26: tokenizer.ggml.pre str = qwen2
+llama_model_loader: - kv 27: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
+llama_model_loader: - kv 28: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 29: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
+llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 151645
+llama_model_loader: - kv 31: tokenizer.ggml.padding_token_id u32 = 151643
+llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false
+llama_model_loader: - kv 33: tokenizer.chat_template str = {% macro render_item_list(item_list, ...
+llama_model_loader: - kv 34: split.no u16 = 0
+llama_model_loader: - kv 35: split.count u16 = 21
+llama_model_loader: - kv 36: split.tensors.count i32 = 747
+llama_model_loader: - type f32: 311 tensors
+llama_model_loader: - type bf16: 436 tensors
+================================ Have weights data with 497 entries
+[ 1/ 747] token_embd.weight - [ 6144, 151936, 1, 1], type = bf16, Using custom type iq4_kt for tensor token_embd.weight
+
+====== llama_model_quantize_internal: did not find weights for token_embd.weight
+converting to iq4_kt .. Adding custom rule blk\..*\.attn_q.* -> iq4_kt
+Adding custom rule blk\..*\.attn_k.* -> iq4_kt
+Adding custom rule blk\..*\.attn_v.* -> iq4_kt
+Adding custom rule blk\..*\.attn_output.* -> iq4_kt
+Adding custom rule blk\..*\.ffn_down_exps\.weight -> iq3_kt
+Adding custom rule blk\..*\.ffn_(gate|up)_exps\.weight -> iq2_kt
+Adding custom rule token_embd\.weight -> iq4_kt
+Adding custom rule output\.weight -> iq6_k
+load_imatrix: imatrix dataset='ubergarm-imatrix-calibration-corpus-v02.txt'
+load_imatrix: loaded 497 importance matrix entries from /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat computed on 840 chunks
+prepare_imatrix: have 497 importance matrix entries
+size = 1780.50 MiB -> 445.70 MiB
+[ 2/ 747] blk.0.attn_k_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 3/ 747] blk.0.attn_k.weight - [ 6144, 1024, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_k.weight
+converting to iq4_kt .. cluster_points: Oops. Cluster 4 has no points: 0 1 0 0
+cluster_points: 1 out of 625 clusters dir not have any points
+size = 12.00 MiB -> 3.00 MiB
+[ 4/ 747] blk.0.attn_output.weight - [12288, 6144, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_output.weight
+converting to iq4_kt .. size = 144.00 MiB -> 36.02 MiB
+[ 5/ 747] blk.0.attn_q_norm.weight - [ 128, 1, 1, 1], type = f32, size = 0.000 MB
+[ 6/ 747] blk.0.attn_q.weight - [ 6144, 12288, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_q.weight
+converting to iq4_kt .. size = 144.00 MiB -> 36.05 MiB
+[ 7/ 747] blk.0.attn_v.weight - [ 6144, 1024, 1, 1], type = bf16, Using custom type iq4_kt for tensor blk.0.attn_v.weight
+converting to iq4_kt .. size = 12.00 MiB -> 3.00 MiB
+[ 8/ 747] blk.0.attn_norm.weight - [ 6144, 1, 1, 1], type = f32, size = 0.023 MB
+[ 9/ 747] blk.0.ffn_down_exps.weight - [ 2560, 6144, 160, 1], type = bf16, Using custom type iq3_kt for tensor blk.0.ffn_down_exps.weight
+converting to iq3_kt .. ./myscripts/quantize-Qwen3-Coder-480B-A35B-Instruct-v08.sh: line 33: 2323451 Segmentation fault (core dumped) numactl -N 0 -m 0 ./build/bin/llama-quantize --custom-q "$custom" --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf IQ2_KT 192
+```
+
+
+
+@ThomasBaruzier
+
+I can open a 3rd issue for the mla stuff and put all the notes in one place along with ik's above comments and work together to figure out what is going on. thanks!
+
+*EDIT* Here is that issue now: [#651](https://github.com/ikawrakow/ik_llama.cpp/issues/651)
+
+---
+
+👤 **Nexesenex** commented on **2025-07-23** at **21:12:23**
+
+> Not much is known about IQ4_KSS, and nobody seems to be using it. So, I decided to give it some attention.
+
+@Ikawrakow : And now that it has Cuda MMQ, I will use it! Thanks for completing it!
+
+And have a great time off!
+
+---
+
+👤 **ThomasBaruzier** commented on **2025-07-23** at **22:55:11**
+
+> So, in short, just try running without -amb. First with --verbosity 2 to see if the imatrix data collection function gets called with attn_k_b and attn_v_b. If yes, rerun the imatrix calculation that way. If it still doesn't work, it will have to wait until I come back.
+
+Thank you for the detailed explanation! Since I rely on @ubergarm's imatrix due to hardware limitations (no pressure as well), I won't be able to verify this on my end right now. You'll be back in two weeks anyway (have a great time!).
+
+> Just got a laptop with some WiFi
+
+You seem like someone who would really appreciate [Termux](https://github.com/termux/termux-app). Apologies for the poor internet, seems we're all on vacation/away 😅
+
+https://github.com/user-attachments/assets/9cde804a-b6bd-487f-b25e-f2b1848d9394
+
+> I can open a 3rd issue for the mla stuff and put all the notes in one place along with ik's above comments and work together to figure out what is going on
+
+That sounds really nice! Thanks
+
+---
+
+👤 **ubergarm** commented on **2025-07-25** at **22:22:49**
+
+
+
+The IQ4_KSS is looking like a pretty good spot for [ubergarm/Qwen3-235B-A22B-Thinking-2507](https://huggingface.co/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF)
+
+---
+
+👤 **ubergarm** commented on **2025-07-27** at **18:36:50**
+
+I used Qwen3-Coder-480B-A35B-Instruct-IQ5_K to vibe code up some new matplot lib software and actually fixup my Y-axis log scale more similar to how I've seen some of ik's plots. The IQ4_KSS recipes seem quite strong. They differ *slightly* from each other, exact recipe in links below.
+
+
+
+
+
+* https://huggingface.co/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF#iq4_kss-115085-gib-4205-bpw
+* https://huggingface.co/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF#iq4_kss-114093-gib-4169-bpw
+
+*UPDATE*
+
+And just finished up the bigger [Qwen3-Coder-480B-A35B-Instruct-GGUF IQ4_KSS](https://huggingface.co/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF#iq4_kss-233676-gib-4180-bpw)
+
+
+
+(*note that the IQ2_K here is using iq2_kl as ffn_down_exps instead of larger iq3_k so it is right in line with what the IQ2_KS would be for size and PPL).
\ No newline at end of file
diff --git a/github-data/pull_requests/643 - Enable LLM function calls.md b/github-data/pull_requests/643 - Enable LLM function calls.md
new file mode 100644
index 000000000..483908dfe
--- /dev/null
+++ b/github-data/pull_requests/643 - Enable LLM function calls.md
@@ -0,0 +1,30 @@
+## 🔀 [Pull Request #643](https://github.com/ikawrakow/ik_llama.cpp/pull/643) - Enable LLM function calls
+
+| **Author** | `iSevenDays` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `qwen3-function-calls` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-24 |
+| **Updated** | 2025-07-24 |
+| **Merged** | 2025-07-24 |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+The PR fixes the logic when LLM responds with text and includes tool calls, but responds with "stop" instead of "tool_calls".
+
+The PR enables LLM to work with Claude Code proxies in streaming mode.
+
+---
+
+## 💬 Conversation
+
+👤 **ikawrakow** approved this pull request ✅ on **2025-07-24** at **18:24:00**
\ No newline at end of file
diff --git a/github-data/pull_requests/645 - Port speculative decoding from upstream to llama-server.md b/github-data/pull_requests/645 - Port speculative decoding from upstream to llama-server.md
new file mode 100644
index 000000000..85f0fa520
--- /dev/null
+++ b/github-data/pull_requests/645 - Port speculative decoding from upstream to llama-server.md
@@ -0,0 +1,1061 @@
+## 🔀 [Pull Request #645](https://github.com/ikawrakow/ik_llama.cpp/pull/645) - Port speculative decoding from upstream to llama-server
+
+| **Author** | `g2mt` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Source Branch** | `speculative-port` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-25 |
+| **Updated** | 2025-07-27 |
+| **Assignees** | `saood06` |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [x] Medium
+ - [ ] High
+
+Related to [#322](https://github.com/ikawrakow/ik_llama.cpp/issues/322)
+
+This is a port of the speculative decoding function for llama-server from the upstream code base.
+
+Changes:
+
+- Updated llama-server source code
+- Added several functions needed for speculative decoding.
+- Add prefixes to KV cache tensors to support loading of multiple models
+
+I used Qwen3-235B in this PR.
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** commented on **2025-07-25** at **05:15:48**
+
+Thank you for doing this. I can test/review/assist if you need.
+
+---
+
+👤 **saood06** commented on **2025-07-25** at **05:18:58**
+
+Also are you aware this: https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/speculative/speculative.cpp exists.
+
+---
+
+👤 **g2mt** commented on **2025-07-25** at **05:26:10**
+
+I got the server to compile, but when loading Qwen 2.5 1.5b with the 0.5b version as the draft, I get this error:
+
+```
+ggml_backend_alloc_ctx_tensors_from_buft: all tensors in the context are already allocated
+llama_kv_cache_init: failed to allocate buffer for kv cache
+llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
+llama_init_from_gpt_params: error: failed to create context with model 'Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf'
+ ERR [ load_model] failed to load draft model | tid="140650859190528" timestamp=1753420591 model="Qwen2.5-Coder-0.5B-Instruct-Q8_0.gguf"
+```
+
+GDB says it occurred in this llama_init_from_gpt_params call:
+
+```cpp
+ llama_init_result llama_init_dft = llama_init_from_gpt_params(params_dft);
+```
+
+I wonder if llama_kv_cache_init is unable to load tensors with the same name. I'll try and fix the code later.
+
+---
+
+👤 **g2mt** commented on **2025-07-25** at **05:27:44**
+
+> Also are you aware this: https://github.com/ikawrakow/ik_llama.cpp/blob/main/examples/speculative/speculative.cpp exists.
+
+I am aware of the example. I'll check it later.
+
+---
+
+👤 **saood06** commented on **2025-07-25** at **05:34:38**
+
+>I am aware of the example. I'll check it later.
+
+Sorry. I forgot my history. The common one (introduced here: https://github.com/ggml-org/llama.cpp/pull/10362) was done before server: https://github.com/ggml-org/llama.cpp/pull/10455. The common implementation was made to be simpler to understand and work with which is why it came bundled with https://github.com/ggml-org/llama.cpp/tree/8f419181d1c20d8195148680df15b6f093cb1512/examples/speculative-simple
+
+---
+
+👤 **g2mt** commented on **2025-07-25** at **07:09:50**
+
+I'm now able to load the draft model. It seems that the kv-cache tensor names were reused for both models. Prefixing them with the model name fixes it.
+
+---
+
+👤 **saood06** commented on **2025-07-25** at **07:47:27**
+
+>I'm now able to load the draft model. It seems that the kv-cache tensor names were reused for both models. Prefixing them with the model name fixes it.
+
+Nice. Did you get any accepted tokens?
+
+---
+
+👤 **g2mt** commented on **2025-07-25** at **09:02:33**
+
+I think I got it working. For some reason ik_llama's slot.id is offset by 1, which tripped me off a bit.
+
+A simple test of repeating a string shows it working:
+
+```
+curl -s http://localhost:9001/v1/chat/completions \
+ -H "Content-Type: application/json" \
+ -H "Authorization: Bearer no-key" \
+ -d '{"model": "test","messages": [{"role": "user","content": "Repeat the following sentence, as is: The quick brown fox jumped over the lazy dog."}]}'
+{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","content":"The quick brown fox jumped over the lazy dog."}}],"created":1753433480,"model":"test","object":"chat.completion","usage":{"completion_tokens":14,"prompt_tokens":26,"total_tokens":40},"id":"chatcmpl-QK3CBenhWiSBeeuIs6UGs2yXCV5YpqRO","__verbose":{"content":"The quick brown fox jumped over the lazy dog.","generated_text":"The quick brown fox jumped over the lazy dog.",
+```
+
+Server logs do show the speculative decoding results being accepted:
+
+```
+VERB [ update_slots] speculative decoding result | tid="140737350637888" timestamp=1753433480 id_slot=0 accepted=12 total=13 new_n_past=39
+```
+
+It looks like it's working, but I think more testing is needed. If someone else could post more test results that would be great. I'll open the PR up for review now.
+
+---
+
+👤 **saood06** commented on **2025-07-25** at **09:12:46**
+
+>If someone else could post more test results that would be great. I'll open the PR up for review now.
+
+I'll try to do some tests within a day.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-25** at **09:21:28**
+
+@saood06 I'll be not able to review before August 7, so I have assigned you as a reviewer.
+
+Hopefully more people will test.
+
+---
+
+👤 **saood06** commented on **2025-07-25** at **09:47:41**
+
+> @saood06 I'll be not able to review before August 7, so I have assigned you as a reviewer.
+
+I'll review and test it.
+
+---
+
+👤 **ChicoPinto70** commented on **2025-07-26** at **12:30:23**
+
+Hi, Guys. I've tested this branch in a Dual Xeon E5 2699 v3, 256GB DDR4, 3xRTX 3090 in Ubuntu 24.04.2 LTS.
+
+I compiled the project with these parameters:
+
+cmake -B build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_SCHED_MAX_COPIES=1 DGGML_CUDA_DISABLE_GRAPHS=1 -DGGML_CUDA_IQK_FORCE_BF16=ON -DGGML_CUDA_MIN_BATCH_OFFLOAD=64
+
+And I ran it with this command:
+
+CUDA_VISIBLE_DEVICES="1,2,0" ./build/bin/llama-server --alias unsloth/DeepSeek-R1-0528-UD-Q3_K_XL -m /home/chico/.lmstudio/models/unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-UD-Q3_K_XL-00001-of-00007.gguf -ngl 64 -c 16384 -mla 3 -fa -amb 1024 -fmoe -t 32 -ctk q8_0 -ot "blk\.[0-6]\..*_exps\.=CUDA1,blk\.(7|8|9|10)\..*_exps\.=CUDA2,exps=CPU" --parallel 1 --numa distribute -b 4096 -ub 4096 --no-mmap -ts 1,0,0 -ser 7,1 --host 192.168.0.9 --port 1235 -md /home/chico/.lmstudio/models/jukofyork/DeepSeek-R1-DRAFT-0.6B-v2.0-GGUF/DeepSeek-R1-DRAFT-0.6B-128k-Q4_0.gguf -ngld 64
+
+It failed with this message: /home/chico/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu:604: GGML_ASSERT(src0->ne[2] == src1->ne[2] && src0->ne[2] == dst->ne[2]) failed
+
+Bellow, the complete output:
+
+ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
+ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
+ggml_cuda_init: found 3 CUDA devices:
+ Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+ Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
+INFO [ main] build info | tid="130964736602112" timestamp=1753532241 build=3841 commit="e938d9f6"
+INFO [ main] system info | tid="130964736602112" timestamp=1753532241 n_threads=32 n_threads_batch=-1 total_threads=36 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
+llama_model_loader: additional 6 GGUFs metadata loaded.
+llama_model_loader: loaded meta data with 63 key-value pairs and 1086 tensors from /home/chico/.lmstudio/models/unsloth/DeepSeek-R1-0528-GGUF/DeepSeek-R1-0528-UD-Q3_K_XL-00001-of-00007.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = deepseek2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = Deepseek-R1-0528
+llama_model_loader: - kv 3: general.basename str = Deepseek-R1-0528
+llama_model_loader: - kv 4: general.quantized_by str = Unsloth
+llama_model_loader: - kv 5: general.size_label str = 256x20B
+llama_model_loader: - kv 6: general.license str = mit
+llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
+llama_model_loader: - kv 8: general.base_model.count u32 = 1
+llama_model_loader: - kv 9: general.base_model.0.name str = DeepSeek R1 0528
+llama_model_loader: - kv 10: general.base_model.0.version str = 0528
+llama_model_loader: - kv 11: general.base_model.0.organization str = Deepseek Ai
+llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
+llama_model_loader: - kv 13: general.tags arr[str,3] = ["deepseek", "unsloth", "transformers"]
+llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
+llama_model_loader: - kv 15: deepseek2.block_count u32 = 61
+llama_model_loader: - kv 16: deepseek2.context_length u32 = 163840
+llama_model_loader: - kv 17: deepseek2.embedding_length u32 = 7168
+llama_model_loader: - kv 18: deepseek2.feed_forward_length u32 = 18432
+llama_model_loader: - kv 19: deepseek2.attention.head_count u32 = 128
+llama_model_loader: - kv 20: deepseek2.attention.head_count_kv u32 = 1
+llama_model_loader: - kv 21: deepseek2.rope.freq_base f32 = 10000.000000
+llama_model_loader: - kv 22: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 23: deepseek2.expert_used_count u32 = 8
+llama_model_loader: - kv 24: deepseek2.leading_dense_block_count u32 = 3
+llama_model_loader: - kv 25: deepseek2.vocab_size u32 = 129280
+llama_model_loader: - kv 26: deepseek2.attention.q_lora_rank u32 = 1536
+llama_model_loader: - kv 27: deepseek2.attention.kv_lora_rank u32 = 512
+llama_model_loader: - kv 28: deepseek2.attention.key_length u32 = 576
+llama_model_loader: - kv 29: deepseek2.attention.value_length u32 = 512
+llama_model_loader: - kv 30: deepseek2.attention.key_length_mla u32 = 192
+llama_model_loader: - kv 31: deepseek2.attention.value_length_mla u32 = 128
+llama_model_loader: - kv 32: deepseek2.expert_feed_forward_length u32 = 2048
+llama_model_loader: - kv 33: deepseek2.expert_count u32 = 256
+llama_model_loader: - kv 34: deepseek2.expert_shared_count u32 = 1
+llama_model_loader: - kv 35: deepseek2.expert_weights_scale f32 = 2.500000
+llama_model_loader: - kv 36: deepseek2.expert_weights_norm bool = true
+llama_model_loader: - kv 37: deepseek2.expert_gating_func u32 = 2
+llama_model_loader: - kv 38: deepseek2.rope.dimension_count u32 = 64
+llama_model_loader: - kv 39: deepseek2.rope.scaling.type str = yarn
+llama_model_loader: - kv 40: deepseek2.rope.scaling.factor f32 = 40.000000
+llama_model_loader: - kv 41: deepseek2.rope.scaling.original_context_length u32 = 4096
+llama_model_loader: - kv 42: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
+llama_model_loader: - kv 43: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 44: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 45: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
+llama_model_loader: - kv 46: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 47: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
+llama_model_loader: - kv 48: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 49: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 50: tokenizer.ggml.padding_token_id u32 = 2
+llama_model_loader: - kv 51: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 52: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 53: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
+llama_model_loader: - kv 54: general.quantization_version u32 = 2
+llama_model_loader: - kv 55: general.file_type u32 = 12
+llama_model_loader: - kv 56: quantize.imatrix.file str = DeepSeek-R1-0528-GGUF/imatrix_unsloth...
+llama_model_loader: - kv 57: quantize.imatrix.dataset str = unsloth_calibration_DeepSeek-R1-0528-...
+llama_model_loader: - kv 58: quantize.imatrix.entries_count i32 = 659
+llama_model_loader: - kv 59: quantize.imatrix.chunks_count i32 = 720
+llama_model_loader: - kv 60: split.no u16 = 0
+llama_model_loader: - kv 61: split.tensors.count i32 = 1086
+llama_model_loader: - kv 62: split.count u16 = 7
+llama_model_loader: - type f32: 361 tensors
+llama_model_loader: - type q8_0: 122 tensors
+llama_model_loader: - type q3_K: 166 tensors
+llama_model_loader: - type q4_K: 392 tensors
+llama_model_loader: - type q5_K: 29 tensors
+llama_model_loader: - type q6_K: 16 tensors
+==========================================================================
+Detected incompatible DeepSeek model.
+Will try to fix, but there are no guarantees
+
+*** Your prompt processing speed will be crippled ***
+
+Consider making your own ik_llama.cpp compatible model or
+ask the model provider to make one for you,
+==========================================================================
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = deepseek2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 163840
+llm_load_print_meta: n_embd = 7168
+llm_load_print_meta: n_layer = 61
+llm_load_print_meta: n_head = 128
+llm_load_print_meta: n_head_kv = 128
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 192
+llm_load_print_meta: n_embd_head_v = 128
+llm_load_print_meta: n_gqa = 1
+llm_load_print_meta: n_embd_k_gqa = 24576
+llm_load_print_meta: n_embd_v_gqa = 16384
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 18432
+llm_load_print_meta: n_expert = 256
+llm_load_print_meta: n_expert_used = 8
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 0
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 10000.0
+llm_load_print_meta: freq_scale_train = 0.025
+llm_load_print_meta: n_ctx_orig_yarn = 4096
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 671B
+llm_load_print_meta: model ftype = Q3_K - Medium
+llm_load_print_meta: model params = 671.026 B
+llm_load_print_meta: model size = 275.576 GiB (3.528 BPW)
+llm_load_print_meta: repeating layers = 274.383 GiB (3.522 BPW, 669.173 B parameters)
+llm_load_print_meta: general.name = Deepseek-R1-0528
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 2 '<|▁pad▁|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_print_meta: n_layer_dense_lead = 3
+llm_load_print_meta: n_lora_q = 1536
+llm_load_print_meta: n_lora_kv = 512
+llm_load_print_meta: n_ff_exp = 2048
+llm_load_print_meta: n_expert_shared = 1
+llm_load_print_meta: expert_weights_scale = 2.5
+llm_load_print_meta: expert_weights_norm = 1
+llm_load_print_meta: expert_gating_func = sigmoid
+llm_load_print_meta: rope_yarn_log_mul = 0.1000
+llm_load_tensors: ggml ctx size = 0.89 MiB
+Tensor blk.3.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.3.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.3.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.4.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.4.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.4.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.5.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.5.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.5.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.6.ffn_gate_exps.weight buffer type overriden to CUDA1
+Tensor blk.6.ffn_down_exps.weight buffer type overriden to CUDA1
+Tensor blk.6.ffn_up_exps.weight buffer type overriden to CUDA1
+Tensor blk.7.ffn_gate_exps.weight buffer type overriden to CUDA2
+Tensor blk.7.ffn_down_exps.weight buffer type overriden to CUDA2
+Tensor blk.7.ffn_up_exps.weight buffer type overriden to CUDA2
+Tensor blk.8.ffn_gate_exps.weight buffer type overriden to CUDA2
+Tensor blk.8.ffn_down_exps.weight buffer type overriden to CUDA2
+Tensor blk.8.ffn_up_exps.weight buffer type overriden to CUDA2
+Tensor blk.9.ffn_gate_exps.weight buffer type overriden to CUDA2
+Tensor blk.9.ffn_down_exps.weight buffer type overriden to CUDA2
+Tensor blk.9.ffn_up_exps.weight buffer type overriden to CUDA2
+Tensor blk.10.ffn_gate_exps.weight buffer type overriden to CUDA2
+Tensor blk.10.ffn_down_exps.weight buffer type overriden to CUDA2
+Tensor blk.10.ffn_up_exps.weight buffer type overriden to CUDA2
+Tensor blk.11.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.11.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.12.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.13.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.14.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.15.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.16.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.17.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.18.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.19.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.20.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.21.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.22.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.23.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.24.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.25.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.26.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.27.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.28.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.29.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.30.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.31.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.32.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.33.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.34.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.35.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.36.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.37.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.38.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.39.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.40.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.41.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.42.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.43.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.44.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.45.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.46.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.47.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.48.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.49.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.50.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.51.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.52.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.53.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.54.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.55.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.56.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.57.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.58.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.59.ffn_up_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_down_exps.weight buffer type overriden to CPU
+Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU
+llm_load_tensors: offloading 61 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 62/62 layers to GPU
+llm_load_tensors: CPU buffer size = 233856.00 MiB
+llm_load_tensors: CUDA_Host buffer size = 497.11 MiB
+llm_load_tensors: CUDA0 buffer size = 9925.05 MiB
+llm_load_tensors: CUDA1 buffer size = 18956.00 MiB
+llm_load_tensors: CUDA2 buffer size = 18956.00 MiB
+....................................................................................................
+============ llm_prepare_mla: need to compute 61 wkv_b tensors
+Computed blk.0.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.1.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.2.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.3.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.4.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.5.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.6.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.7.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.8.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.9.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.10.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.11.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.12.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.13.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.14.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.15.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.16.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.17.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.18.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.19.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.20.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.21.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.22.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.23.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.24.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.25.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.26.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.27.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.28.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.29.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.30.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.31.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.32.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.33.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.34.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.35.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.36.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.37.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.38.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.39.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.40.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.41.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.42.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.43.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.44.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.45.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.46.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.47.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.48.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.49.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.50.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.51.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.52.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.53.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.54.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.55.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.56.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.57.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.58.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.59.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+Computed blk.60.attn_kv_b.weight as 512 x 32768 and stored in buffer CUDA0
+llama_new_context_with_model: n_ctx = 16384
+llama_new_context_with_model: n_batch = 4096
+llama_new_context_with_model: n_ubatch = 4096
+llama_new_context_with_model: flash_attn = 1
+llama_new_context_with_model: mla_attn = 3
+llama_new_context_with_model: attn_max_b = 1024
+llama_new_context_with_model: fused_moe = 1
+llama_new_context_with_model: ser = 7, 1
+llama_new_context_with_model: freq_base = 10000.0
+llama_new_context_with_model: freq_scale = 0.025
+llama_kv_cache_init: CUDA0 KV buffer size = 583.34 MiB
+llama_new_context_with_model: KV self size = 583.31 MiB, c^KV (q8_0): 583.31 MiB, kv^T: not used
+llama_new_context_with_model: CUDA_Host output buffer size = 0.99 MiB
+llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
+llama_new_context_with_model: CUDA0 compute buffer size = 4496.02 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 1272.00 MiB
+llama_new_context_with_model: CUDA2 compute buffer size = 1272.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 368.05 MiB
+llama_new_context_with_model: graph nodes = 4219
+llama_new_context_with_model: graph splits = 118
+INFO [ load_model] loading draft model | tid="130964736602112" timestamp=1753532513 model="/home/chico/.lmstudio/models/jukofyork/DeepSeek-R1-DRAFT-0.6B-v2.0-GGUF/DeepSeek-R1-DRAFT-0.6B-128k-Q4_0.gguf"
+llama_model_loader: loaded meta data with 30 key-value pairs and 291 tensors from /home/chico/.lmstudio/models/jukofyork/DeepSeek-R1-DRAFT-0.6B-v2.0-GGUF/DeepSeek-R1-DRAFT-0.6B-128k-Q4_0.gguf (version GGUF V3 (latest))
+llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
+llama_model_loader: - kv 0: general.architecture str = qwen2
+llama_model_loader: - kv 1: general.type str = model
+llama_model_loader: - kv 2: general.name str = DeepSeek R1 0528 DRAFT 0.6B
+llama_model_loader: - kv 3: general.basename str = DeepSeek-R1-0528-DRAFT
+llama_model_loader: - kv 4: general.size_label str = 0.6B
+llama_model_loader: - kv 5: qwen2.block_count u32 = 24
+llama_model_loader: - kv 6: qwen2.context_length u32 = 131072
+llama_model_loader: - kv 7: qwen2.embedding_length u32 = 896
+llama_model_loader: - kv 8: qwen2.feed_forward_length u32 = 4864
+llama_model_loader: - kv 9: qwen2.attention.head_count u32 = 14
+llama_model_loader: - kv 10: qwen2.attention.head_count_kv u32 = 2
+llama_model_loader: - kv 11: qwen2.rope.freq_base f32 = 1000000.000000
+llama_model_loader: - kv 12: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
+llama_model_loader: - kv 13: qwen2.rope.scaling.type str = yarn
+llama_model_loader: - kv 14: qwen2.rope.scaling.factor f32 = 4.000000
+llama_model_loader: - kv 15: qwen2.rope.scaling.original_context_length u32 = 32768
+llama_model_loader: - kv 16: tokenizer.ggml.model str = gpt2
+llama_model_loader: - kv 17: tokenizer.ggml.pre str = deepseek-v3
+llama_model_loader: - kv 18: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
+llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
+llama_model_loader: - kv 20: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
+llama_model_loader: - kv 21: tokenizer.ggml.bos_token_id u32 = 0
+llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 1
+llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 1
+llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true
+llama_model_loader: - kv 25: tokenizer.ggml.add_sep_token bool = false
+llama_model_loader: - kv 26: tokenizer.ggml.add_eos_token bool = false
+llama_model_loader: - kv 27: tokenizer.chat_template str = {% if not add_generation_prompt is de...
+llama_model_loader: - kv 28: general.quantization_version u32 = 2
+llama_model_loader: - kv 29: general.file_type u32 = 2
+llama_model_loader: - type f32: 121 tensors
+llama_model_loader: - type q4_0: 169 tensors
+llama_model_loader: - type q8_0: 1 tensors
+llm_load_vocab: special tokens cache size = 818
+llm_load_vocab: token to piece cache size = 0.8223 MB
+llm_load_print_meta: format = GGUF V3 (latest)
+llm_load_print_meta: arch = qwen2
+llm_load_print_meta: vocab type = BPE
+llm_load_print_meta: n_vocab = 129280
+llm_load_print_meta: n_merges = 127741
+llm_load_print_meta: vocab_only = 0
+llm_load_print_meta: n_ctx_train = 131072
+llm_load_print_meta: n_embd = 896
+llm_load_print_meta: n_layer = 24
+llm_load_print_meta: n_head = 14
+llm_load_print_meta: n_head_kv = 2
+llm_load_print_meta: n_rot = 64
+llm_load_print_meta: n_swa = 0
+llm_load_print_meta: n_swa_pattern = 1
+llm_load_print_meta: n_embd_head_k = 64
+llm_load_print_meta: n_embd_head_v = 64
+llm_load_print_meta: n_gqa = 7
+llm_load_print_meta: n_embd_k_gqa = 128
+llm_load_print_meta: n_embd_v_gqa = 128
+llm_load_print_meta: f_norm_eps = 0.0e+00
+llm_load_print_meta: f_norm_rms_eps = 1.0e-06
+llm_load_print_meta: f_clamp_kqv = 0.0e+00
+llm_load_print_meta: f_max_alibi_bias = 0.0e+00
+llm_load_print_meta: f_logit_scale = 0.0e+00
+llm_load_print_meta: n_ff = 4864
+llm_load_print_meta: n_expert = 0
+llm_load_print_meta: n_expert_used = 0
+llm_load_print_meta: causal attn = 1
+llm_load_print_meta: pooling type = 0
+llm_load_print_meta: rope type = 2
+llm_load_print_meta: rope scaling = yarn
+llm_load_print_meta: freq_base_train = 1000000.0
+llm_load_print_meta: freq_scale_train = 0.25
+llm_load_print_meta: n_ctx_orig_yarn = 32768
+llm_load_print_meta: rope_finetuned = unknown
+llm_load_print_meta: ssm_d_conv = 0
+llm_load_print_meta: ssm_d_inner = 0
+llm_load_print_meta: ssm_d_state = 0
+llm_load_print_meta: ssm_dt_rank = 0
+llm_load_print_meta: model type = 1B
+llm_load_print_meta: model ftype = Q4_0
+llm_load_print_meta: model params = 589.568 M
+llm_load_print_meta: model size = 371.738 MiB (5.289 BPW)
+llm_load_print_meta: repeating layers = 192.226 MiB (4.505 BPW, 357.898 M parameters)
+llm_load_print_meta: general.name = DeepSeek R1 0528 DRAFT 0.6B
+llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
+llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
+llm_load_print_meta: LF token = 131 'Ä'
+llm_load_print_meta: max token length = 256
+llm_load_tensors: ggml ctx size = 0.51 MiB
+llm_load_tensors: offloading 24 repeating layers to GPU
+llm_load_tensors: offloading non-repeating layers to GPU
+llm_load_tensors: offloaded 25/25 layers to GPU
+llm_load_tensors: CPU buffer size = 62.14 MiB
+llm_load_tensors: CUDA0 buffer size = 120.15 MiB
+llm_load_tensors: CUDA1 buffer size = 48.06 MiB
+llm_load_tensors: CUDA2 buffer size = 141.41 MiB
+......................................................
+llama_new_context_with_model: n_ctx = 16384
+llama_new_context_with_model: n_batch = 2048
+llama_new_context_with_model: n_ubatch = 512
+llama_new_context_with_model: flash_attn = 0
+llama_new_context_with_model: mla_attn = 0
+llama_new_context_with_model: attn_max_b = 0
+llama_new_context_with_model: fused_moe = 0
+llama_new_context_with_model: ser = -1, 0
+llama_new_context_with_model: freq_base = 1000000.0
+llama_new_context_with_model: freq_scale = 0.25
+llama_kv_cache_init: CUDA0 KV buffer size = 91.88 MiB
+llama_kv_cache_init: CUDA1 KV buffer size = 36.75 MiB
+llama_kv_cache_init: CUDA2 KV buffer size = 18.38 MiB
+llama_new_context_with_model: KV self size = 147.00 MiB, K (q8_0): 51.00 MiB, V (f16): 96.00 MiB
+llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
+llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
+llama_new_context_with_model: CUDA0 compute buffer size = 487.00 MiB
+llama_new_context_with_model: CUDA1 compute buffer size = 487.00 MiB
+llama_new_context_with_model: CUDA2 compute buffer size = 487.00 MiB
+llama_new_context_with_model: CUDA_Host compute buffer size = 33.76 MiB
+llama_new_context_with_model: graph nodes = 773
+llama_new_context_with_model: graph splits = 4
+/home/chico/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu:604: GGML_ASSERT(src0->ne[2] == src1->ne[2] && src0->ne[2] == dst->ne[2]) failed
+Could not attach to process. If your uid matches the uid of the target
+process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
+again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
+ptrace: Operation not permitted.
+No stack.
+The program is not being run.
+Aborted (core dumped)
+
+---
+
+👤 **usrlocalben** commented on **2025-07-26** at **18:35:07**
+
+tl;dr: I have a working draft config for mainline llama. It works fine with this branch without changes.
+
+I see the same relative perf improvement as mainline. (substantial, about +30-40% TG for codegen or other outputs with recurring patterns)
+
+It isn't clear how to interpret the log output wrt. draft acceptance:
+
+These blobs pop up from time to time during TG. _reuse_i_ is always zero which seems suspicious. However, if I run the same config/prompt w and w/o the draft model, I do see expected change in perf.
+
+```
+llama_speculative_gen_draft: reuse_i = 0, reuse_n = 4693, prompt = 4693
+llama_speculative_gen_draft: n_past = 4695
+ - draft candidate 0, pos 0: 29938 ( 1.000) ' "__'
+ - draft candidate 1, pos 0: 37038 ( 0.000) ' '__'
+ - draft candidate 2, pos 0: 414 ( 0.000) ' "'
+ - draft candidate 0, pos 1: 9885 ( 1.000) 'main'
+ - draft candidate 1, pos 1: 6593 ( 0.000) 'build'
+ - draft candidate 2, pos 1: 13098 ( 0.000) 'async'
+ - draft candidate 0, pos 2: 71703 ( 0.998) '__":
+'
+ - draft candidate 1, pos 2: 1025 ( 0.002) '__'
+ - draft candidate 2, pos 2: 59325 ( 0.000) '__':
+'
+ - draft candidate 0, pos 3: 274 ( 0.999) ' '
+ - draft candidate 1, pos 3: 337 ( 0.001) ' '
+ - draft candidate 2, pos 3: 220 ( 0.000) ' '
+ - draft candidate 0, pos 4: 85632 ( 0.422) ' asyncio'
+ - draft candidate 1, pos 4: 7443 ( 0.223) ' async'
+ - draft candidate 2, pos 4: 5276 ( 0.107) ' await'
+ ```
+
+ my config:
+ ```
+-fa -mla 2 -fmoe
+-b 4096 -ub 4096
+--n-gpu-layers 99
+-c 32000
+-ot exps=CPU
+--top-k 1 --samplers "top_k"
+-m /path/to/k2/DevQuasar/Q4_K_M/moonshotai.Kimi-K2-Instruct_updated.Q4_K_M-00001-of-00053.gguf
+-md /path/to/k2/draft/Kimi-K2-Instruct-DRAFT-0.6B-32k-Q4_0.gguf
+-ngld 99
+```
+
+model is Kimi-K2
+target is [DevQuasar Q4_K_M](https://huggingface.co/DevQuasar/moonshotai.Kimi-K2-Instruct-GGUF)
+draft is [jukofyork](https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF)
+
+some perf results, although it's all content dependent so grain of salt etc.
+```
+w/draft
+prompt eval time = 53826.90 ms / 2379 tokens ( 22.63 ms per token, 44.20 tokens per second)
+generation eval time = 162801.04 ms / 2328 runs ( 69.93 ms per token, 14.30 tokens per second)
+ total time = 216627.94 ms
+
+
+w/o draft
+prompt eval time = 53792.43 ms / 2379 tokens ( 22.61 ms per token, 44.23 tokens per second)
+generation eval time = 208580.89 ms / 2358 runs ( 88.46 ms per token, 11.30 tokens per second)
+ total time = 262373.32 ms
+```
+
+and another K2 run, same content
+target is [ubergarm IQ3_KS](https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF)
+draft is [jukofyork](https://huggingface.co/jukofyork/Kimi-K2-Instruct-DRAFT-0.6B-v2.0-GGUF)
+
+```
+w/draft
+prompt eval time = 24146.63 ms / 2379 tokens ( 10.15 ms per token, 98.52 tokens per second)
+generation eval time = 141704.33 ms / 2312 runs ( 61.29 ms per token, 16.32 tokens per second)
+ total time = 165850.96 ms
+
+w/o draft
+prompt eval time = 23862.52 ms / 2379 tokens ( 10.03 ms per token, 99.70 tokens per second)
+generation eval time = 174326.72 ms / 2260 runs ( 77.14 ms per token, 12.96 tokens per second)
+ total time = 198189.24 ms
+```
+
+```
+-fa -mla 2 -fmoe
+-b 4096 -ub 4096
+--n-gpu-layers 99
+-c 32000
+-ot "blk\.(1|2|3|4|5|6)\.ffn_up_exps=CUDA0,blk\.(1|2|3|4|5|6)\.ffn_gate_exps=CUDA0"
+-ot exps=CPU
+-op 26,0,27,0,29,0
+--top-k 1 --samplers "top_k"
+-m /path/to/k2/ubergarm/IQ3_KS/Kimi-K2-Instruct-IQ3_KS-00001-of-00010.gguf
+-md /path/to/k2/draft/Kimi-K2-Instruct-DRAFT-0.6B-32k-Q4_0.gguf
+-ngld 99
+```
+
+hardware is 2S EPYC 9115 NPS0, 24x DDR5, RTX 8000 (Turing)
+
+---
+
+👤 **g2mt** commented on **2025-07-26** at **18:51:50**
+
+@ChicoPinto70
+
+I wonder if this is another tensor name conflict error. I don't have a GPU, so I can't really test this. Could you run the fork in gdb and paste the stack trace here?
+
+> These blobs pop up from time to time during TG. _reuse_i_ is always zero which seems suspicious. However, if I run the same config/prompt w and w/o the draft model, I do see expected change in perf.
+
+@usrlocalben
+reuse_i should be zero most of the time if I understood the original common/speculative.cpp code correctly. It represents the index of the first token in the prompt of the draft model that can be reused (basically the first index of the same token). If the prompt isn't changed across generation then it should stay at 0.
+
+---
+
+👤 **ChicoPinto70** commented on **2025-07-26** at **20:23:16**
+
+> @ChicoPinto70
+>
+> I wonder if this is another tensor name conflict error. I don't have a GPU, so I can't really test this. Could you run the fork in gdb and paste the stack trace here?
+>
+> > These blobs pop up from time to time during TG. _reuse_i_ is always zero which seems suspicious. However, if I run the same config/prompt w and w/o the draft model, I do see expected change in perf.
+>
+> @usrlocalben reuse_i should be zero most of the time if I understood the original common/speculative.cpp code correctly. It represents the index of the first token in the prompt of the draft model that can be reused (basically the first index of the same token). If the prompt isn't changed across generation then it should stay at 0.
+
+Sure! Did I make it right?
+
+(gdb) backtrace
+[#0](https://github.com/ikawrakow/ik_llama.cpp/issues/0) __pthread_kill_implementation (no_tid=0, signo=6, threadid=) at ./nptl/pthread_kill.c:44
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
+[#2](https://github.com/ikawrakow/ik_llama.cpp/issues/2) __GI___pthread_kill (threadid=, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
+[#3](https://github.com/ikawrakow/ik_llama.cpp/issues/3) 0x00007fffe444527e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
+[#4](https://github.com/ikawrakow/ik_llama.cpp/issues/4) 0x00007fffe44288ff in __GI_abort () at ./stdlib/abort.c:79
+[#5](https://github.com/ikawrakow/ik_llama.cpp/issues/5) 0x00007fffe4cb588c in ggml_abort (file=0x7fffe61cf590 "/home/chico/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu", line=604,
+ fmt=0x7fffe61cfb5a "GGML_ASSERT(%s) failed") at /home/chico/ik_llama.cpp/ggml/src/ggml.c:270
+[#6](https://github.com/ikawrakow/ik_llama.cpp/issues/6) 0x00007fffe4e4713f in ggml_cuda_op_mul_mat_vec_q_3D (ctx=..., src0=0x7fba52900fc0, src1=0x7fba52900e50,
+ dst=0x7fba529012a0, src0_dd_i=0x1d02000000 "", src1_ddf_i=0x7fb590540800, src1_ddq_i=0x1d02001380 "",
+ dst_dd_i=0x7fb590700800, row_low=0, row_high=32, src1_ncols=1, src1_padded_row_size=512, stream=0x5555672e3690)
+ at /home/chico/ik_llama.cpp/ggml/src/ggml-cuda/mmvq.cu:604
+[#7](https://github.com/ikawrakow/ik_llama.cpp/issues/7) 0x00007fffe4efcc6e in ggml_cuda_op_mul_mat (ctx=..., src0=0x7fba52900fc0, src1=0x7fba52900e50, dst=0x7fba529012a0,
+ op=0x7fffe4e471c3 ,
+ quantize_src1=0x7fffe4ede4af ) at /home/chico/ik_llama.cpp/ggml/src/ggml-cuda.cu:1658
+[#8](https://github.com/ikawrakow/ik_llama.cpp/issues/8) 0x00007fffe4eff807 in ggml_cuda_mul_mat (ctx=..., src0=0x7fba52900fc0, src1=0x7fba52900e50, dst=0x7fba529012a0)
+ at /home/chico/ik_llama.cpp/ggml/src/ggml-cuda.cu:2176
+[#9](https://github.com/ikawrakow/ik_llama.cpp/issues/9) 0x00007fffe4f05314 in ggml_cuda_compute_forward (ctx=..., dst=0x7fba529012a0, next=0x7fba52901410,
+ skip_next=@0x7fffffffa5e0: false) at /home/chico/ik_llama.cpp/ggml/src/ggml-cuda.cu:2937
+[#10](https://github.com/ikawrakow/ik_llama.cpp/issues/10) 0x00007fffe4f069f8 in ggml_backend_cuda_graph_compute (backend=0x555565795070, cgraph=0x555564d3bcc8)
+ at /home/chico/ik_llama.cpp/ggml/src/ggml-cuda.cu:3327
+[#11](https://github.com/ikawrakow/ik_llama.cpp/issues/11) 0x00007fffe4d0e0d3 in ggml_backend_graph_compute_async (backend=0x555565795070, cgraph=0x555564d3bcc8)
+ at /home/chico/ik_llama.cpp/ggml/src/ggml-backend.c:317
+[#12](https://github.com/ikawrakow/ik_llama.cpp/issues/12) 0x00007fffe4d12c1f in ggml_backend_sched_compute_splits (sched=0x555564d398d0)
+ at /home/chico/ik_llama.cpp/ggml/src/ggml-backend.c:1887
+[#13](https://github.com/ikawrakow/ik_llama.cpp/issues/13) 0x00007fffe4d13831 in ggml_backend_sched_graph_compute_async (sched=0x555564d398d0, graph=0x7fba526fb030)
+ at /home/chico/ik_llama.cpp/ggml/src/ggml-backend.c:2081
+[#14](https://github.com/ikawrakow/ik_llama.cpp/issues/14) 0x00007ffff7cea0de in llama_graph_compute (lctx=..., gf=0x7fba526fb030, n_threads=36)
+ at /home/chico/ik_llama.cpp/src/llama.cpp:18241
+[#15](https://github.com/ikawrakow/ik_llama.cpp/issues/15) 0x00007ffff7cead49 in llama_decode_internal (lctx=..., batch_all=...) at /home/chico/ik_llama.cpp/src/llama.cpp:18457
+[#16](https://github.com/ikawrakow/ik_llama.cpp/issues/16) 0x00007ffff7cfd6f7 in llama_decode (ctx=0x55556777b1f0, batch=...) at /home/chico/ik_llama.cpp/src/llama.cpp:22945
+[#17](https://github.com/ikawrakow/ik_llama.cpp/issues/17) 0x000055555575aec4 in llama_init_from_gpt_params (params=...) at /home/chico/ik_llama.cpp/common/common.cpp:2414
+[#18](https://github.com/ikawrakow/ik_llama.cpp/issues/18) 0x0000555555639ffd in server_context::load_model (this=0x7fffffffc9b0, params_=...)
+ at /home/chico/ik_llama.cpp/examples/server/server.cpp:919
+[#19](https://github.com/ikawrakow/ik_llama.cpp/issues/19) 0x00005555556063e4 in main (argc=42, argv=0x7fffffffd948) at /home/chico/ik_llama.cpp/examples/server/server.cpp:3386
+(gdb) down
+Bottom (innermost) frame selected; you cannot go down.
+(gdb) up
+[#1](https://github.com/ikawrakow/ik_llama.cpp/issues/1) __pthread_kill_internal (signo=6, threadid=) at ./nptl/pthread_kill.c:78
+78 in ./nptl/pthread_kill.c
+(gdb)
+
+---
+
+👤 **saood06** commented on **2025-07-26** at **20:45:22**
+
+> > If someone else could post more test results that would be great. I'll open the PR up for review now.
+>
+> I'll try to do some tests within a day.
+
+So far I've compiled and ran the new R1 with a draft model (using [this](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.6B-v2.0-GGUF/blob/main/DeepSeek-R1-DRAFT-0.6B-64k-Q4_0.gguf) one). With both the Draft and main model on my CPU (this server has no GPU).
+
+(I'll report performance numbers later, with tests at temp 0 for the best potential draft rate).
+
+I'll do some more tests on Windows where I have a GPU, and RPC (I didn't see code that would allow you to offload your draft model via RPC but I might try it anyway), and with other models (can't run Deepseek on my Windows machine) and more thoroughly review (and submit potential comments) the code next.
+
+Edit: For now I'm not sure if I agree with what the logging levels are "speculative decoding result" seems a lot more useful to print by default than the spam of `reuse_i [...]` and `draft candidate`
+
+Edit 2: I know performance with drafting has a lot of variables and factors so take that into consideration before the numbers below.
+
+Without draft model:
+`generation eval time = 423525.84 ms / 1155 runs ( 366.69 ms per token, 2.73 tokens per second)`
+
+With draft model:
+`generation eval time = 416970.47 ms / 1083 runs ( 385.01 ms per token, 2.60 tokens per second)`
+
+So for these conditions (temperature 0, the specific models and hardware used) this resulted in worse performance.
+
+The response also did change in the second paragraph of the thinking process which I'm fairly certain should not happen at temperature 0. (There is a chance this could be my error given I was testing this using the new WebUI that I am far less familiar with but looking at the other artifacts '/slots' endpoint the second test (without draft model) the same prompt was sent with 0 temp so the testing seems valid to me).
+
+---
+
+👤 **g2mt** commented on **2025-07-26** at **21:30:50**
+
+I think I see the problem. It seems that building the computation graph creates new tensors with conflicting names:
+
+https://github.com/ikawrakow/ik_llama.cpp/blob/7093a35869670cf954bd1ba843df8ccf0c2867f2/src/llama.cpp#L17445-L17461
+
+The code that builds the computation graph (and maybe the rest of src/llama.cpp) uses a lot of hardcoded names. This would not work if there's more than one model being loaded. I think a bigger PR is needed for refactoring the file. My guess is that the code sections that search for hard coded tensor names aren't triggered during normal CPU-only inference. I'm not really familiar with this code base to make that assessment though.
+
+Side note, wow the file is huge. clangd doesn't even show me autocompletions.
+
+---
+
+👤 **saood06** commented on **2025-07-26** at **21:42:06**
+
+>My guess is that the code sections that search for hard coded tensor names aren't triggered during normal CPU-only inference. I'm not really familiar with this code base to make that assessment though.
+
+But given that it worked for usrlocalben who was using pure GPU inference `--n-gpu-layers 99` and `-ngld 99` could it be the issue is when more than one backend type is used?
+
+I could try and confirm that theory later on my Windows system.
+
+>Side note, wow the file is huge. clangd doesn't even show me autocompletions.
+
+Github doesn't index it for search or allow blame, or syntax highlight it. As mentioned [here](https://github.com/ikawrakow/ik_llama.cpp/issues/472#issuecomment-2924324079), it is something that would be nice to refactor but no one has done it yet.
+
+---
+
+👤 **usrlocalben** commented on **2025-07-26** at **23:49:10**
+
+@saood06 my setup is mixed GPU/CPU. The tensor offload pattern rules have precedence, but -ngl is still needed to make them available for GPU.
+
+In case anyone in this thread is unaware, speculative generation perf tends to be dependent on the content. Code happens to have a good outcome, and repetitive code even better.
+
+[This thread](https://github.com/ggml-org/llama.cpp/discussions/10466) from mainline has some discussion on it.
+
+here's one target quant:
+```
+IQ3_KS (ubergarm)
+ TG 8.44 t/s speculative, general: "summarize this EULA"
+ TG 11.32 t/s normal
+ TG 13.87 t/s speculative, code: "rewrite using async/await"
+```
+
+My GPU (full offload of the draft model) is older/slower (Turing). Maybe newer tech would perform better in the face of low draft hit-rate.
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **00:09:12**
+
+> @saood06 my setup is mixed GPU/CPU. The tensor offload pattern rules have precedence, but -ngl is still needed to make them available for GPU.
+
+Sorry I noticed that when I first read your post, but missed it when I came back to it.
+
+
+> In case anyone in this thread is unaware, speculative generation perf tends to be dependent on the content. Code happens to have a good outcome, and repetitive code even better.
+>
+> [This thread](https://github.com/ggml-org/llama.cpp/discussions/10466) from mainline has some discussion on it.
+
+That is a good thread, thanks for the link.
+
+> here's one target quant:
+>
+> ```
+> IQ3_KS (ubergarm)
+> TG 8.44 t/s speculative, general: "summarize this EULA"
+> TG 11.32 t/s normal
+> TG 13.87 t/s speculative, code: "rewrite using async/await"
+> ```
+>
+
+Nice. I do see you run `--top-k 1 --samplers "top_k` but I wonder how much using a non zero temperature would impact this.
+
+> My GPU (full offload of the draft model) is older/slower (Turing). Maybe newer tech would perform better in the face of low draft hit-rate.
+
+Well the faster your draft model the more tokens it produces so that may lower hit-rate even more, but thinking about performance when the draft and target on the same hardware seems complicated.
+
+---
+
+👤 **ikawrakow** commented on **2025-07-27** at **04:45:49**
+
+Why is the presence of tensors with the same name in the two models a problem? And how does it work in mainline, where tensor names are given in the same way?
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **05:05:38**
+
+>Why is the presence of tensors with the same name in the two models a problem? And how does it work in mainline, where tensor names are given in the same way?
+
+I'm not sure I agree with that conclusion of the issue.
+
+It fails in the ggml_cuda_op_mul_mat_vec_q_3D of a llama_decode_internal call in the llama_graph_compute call. I'm going to test on my Windows machine with my 3090 shortly and see how things work there with different configs.
+
+I'm trying to see if I can get RPC working now.
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **05:08:54**
+
+I see you just pushed a revert. Was that not needed?
+
+---
+
+👤 **g2mt** commented on **2025-07-27** at **05:09:07**
+
+> Why is the presence of tensors with the same name in the two models a problem? And how does it work in mainline, where tensor names are given in the same way?
+
+Turns out I did something weird with my environment which caused an error when loading the llama-server binary. I reverted the KV cache change. Sorry for any misunderstandings.
+
+It should stiil work, not sure about the CUDA problem though.
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **11:00:51**
+
+I modified the code (inspired by the `-ot PR`: https://github.com/ikawrakow/ik_llama.cpp/pull/232) and got it to allocate all the buffers of the draft model to my RPC node.
+
+
+```
+Tensor blk.0.attn_k.weight buffer type overriden to RPC[10.0.0.250:50052]
+Tensor blk.0.attn_v.weight buffer type overriden to RPC[10.0.0.250:50052]
+Tensor blk.0.attn_output.weight buffer type overriden to RPC[10.0.0.250:50052]
+[...]
+Tensor blk.23.ffn_gate.weight buffer type overriden to RPC[10.0.0.250:50052]
+Tensor blk.23.ffn_down.weight buffer type overriden to RPC[10.0.0.250:50052]
+Tensor blk.23.ffn_up.weight buffer type overriden to RPC[10.0.0.250:50052]
+```
+but then it crashes with
+
+```
+llama_model_load: error loading model: failed to allocate buffer
+llama_load_model_from_file: failed to load model
+llama_init_from_gpt_params: error: failed to load model '/mnt/sda/draft_models/R1/DeepSeek-R1-DRAFT-0.6B-64k-Q4_0.gguf'
+```
+
+Got a backtrace:
+
+```
+#0 __GI___libc_free () at malloc.c:3378
+#1 0x0000555555417250 in llama_batch_free (batch=...) at /home/saood06/ik_temp/ik_llama.cpp/src/llama.cpp:22920
+#2 0x00005555555cde00 in server_context::~server_context (this=0x7fffffffb290, __in_chrg=) at /home/saood06/ik_temp/ik_llama.cpp/examples/server/server.cpp:882
+#3 0x000055555559129f in main (argc=, argv=) at /home/saood06/ik_temp/ik_llama.cpp/examples/server/server.cpp:4344
+```
+Not sure why right now. Stepping off for now, will test my 3090 and Windows tomorrow.
+
+---
+
+👤 **ChicoPinto70** commented on **2025-07-27** at **12:17:36**
+
+> > Why is the presence of tensors with the same name in the two models a problem? And how does it work in mainline, where tensor names are given in the same way?
+>
+> Turns out I did something weird with my environment which caused an error when loading the llama-server binary. I reverted the KV cache change. Sorry for any misunderstandings.
+>
+> It should stiil work, not sure about the CUDA problem though.
+
+Hi, I tested the current branch and the CUDA problem persists....
+
+I also tested run the draft model without offloading it to gpu (-ngld 0), but still keeping the full model in gpu, and it worked, but the TG speed plummed and it was much slower than without the draft (from ~7 T/s to 0.7 T/s).
\ No newline at end of file
diff --git a/github-data/pull_requests/648 - Fix missing token per second for webui after function call update.md b/github-data/pull_requests/648 - Fix missing token per second for webui after function call update.md
new file mode 100644
index 000000000..8b8261f6c
--- /dev/null
+++ b/github-data/pull_requests/648 - Fix missing token per second for webui after function call update.md
@@ -0,0 +1,48 @@
+## 🔀 [Pull Request #648](https://github.com/ikawrakow/ik_llama.cpp/pull/648) - Fix missing token per second for webui after function call update
+
+| **Author** | `firecoperana` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `fcp/missing_token_ps` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-25 |
+| **Updated** | 2025-07-27 |
+| **Merged** | 2025-07-27 |
+| **Assignees** | `firecoperana` |
+
+---
+
+## 📄 Description
+
+1. Moves Preset to top of settings window for easier navigation
+2. Send timings in streaming_chunks otherwise webui won't show token per second
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+
+---
+
+## 💬 Conversation
+
+👤 **saood06** approved this pull request ✅ on **2025-07-27** at **01:51:35**
+
+LGTM.
+
+Tested and see the t/s measurements which was missing when testing main.
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **02:31:21**
+
+@firecoperana
+
+Just to let you know the [CONTRIBUTING.md](https://github.com/ikawrakow/ik_llama.cpp/blob/main/CONTRIBUTING.md) says to "Squash-merge PRs", you created a merge commit, which is different. I know that document was created from mainline and hasn't been touched here, but it does seem that squash merge commits are the norm here as well.
+
+---
+
+👤 **firecoperana** commented on **2025-07-27** at **03:43:59**
+
+Thanks for the heads up. Will do that next time.
\ No newline at end of file
diff --git a/github-data/pull_requests/65 - Adding SWIGLU unary op.md b/github-data/pull_requests/65 - Adding SWIGLU unary op.md
index 8686be44e..77074d326 100644
--- a/github-data/pull_requests/65 - Adding SWIGLU unary op.md
+++ b/github-data/pull_requests/65 - Adding SWIGLU unary op.md
@@ -1,14 +1,17 @@
-### 🔀 [#65](https://github.com/ikawrakow/ik_llama.cpp/pull/65) - Adding SWIGLU unary op
+## 🔀 [Pull Request #65](https://github.com/ikawrakow/ik_llama.cpp/pull/65) - Adding SWIGLU unary op
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/swiglu` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-28 |
| **Updated** | 2024-09-28 |
+| **Merged** | 2024-09-28 |
---
-#### Description
+## 📄 Description
Phi-3(.5) (and also ChatGLM) uses a "SWIGLU" operation in its FFN. There is nothing special about "SWIGLU", it is just that the `ffn_up` tensor is actually a combination of the usual `ffn_up` and `ffn_gate` tensors, where in each row the first half contains the `ffn_up` weights and the second half has the `ffn_gate` weights. So that, to implement
```
@@ -51,9 +54,9 @@ This results in an additional 2-3% speedup of PP-512(Phi-3.5-mini) when running
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-09-28** at **10:07:59**:
+👤 **ikawrakow** commented on **2024-09-28** at **10:07:59**
OK, Phi-3.5 has a 128k context, so let's run a benchmark with a longer context, say, 8k tokens. Here is what I get after this PR on a Ryzen-7950X CPU for Phi-3.5-mini:
diff --git a/github-data/pull_requests/652 - Deepseek R1 function calls more formats.md b/github-data/pull_requests/652 - Deepseek R1 function calls more formats.md
new file mode 100644
index 000000000..e2f2c70aa
--- /dev/null
+++ b/github-data/pull_requests/652 - Deepseek R1 function calls more formats.md
@@ -0,0 +1,25 @@
+## 🔀 [Pull Request #652](https://github.com/ikawrakow/ik_llama.cpp/pull/652) - Deepseek R1 function calls (more formats)
+
+| **Author** | `iSevenDays` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Source Branch** | `deepseek-r1-parsing` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-26 |
+| **Updated** | 2025-07-26 |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [ ] Low
+ - [x] Medium
+ - [ ] High
+
+
+Implemented more DeepSeek R1 supported function tool calls formats.
+The diff of `examples/server/function_calls.md` shows what formats are supported.
+
+I was testing DeepSeek R1 and I found out that it often uses different formats with Claude Code, so I decided to support them as well. It can be useful when next version of DeepSeek is released, so we will have better support than even original llama.cpp
\ No newline at end of file
diff --git a/github-data/pull_requests/653 - Add GitHub data backup and convertion scripts backup update.md b/github-data/pull_requests/653 - Add GitHub data backup and convertion scripts backup update.md
new file mode 100644
index 000000000..c22b2f1c4
--- /dev/null
+++ b/github-data/pull_requests/653 - Add GitHub data backup and convertion scripts backup update.md
@@ -0,0 +1,125 @@
+## 🔀 [Pull Request #653](https://github.com/ikawrakow/ik_llama.cpp/pull/653) - Add GitHub data: backup and convertion scripts + backup update
+
+| **Author** | `ThomasBaruzier` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Source Branch** | `tb/github-data-scripts` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-26 |
+| **Updated** | 2025-07-28 |
+
+---
+
+## 📄 Description
+
+Hello!
+
+I've refined the scraping and conversion scripts. While they should work with any repository, I haven't extensively tested them beyond the current use case. For this repository, the scripts consistently complete in ~30 seconds (initially 750s!) using just ~~11~~ 10 API requests to fetch all issues, pull requests, and discussions.
+
+I initially explored resumable/incremental scraping but abandoned the idea due to reliability issues: the `updatedAt` field only reflects edits to the issue/PR body, not new activity. Instead, I focused on optimization, achieving the results below.
+
+---
+
+### Usage
+```bash
+export GITHUB_TOKEN='github_pat_...'
+cd github-data
+rm -rf issues discussions pull_requests index.md ik.json
+python ghscrape.py ikawrakow/ik_llama.cpp -o ik.json
+python ghconvert.py ik.json -o .
+```
+
+#### Or as a one-liner (ensure that you are in `github-data/`):
+```bash
+rm -rf issues discussions pull_requests index.md ik.json && GITHUB_TOKEN='github_pat_...' python ghscrape.py ikawrakow/ik_llama.cpp -o ik.json && python ghconvert.py ik.json -o .
+```
+
+---
+
+### Scraping Demo
+
+```
+python ghscrape.py ikawrakow/ik_llama.cpp -o ik.json
+```
+```
+INFO: Fetching all issues...
+INFO: API Rate Limit (Req #1): 4997 points remaining, resets in 59m 54s.
+INFO: Processed 100 issues...
+INFO: API Rate Limit (Req #2): 4994 points remaining, resets in 59m 52s.
+INFO: Processed 131 issues...
+INFO: Fetching all nested data for 131 items (1 pages)...
+INFO: API Rate Limit (Req #3): 4993 points remaining, resets in 59m 52s.
+INFO: Processed batch of 1 pages. 0 pages remaining.
+INFO: Structuring final items for issues...
+INFO: Finished issues: Found and processed 131 items.
+INFO: Fetching all pull requests...
+INFO: API Rate Limit (Req #4): 4888 points remaining, resets in 59m 49s.
+INFO: Processed 100 pull_requests...
+INFO: API Rate Limit (Req #5): 4783 points remaining, resets in 59m 46s.
+INFO: Processed 200 pull_requests...
+INFO: API Rate Limit (Req #6): 4678 points remaining, resets in 59m 41s.
+INFO: Processed 300 pull_requests...
+INFO: API Rate Limit (Req #7): 4573 points remaining, resets in 59m 36s.
+INFO: Processed 400 pull_requests...
+INFO: API Rate Limit (Req #8): 4468 points remaining, resets in 59m 34s.
+INFO: Processed 452 pull_requests...
+INFO: Fetching all nested data for 452 items (0 pages)...
+INFO: Structuring final items for pull_requests...
+INFO: Finished pull_requests: Found and processed 452 items.
+INFO: Fetching all discussions...
+INFO: API Rate Limit (Req #9): 4366 points remaining, resets in 59m 30s.
+INFO: Processed 71 discussions...
+INFO: Fetching all nested data for 71 items (1 pages)...
+INFO: API Rate Limit (Req #10): 4365 points remaining, resets in 59m 29s.
+INFO: Processed batch of 1 pages. 0 pages remaining.
+INFO: Structuring final items for discussions...
+INFO: Finished discussions: Found and processed 71 items.
+INFO: Data successfully saved to ik.json
+INFO: Total execution time: 30.55 seconds
+```
+
+### Conversion Demo
+
+```
+python ghconvert.py ik.json -o .
+```
+```
+Processing 131 issues...
+Processing 452 pull_requests...
+Processing 71 discussions...
+Generating index.md summary file...
+Successfully generated 654 Markdown files.
+Files are in the '.' directory.
+```
+
+---
+
+### Relevant links jump:
+
+Scripts:
+- https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/ghscrape.py
+- https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/ghconvert.py
+
+Index:
+- https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/index.md
+
+Discussion example:
+- https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/discussions/477%20-%20DeepSeek-R1-0528%20ik%20quants.md
+
+PR example:
+- https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/pull_requests/620%20-%20Bump%20Windows%20max%20open%20files%20from%20512%20to%202048.md
+
+Issue example:
+- https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/issues/296%20-%20Possible%20numerical%20stability%20issue%20with%20experimental%20quant%20of%20DeepSeek-V3-0324.md
+
+---
+
+### Notes
+- ~~Content extraction for reviews isn’t fully implemented yet (see [example](https://github.com/ThomasBaruzier/ik_llama.cpp/blob/tb/github-data-scripts/github-data/pull_requests/620%20-%20Bump%20Windows%20max%20open%20files%20from%20512%20to%202048.md)). This could be added later if needed.~~ Fixed.
+- Wiki backups are not implemented.
+
+- [x] I’ve read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md).
+- **Self-reported review complexity**:
+ - [ ] Low
+ - [x] Medium
+ - [ ] High
\ No newline at end of file
diff --git a/github-data/pull_requests/654 - Fix text generation endpoint.md b/github-data/pull_requests/654 - Fix text generation endpoint.md
new file mode 100644
index 000000000..7a8974ef9
--- /dev/null
+++ b/github-data/pull_requests/654 - Fix text generation endpoint.md
@@ -0,0 +1,54 @@
+## 🔀 [Pull Request #654](https://github.com/ikawrakow/ik_llama.cpp/pull/654) - Fix text generation endpoint
+
+| **Author** | `iSevenDays` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `patch-2` |
+| **Target Branch** | `main` |
+| **Created** | 2025-07-26 |
+| **Updated** | 2025-07-27 |
+| **Merged** | 2025-07-27 |
+
+---
+
+## 📄 Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+ - [x] Low
+ - [ ] Medium
+ - [ ] High
+ - [ ]
+The recent function call implementation changed streaming responses to always send empty content with diffs, which broke text completion streaming endpoints (like those used by mikupad) that need actual token content in each streaming chunk. This fix differentiates between OpenAI-compatible chat completion (which uses diffs) and text completion endpoints (which need actual content) using the existing slot.oaicompat flag.
+
+---
+
+## 💬 Conversation
+
+👤 **iSevenDays** commented on **2025-07-26** at **18:38:01**
+
+The fix has been also verified by another person here https://github.com/ikawrakow/ik_llama.cpp/pull/628#issuecomment-3122219232
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **00:15:39**
+
+> The fix has been also verified by another person here [[#628](https://github.com/ikawrakow/ik_llama.cpp/issues/628) (comment)](https://github.com/ikawrakow/ik_llama.cpp/pull/628#issuecomment-3122219232)
+
+And by me.
+
+---
+
+👤 **saood06** approved this pull request ✅ on **2025-07-27** at **00:17:02**
+
+Tested.
+
+This restored functionality to the `/completion` endpoint
+
+---
+
+👤 **saood06** commented on **2025-07-27** at **00:32:21**
+
+@ikawrakow
+
+I've been very intentional in not pushing code into branches that are not mine (including main) without your approval as this is your repo, but I am making an exception in this case as this is a very minor change, that fixes a rather serious bug and you are out on vacation.
\ No newline at end of file
diff --git a/github-data/pull_requests/66 - CUDA non-contiguous RoPE.md b/github-data/pull_requests/66 - CUDA non-contiguous RoPE.md
index e5801a547..bfd01932d 100644
--- a/github-data/pull_requests/66 - CUDA non-contiguous RoPE.md
+++ b/github-data/pull_requests/66 - CUDA non-contiguous RoPE.md
@@ -1,18 +1,21 @@
-### 🔀 [#66](https://github.com/ikawrakow/ik_llama.cpp/pull/66) - CUDA non-contiguous RoPE
+## 🔀 [Pull Request #66](https://github.com/ikawrakow/ik_llama.cpp/pull/66) - CUDA non-contiguous RoPE
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/non_contiguous_rope` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-28 |
| **Updated** | 2024-09-28 |
+| **Merged** | 2024-09-28 |
---
-#### Description
+## 📄 Description
-In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini (see #65 for details). This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080). There is also a 2-3% gain on Metal (M2-Max GPU).
+In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini (see [#65](https://github.com/ikawrakow/ik_llama.cpp/issues/65) for details). This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080). There is also a 2-3% gain on Metal (M2-Max GPU).
-Here is the combined effect of this PR and PR #65 on CUDA (RTX-4080) and Metal (M2-Max 30-core GPU) for Phi-3.5-mini:
+Here is the combined effect of this PR and PR [#65](https://github.com/ikawrakow/ik_llama.cpp/issues/65) on CUDA (RTX-4080) and Metal (M2-Max 30-core GPU) for Phi-3.5-mini:
| model | backend | ngl | threads | test | t/s (llama.cpp) | t/s (this PR) | Speedup |
| -------------| ---------- | --: | ------: | ------------: | -------------------: | ---------------: | -------: |
@@ -23,9 +26,9 @@ Here is the combined effect of this PR and PR #65 on CUDA (RTX-4080) and Metal (
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** commented the **2024-09-28** at **12:42:05**:
+👤 **ikawrakow** commented on **2024-09-28** at **12:42:05**
So, I see that there are a lot of models that can potentially benefit from this PR as the pattern
```
diff --git a/github-data/pull_requests/68 - It is time to fix replace_all.md b/github-data/pull_requests/68 - It is time to fix replace_all.md
index a72ca27ba..b175a9fb5 100644
--- a/github-data/pull_requests/68 - It is time to fix replace_all.md
+++ b/github-data/pull_requests/68 - It is time to fix replace_all.md
@@ -1,14 +1,17 @@
-### 🐛 [#68](https://github.com/ikawrakow/ik_llama.cpp/pull/68) - It is time to fix replace_all
+## 🔀 [Pull Request #68](https://github.com/ikawrakow/ik_llama.cpp/pull/68) - It is time to fix replace_all
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_replace_all` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-28 |
| **Updated** | 2024-09-28 |
+| **Merged** | 2024-09-28 |
---
-#### Description
+## 📄 Description
I have been annoyed by having to wait for close to 2 seconds for the perplexity calculation to start because that's how long tokenization took when using Phi-3.5-mini (not to mention the close to 20 seconds wait when running an imatrix calculation with `wiki.train.raw`). Today my patience got exhausted and I decided to investigate. Turns out I inherited this gem when I last synced with mainline `llama.cpp` (in `src/llama-impl.h`):
```
diff --git a/github-data/pull_requests/69 - Allow bf16 kv-cache.md b/github-data/pull_requests/69 - Allow bf16 kv-cache.md
index fdcb5b20b..5fdb21270 100644
--- a/github-data/pull_requests/69 - Allow bf16 kv-cache.md
+++ b/github-data/pull_requests/69 - Allow bf16 kv-cache.md
@@ -1,13 +1,16 @@
-### 🔀 [#69](https://github.com/ikawrakow/ik_llama.cpp/pull/69) - Allow bf16 kv-cache
+## 🔀 [Pull Request #69](https://github.com/ikawrakow/ik_llama.cpp/pull/69) - Allow bf16 kv-cache
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bf16_kv_cache` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-29 |
| **Updated** | 2024-09-29 |
+| **Merged** | 2024-09-29 |
---
-#### Description
+## 📄 Description
On the CPU I get the exact same PPL with and without FA using `bf16` for kv-cache. But on CUDA the `bf16` kv-cache result is about the same as the `fp16` kv-cache CPU result, so I'm missing some conversion somewhere. Either way, we can now run on all platforms supported here with `bf16` kv-cache.
\ No newline at end of file
diff --git a/github-data/pull_requests/7 - Adding IQ2_K IQ3_K and IQ5_K.md b/github-data/pull_requests/7 - Adding IQ2_K IQ3_K and IQ5_K.md
new file mode 100644
index 000000000..2bd0810a8
--- /dev/null
+++ b/github-data/pull_requests/7 - Adding IQ2_K IQ3_K and IQ5_K.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #7](https://github.com/ikawrakow/ik_llama.cpp/pull/7) - Adding IQ2_K, IQ3_K and IQ5_K
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2_k` |
+| **Target Branch** | `main` |
+| **Created** | 2024-07-31 |
+| **Updated** | 2024-08-01 |
+| **Merged** | 2024-08-01 |
+
+---
+
+## 📄 Description
+
+See [this discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/8) for rationale.
\ No newline at end of file
diff --git a/github-data/pull_requests/7 - Adding IQ2_K_ IQ3_K and IQ5_K.md b/github-data/pull_requests/7 - Adding IQ2_K_ IQ3_K and IQ5_K.md
deleted file mode 100644
index a5d9f3611..000000000
--- a/github-data/pull_requests/7 - Adding IQ2_K_ IQ3_K and IQ5_K.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#7](https://github.com/ikawrakow/ik_llama.cpp/pull/7) - Adding IQ2_K, IQ3_K and IQ5_K
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-07-31 |
-| **Updated** | 2024-08-01 |
-
----
-
-#### Description
-
-See [this discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/8) for rationale.
\ No newline at end of file
diff --git a/github-data/pull_requests/70 - Fused unary_x_y.md b/github-data/pull_requests/70 - Fused unaryxy.md
similarity index 57%
rename from github-data/pull_requests/70 - Fused unary_x_y.md
rename to github-data/pull_requests/70 - Fused unaryxy.md
index c5e93af3f..ce164b3f6 100644
--- a/github-data/pull_requests/70 - Fused unary_x_y.md
+++ b/github-data/pull_requests/70 - Fused unaryxy.md
@@ -1,14 +1,17 @@
-### 🔀 [#70](https://github.com/ikawrakow/ik_llama.cpp/pull/70) - Fused unary(x)*y
+## 🔀 [Pull Request #70](https://github.com/ikawrakow/ik_llama.cpp/pull/70) - Fused unary(x)*y
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fused_mul_unary` |
+| **Target Branch** | `main` |
| **Created** | 2024-09-30 |
| **Updated** | 2024-10-02 |
+| **Merged** | 2024-10-02 |
---
-#### Description
+## 📄 Description
This is useful for parallel FFNs. `unary` can be `silu, gelu` or `relu`.
diff --git a/github-data/pull_requests/71 - iqk_mul_mat_ better srategy when nrc_y not divisible by ny.md b/github-data/pull_requests/71 - iqk_mul_mat better srategy when nrc_y not divisible by ny.md
similarity index 84%
rename from github-data/pull_requests/71 - iqk_mul_mat_ better srategy when nrc_y not divisible by ny.md
rename to github-data/pull_requests/71 - iqk_mul_mat better srategy when nrc_y not divisible by ny.md
index acac59a6a..dd1ae280e 100644
--- a/github-data/pull_requests/71 - iqk_mul_mat_ better srategy when nrc_y not divisible by ny.md
+++ b/github-data/pull_requests/71 - iqk_mul_mat better srategy when nrc_y not divisible by ny.md
@@ -1,14 +1,17 @@
-### 🔀 [#71](https://github.com/ikawrakow/ik_llama.cpp/pull/71) - iqk_mul_mat: better srategy when nrc_y not divisible by ny
+## 🔀 [Pull Request #71](https://github.com/ikawrakow/ik_llama.cpp/pull/71) - iqk_mul_mat: better srategy when nrc_y not divisible by ny
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/better_iqk_strategy` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-01 |
| **Updated** | 2024-12-09 |
+| **Merged** | 2024-10-01 |
---
-#### Description
+## 📄 Description
In the llamafile repository @Djip007 has posted [PP results](https://github.com/Mozilla-Ocho/llamafile/discussions/549#discussioncomment-10780156) for short prompt lengths in steps of 1, and one sees a sharp drop in performance for 9 tokens for `Q6_K` and `Q5_K_M`. Why? For these quants llamafile uses `iqk_mul_mat` that I have contributed there, so the matrix multiplication is done using 1x8 tiles. The way it is implemented there (and also here on the main branch) is that first we multiply with 8 columns from the right matrix and then have a second pass to multiple with the remaining 9th column. This second pass is much slower, so overall performance drops. I was of course aware that there will be this effect, and always meant to investigate it, but never did. Now that we have it published, it is time to fix it via this PR.
@@ -23,9 +26,9 @@ This strategy is implemented in this PR. The following graph shows performance (
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Djip007** commented the **2024-11-26** at **19:09:21**:
+👤 **Djip007** commented on **2024-11-26** at **19:09:21**
I what thinking to do something for that to (on tinyBLAS) but not that way. Good to see that it work, I may use it in some other case...
Good JOB!
@@ -34,7 +37,7 @@ Will you do the same on tinyBLAS for non the other case (FP16/BF16/...) ?
---
-👤 **ikawrakow** commented the **2024-11-27** at **15:34:24**:
+👤 **ikawrakow** commented on **2024-11-27** at **15:34:24**
> Will you do the same on tinyBLAS for non the other case (FP16/BF16/...) ?
@@ -42,7 +45,7 @@ In my case all matrix multiplications are driven by the same function, so this c
---
-👤 **Djip007** commented the **2024-12-09** at **22:08:55**:
+👤 **Djip007** commented on **2024-12-09** at **22:08:55**
OK I think I figure how to do it for FP16/BF16/FP32 on tinyblas...
https://github.com/Mozilla-Ocho/llamafile/discussions/654
diff --git a/github-data/pull_requests/72 - iqk_mul_mat_ better iq4_nl implementation on Zen4_AVX2.md b/github-data/pull_requests/72 - iqk_mul_mat better iq4_nl implementation on Zen4AVX2.md
similarity index 52%
rename from github-data/pull_requests/72 - iqk_mul_mat_ better iq4_nl implementation on Zen4_AVX2.md
rename to github-data/pull_requests/72 - iqk_mul_mat better iq4_nl implementation on Zen4AVX2.md
index ada8de5a4..4225f389f 100644
--- a/github-data/pull_requests/72 - iqk_mul_mat_ better iq4_nl implementation on Zen4_AVX2.md
+++ b/github-data/pull_requests/72 - iqk_mul_mat better iq4_nl implementation on Zen4AVX2.md
@@ -1,14 +1,17 @@
-### 🔀 [#72](https://github.com/ikawrakow/ik_llama.cpp/pull/72) - iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2
+## 🔀 [Pull Request #72](https://github.com/ikawrakow/ik_llama.cpp/pull/72) - iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/better_iq4_nl` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-01 |
| **Updated** | 2024-10-01 |
+| **Merged** | 2024-10-01 |
---
-#### Description
+## 📄 Description
PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s (22% speedup).
diff --git a/github-data/pull_requests/73 - CUDA_ faster float -_ iq4_nl conversion.md b/github-data/pull_requests/73 - CUDA faster float - iq4_nl conversion.md
similarity index 84%
rename from github-data/pull_requests/73 - CUDA_ faster float -_ iq4_nl conversion.md
rename to github-data/pull_requests/73 - CUDA faster float - iq4_nl conversion.md
index 4edff4f52..5321204c8 100644
--- a/github-data/pull_requests/73 - CUDA_ faster float -_ iq4_nl conversion.md
+++ b/github-data/pull_requests/73 - CUDA faster float - iq4_nl conversion.md
@@ -1,14 +1,17 @@
-### 🔀 [#73](https://github.com/ikawrakow/ik_llama.cpp/pull/73) - CUDA: faster float -> iq4_nl conversion
+## 🔀 [Pull Request #73](https://github.com/ikawrakow/ik_llama.cpp/pull/73) - CUDA: faster float -> iq4_nl conversion
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cuda_faster_iq4nl_kvcache` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-01 |
| **Updated** | 2024-10-01 |
+| **Merged** | 2024-10-01 |
---
-#### Description
+## 📄 Description
I had forgotten that `IQ4_NL` can be used for kv-cache on CUDA. It can be, but it is slower than `fp16, q4_0, ...`.
diff --git a/github-data/pull_requests/74 - IQ4_NL kv-cache on the CPU Zen4AVX2ARM_NEON.md b/github-data/pull_requests/74 - IQ4_NL kv-cache on the CPU Zen4AVX2ARM_NEON.md
new file mode 100644
index 000000000..da7e1b9a6
--- /dev/null
+++ b/github-data/pull_requests/74 - IQ4_NL kv-cache on the CPU Zen4AVX2ARM_NEON.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #74](https://github.com/ikawrakow/ik_llama.cpp/pull/74) - IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4nl_kv_cache` |
+| **Target Branch** | `main` |
+| **Created** | 2024-10-01 |
+| **Updated** | 2024-10-01 |
+| **Merged** | 2024-10-01 |
+
+---
+
+## 📄 Description
+
+This is a followup of PR [#73](https://github.com/ikawrakow/ik_llama.cpp/issues/73) that enables usage of `IQ4_NL` for kv-cache on the CPU.
\ No newline at end of file
diff --git a/github-data/pull_requests/74 - IQ4_NL kv-cache on the CPU _Zen4_AVX2_ARM_NEON_.md b/github-data/pull_requests/74 - IQ4_NL kv-cache on the CPU _Zen4_AVX2_ARM_NEON_.md
deleted file mode 100644
index 9d2b9f1a1..000000000
--- a/github-data/pull_requests/74 - IQ4_NL kv-cache on the CPU _Zen4_AVX2_ARM_NEON_.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#74](https://github.com/ikawrakow/ik_llama.cpp/pull/74) - IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON)
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-10-01 |
-| **Updated** | 2024-10-01 |
-
----
-
-#### Description
-
-This is a followup of PR #73 that enables usage of `IQ4_NL` for kv-cache on the CPU.
\ No newline at end of file
diff --git a/github-data/pull_requests/75 - Fix Q5_0 flash attention.md b/github-data/pull_requests/75 - Fix Q5_0 flash attention.md
index bcffcab79..78e7731a6 100644
--- a/github-data/pull_requests/75 - Fix Q5_0 flash attention.md
+++ b/github-data/pull_requests/75 - Fix Q5_0 flash attention.md
@@ -1,13 +1,16 @@
-### 🐛 [#75](https://github.com/ikawrakow/ik_llama.cpp/pull/75) - Fix Q5_0 flash attention
+## 🔀 [Pull Request #75](https://github.com/ikawrakow/ik_llama.cpp/pull/75) - Fix Q5_0 flash attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_q5_0_fa` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-01 |
| **Updated** | 2024-10-01 |
+| **Merged** | 2024-10-01 |
---
-#### Description
+## 📄 Description
When I changed `iqk_mul_mat` to use type-1 dot products for type-0 legacy quants, I forgot to also change the `vec_dot_type` when the dot product is done via ggml as in flash attention. This PR fixes it.
\ No newline at end of file
diff --git a/github-data/pull_requests/76 - iq4_nl_ faster quantization.md b/github-data/pull_requests/76 - iq4_nl faster quantization.md
similarity index 85%
rename from github-data/pull_requests/76 - iq4_nl_ faster quantization.md
rename to github-data/pull_requests/76 - iq4_nl faster quantization.md
index b208deb2a..9ff2d1110 100644
--- a/github-data/pull_requests/76 - iq4_nl_ faster quantization.md
+++ b/github-data/pull_requests/76 - iq4_nl faster quantization.md
@@ -1,14 +1,17 @@
-### 🔀 [#76](https://github.com/ikawrakow/ik_llama.cpp/pull/76) - iq4_nl: faster quantization
+## 🔀 [Pull Request #76](https://github.com/ikawrakow/ik_llama.cpp/pull/76) - iq4_nl: faster quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/faster_iq4nl_quantize` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-02 |
| **Updated** | 2024-10-02 |
+| **Merged** | 2024-10-02 |
---
-#### Description
+## 📄 Description
Speeds up CPU flash attention using `IQ4_NL`.
diff --git a/github-data/pull_requests/77 - Adding Q6_0.md b/github-data/pull_requests/77 - Adding Q6_0.md
index c6db94d8e..7413e78e2 100644
--- a/github-data/pull_requests/77 - Adding Q6_0.md
+++ b/github-data/pull_requests/77 - Adding Q6_0.md
@@ -1,14 +1,17 @@
-### 🔀 [#77](https://github.com/ikawrakow/ik_llama.cpp/pull/77) - Adding Q6_0
+## 🔀 [Pull Request #77](https://github.com/ikawrakow/ik_llama.cpp/pull/77) - Adding Q6_0
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/add_q60` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-02 |
| **Updated** | 2024-10-21 |
+| **Merged** | 2024-10-02 |
---
-#### Description
+## 📄 Description
Main motivation was to see how it performs for quantized kv-cache. Disappointingly, it is slightly worse than `Q8_0` for K-cache and `IQ4_NL` for V-cache (this `Q8_0`+`IQ4_NL` combo needs the exact same memory as `Q6_0` for both caches).
@@ -16,9 +19,17 @@ Nevertheless, with a block size of 32 it is the same as the other legacy quants,
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-10-21** at **09:42:19**:
+👤 **Nexesenex** commented on **2024-10-21** at **09:42:19**
You should test the combo -ctk q6_0 -ctv q5_0.
-After a few PPL tests, it seems to be a keeper for me, to replace q5_1 - q5_0 and be quite close to the K q8_0 mixes in term of quality with much less VRAM occupation.
\ No newline at end of file
+After a few PPL tests, it seems to be a keeper for me, to replace q5_1 - q5_0 and be quite close to the K q8_0 mixes in term of quality with much less VRAM occupation.
+
+ON L3.1 8B Q5_K
+
+PPL 512 211 chunks- K q5_1 / V q5_0 : 7.4175 - 46 MB
+
+PPL 512 211 chunks - K q6_0 / V q5_0 : 7.3995 - 48 MB -> choice of the jury
+
+PPL 512 211 chunks - K q8_0 / V iq4_nl : 7.4078 - 52MB
\ No newline at end of file
diff --git a/github-data/pull_requests/78 - q6_0 Slightly faster Zen4AVX2.md b/github-data/pull_requests/78 - q6_0 Slightly faster Zen4AVX2.md
new file mode 100644
index 000000000..8c7f6c6b5
--- /dev/null
+++ b/github-data/pull_requests/78 - q6_0 Slightly faster Zen4AVX2.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #78](https://github.com/ikawrakow/ik_llama.cpp/pull/78) - q6_0: Slightly faster Zen4/AVX2
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/faster_q60_avx2` |
+| **Target Branch** | `main` |
+| **Created** | 2024-10-02 |
+| **Updated** | 2024-10-02 |
+| **Merged** | 2024-10-02 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/78 - q6_0_ Slightly faster Zen4_AVX2.md b/github-data/pull_requests/78 - q6_0_ Slightly faster Zen4_AVX2.md
deleted file mode 100644
index 538b549e4..000000000
--- a/github-data/pull_requests/78 - q6_0_ Slightly faster Zen4_AVX2.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🔀 [#78](https://github.com/ikawrakow/ik_llama.cpp/pull/78) - q6_0: Slightly faster Zen4/AVX2
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-10-02 |
-| **Updated** | 2024-10-02 |
\ No newline at end of file
diff --git a/github-data/pull_requests/79 - Do not quantize activations if not necessary.md b/github-data/pull_requests/79 - Do not quantize activations if not necessary.md
index 0014eb34e..959bfb291 100644
--- a/github-data/pull_requests/79 - Do not quantize activations if not necessary.md
+++ b/github-data/pull_requests/79 - Do not quantize activations if not necessary.md
@@ -1,14 +1,17 @@
-### 🔀 [#79](https://github.com/ikawrakow/ik_llama.cpp/pull/79) - Do not quantize activations if not necessary
+## 🔀 [Pull Request #79](https://github.com/ikawrakow/ik_llama.cpp/pull/79) - Do not quantize activations if not necessary
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/skip_unnecessary_quantize` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-04 |
| **Updated** | 2024-10-04 |
+| **Merged** | 2024-10-04 |
---
-#### Description
+## 📄 Description
It has always bugged me that `ggml` unnecessarily repeats the "quantization" of activations when the corresponding matrix multiplication cannot be done directly. E.g., `Q`, `K` and `V` all multiply the input to the self-attention layer. Similarly, `ffn_up` and `ffn_gate` multiply the same activations for parallel FFNs. "Quantization" is in quotes, because it applies to `fp16` and `bf16` tensors when the matrix multiplication function used does not work directly with `fp32` activations. There are typically 7 tensors per layer in a transformer model, so basically 3 out of 7 "quantizations" are unnecessary.
diff --git a/github-data/pull_requests/80 - Move to c17 projectwide.md b/github-data/pull_requests/80 - Move to c17 projectwide.md
new file mode 100644
index 000000000..d21aa6b18
--- /dev/null
+++ b/github-data/pull_requests/80 - Move to c17 projectwide.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #80](https://github.com/ikawrakow/ik_llama.cpp/pull/80) - Move to c++17 projectwide
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cpp_17` |
+| **Target Branch** | `main` |
+| **Created** | 2024-10-04 |
+| **Updated** | 2024-10-04 |
+| **Merged** | 2024-10-04 |
+
+---
+
+## 📄 Description
+
+_No description provided._
\ No newline at end of file
diff --git a/github-data/pull_requests/80 - Move to c_17 projectwide.md b/github-data/pull_requests/80 - Move to c_17 projectwide.md
deleted file mode 100644
index 1d58c5122..000000000
--- a/github-data/pull_requests/80 - Move to c_17 projectwide.md
+++ /dev/null
@@ -1,7 +0,0 @@
-### 🔀 [#80](https://github.com/ikawrakow/ik_llama.cpp/pull/80) - Move to c++17 projectwide
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-10-04 |
-| **Updated** | 2024-10-04 |
\ No newline at end of file
diff --git a/github-data/pull_requests/81 - Cleanup scale fudge factors.md b/github-data/pull_requests/81 - Cleanup scale fudge factors.md
index 95cd518da..54eb45ee4 100644
--- a/github-data/pull_requests/81 - Cleanup scale fudge factors.md
+++ b/github-data/pull_requests/81 - Cleanup scale fudge factors.md
@@ -1,13 +1,16 @@
-### 🔀 [#81](https://github.com/ikawrakow/ik_llama.cpp/pull/81) - Cleanup scale fudge factors
+## 🔀 [Pull Request #81](https://github.com/ikawrakow/ik_llama.cpp/pull/81) - Cleanup scale fudge factors
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/cleanup_fudge_factors` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-04 |
| **Updated** | 2024-10-04 |
+| **Merged** | 2024-10-04 |
---
-#### Description
+## 📄 Description
Low-bit quants often benefit from a fudge factor applied to the (super-)block scale. When I was developing `IQ2_K` and `IQ3_K` it was faster to change the fudge factor in `ggml-cuda/convert.cu` and recompile than to change it in the quantization function and re-quantize. But when I was ready, I forgot to move the `IQ2_K` and `IQ3_K` fudge factors to quantization, so they remained in the CUDA dequantization function (and hence weren't applied anywhere else). This PR fixes this.
\ No newline at end of file
diff --git a/github-data/pull_requests/83 - New SOTA quantization_ 4.25 bpw IQ4_KS.md b/github-data/pull_requests/83 - New SOTA quantization 4.25 bpw IQ4_KS.md
similarity index 85%
rename from github-data/pull_requests/83 - New SOTA quantization_ 4.25 bpw IQ4_KS.md
rename to github-data/pull_requests/83 - New SOTA quantization 4.25 bpw IQ4_KS.md
index e6ca1a43c..cf55c2b65 100644
--- a/github-data/pull_requests/83 - New SOTA quantization_ 4.25 bpw IQ4_KS.md
+++ b/github-data/pull_requests/83 - New SOTA quantization 4.25 bpw IQ4_KS.md
@@ -1,14 +1,17 @@
-### 🔀 [#83](https://github.com/ikawrakow/ik_llama.cpp/pull/83) - New SOTA quantization: 4.25 bpw IQ4_KS
+## 🔀 [Pull Request #83](https://github.com/ikawrakow/ik_llama.cpp/pull/83) - New SOTA quantization: 4.25 bpw IQ4_KS
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_k_xxs` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-09 |
| **Updated** | 2024-10-09 |
+| **Merged** | 2024-10-09 |
---
-#### Description
+## 📄 Description
It is similar to `IQ4_K` with the following difference
* Blocks of 32 instead of blocks of 16
diff --git a/github-data/pull_requests/84 - Better model info.md b/github-data/pull_requests/84 - Better model info.md
index 8c20f97cb..7547cae50 100644
--- a/github-data/pull_requests/84 - Better model info.md
+++ b/github-data/pull_requests/84 - Better model info.md
@@ -1,14 +1,17 @@
-### 🔀 [#84](https://github.com/ikawrakow/ik_llama.cpp/pull/84) - Better model info
+## 🔀 [Pull Request #84](https://github.com/ikawrakow/ik_llama.cpp/pull/84) - Better model info
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/better_model_info` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-10 |
| **Updated** | 2024-10-10 |
+| **Merged** | 2024-10-10 |
---
-#### Description
+## 📄 Description
In the quantization literature they always ignore the token embedding and output tensors (they leave them as `f16`). But when `llama.cpp` loads a model, it prints a bits-per-weight (bpw) value that is basically `total file size on disk / total number of parameters`. As this includes the output tensor, which is almost always quantized with more bpw, this makes the i- and k-quants appear not competitive.
diff --git a/github-data/pull_requests/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md b/github-data/pull_requests/85 - IQ2_KS 2.1875 bpw non-linear quantization.md
similarity index 89%
rename from github-data/pull_requests/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md
rename to github-data/pull_requests/85 - IQ2_KS 2.1875 bpw non-linear quantization.md
index 3121cdf23..b39f075ee 100644
--- a/github-data/pull_requests/85 - IQ2_KS_ 2.1875 bpw non-linear quantization.md
+++ b/github-data/pull_requests/85 - IQ2_KS 2.1875 bpw non-linear quantization.md
@@ -1,14 +1,17 @@
-### 🔀 [#85](https://github.com/ikawrakow/ik_llama.cpp/pull/85) - IQ2_KS: 2.1875 bpw non-linear quantization
+## 🔀 [Pull Request #85](https://github.com/ikawrakow/ik_llama.cpp/pull/85) - IQ2_KS: 2.1875 bpw non-linear quantization
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq2k_experiments` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-13 |
| **Updated** | 2024-10-13 |
+| **Merged** | 2024-10-13 |
---
-#### Description
+## 📄 Description
It ends up being somewhere in the middle between `IQ2_XXS` and `IQ2_XS` in terms of quantized model size and quantization accuracy. This graph shows quantization error vs bpw for LLaMA-3.1-8B-Instruct

diff --git a/github-data/pull_requests/86 - Fix and optimize iq2k Metal implementation.md b/github-data/pull_requests/86 - Fix and optimize iq2k Metal implementation.md
index cdc27c517..7d8016d56 100644
--- a/github-data/pull_requests/86 - Fix and optimize iq2k Metal implementation.md
+++ b/github-data/pull_requests/86 - Fix and optimize iq2k Metal implementation.md
@@ -1,13 +1,16 @@
-### 🐛 [#86](https://github.com/ikawrakow/ik_llama.cpp/pull/86) - Fix and optimize iq2k Metal implementation
+## 🔀 [Pull Request #86](https://github.com/ikawrakow/ik_llama.cpp/pull/86) - Fix and optimize iq2k Metal implementation
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/metal_fix_iq2k` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-13 |
| **Updated** | 2024-10-13 |
+| **Merged** | 2024-10-13 |
---
-#### Description
+## 📄 Description
I completely forgot to change the `IQ2_K` Metal implementation after changing the `IQ2_K` block scales in the last PR. This PR fixes it. It also improves the performance of the `IQ2_K` Metal dot product - TG-128 for LLaMA-3.1-8B goes to 46.2 t/s up from 42.6 t./s.
\ No newline at end of file
diff --git a/github-data/pull_requests/87 - iq3_k_ fix and optimize Metal dot product.md b/github-data/pull_requests/87 - iq3_k fix and optimize Metal dot product.md
similarity index 65%
rename from github-data/pull_requests/87 - iq3_k_ fix and optimize Metal dot product.md
rename to github-data/pull_requests/87 - iq3_k fix and optimize Metal dot product.md
index d70744f47..62382e764 100644
--- a/github-data/pull_requests/87 - iq3_k_ fix and optimize Metal dot product.md
+++ b/github-data/pull_requests/87 - iq3_k fix and optimize Metal dot product.md
@@ -1,14 +1,17 @@
-### 🐛 [#87](https://github.com/ikawrakow/ik_llama.cpp/pull/87) - iq3_k: fix and optimize Metal dot product
+## 🔀 [Pull Request #87](https://github.com/ikawrakow/ik_llama.cpp/pull/87) - iq3_k: fix and optimize Metal dot product
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/metal_fix_iq3k` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-14 |
| **Updated** | 2024-10-14 |
+| **Merged** | 2024-10-14 |
---
-#### Description
+## 📄 Description
I was accessing the scales as 4-byte aligned, but `IQ3_K` is not 4-byte aligned. Instead of throwing an error (as it happens
on CUDA when one makes a mistake such as this), Metal silently accepts and we get garbage. But we don't get garbage right away so one can easily notice, no we get garbage after some tokens have been generated.
diff --git a/github-data/pull_requests/89 - Adding IQ4_KSS_ 4.0 bpw quants.md b/github-data/pull_requests/89 - Adding IQ4_KSS 4.0 bpw quants.md
similarity index 81%
rename from github-data/pull_requests/89 - Adding IQ4_KSS_ 4.0 bpw quants.md
rename to github-data/pull_requests/89 - Adding IQ4_KSS 4.0 bpw quants.md
index 83c4e025f..f0f0f6b2d 100644
--- a/github-data/pull_requests/89 - Adding IQ4_KSS_ 4.0 bpw quants.md
+++ b/github-data/pull_requests/89 - Adding IQ4_KSS 4.0 bpw quants.md
@@ -1,14 +1,17 @@
-### 🔀 [#89](https://github.com/ikawrakow/ik_llama.cpp/pull/89) - Adding IQ4_KSS: 4.0 bpw quants
+## 🔀 [Pull Request #89](https://github.com/ikawrakow/ik_llama.cpp/pull/89) - Adding IQ4_KSS: 4.0 bpw quants
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/iq4_kss` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-16 |
| **Updated** | 2024-10-17 |
+| **Merged** | 2024-10-16 |
---
-#### Description
+## 📄 Description
@Nexesenex has been asking for a 4.0 bpw quantization here and in `llama.cpp`. Well, here it is.
@@ -36,17 +39,27 @@ In all following graph the token embedding and output tensors are quantized, and
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-10-16** at **20:38:07**:
+👤 **Nexesenex** commented on **2024-10-16** at **20:38:07**
Hey IK,
-Congratulations and thank you. Now, I'm gonna try to make all of this work, because I ideally don't want to ever touch 3 bits quants ever again (except for attn_q.weight :P). I'll report my progresses. :D
+Congratulations and thank you. Now, I'm gonna try to make all of this work, because I ideally don't want to ever touch 3 bits quants ever again (except for attn_q.weight :P). I'll report my progresses. :D
+
+Ok, I compiled it in debug thanks to @saood06 yesterday, and finally with Cuda today. Now I can work with this! I tested PPL512 on Sheared Llama 2.7b quickly, normal ftypes (pure, but Q6_K output and with imatrix).
+
+FP16 = 7.3507
+IQ4_K : 7.4172
+IQ4_XS : 7.4444
+IQ4_KS : 7.4322
+IQ4_KSS : 7.4820 (!!!)
+IQ3_S : 7.6666
+IQ3_K : 7.6446 +/- 0.04566
---
-👤 **Nexesenex** commented the **2024-10-16** at **23:20:20**:
+👤 **Nexesenex** commented on **2024-10-16** at **23:20:20**
The new IQ4_KSS quant is really SOTA imo, and thank you very much. You're rocking the place, as usual.
@@ -58,14 +71,10 @@ Also, I observed how Exllama v2 quantizes. Turboderp's tool calculates something
With an IQ3_KM and an IQ3_KSS, you might be able to drop down a bit (attn_q wise, and ffn_gate wise) the bpw of the quant strategies revolving in the 3 to 4.5 bpw bracket. Ofc, the logic applies on the whole scope, but that's a work I'm only able to suggest, not to do myself lol.
-Then, if you were willing to code an automatic quantization system akin to Exllama v2, but maybe more rigorous on the skeleton "ftype" strategy employed (due to the knowledge gained in all the experimentation with FTYPES) and an automatic upscale or downscale (compared to the skeleton 'ftype" strategy) of the quant of a given tensor accordingly to its "error rate", then the process of strategization of the quants would be greatly helped, and the FTYPES also could be SOTA, on the top of your SOTA GGML_TYPES.
+Then, if you were willing to code an automatic quantization system akin to Exllama v2, but maybe more rigorous on the skeleton "ftype" strategy employed (due to the knowledge gained in all the experimentation with FTYPES) and an automatic upscale or downscale (compared to the skeleton 'ftype" strategy) of the GGML type quant of a given tensor accordingly to its "error rate", then the process of strategization of the quants would be greatly helped, and the FTYPES also could be SOTA, on the top of your SOTA GGML_TYPES.
-On my side, I ponder seriously about trying to rebase my KoboldCPP fork on your LlamaCPP clone, to offer the benefit of your quants to myself and others in daily use.
-
----
-
-👤 **Nexesenex** commented the **2024-10-17** at **03:30:26**:
-
-I tested your IQ6_K quant on Nemo 12b on ST/llama-server, and it indeed feels very like a Q8_0.
+On my side, I ponder seriously about trying to rebase my KoboldCPP fork on your LlamaCPP clone, to offer the benefit of your quants to myself and others in daily use.
+
+More practically, I tested your IQ6_K quant on Nemo 12b on ST/llama-server, and it indeed feels very like a Q8_0.
Your quants are amazing.
This night, I'm gonna quant a IQ4_KSS modified ftype for Mistral 123b. I can't wait ! :D
\ No newline at end of file
diff --git a/github-data/pull_requests/9 - Fused soft cap and SIMD-ified GeLU.md b/github-data/pull_requests/9 - Fused soft cap and SIMD-ified GeLU.md
index 05e9fd9aa..0269ba426 100644
--- a/github-data/pull_requests/9 - Fused soft cap and SIMD-ified GeLU.md
+++ b/github-data/pull_requests/9 - Fused soft cap and SIMD-ified GeLU.md
@@ -1,14 +1,17 @@
-### 🔀 [#9](https://github.com/ikawrakow/ik_llama.cpp/pull/9) - Fused soft cap and SIMD-ified GeLU
+## 🔀 [Pull Request #9](https://github.com/ikawrakow/ik_llama.cpp/pull/9) - Fused soft cap and SIMD-ified GeLU
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/softcap` |
+| **Target Branch** | `main` |
| **Created** | 2024-08-02 |
| **Updated** | 2024-08-20 |
+| **Merged** | 2024-08-20 |
---
-#### Description
+## 📄 Description
Some models use a so called "soft cap" in their attention portions, some may use a "soft cap" also for the final output. This is currently implemented as
```
diff --git a/github-data/pull_requests/90 - iq4_ks_ faster dot product on Metal.md b/github-data/pull_requests/90 - iq4_ks faster dot product on Metal.md
similarity index 63%
rename from github-data/pull_requests/90 - iq4_ks_ faster dot product on Metal.md
rename to github-data/pull_requests/90 - iq4_ks faster dot product on Metal.md
index aa806b156..c4d39d6dd 100644
--- a/github-data/pull_requests/90 - iq4_ks_ faster dot product on Metal.md
+++ b/github-data/pull_requests/90 - iq4_ks faster dot product on Metal.md
@@ -1,14 +1,17 @@
-### 🔀 [#90](https://github.com/ikawrakow/ik_llama.cpp/pull/90) - iq4_ks: faster dot product on Metal
+## 🔀 [Pull Request #90](https://github.com/ikawrakow/ik_llama.cpp/pull/90) - iq4_ks: faster dot product on Metal
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/metal_faster_iq4ks` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-16 |
| **Updated** | 2024-10-16 |
+| **Merged** | 2024-10-16 |
---
-#### Description
+## 📄 Description
Haha, I keep forgetting that the Metal compiler often needs a hand to produce fast code.
In this particular instance, we gain almost 8.5% token generation (TG) speedup for `IQ4_KS`:
diff --git a/github-data/pull_requests/91 - CLI - Specify GGML_TYPE to quantize for the main tensors..md b/github-data/pull_requests/91 - CLI - Specify GGML_TYPE to quantize for the main tensors.md
similarity index 60%
rename from github-data/pull_requests/91 - CLI - Specify GGML_TYPE to quantize for the main tensors..md
rename to github-data/pull_requests/91 - CLI - Specify GGML_TYPE to quantize for the main tensors.md
index 2380e9aa0..97e4ca4da 100644
--- a/github-data/pull_requests/91 - CLI - Specify GGML_TYPE to quantize for the main tensors..md
+++ b/github-data/pull_requests/91 - CLI - Specify GGML_TYPE to quantize for the main tensors.md
@@ -1,14 +1,17 @@
-### 🔀 [#91](https://github.com/ikawrakow/ik_llama.cpp/pull/91) - CLI - Specify GGML_TYPE to quantize for the main tensors.
+## 🔀 [Pull Request #91](https://github.com/ikawrakow/ik_llama.cpp/pull/91) - CLI - Specify GGML_TYPE to quantize for the main tensors.
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `specify_tensor_quants_in_cli` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-17 |
| **Updated** | 2024-10-18 |
+| **Merged** | 2024-10-18 |
---
-#### Description
+## 📄 Description
To complement the cli based custom quantization of token_embd.weight and output.weight, the ggml_type of the following tensors can now be specified :
@@ -29,8 +32,8 @@ ffn_up
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2024-10-17** at **06:32:51**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2024-10-17** at **06:32:51**
This looks fine. I'm traveling today. Will do some testing and merge it tomorrow.
\ No newline at end of file
diff --git a/github-data/pull_requests/93 - Attempt to blindly fix Windows build failure.md b/github-data/pull_requests/93 - Attempt to blindly fix Windows build failure.md
index eb51c6fc4..91986b009 100644
--- a/github-data/pull_requests/93 - Attempt to blindly fix Windows build failure.md
+++ b/github-data/pull_requests/93 - Attempt to blindly fix Windows build failure.md
@@ -1,16 +1,19 @@
-### 🐛 [#93](https://github.com/ikawrakow/ik_llama.cpp/pull/93) - Attempt to blindly fix Windows build failure
+## 🔀 [Pull Request #93](https://github.com/ikawrakow/ik_llama.cpp/pull/93) - Attempt to blindly fix Windows build failure
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fix_reduce_windows` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-18 |
| **Updated** | 2024-10-19 |
+| **Merged** | 2024-10-19 |
---
-#### Description
+## 📄 Description
-Ref #88
+Ref [#88](https://github.com/ikawrakow/ik_llama.cpp/issues/88)
@Nexesenex @saood06
@@ -18,10 +21,10 @@ Does this work?
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-10-18** at **15:37:10**:
+👤 **Nexesenex** commented on **2024-10-18** at **15:37:10**
Hey IK.
-Yes, both your last commit and Saood's cracked up fix are working for compilation in non-Cuda and cuda mode.
\ No newline at end of file
+Yes, both your branch https://github.com/ikawrakow/ik_llama.cpp/tree/ik/fix_reduce_windows and Saood's cracked up fix are working for compilation on Windows / MSVS in non-Cuda and cuda mode.
\ No newline at end of file
diff --git a/github-data/pull_requests/94 - Adding _agray3_s graph caching approach.md b/github-data/pull_requests/94 - Adding agray3s graph caching approach.md
similarity index 84%
rename from github-data/pull_requests/94 - Adding _agray3_s graph caching approach.md
rename to github-data/pull_requests/94 - Adding agray3s graph caching approach.md
index 68bd8086e..27db7360a 100644
--- a/github-data/pull_requests/94 - Adding _agray3_s graph caching approach.md
+++ b/github-data/pull_requests/94 - Adding agray3s graph caching approach.md
@@ -1,14 +1,16 @@
-### 🔀 [#94](https://github.com/ikawrakow/ik_llama.cpp/pull/94) - Adding @agray3's graph caching approach
+## 🔀 [Pull Request #94](https://github.com/ikawrakow/ik_llama.cpp/pull/94) - Adding @agray3's graph caching approach
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
+| **Source Branch** | `ik/cached_graph` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-18 |
| **Updated** | 2024-10-20 |
---
-#### Description
+## 📄 Description
@agray3 has [PR-8366](https://github.com/ggerganov/llama.cpp/pull/8366) open in mainline `llama.cpp` that appears to not meet the high standards of the `llama.cpp` maintainers. Me, being more pragmatic and less of a purist, would like to have these changes here as that way one avoids rebuilding the computation graph for every new token, a "feature" inherited from `llama.cpp` that I don't really like.
@@ -52,47 +54,49 @@ and this for the M2-Max
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **Nexesenex** commented the **2024-10-18** at **17:58:54**:
+👤 **Nexesenex** commented on **2024-10-18** at **17:58:54**
@ikawrakow : check the "continuation" of this PR also :
https://github.com/ggerganov/llama.cpp/pull/9017
---
-👤 **ikawrakow** commented the **2024-10-19** at **09:44:44**:
+👤 **ikawrakow** commented on **2024-10-19** at **09:44:44**
Oh, btw,
> @ikawrakow : check the "continuation" of this PR also :
-> [ggerganov/llama.cpp#9017](https://github.com/ggerganov/llama.cpp/pull/9017)
+> [ggerganov/llama.cpp[#9017](https://github.com/ikawrakow/ik_llama.cpp/issues/9017)](https://github.com/ggerganov/llama.cpp/pull/9017)
Yes, I saw that. But the performance gain there is even less, so not sure if I want to add it.
---
-👤 **Nexesenex** commented the **2024-10-19** at **14:07:33**:
+👤 **Nexesenex** commented on **2024-10-19** at **14:07:33**
Well, IK, little streams make big rivers at some point.
I know you're CPU focused, but as far as I know, only lacks Agray3's missing PR and the MMQ kernels (the "normal" cuda implementation is quite slow and a massive memory hog, and can reach several percents more size occupation of the VRAM for the same model/bbs/ctx) for your new SOTA ggml_types to have the best CUDA inference speed and quality/size reachable in the GGUF ecosystem.
---
-👤 **ikawrakow** commented the **2024-10-19** at **14:37:26**:
+👤 **ikawrakow** commented on **2024-10-19** at **14:37:26**
> only lacks Agray3's missing PR and the MMQ kernels
-I know I need to do something about quantized matrix multiplications on CUDA for the new quants. It is not hard to take Johannes' MMQ kernels and adapt. But I have an extremely strong resistance against doing that. I find the MMQ kernels unacceptable, and even less so the several minutes build time associated with them. Adding even more quants will explode build time even further. Each time I want to make a change to one of the headers that I know will trigger full CUDA rebuild, I think 5 times before doing it. I think, a much better approach to pursue there is to find a way to interleave dequantization and matrix multiplications. This is done in the Metal implementation. A simple napkin math shows that the difference in performance between dequantize + cuBLAS matrix multiplication and the MMQ kernels is simply due to the time it takes to store the dequantized tensors in memory. If one would interleave dequantize and matrix multiplications, one would A) (nearly) remove the performance gap B) reduce the extra VRAM required to store the dequantized tensors by a large amount, and C) Get back to normal build times after throwing out the MMQ kernels. I'm just not enough of a CUDA expert to (easily) implements, so keep pushing it out.
+I know I need to do something about quantized matrix multiplications on CUDA for the new quants. It is not hard to take Johannes' MMQ kernels and adapt. But I have an extremely strong resistance against doing that. I find the MMQ kernels unacceptable, and even less so the several minutes build time associated with them. Adding even more quants will explode build time even further. Each time I want to make a change to one of the headers that I know will trigger full CUDA rebuild, I think 5 times before doing it. I think, a much better approach to pursue there is to find a way to interleave dequantization and matrix multiplications. This is done in the Metal implementation. A simple napkin math shows that the difference in performance between dequantize + cuBLAS matrix multiplication and the MMQ kernels is simply due to the time it takes to store the dequantized tensors in memory. If one would interleave dequantize and matrix multiplications, one would A) (nearly) remove the performance gap B) reduce the extra VRAM required to store the dequantized tensors by a large amount, and C) Get back to normal build times after throwing out the MMQ kernels. I'm just not enough of a CUDA expert to (easily) implement, so keep pushing it out.
+
+The alternative would be to write quantized matrix multiplications for CUDA from scratch. In a way that does not require 10 minutes build time. I have done it for 3 different CPU platforms in `iqk_mul_mat.cpp`, so I should be able to do it for CUDA too.
---
-👤 **agray3** commented the **2024-10-19** at **19:22:56**:
+👤 **agray3** commented on **2024-10-19** at **19:22:56**
Thanks @ikawrakow. I have now created this PR at https://github.com/ikawrakow/ik_llama.cpp/pull/98 (it is exactly the same as this one). FWIW, to be fair to the llama.cpp maintainers, they are also maintaining the GGML library which can be used separately from llama.cpp and there may be unintended consequences related to that. It should be fine when GGML is always used with llama.cpp.
---
-👤 **ikawrakow** commented the **2024-10-20** at **06:36:49**:
+👤 **ikawrakow** commented on **2024-10-20** at **06:36:49**
-Closing in favor of #98
\ No newline at end of file
+Closing in favor of [#98](https://github.com/ikawrakow/ik_llama.cpp/issues/98)
\ No newline at end of file
diff --git a/github-data/pull_requests/96 - Quant strategies_ attn_q Q4 _ attn_v Q6 for Llama 3.1 Q5_K_S.md b/github-data/pull_requests/96 - Quant strategies attn_q Q4 attn_v Q6 for Llama 3.1 Q5_K_S.md
similarity index 75%
rename from github-data/pull_requests/96 - Quant strategies_ attn_q Q4 _ attn_v Q6 for Llama 3.1 Q5_K_S.md
rename to github-data/pull_requests/96 - Quant strategies attn_q Q4 attn_v Q6 for Llama 3.1 Q5_K_S.md
index 0bc700ef9..a46fd2e18 100644
--- a/github-data/pull_requests/96 - Quant strategies_ attn_q Q4 _ attn_v Q6 for Llama 3.1 Q5_K_S.md
+++ b/github-data/pull_requests/96 - Quant strategies attn_q Q4 attn_v Q6 for Llama 3.1 Q5_K_S.md
@@ -1,14 +1,17 @@
-### 🔀 [#96](https://github.com/ikawrakow/ik_llama.cpp/pull/96) - Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S
+## 🔀 [Pull Request #96](https://github.com/ikawrakow/ik_llama.cpp/pull/96) - Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S
| **Author** | `Nexesenex` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `L3.1_q5_k_s_q4v6` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-19 |
| **Updated** | 2024-11-22 |
+| **Merged** | 2024-10-19 |
---
-#### Description
+## 📄 Description
Pattern (attn-q -1 attn-v+1) worth to be tested on more quants levels (Q_x_K, IQx, & IQx_K) and on Llama 3.0 if confirmation is needed.
@@ -27,23 +30,23 @@ I suspect that it goes similarly for L3 as well, which was quite insensitive to
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **ikawrakow** submitted a review the **2024-10-19** at **15:24:19**: ✅ `APPROVED`
+👤 **ikawrakow** approved this pull request ✅ on **2024-10-19** at **15:24:19**
Yes, reducing bpw for `attn_q` and increasing `bpw` for `attn_v` is a good strategy to improve quantized model performance in general in my experience.
---
-👤 **Nexesenex** commented the **2024-10-19** at **16:04:22**:
+👤 **Nexesenex** commented on **2024-10-19** at **16:04:22**
If you're open to the idea, I can contribute more to that quant strategy part, in a progressive way, PR by PR.
-
-I now handle well the afferent code, and got a lot of experimentation behind me already.
+I now handle well the afferent code, and got a lot of experimentation behind me already.
+The merged PRs/commits can then be squashed to keep the commit log clear of clutter.
---
-👤 **ikawrakow** commented the **2024-10-20** at **09:18:46**:
+👤 **ikawrakow** commented on **2024-10-20** at **09:18:46**
> If you're open to the idea, I can contribute more to that quant strategy part, in a progressive way, PR by PR.
> I now handle well the afferent code, and got a lot of experimentation behind me already.
@@ -53,31 +56,31 @@ Sure, go ahead.
---
-👤 **Nexesenex** commented the **2024-10-20** at **22:44:46**:
+👤 **Nexesenex** commented on **2024-10-20** at **22:44:46**
Shall I separate the IQ_K from the legacy IQ Quants in the mixes?
---
-👤 **Nexesenex** commented the **2024-11-22** at **07:41:35**:
+👤 **Nexesenex** commented on **2024-11-22** at **07:41:35**
@ikawrakow would it be possible and not a hassle for you to decouple the quant strategies part of the llama.cpp source file in order to reduce the recompilation time when the quant strategies are edited, so it can speed up the tests?
---
-👤 **ikawrakow** commented the **2024-11-22** at **08:08:37**:
+👤 **ikawrakow** commented on **2024-11-22** at **08:08:37**
It is of course possible. But is compilation time really a major factor in testing? One needs to quantize and run a test such as PPL. Compared to that `llama.cpp` compilation time should not be a major factor. Or am I missing something?
---
-👤 **Nexesenex** commented the **2024-11-22** at **11:22:35**:
+👤 **Nexesenex** commented on **2024-11-22** at **11:22:35**
Well, if one plays with use more bit formulas (I use customized ones a lot), which are not supported by the CLI args, then the endless lengthy recompiles quickly become a hassle. ^^
---
-👤 **ikawrakow** commented the **2024-11-22** at **16:40:39**:
+👤 **ikawrakow** commented on **2024-11-22** at **16:40:39**
So, let's say compiling `llama.cpp` takes 15 seconds. Quantizing a 7B model is 15+ seconds. Running PPL is 60 seconds. So, at the very best, compilation time is ~15% of the overall time to test. If we are looking at larger models and/or more than one model (my usual approach is to check at least 5 models before drawing conclusions that one quantization strategy is better than another), the compilation time basically becomes a negligible fraction of the time needed to test a new quantization strategy.
diff --git a/github-data/pull_requests/97 - Bitnet make the scale tensors optional.md b/github-data/pull_requests/97 - Bitnet make the scale tensors optional.md
new file mode 100644
index 000000000..77a6c6c70
--- /dev/null
+++ b/github-data/pull_requests/97 - Bitnet make the scale tensors optional.md
@@ -0,0 +1,16 @@
+## 🔀 [Pull Request #97](https://github.com/ikawrakow/ik_llama.cpp/pull/97) - Bitnet: make the scale tensors optional
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/bitnet_optional_scales` |
+| **Target Branch** | `main` |
+| **Created** | 2024-10-19 |
+| **Updated** | 2024-10-19 |
+| **Merged** | 2024-10-19 |
+
+---
+
+## 📄 Description
+
+Needed this to be able to run the fake models generated by the [Microsoft Bitnet implementation](https://github.com/microsoft/BitNet) to make a direct performance comparison with their Bitnet implementation (see [#95](https://github.com/ikawrakow/ik_llama.cpp/issues/95)).
\ No newline at end of file
diff --git a/github-data/pull_requests/97 - Bitnet_ make the scale tensors optional.md b/github-data/pull_requests/97 - Bitnet_ make the scale tensors optional.md
deleted file mode 100644
index e131475bd..000000000
--- a/github-data/pull_requests/97 - Bitnet_ make the scale tensors optional.md
+++ /dev/null
@@ -1,13 +0,0 @@
-### 🔀 [#97](https://github.com/ikawrakow/ik_llama.cpp/pull/97) - Bitnet: make the scale tensors optional
-
-| **Author** | `ikawrakow` |
-| :--- | :--- |
-| **State** | ❌ **Closed** |
-| **Created** | 2024-10-19 |
-| **Updated** | 2024-10-19 |
-
----
-
-#### Description
-
-Needed this to be able to run the fake models generated by the [Microsoft Bitnet implementation](https://github.com/microsoft/BitNet) to make a direct performance comparison with their Bitnet implementation (see #95).
\ No newline at end of file
diff --git a/github-data/pull_requests/98 - Avoid rebuild of GGML graph for each token.md b/github-data/pull_requests/98 - Avoid rebuild of GGML graph for each token.md
index 37412c064..adc14e0cc 100644
--- a/github-data/pull_requests/98 - Avoid rebuild of GGML graph for each token.md
+++ b/github-data/pull_requests/98 - Avoid rebuild of GGML graph for each token.md
@@ -1,14 +1,17 @@
-### 🔀 [#98](https://github.com/ikawrakow/ik_llama.cpp/pull/98) - Avoid rebuild of GGML graph for each token
+## 🔀 [Pull Request #98](https://github.com/ikawrakow/ik_llama.cpp/pull/98) - Avoid rebuild of GGML graph for each token
| **Author** | `agray3` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ag_avoid_ggml_graph_rebuild` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-19 |
| **Updated** | 2024-10-20 |
+| **Merged** | 2024-10-20 |
---
-#### Description
+## 📄 Description
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.
@@ -22,12 +25,12 @@ Introduces caching of GGML graph to avoid unnecessary full rebuild between each
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **agray3** commented the **2024-10-19** at **19:19:21**:
+👤 **agray3** commented on **2024-10-19** at **19:19:21**
See https://github.com/ikawrakow/ik_llama.cpp/pull/94
---
-👤 **ikawrakow** submitted a review the **2024-10-20** at **06:35:58**: ✅ `APPROVED`
\ No newline at end of file
+👤 **ikawrakow** approved this pull request ✅ on **2024-10-20** at **06:35:58**
\ No newline at end of file
diff --git a/github-data/pull_requests/99 - Enable IQ4_NL for KV-cache in token generation using Flash Attention.md b/github-data/pull_requests/99 - Enable IQ4_NL for KV-cache in token generation using Flash Attention.md
index c3feeacb7..f312bb361 100644
--- a/github-data/pull_requests/99 - Enable IQ4_NL for KV-cache in token generation using Flash Attention.md
+++ b/github-data/pull_requests/99 - Enable IQ4_NL for KV-cache in token generation using Flash Attention.md
@@ -1,14 +1,17 @@
-### 🔀 [#99](https://github.com/ikawrakow/ik_llama.cpp/pull/99) - Enable IQ4_NL for KV-cache in token generation using Flash Attention
+## 🔀 [Pull Request #99](https://github.com/ikawrakow/ik_llama.cpp/pull/99) - Enable IQ4_NL for KV-cache in token generation using Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
-| **State** | ❌ **Closed** |
+| **State** | 🔀 **Merged** |
+| **Source Branch** | `ik/fattn_enable_iq4_nl` |
+| **Target Branch** | `main` |
| **Created** | 2024-10-20 |
| **Updated** | 2024-10-21 |
+| **Merged** | 2024-10-21 |
---
-#### Description
+## 📄 Description
Only added for head size = 128 for now, we can add other head sizes if needed.
@@ -16,19 +19,19 @@ For me `-ctk q8_0 -ctv iq4_nl` is the most useful combination in terms of the co
**Update**
-Based on @Nexesenex comment in #92, added `IQ4_NL + IQ4_NL` as a possible KV-cache combination for head size of 128. Hopefully this is a better alternative than `Q4_0 + Q4_0` for the VRAM poor.
+Based on @Nexesenex comment in [#92](https://github.com/ikawrakow/ik_llama.cpp/issues/92), added `IQ4_NL + IQ4_NL` as a possible KV-cache combination for head size of 128. Hopefully this is a better alternative than `Q4_0 + Q4_0` for the VRAM poor.
---
-#### 💬 Conversation
+## 💬 Conversation
-👤 **saood06** commented the **2024-10-20** at **18:48:37**:
+👤 **saood06** commented on **2024-10-20** at **18:48:37**
Since you're enabling q8_0/iq4_nl by default you should update the on_no_fattn_vec_case function in fattn-common.cuh to mention it.
---
-👤 **ikawrakow** commented the **2024-10-21** at **08:10:33**:
+👤 **ikawrakow** commented on **2024-10-21** at **08:10:33**
> Since you're enabling q8_0/iq4_nl by default you should update the on_no_fattn_vec_case function in fattn-common.cuh to mention it.
@@ -36,6 +39,6 @@ Thanks for pointing out. It is now updated to reflect the possible quantized cac
---
-👤 **Nexesenex** commented the **2024-10-21** at **09:47:46**:
+👤 **Nexesenex** commented on **2024-10-21** at **09:47:46**
-It works. In the name of the VRAM poor that I do so well represent, thanks! xD
\ No newline at end of file
+iq4_nl / iq4_nl works. In the name of the VRAM poor that I do so well represent, thanks! xD
\ No newline at end of file