ikawrakow
diff --git a/‎github-data/discussions/100-New argument _ env variable for GGML_SCHED_MAX_COPIES_.md‎
Lines changed: 26 additions & 0 deletions b/‎github-data/discussions/100-New argument _ env variable for GGML_SCHED_MAX_COPIES_.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎github-data/discussions/104-Convenience improvements for llama-quantize.md‎
Lines changed: 22 additions & 0 deletions b/‎github-data/discussions/104-Convenience improvements for llama-quantize.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎github-data/discussions/140-Questions about weight[j].md‎
Lines changed: 276 additions & 0 deletions b/‎github-data/discussions/140-Questions about weight[j].md‎
Lines changed: 276 additions & 0 deletions
diff --git a/‎github-data/discussions/15-Will LQER improve k- and i-quants_.md‎
Lines changed: 293 additions & 0 deletions b/‎github-data/discussions/15-Will LQER improve k- and i-quants_.md‎
Lines changed: 293 additions & 0 deletions
diff --git a/‎github-data/discussions/164-Latest CPU performance comparison with llama.cpp.md‎
Lines changed: 766 additions & 0 deletions b/‎github-data/discussions/164-Latest CPU performance comparison with llama.cpp.md‎
Lines changed: 766 additions & 0 deletions
diff --git a/‎github-data/discussions/165-Norm RMS Epsilon.md‎
Lines changed: 28 additions & 0 deletions b/‎github-data/discussions/165-Norm RMS Epsilon.md‎
Lines changed: 28 additions & 0 deletions
diff --git a/‎github-data/discussions/166-Learning more LLM quantization.md‎
Lines changed: 49 additions & 0 deletions b/‎github-data/discussions/166-Learning more LLM quantization.md‎
Lines changed: 49 additions & 0 deletions
diff --git a/‎github-data/discussions/18-CPU beating GPU in token generation speed.md‎
Lines changed: 96 additions & 0 deletions b/‎github-data/discussions/18-CPU beating GPU in token generation speed.md‎
Lines changed: 96 additions & 0 deletions
@@ -0,0 +1,26 @@
+### 🗣️ [#100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **Created** | 2024-10-21 |
+| **Updated** | 2024-10-21 |
+
+---
+
+#### Description
+
+@ikawrakow, could you set up a CLI argument (or at least an env variable, it's much simpler I guess but I'm failing to do it right) to determine GGML_SCHED_MAX_COPIES without recompiling? It impacts VRAM occupation and performances, and it'd be great to set that up conveniently for benching and customized use.
+
+---
+
+#### 🗣️ Discussion
+
+👤 **ikawrakow** replied the **2024-10-21** at **08:29:25**:<br>
+
+I haven't looked into this at all. What is it good for?
+
+---
+
+👤 **Nexesenex** replied the **2024-10-21** at **09:36:22**:<br>
+
+It's supposed to go faster inference on multi-GPU I guess. Mainline sets it at 4, I set it at 1, because I didn't notice much improvement back in the days, but I noticed more vram consumption and gpu load.
@@ -0,0 +1,22 @@
+### 🗣️ [#104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **Created** | 2024-10-23 |
+| **Updated** | 2024-10-23 |
+
+---
+
+#### Description
+
+Hey IK.
+
+Here are some ideas of potential features for llama-quantize, that I'm not capable to code myself :
+
+- Create a directory when it doesn't exist for the output file.
+
+- Interrupt the quantization (or even **quantize each tensor in a directory**, so the quantization can be resumed on crash, or even a single series of tensor can be requantized (like attn_q weight for example, or even a function of use_more_bits if one of the part of the ternary statement deciding the quantization of a given tensor is not met when you change the quant of a part of the ternary, but not the other). The monolithic approach makes a pretty monster-file, but at the same time, wastes a lot of space, time and compute.
+
+- integrate the formulas like use_more_bits (we have one, I intend to PR more of those) to the tensors that we manually specify with arguments in CLI to customize a FTYPE. 
+
+- A pre-check of the available space on disk before the quantization, ideally coupled with a dry-run giving the final size of the desired quant.
@@ -0,0 +1,28 @@
+### 🗣️ [#165](https://github.com/ikawrakow/ik_llama.cpp/discussions/165) - Norm RMS Epsilon
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **Created** | 2024-12-25 |
+| **Updated** | 2024-12-27 |
+
+---
+
+#### Description
+
+While it crosses my mind..
+
+@Ikawrakow : a while ago, you made some measurements with variations of Norm RMS Epsilon which showed some little benefits to offset it for <2bpw quants. It was on L2 I believe, and I wonder if it applies to other arches, and if yes, if there's some sort of "formula" which would come with it to improve the low bitrate quants themselves.
+
+Just beotian thoughts.
+
+And merry XMAS btw, if you celebrate it!
+
+---
+
+#### 🗣️ Discussion
+
+👤 **ikawrakow** replied the **2024-12-27** at **17:44:24**:<br>
+
+I'm travelling, so just quickly from the phone.
+
+Yes, there is a small benefit from increasing rms_eps also for LlaMA-3, but only for very low-bit quants (IQ2_XXS). No, I haven't done any kind of systematic investigation.
@@ -0,0 +1,49 @@
+### 🗣️ [#166](https://github.com/ikawrakow/ik_llama.cpp/discussions/166) - Learning more LLM quantization
+
+| **Author** | `robinnarsinghranabhat` |
+| :--- | :--- |
+| **Created** | 2025-01-05 |
+| **Updated** | 2025-03-13 |
+
+---
+
+#### Description
+
+For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.
+
+It might sound silly but we have separate tricks for quantization during training and during evaluation right ?
+
+---
+
+#### 🗣️ Discussion
+
+👤 **ikawrakow** replied the **2025-01-05** at **10:37:28**:<br>
+
+> For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.
+
+I developed all quantization types in `llama.cpp` apart from the legacy quants `Q4_0, Q4_1, Q5_0, Q5_1, Q8_0` (but these are very simple round-to-nearest block-wise quants). I did not read any research papers, just went ahead and experimented. Rarely reading papers has always been my approach to research. I have found that reading what others have done influences my thinking direction and hence may prevent finding a better approach. I only go and read papers if I was not able to find a meaningful solution to a problem on my own.
+
+> It might sound silly but we have separate tricks for quantization during training and during evaluation right ?  
+
+`llama.cpp` does not do any training, so it is always post-training quantization (PTQ). But in general there is quantization-aware training (QAT), where the model is not actually quantized during training but model weights are forced to stay within a specified range with the hope that this will give better PTQ results. The only actually quantized model training approach I'm aware of is Bitnet from Microsoft Research, where a ternary model is trained (weights are -1, 0, 1, plus a per tensor float scaling factor). More recently researchers have been utilizing fine-tuning for PTQ, where some corpus of training data is used to guide PTQ (look for, e.g., Quip#, AQLM, QTIP). This is quite different from the simple quantization approaches used in `llama.cpp` and also here in this repository, requires a full-fledged training framework such as PyTorch, powerful GPU(s), and many hours/days of GPU time.
+
+---
+
+👤 **robinnarsinghranabhat** replied the **2025-01-10** at **21:38:11**:<br>
+
+Thank you for this humble response ! 
+
+Now I understand it's doing inference on quantized weights. 
+
+But I get lost trying to understand llama cpp codebase. how should I navigate this codebase ?
+I am comfortable with python, machine learning concepts and understand pointers in C.
+ But never written complex programs in C/C++.
+
+Do I need to understand fundamentals concept on operating systems, comp.arch, memory-management e.t.c. ? 
+
+I want to be a programmar like you. 
+
+Sorry .. lots of questions all over the place :(
+
+> 👤 **arnfaldur** replied the **2025-03-13** at **02:10:31**:<br>
+> Trying to understand this codebase isn't attacking the wall where it's lowest. You're probably best off finding some beginner/intermediate C++ courses online. I imagine that there are plenty available for free. You don't strictly need to understand all these fundamentals to understand what this project is doing, but you sound like you're in the *don't know what you don't know* phase and a general Computer Science course would likely get you the farthest at this point.
@@ -0,0 +1,96 @@
+### 🗣️ [#18](https://github.com/ikawrakow/ik_llama.cpp/discussions/18) - CPU beating GPU in token generation speed
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **Created** | 2024-08-13 |
+| **Updated** | 2025-04-03 |
+
+---
+
+#### Description
+
+The [TriLM](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2) ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with `IQ2_TN`. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:
+
+| backend    | threads |          test |              t/s |
+| ---------- | ------: | ------------: | ---------------: |
+| Ryzen-7950X        |      16 |        pp1500 |  8268.11 ± 48.34 |
+| Ryzen-7950X        |       4 |         tg500   |  1016.65 ± 22.17 |
+| Ryzen-7950X        |       8 |         tg500   |  1224.83 ± 32.28 |
+| Ryzen-7950X        |      16 |         tg500   |  1240.54 ± 25.74  |
+| RTX-4080              |      -   |         pp1500 |  110388 ± 250 |
+| RTX-4080              |      -   |         tg500 |  1136.64 ± 4.99 |
+
+The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed!
+
+I also have an M2-Max laptop (the version with a 30-core GPU). Here is what we get:
+
+| backend    | threads |          test |              t/s |
+| ---------- | ------: | ------------: | ---------------: |
+| M2-Max CPU     |      8 |        pp1500 |  5209.27 ± 21.48 |
+| M2-Max CPU        |       2 |         tg500   |  692.87 ± 1.74 |
+| M2-Max CPU        |       4 |         tg500   |  841.48 ± 5.96 |
+| M2-Max CPU        |       8 |         tg500   |  894.73 ± 10.03 |
+| M2-Max GPU         |      4   |         pp1500 |  25824 ± 562 |
+| M2-Max GPU         |      4  |         tg500 |  464.86 ± 3.85  |
+
+Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads!
+
+---
+
+#### 🗣️ Discussion
+
+👤 **ikawrakow** replied the **2024-09-02** at **13:20:54**:<br>
+
+Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get
+
+|  model                          |       size |     params | backend    | ngl | threads | fa |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ------------: | ---------------: |
+| IQ2_BN - 2.06 bpw TriLM |  45.89 MiB |    99.76 M | CUDA       | 100 |       1 |  1 |        pp1500 | 156827.38 ± 727 |
+| IQ2_BN - 2.06 bpw TriLM |  45.89 MiB |    99.76 M | CUDA       | 100 |       1 |  1 |         tg500 |  1496.37 ± 36.79 |
+| IQ2_BN - 2.06 bpw TriLM |  45.89 MiB |    99.76 M | CPU        |      0 | 16 |  1 |        pp1500 | 12133.80 ± 51.45 |
+| IQ2_BN - 2.06 bpw TriLM |  45.89 MiB |    99.76 M | CPU        |      0 | 16 |  1 |         tg500 |   1509.52 ± 9.65 |
+
+TG speed is now about the same, which is still quite remarkable.
+
+FA has improved CPU prompt processing speed by almost 50%, TG by 22%.
+
+> 👤 **saood06** replied the **2025-04-02** at **10:36:44**:<br>
+> Is there a chance SpargeAttn could be implemented here. Code [here](https://github.com/thu-ml/SpargeAttn), Paper [here](https://arxiv.org/abs/2502.18137). 
+> 
+> If it could would it benefit speed on CPU?
+> 
+> 👤 **ikawrakow** replied the **2025-04-02** at **13:44:09**:<br>
+> Other than the paper, is there any evidence that this works as advertised? If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.
+> 
+> 👤 **saood06** replied the **2025-04-03** at **00:24:39**:<br>
+> >Other than the paper, is there any evidence that this works as advertised?
+> 
+> Not really (there are multiple ComfyUI custom nodes that port support but not much on people using it), the paper looked interesting to me and the idea makes sense to me, but the implementation they have looks premature. The same group put out SageAttention/SageAttention2 which has been widely adopted (mostly for image/video models) and the performance matched the paper but SpargeAttn has gotten interest but not much adoption because of the state of the implmentation. 
+> 
+> >If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.
+> 
+> Sorry.
+
+---
+
+👤 **ikawrakow** replied the **2024-09-08** at **07:16:59**:<br>
+
+With PR #42 we get this
+
+| model                          |       size |     params | backend    | threads | fa |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | ---------------: |
+| IQ2_BN - 2.06 bpw TriLM |  45.89 MiB |    99.76 M | CPU        |      16 |  1 |        pp1500 | 12906.95 ± 61.04 |
+| IQ2_BN - 2.06 bpw TriLM |  45.89 MiB |    99.76 M | CPU        |      16 |  1 |         tg512 |  1563.62 ± 12.55 |
+
+I.e., 56% improvement for PP and 26% improvement for TG since the original post from Aug 13!
+
+I see [PR-8151](https://github.com/ggerganov/llama.cpp/pull/8151), which provides dedicated quantization for the TriLM ternary models in mainline `llama.cpp`, has been merged. Here is what we get for `TQ2_0` that corresponds to our `IQ2_TN`
+
+| model                          |       size |     params | backend    | threads | fa |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | -------------------: |
+| TQ2_0 - 2.06 bpw ternary |  45.89 MiB |    99.76 M | CPU        |      16 |  1 |        pp1500 |      5187.34 ± 11.69 |
+| TQ2_0 - 2.06 bpw ternary |  45.89 MiB |    99.76 M | CPU        |      16 |  0 |        pp1500 |      5281.54 ± 53.33 |
+| TQ2_0 - 2.06 bpw ternary |  45.89 MiB |    99.76 M | CPU        |      16 |  1 |         tg500 |      1156.25 ± 18.14 |
+| TQ2_0 - 2.06 bpw ternary |  45.89 MiB |    99.76 M | CPU        |      16 |  0 |         tg500 |      1041.27 ± 21.30 |
+
+Our version is 2.44X faster for PP and 35% faster for TG.