|
| 1 | +### 🗣️ [#18](https://github.com/ikawrakow/ik_llama.cpp/discussions/18) - CPU beating GPU in token generation speed |
| 2 | + |
| 3 | +| **Author** | `ikawrakow` | |
| 4 | +| :--- | :--- | |
| 5 | +| **Created** | 2024-08-13 | |
| 6 | +| **Updated** | 2025-04-03 | |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +#### Description |
| 11 | + |
| 12 | +The [TriLM](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2) ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with `IQ2_TN`. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case: |
| 13 | + |
| 14 | +| backend | threads | test | t/s | |
| 15 | +| ---------- | ------: | ------------: | ---------------: | |
| 16 | +| Ryzen-7950X | 16 | pp1500 | 8268.11 ± 48.34 | |
| 17 | +| Ryzen-7950X | 4 | tg500 | 1016.65 ± 22.17 | |
| 18 | +| Ryzen-7950X | 8 | tg500 | 1224.83 ± 32.28 | |
| 19 | +| Ryzen-7950X | 16 | tg500 | 1240.54 ± 25.74 | |
| 20 | +| RTX-4080 | - | pp1500 | 110388 ± 250 | |
| 21 | +| RTX-4080 | - | tg500 | 1136.64 ± 4.99 | |
| 22 | + |
| 23 | +The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed! |
| 24 | + |
| 25 | +I also have an M2-Max laptop (the version with a 30-core GPU). Here is what we get: |
| 26 | + |
| 27 | +| backend | threads | test | t/s | |
| 28 | +| ---------- | ------: | ------------: | ---------------: | |
| 29 | +| M2-Max CPU | 8 | pp1500 | 5209.27 ± 21.48 | |
| 30 | +| M2-Max CPU | 2 | tg500 | 692.87 ± 1.74 | |
| 31 | +| M2-Max CPU | 4 | tg500 | 841.48 ± 5.96 | |
| 32 | +| M2-Max CPU | 8 | tg500 | 894.73 ± 10.03 | |
| 33 | +| M2-Max GPU | 4 | pp1500 | 25824 ± 562 | |
| 34 | +| M2-Max GPU | 4 | tg500 | 464.86 ± 3.85 | |
| 35 | + |
| 36 | +Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads! |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +#### 🗣️ Discussion |
| 41 | + |
| 42 | +👤 **ikawrakow** replied the **2024-09-02** at **13:20:54**:<br> |
| 43 | + |
| 44 | +Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get |
| 45 | + |
| 46 | +| model | size | params | backend | ngl | threads | fa | test | t/s | |
| 47 | +| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ------------: | ---------------: | |
| 48 | +| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CUDA | 100 | 1 | 1 | pp1500 | 156827.38 ± 727 | |
| 49 | +| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CUDA | 100 | 1 | 1 | tg500 | 1496.37 ± 36.79 | |
| 50 | +| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 0 | 16 | 1 | pp1500 | 12133.80 ± 51.45 | |
| 51 | +| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 0 | 16 | 1 | tg500 | 1509.52 ± 9.65 | |
| 52 | + |
| 53 | +TG speed is now about the same, which is still quite remarkable. |
| 54 | + |
| 55 | +FA has improved CPU prompt processing speed by almost 50%, TG by 22%. |
| 56 | + |
| 57 | +> 👤 **saood06** replied the **2025-04-02** at **10:36:44**:<br> |
| 58 | +> Is there a chance SpargeAttn could be implemented here. Code [here](https://github.com/thu-ml/SpargeAttn), Paper [here](https://arxiv.org/abs/2502.18137). |
| 59 | +> |
| 60 | +> If it could would it benefit speed on CPU? |
| 61 | +> |
| 62 | +> 👤 **ikawrakow** replied the **2025-04-02** at **13:44:09**:<br> |
| 63 | +> Other than the paper, is there any evidence that this works as advertised? If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours. |
| 64 | +> |
| 65 | +> 👤 **saood06** replied the **2025-04-03** at **00:24:39**:<br> |
| 66 | +> >Other than the paper, is there any evidence that this works as advertised? |
| 67 | +> |
| 68 | +> Not really (there are multiple ComfyUI custom nodes that port support but not much on people using it), the paper looked interesting to me and the idea makes sense to me, but the implementation they have looks premature. The same group put out SageAttention/SageAttention2 which has been widely adopted (mostly for image/video models) and the performance matched the paper but SpargeAttn has gotten interest but not much adoption because of the state of the implmentation. |
| 69 | +> |
| 70 | +> >If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours. |
| 71 | +> |
| 72 | +> Sorry. |
| 73 | +
|
| 74 | +--- |
| 75 | + |
| 76 | +👤 **ikawrakow** replied the **2024-09-08** at **07:16:59**:<br> |
| 77 | + |
| 78 | +With PR #42 we get this |
| 79 | + |
| 80 | +| model | size | params | backend | threads | fa | test | t/s | |
| 81 | +| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | ---------------: | |
| 82 | +| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 16 | 1 | pp1500 | 12906.95 ± 61.04 | |
| 83 | +| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 16 | 1 | tg512 | 1563.62 ± 12.55 | |
| 84 | + |
| 85 | +I.e., 56% improvement for PP and 26% improvement for TG since the original post from Aug 13! |
| 86 | + |
| 87 | +I see [PR-8151](https://github.com/ggerganov/llama.cpp/pull/8151), which provides dedicated quantization for the TriLM ternary models in mainline `llama.cpp`, has been merged. Here is what we get for `TQ2_0` that corresponds to our `IQ2_TN` |
| 88 | + |
| 89 | +| model | size | params | backend | threads | fa | test | t/s | |
| 90 | +| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | -------------------: | |
| 91 | +| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 1 | pp1500 | 5187.34 ± 11.69 | |
| 92 | +| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 0 | pp1500 | 5281.54 ± 53.33 | |
| 93 | +| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 1 | tg500 | 1156.25 ± 18.14 | |
| 94 | +| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 0 | tg500 | 1041.27 ± 21.30 | |
| 95 | + |
| 96 | +Our version is 2.44X faster for PP and 35% faster for TG. |
0 commit comments