You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Convert existing News to new format
* Update with new ones
* Add more links and minor fix
* more minor fixes
* requested changes
* Add old PRs
* Add more old PRs
* Add all IQK quants
Information and the original CUDA implementation in [PR 113](https://github.com/ikawrakow/ik_llama.cpp/pull/113). Additional implementations: Metal [PR 475](https://github.com/ikawrakow/ik_llama.cpp/pull/475), Neon [PR 471](https://github.com/ikawrakow/ik_llama.cpp/pull/471), CPU [PR 441](https://github.com/ikawrakow/ik_llama.cpp/pull/441)
22
+
23
+
##### IQK quants
24
+
25
+
Information can be found in [Discussion 8](https://github.com/ikawrakow/ik_llama.cpp/discussions/8).
#### Quantization performance and support improvements and fixes
34
+
35
+
* MMQ implementation for `IQ4_KS_R4` and `IQ5_KS_R4`[PR 493](https://github.com/ikawrakow/ik_llama.cpp/pull/493)
36
+
* CUDA implementation for `IQ1_S_R4`[PR 492](https://github.com/ikawrakow/ik_llama.cpp/pull/492), `IQ1_M_R4`[PR 494](https://github.com/ikawrakow/ik_llama.cpp/pull/494)
37
+
* Faster CPU prompt processing for Trellis quants and MoE models. [PR 488](https://github.com/ikawrakow/ik_llama.cpp/pull/488)
38
+
* Trellis quants: faster CPU prompt processing [PR 482](https://github.com/ikawrakow/ik_llama.cpp/pull/482).
39
+
* Minor (~2%) `iq2_ks` TG performance improvement on CUDA [PR 468](https://github.com/ikawrakow/ik_llama.cpp/pull/468)
40
+
* CUDA GEMM and GEMV for `IQ4_KS_R4` and `IQ5_KS_R4`[PR 462](https://github.com/ikawrakow/ik_llama.cpp/pull/462)
41
+
* CUDA implementation for `IQ2_K_R4`, `IQ3_K_R4`, `IQ4_K_R4`, `IQ5_K_R4`[PR 461](https://github.com/ikawrakow/ik_llama.cpp/pull/461)
42
+
* Faster `IQ3_KT` and `IQ4_KT`[PR 453](https://github.com/ikawrakow/ik_llama.cpp/pull/453)
43
+
* Legacy quants conversion schemes in `convert_hf_to_gguf.py`[PR 449](https://github.com/ikawrakow/ik_llama.cpp/pull/449), `Q6_0` in [PR 483](https://github.com/ikawrakow/ik_llama.cpp/pull/483)
44
+
* Zen4: Faster PP for `IQ2_KS, IQ4_KS, IQ5_KS`[PR 428](https://github.com/ikawrakow/ik_llama.cpp/pull/428)
* June 8 2025: Webui updated (legacy still available when `--path ./examples/server/public_legacy` is passed) [PR 481](https://github.com/ikawrakow/ik_llama.cpp/pull/481)
50
+
* June 8 2025: RPC improvements [PR 480](https://github.com/ikawrakow/ik_llama.cpp/pull/480)
51
+
* June 7 2025: Add an endpoint that lists all the saved prompt caches to server [PR 502](https://github.com/ikawrakow/ik_llama.cpp/pull/502)
52
+
* June 6 2025: Make prompt cache saving and restoring MLA aware [PR 497](https://github.com/ikawrakow/ik_llama.cpp/pull/497)
* May 22 2025: Refactor `iqk_mul_mat.cpp` which speeds up compilation time significantly. [PR 435](https://github.com/ikawrakow/ik_llama.cpp/pull/435)
55
+
* May 17 2025: Option to enable or disable the CPU FA kernels [PR 429](https://github.com/ikawrakow/ik_llama.cpp/pull/429).
11
56
* May 12 2025: User can now control if/which operations with tensors held in RAM are offloaded to the GPU. See [PR 405](https://github.com/ikawrakow/ik_llama.cpp/pull/405)
12
57
* May 12 2025: Compatibility issues with mainline `llama.cpp` GGUFs for DeepSeek models with MLA enabled were resolved in [PR 394](https://github.com/ikawrakow/ik_llama.cpp/pull/394). The lower prompt processing performance resulting from using `llama.cpp`-style MLA GGUFs was recovered in [PR 409](https://github.com/ikawrakow/ik_llama.cpp/pull/409).
13
-
* May 11 2025: 🚀 Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. See [PR 408](https://github.com/ikawrakow/ik_llama.cpp/pull/408)
14
-
* May 9 2025: Support for LlaMA-3-Nemotron models added, see [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377)
15
-
* May 7 2025: 🚀 Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. See [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386) for details. Caveat: Ampere or newer Nvidia GPU required
16
-
* May 4 2025: 🚀 Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks see [PR #370](https://github.com/ikawrakow/ik_llama.cpp/pull/370)
17
-
* April 29 2025: Qwen3 support added, see [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355)
18
-
* April 26 2025: GLM-4 support added, see [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344)
19
-
* April 26 2025: Command-A support added, see [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341)
20
-
* April 22 2025: Support for the latest Microsoft Bitnet model added, see [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337)
21
58
* April 21 2025: ik_llama.cpp builds and runs successfully on Android (using termux), see [PR 336](https://github.com/ikawrakow/ik_llama.cpp/pull/336)
22
-
* April 17 2025: 🚀 Better CPU Flash Attention token generation performance, see [PR 332](https://github.com/ikawrakow/ik_llama.cpp/pull/332)
23
-
* April 13 2025: `IQ1_M` quantization improvements, see [PR 327](https://github.com/ikawrakow/ik_llama.cpp/pull/327)
24
-
* April 10 2025: LLaMA-4 support added, see [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321). In the PR there are also some custom quantization recipes for L4-Scout provided.
25
-
* April 7 2025: `IQ2_XS` quantization improvements, see [PR 312](https://github.com/ikawrakow/ik_llama.cpp/pull/312)
26
-
* April 3 2025: 🚀 Much faster MoE implementation on Metal, see [PR 307](https://github.com/ikawrakow/ik_llama.cpp/pull/307)
27
-
* April 1 2025: Quantization improvements for `Q2_K, Q4_K, Q5_K, Q4_1, Q5_1`, see [PR 302](https://github.com/ikawrakow/ik_llama.cpp/pull/302)
28
-
* March 28 2025: Quantization imrovements for `Q4_0, Q5_0, Q6_0, Q3_K, Q6_K, IQ4_XS, IQ4_NL`, see [PR 295](https://github.com/ikawrakow/ik_llama.cpp/pull/295)
29
-
* March 25 2025: 🚀 Better MoE performance on CUDA
30
-
* March 23 2025: 🚀 Better batched processing speed for DeepSeek models
31
-
* March 22 2025: Gemma3 support added
32
-
* March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models
33
-
* March 18 2025: Reduce compute buffer size
34
-
* March 17 2025: 🚀 FlashMLA-2 performance improvements
35
-
* March 12 2025: Allow `Q8_0` KV cache with FlashMLA-2 on CUDA
36
-
* March 10 2025: 🚀 Better TG performance for MoE models on CUDA
37
-
* March 9 2025: 🚀 FlashMLA on CUDA
38
-
* March 8 2025: 🚀 Faster FlashMLA CPU implementation
39
-
* March 7 2025: Custom quantization mixes using regular expressions
40
-
* March 5 2025: 🚀 FlashMLA on CUDA
41
-
* March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention
42
-
* March 1 2025: Smart Expert Reduction for faster DeepSeek inference
43
-
* Feb 27 2025: MLA without transposed cache
44
-
* Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU)
45
-
* Feb 23 2025: 🚀 Fused FFN ops for faster MoE inference
46
-
* Feb 23 2025: `sweep-bench` - better performance benchmarking
47
-
* Feb 20 2025: 🚀 Fast GEMM/GEMV for `IQ1_S`
48
-
* Feb 19 2025: `Q8_KV` - new type for 8-bit KV-cache quantization
49
-
* Feb 13 2025: Allow `Q8_0` quantized cache with MLA
50
-
* Feb 11 2025: 🚀 Flash Attention support for DeepSeek models
51
-
* Feb 9 2025: 🚀 MLA for DeepSeek models
52
-
* Jan 23 2025: DeepSeek-V3 support added
59
+
* March 1 2025: Smart Expert Reduction for faster DeepSeek inference [PR 239](https://github.com/ikawrakow/ik_llama.cpp/pull/239)
60
+
* Feb 25 2025: Tensor overrides for better control where model weights are stored (GPU or CPU) [PR 232](https://github.com/ikawrakow/ik_llama.cpp/pull/232)
* Feb 19 2025: `Q8_KV` - new type for 8-bit KV-cache quantization [PR 208](https://github.com/ikawrakow/ik_llama.cpp/pull/208)
63
+
* March 7 2025: Custom quantization mixes using regular expressions [PR 244](https://github.com/ikawrakow/ik_llama.cpp/pull/244)
64
+
65
+
### Performance improvements
66
+
67
+
* May 13 2025: Better CPU FA performance for DeepSeek-Lite. [PR 410](https://github.com/ikawrakow/ik_llama.cpp/pull/410)
68
+
* May 11 2025: Slightly faster flash attention for DeepSeek models on CUDA, along with extending compatibility to Touring or newer GPUs. [PR 408](https://github.com/ikawrakow/ik_llama.cpp/pull/408)
69
+
* May 7 2025: Faster TG for DeepSeek models with GPU or hybrid GPU/CPU inference. [PR 386](https://github.com/ikawrakow/ik_llama.cpp/pull/386). Caveat: Ampere or newer Nvidia GPU required
70
+
* May 4 2025: Significant token generation performance improvement on CUDA with Flash Attention for GQA models. For details and benchmarks. [PR 370](https://github.com/ikawrakow/ik_llama.cpp/pull/370)
71
+
* April 17 2025: Better CPU Flash Attention token generation performance. [PR 332](https://github.com/ikawrakow/ik_llama.cpp/pull/332)
72
+
* April 3 2025: Much faster MoE implementation on Metal. [PR 307](https://github.com/ikawrakow/ik_llama.cpp/pull/307)
73
+
* March 25 2025: Better MoE performance on CUDA [PR 283](https://github.com/ikawrakow/ik_llama.cpp/pull/283)
74
+
* March 23 2025: Better batched processing speed for DeepSeek models [PR 282](https://github.com/ikawrakow/ik_llama.cpp/pull/282)
75
+
* March 18 2025: Reduce compute buffer size [PR 237](https://github.com/ikawrakow/ik_llama.cpp/pull/237)
76
+
* March 10 2025: Better TG performance for MoE models on CUDA [PR 248](https://github.com/ikawrakow/ik_llama.cpp/pull/248)
77
+
* Feb 23 2025: Fused FFN ops for faster MoE inference [PR 229](https://github.com/ikawrakow/ik_llama.cpp/pull/229)
78
+
* Feb 20 2025: Fast GEMM/GEMV for `IQ1_S`[PR 212](https://github.com/ikawrakow/ik_llama.cpp/pull/212)
79
+
80
+
### Flash-MLA
81
+
82
+
* March 21 2025: 🚀 FlashMLA-3: fastest CPU-only inference for DeepSeek models [PR 273](https://github.com/ikawrakow/ik_llama.cpp/pull/273)
83
+
* March 17 2025: 🚀 FlashMLA-2 performance improvements [PR 253](https://github.com/ikawrakow/ik_llama.cpp/pull/253)
84
+
* March 12 2025: Allow `Q8_0` KV cache with FlashMLA-2 on CUDA [PR 265](https://github.com/ikawrakow/ik_llama.cpp/pull/265)
85
+
* March 9 2025: 🚀 FlashMLA on CUDA [PR 247](https://github.com/ikawrakow/ik_llama.cpp/pull/247)
86
+
* March 8 2025: 🚀 Faster FlashMLA CPU implementation [PR 243](https://github.com/ikawrakow/ik_llama.cpp/pull/243)
87
+
* March 3 2025: 🚀 Introducing FlashMLA - MLA with Flash Attention [PR 240](https://github.com/ikawrakow/ik_llama.cpp/pull/240)
88
+
* Feb 27 2025: MLA without transposed cache [PR 235](https://github.com/ikawrakow/ik_llama.cpp/pull/235)
89
+
* Feb 13 2025: Allow `Q8_0` quantized cache with MLA [PR 206](https://github.com/ikawrakow/ik_llama.cpp/pull/206)
90
+
* Feb 11 2025: 🚀 Flash Attention support for DeepSeek models [PR 200](https://github.com/ikawrakow/ik_llama.cpp/pull/200)
91
+
* Feb 9 2025: 🚀 MLA for DeepSeek models [PR 188](https://github.com/ikawrakow/ik_llama.cpp/pull/188)
92
+
93
+
### Fixes
94
+
95
+
* Fix bug in MMVQ kernel [PR 446](https://github.com/ikawrakow/ik_llama.cpp/pull/446)
96
+
* Fix AVX2 implementation of `IQ4_K, IQ4_KS, IQ5_K, IQ6_K`[PR 427](https://github.com/ikawrakow/ik_llama.cpp/pull/427)
97
+
* Fix standard attention on the CPU [PR 421](https://github.com/ikawrakow/ik_llama.cpp/pull/421)
98
+
* Fix imatrix calculation for MLA models [PR 411](https://github.com/ikawrakow/ik_llama.cpp/pull/411)
99
+
* Fix new CUDA FA on Touring [PR 413](https://github.com/ikawrakow/ik_llama.cpp/pull/413)
0 commit comments