Skip to content

Releases: ggml-org/llama.cpp

b6124

11 Aug 09:52
002cb1b
Compare
Choose a tag to compare
kleidiai: fix unsigned overflow bug (#15150)

* kleidiai: fix unsigned overflow bug

* address review comments

b6123

09 Aug 18:42
79c1160
Compare
Choose a tag to compare
cuda: refactored ssm_scan and use CUB (#13291)

* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning

b6122

09 Aug 12:14
34c9d76
Compare
Choose a tag to compare
CUDA: add attention sinks for tile and wmma (#15178)

* CUDA: add attention sinks for tile and wmma

* Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma

b6121

08 Aug 22:36
e54d41b
Compare
Choose a tag to compare
gguf-py : add Numpy MXFP4 de/quantization support (#15111)

* gguf-py : add MXFP4 de/quantization support

* ggml-quants : handle zero amax for MXFP4

b6119

08 Aug 12:54
cd6983d
Compare
Choose a tag to compare
ggml : fix field name when new ggml_backend (#14944)

b6118

08 Aug 10:07
6c7e9a5
Compare
Choose a tag to compare
vendor: sync minja (#15161)

* vendor: sync minja

* Update minja.hpp

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

b6117

08 Aug 06:35
1425f58
Compare
Choose a tag to compare
CUDA: attention sinks for mma FlashAttention (#15157)

b6116

08 Aug 05:06
aaa3d07
Compare
Choose a tag to compare
opencl: support sink in `soft_max` (attn sinks) (#15152)

b6115

07 Aug 21:40
50aa938
Compare
Choose a tag to compare
convert : support non-mxfp4 HF model (#15153)

* convert : support non-mxfp4 HF model

* rm redundant check

* disable debug check

b6114

07 Aug 21:04
c4f5356
Compare
Choose a tag to compare
vulkan: support fattn sinks (#15126)