Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b6124
kleidiai: fix unsigned overflow bug (#15150) * kleidiai: fix unsigned overflow bug * address review comments
b6123
cuda: refactored ssm_scan and use CUB (#13291) * cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning
b6122
CUDA: add attention sinks for tile and wmma (#15178) * CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
b6121
gguf-py : add Numpy MXFP4 de/quantization support (#15111) * gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4
b6119
ggml : fix field name when new ggml_backend (#14944)
b6118
vendor: sync minja (#15161) * vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>
b6117
CUDA: attention sinks for mma FlashAttention (#15157)
b6116
opencl: support sink in `soft_max` (attn sinks) (#15152)
b6115
convert : support non-mxfp4 HF model (#15153) * convert : support non-mxfp4 HF model * rm redundant check * disable debug check
b6114
vulkan: support fattn sinks (#15126)