Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b6132
chat : hotfix gpt-oss jinja raising an exception (#15243) * chat : hotfix gpt-oss jinja raising an exception * fix
b6131
server : allow specifying reasoning_format in HTTP request (#15238)
b6129
kv-cache : fix seq_rm with seq_id == -1 (#15226) * kv-cache : fix seq_rm with seq_id == -1 ggml-ci * cont : iterate over streams ggml-ci
b6128
kv-cache : log (debug) all streams in find_slot (#15176) This commit updates `llama_kv_cache_unified::find_slot` to log information for all streams when debug is enabled. The motivation for this change is that currently if a non-unified kv-cache is used, then only one stream will be logged because the code was currently uses `seq_to_stream[1]`.
b6124
kleidiai: fix unsigned overflow bug (#15150) * kleidiai: fix unsigned overflow bug * address review comments
b6123
cuda: refactored ssm_scan and use CUB (#13291) * cuda: refactored ssm_scan to use CUB * fixed compilation error when when not using CUB * assign L to constant and use size_t instead of int * deduplicated functions * change min blocks per mp to 1 * Use cub load and store warp transpose * suppress clang warning
b6122
CUDA: add attention sinks for tile and wmma (#15178) * CUDA: add attention sinks for tile and wmma * Review: formatting changes + remove syncthreads from tile + remove warp_reduce_max from wmma
b6121
gguf-py : add Numpy MXFP4 de/quantization support (#15111) * gguf-py : add MXFP4 de/quantization support * ggml-quants : handle zero amax for MXFP4
b6119
ggml : fix field name when new ggml_backend (#14944)
b6118
vendor: sync minja (#15161) * vendor: sync minja * Update minja.hpp * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>