Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5054
sync: minja (#12739) * sync: minja https://github.com/google/minja/pull/57 * fix json include
b5053
kv-cache : simplify + fix warning for recurrent models (#12756) ggml-ci
b5052
ci: add Linux cross-compile build (#12428)
b5050
gguf-split : --merge now respects --dry-run option (#12681) * gguf-split now respects dry-run option * removing trailing space
b5049
sycl: allow ggml-sycl configuration and compilation using Visual Stud…
b5046
vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (…
b5045
vulkan: set cmake minimum and project name in vulkan-shaders (#12744)
b5043
CUDA: Prefer vector flash decoding kernel for Gemma models (#12738) * Prefer vector flash decoding kernel for Gemma models Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category. Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models. * Update ggml/src/ggml-cuda/fattn.cu Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
b5041
vulkan: Fix missing cmake logic for dot product extension (#12721)
b5039
sync : minja (inclusionAI/Ling) and update tests (#12699) Signed-off-by: Xiaodong Ye <[email protected]>