Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b4936
model : do not repack if a GPU device is present (#12498) ggml-ci
b4935
chore : cleanup llama_model_loader::TENSOR_ usage (#12492)
b4934
llama-tts : avoid crashes related to bad model file paths (#12482)
b4933
[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976) * [SYCL] Fix build on Windows when ccache enabled (#9954) * take effect only on windows and force it to icl --------- Co-authored-by: Romain Biessy <[email protected]>
b4932
sycl: cleanup oneDNN related code (#12097)
b4930
llama : make Qwen2MoE QKV bias optional (#12477)
b4929
ggml : block interleaving support for Q4_K quantization for x86 AVX2 …
b4927
context : clear sets containing encoder output sequence ids before st…
b4926
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183) - Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value. - Prefer vector flash attention kernels over MMA kernel for BS=1 Fixes Issue: #12182 --------- Co-authored-by: Johannes Gäßler <[email protected]>
b4925
vulkan: optimize iq1 coopmat2 dequant functions (#12427)