Skip to content

Releases: ngxson/llama.cpp

b4936

21 Mar 15:02
af04481
Compare
Choose a tag to compare
model : do not repack if a GPU device is present (#12498)

ggml-ci

b4935

21 Mar 10:08
960e726
Compare
Choose a tag to compare
chore : cleanup llama_model_loader::TENSOR_ usage (#12492)

b4934

21 Mar 10:04
ea1518e
Compare
Choose a tag to compare
llama-tts : avoid crashes related to bad model file paths (#12482)

b4933

21 Mar 07:55
1aa87ee
Compare
Choose a tag to compare
[SYCL] Fix build on Windows when ccache enabled (#9954) (#9976)

* [SYCL] Fix build on Windows when ccache enabled (#9954)

* take effect only on windows and force it to icl

---------

Co-authored-by: Romain Biessy <[email protected]>

b4932

21 Mar 03:07
9ffcc9e
Compare
Choose a tag to compare
sycl: cleanup oneDNN related code (#12097)

b4930

20 Mar 12:47
dbb3a47
Compare
Choose a tag to compare
llama : make Qwen2MoE QKV bias optional (#12477)

b4929

20 Mar 12:18
3d82dbc
Compare
Choose a tag to compare
ggml : block interleaving support for Q4_K quantization for x86 AVX2 …

b4927

19 Mar 21:03
568013d
Compare
Choose a tag to compare
context : clear sets containing encoder output sequence ids before st…

b4926

19 Mar 20:53
517b5dd
Compare
Choose a tag to compare
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)

- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <[email protected]>

b4925

19 Mar 19:46
a9b5928
Compare
Choose a tag to compare
vulkan: optimize iq1 coopmat2 dequant functions (#12427)