Skip to content

Releases: ngxson/llama.cpp

b6334

31 Aug 14:39
9777032
Compare
Choose a tag to compare
llama : separate compute buffer reserve from fattn check (#15696)

Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.

b6332

31 Aug 08:43
bbbf5ec
Compare
Choose a tag to compare
vulkan: handle large sizes for get_rows (#15686)

b6331

31 Aug 07:41
c37052a
Compare
Choose a tag to compare
vulkan: mul_mat_id coopmat2 optimizations (#15546)

* vulkan: mul_mat_id coopmat2 optimizations

Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.

Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.

* Also add a path for BN/4 - worth a couple more percent

b6330

31 Aug 07:37
5c16b9c
Compare
Choose a tag to compare
vulkan : remove unused portability_enumeration_ext variable (#15679)

This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.

b6329

31 Aug 07:27
b97c9ed
Compare
Choose a tag to compare
vulkan: Allow fallback to sysmem memory when vidmem is full (#15649)

* vulkan: Allow fallback to sysmem memory when vidmem is full

* vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK

b6328

31 Aug 07:04
94e82c7
Compare
Choose a tag to compare
vulkan: clamp matmul and FA results to the max finite value (#15652)

* vulkan: clamp matmul and FA results to the max finite value

* only clamp for fp16

b6327

30 Aug 16:25
4d74393
Compare
Choose a tag to compare
ggml: update kleidiai to v1.13.0 (#15663)

b6325

30 Aug 15:04
e81b8e4
Compare
Choose a tag to compare
llama: use FA + max. GPU layers by default (#15434)

* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault

b6324

30 Aug 14:48
38ad381
Compare
Choose a tag to compare
CUDA: use FP32 arithmetic for conv2d (#15683)

b6323

30 Aug 09:34
696fccf
Compare
Choose a tag to compare
vulkan: Skip syncing for prealloc_y when it is reused (#15544)