Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b6334
llama : separate compute buffer reserve from fattn check (#15696) Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
b6332
vulkan: handle large sizes for get_rows (#15686)
b6331
vulkan: mul_mat_id coopmat2 optimizations (#15546) * vulkan: mul_mat_id coopmat2 optimizations Add a path for when the tile fits in BN/2, similar to what we have for mul_mat. Only call fetch_scales/store_scales once per QUANT_K block, and once at the beginning in case start_k is not aligned. * Also add a path for BN/4 - worth a couple more percent
b6330
vulkan : remove unused portability_enumeration_ext variable (#15679) This commit removes the portability_enumeration_ext variable from the ggml_vk_instance_portability_enumeration_ext_available function as it is initialized to false but never modified, making it redundant.
b6329
vulkan: Allow fallback to sysmem memory when vidmem is full (#15649) * vulkan: Allow fallback to sysmem memory when vidmem is full * vulkan: Add env var GGML_VK_ALLOW_SYSMEM_FALLBACK
b6328
vulkan: clamp matmul and FA results to the max finite value (#15652) * vulkan: clamp matmul and FA results to the max finite value * only clamp for fp16
b6327
ggml: update kleidiai to v1.13.0 (#15663)
b6325
llama: use FA + max. GPU layers by default (#15434) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault
b6324
CUDA: use FP32 arithmetic for conv2d (#15683)
b6323
vulkan: Skip syncing for prealloc_y when it is reused (#15544)