Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b6278
vulkan: Remove splitting for mul_mat_id (#15568) row_ids only needs to hold the BN rows for the current tile.
b6277
CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>
b6276
opencl: fix support ops condition for `rms_norm` (#15560)
b6275
vulkan: fix min subgroup 16 condition for mmid subgroup optimization …
b6269
batched-bench : fix unified KV cache handling + pp timing (#15562) * batched-bench : fix unified KV cache handling + pp timing * cont : run dummy token only with split KV cache
b6267
metal : add FA kernels for HS=40 (#15559) ggml-ci
b6265
CANN: ROPE cache sin/cos repeat (#15501) Signed-off-by: noemotiovon <[email protected]>
b6264
vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices…
b6262
vulkan: Support FA with any multiple of 8 head sizes (#15537) The scalar FA shader already handled multiples of 8. The coopmat1 FA shader assumed 16x16x16 and the shared memory allocations need the HSK dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation requires multiples of 16 for N and K, and needs the matrix dimensions padded and loads clamped. Store the FA pipelines in a map, indexed by the pipeline state.
b6261
vulkan: enable Conv2D for Apple after MoltenVK fixed the bug (#15526)