Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5038
fix MUSA compiler warning (#12704) * fix MUSA compiler warning * replace (void) with GGML_UNUSED
b5037
CANN: Support operator SIN COS ARGMAX (#12709) * [CANN]support sin cos argmax Signed-off-by: noemotiovon <[email protected]> * [CANN]codestyle adjustment Signed-off-by: noemotiovon <[email protected]> * [CANN]Remove redundant code Signed-off-by: noemotiovon <[email protected]> --------- Signed-off-by: noemotiovon <[email protected]> Co-authored-by: noemotiovon <[email protected]>
b5036
Simplify and improve CUDA graphs through use of indirect copy pointer…
b5035
CANN: Fix failed test cases (#12708) * CANN: Fix memory waste in aclnn_tensor * CANN: fix backend ops fail * CANN: fix acl_tensor memory alloc. * CANN: format * CANN: remove trailing whitespace
b5034
opencl: use `max_alloc_size` in backend ctx instead of querying again…
b5033
vulkan: Implement split_k for coopmat2 flash attention. (#12627) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.
b5032
cmake: remove caching from vulkan coopmat checks (#12719)
b5031
vulkan: Implement grouped query attention in the coopmat2 FA shader (…
b5030
Vulkan: Fix mmq int dot float cache size (#12722)
b5029
model : print tensor size during load (#12711) * model : print tensor size during load * cont : fix units MB -> MiB Co-authored-by: Diego Devesa <[email protected]> --------- Co-authored-by: Diego Devesa <[email protected]>