Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b4948
llama-vocab : add SuperBPE pre-tokenizer (#12532)
b4947
CUDA: Fix clang warnings (#12540) Signed-off-by: Xiaodong Ye <[email protected]>
b4946
mmap : skip resource limit checks on AIX (#12541)
b4945
vulkan: fix mul_mat_vec failure in backend tests (#12529) The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.
b4944
server : Add verbose output to OAI compatible chat endpoint. (#12246) Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
b4942
llama : gemma3 : use output tensor if it exists in model weight (#12506) * llama : gemma3 : use output tensor if it exists in model weight * also add to the llm_tensor_names
b4941
ggml : fix quantized cpy op (#12310) * ggml : fix quantized cpy op ggml-ci * tests : add cpy tests for all types ggml-ci * tests : add BF16 copy tests ggml-ci * tests : fix loop for same-type copy ggml-ci * tests : add option to permute the dst tensor ggml-ci
b4940
musa: refine compute capability (#12493) * musa: refine compute capability Signed-off-by: Xiaodong Ye <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>
b4939
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.
b4938
Vulkan: RTE rounding for cpy to quant (#12480) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <[email protected]> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <[email protected]>