Skip to content

Releases: ngxson/llama.cpp

b4948

24 Mar 11:51
00d5380
Compare
Choose a tag to compare
llama-vocab : add SuperBPE pre-tokenizer (#12532)

b4947

24 Mar 11:26
7ea7503
Compare
Choose a tag to compare
CUDA: Fix clang warnings (#12540)

Signed-off-by: Xiaodong Ye <[email protected]>

b4946

24 Mar 11:25
c54f6b7
Compare
Choose a tag to compare
mmap : skip resource limit checks on AIX (#12541)

b4945

24 Mar 07:51
9b169a4
Compare
Choose a tag to compare
vulkan: fix mul_mat_vec failure in backend tests (#12529)

The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.

b4944

23 Mar 19:13
77f9c6b
Compare
Choose a tag to compare
server : Add verbose output to OAI compatible chat endpoint. (#12246)

Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.

b4942

22 Mar 23:14
fbdfefe
Compare
Choose a tag to compare
llama : gemma3 : use output tensor if it exists in model weight (#12506)

* llama : gemma3 : use output tensor if it exists in model weight

* also add to the llm_tensor_names

b4941

22 Mar 15:10
ba932df
Compare
Choose a tag to compare
ggml : fix quantized cpy op (#12310)

* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci

b4940

22 Mar 10:03
fac63a3
Compare
Choose a tag to compare
musa: refine compute capability (#12493)

* musa: refine compute capability

Signed-off-by: Xiaodong Ye <[email protected]>

* Address review comments

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>

b4939

22 Mar 09:30
eddfb43
Compare
Choose a tag to compare
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)

* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.

b4938

21 Mar 20:23
4375415
Compare
Choose a tag to compare
Vulkan: RTE rounding for cpy to quant (#12480)

* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <[email protected]>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <[email protected]>