Releases · ngxson/llama.cpp

24 Mar 11:51

00d5380

b4948

llama-vocab : add SuperBPE pre-tokenizer (#12532)

Assets 26

24 Mar 11:26

github-actions

b4947

7ea7503

b4947

CUDA: Fix clang warnings (#12540)

Signed-off-by: Xiaodong Ye <[email protected]>

Assets 26

24 Mar 11:25

github-actions

b4946

c54f6b7

b4946

mmap : skip resource limit checks on AIX (#12541)

Assets 26

24 Mar 07:51

github-actions

b4945

9b169a4

b4945

vulkan: fix mul_mat_vec failure in backend tests (#12529)

The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.

Assets 26

23 Mar 19:13

github-actions

b4944

77f9c6b

b4944

server : Add verbose output to OAI compatible chat endpoint. (#12246)

Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.

Assets 26

22 Mar 23:14

github-actions

b4942

fbdfefe

b4942

llama : gemma3 : use output tensor if it exists in model weight (#12506)

* llama : gemma3 : use output tensor if it exists in model weight

* also add to the llm_tensor_names

Assets 26

22 Mar 15:10

github-actions

b4941

ba932df

b4941

ggml : fix quantized cpy op (#12310)

* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci

Assets 26

22 Mar 10:03

github-actions

b4940

fac63a3

b4940

musa: refine compute capability (#12493)

* musa: refine compute capability

Signed-off-by: Xiaodong Ye <[email protected]>

* Address review comments

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>

Assets 26

22 Mar 09:30

github-actions

b4939

eddfb43

b4939

vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)

* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.

Assets 26

21 Mar 20:23

github-actions

b4938

4375415

b4938

Vulkan: RTE rounding for cpy to quant (#12480)

* Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz <[email protected]>

* remove trailing whitespace

* avoid duplicating pipeline_cpy_f32_quant

* fix copypasting issue

* remove duplicated code

---------

Co-authored-by: Jeff Bolz <[email protected]>

Assets 25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ngxson/llama.cpp

b4948

Uh oh!

b4947

Uh oh!

b4946

Uh oh!

b4945

Uh oh!

b4944

Uh oh!

b4942

Uh oh!

b4941

Uh oh!

b4940

Uh oh!

b4939

Uh oh!

b4938

Uh oh!