Releases · Xarbirus/llama.cpp

20 Jan 10:34

ef6dada

b4513 Latest

Latest

cont : fix whitespaces (#11305)

Assets 23

cudart-llama-bin-win-cu11.7-x64.zip

303 MB 2025-01-20T10:34:17Z
cudart-llama-bin-win-cu12.4-x64.zip

373 MB 2025-01-20T10:34:27Z
llama-b4513-bin-macos-arm64.zip

13 MB 2025-01-20T10:34:42Z
llama-b4513-bin-macos-x64.zip

13.9 MB 2025-01-20T10:34:43Z
llama-b4513-bin-ubuntu-x64.zip

15.8 MB 2025-01-20T10:34:45Z
llama-b4513-bin-win-avx-x64.zip

9.86 MB 2025-01-20T10:34:47Z
llama-b4513-bin-win-avx2-x64.zip

9.86 MB 2025-01-20T10:34:48Z
llama-b4513-bin-win-avx512-x64.zip

9.87 MB 2025-01-20T10:34:49Z
llama-b4513-bin-win-cuda-cu11.7-x64.zip

147 MB 2025-01-20T10:34:50Z
llama-b4513-bin-win-cuda-cu12.4-x64.zip

147 MB 2025-01-20T10:34:56Z
Source code (zip)

2025-01-20T07:29:32Z
Source code (tar.gz)

2025-01-20T07:29:32Z

16 Jan 18:26

github-actions

b4493

9c8dcef

b4493

CUDA: backwards pass for misc. ops, add tests (#11257)

* CUDA: backwards pass for misc. ops, add tests

* remove restrict from pointers

Assets 23

27 Dec 00:20

github-actions

b4393

d79d8f3

b4393

vulkan: multi-row k quants (#10846)

* multi row k quant shaders!

* better row selection

* more row choices

* readjust row selection

* rm_kq=2 by default

Assets 23

16 Dec 23:52

github-actions

b4320

64ae065

b4320

vulkan: small mul_mat_vec optimizations (#10665)

* double the number of rows per workgroup

* Update ggml-vulkan.cpp

* Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats

* only increase the number of rows for amd and subgroup size 64

* fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested

* use subgroup min and max to check for gcn (requires https://github.com/ggerganov/llama.cpp/pull/10721)

* manual merge ggml-vulkan.cpp

* set min and max subgroup size in any case

* Also double the number of rows for Intel GPUs

Assets 22

02 Dec 22:45

github-actions

b4240

64ed209

b4240

server: Add "tokens per second" information in the backend (#10548)

* add cmake rvv support

* add timings

* remove space

* update readme

* fix

* fix code

* remove empty line

* add test

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>

Assets 22

09 Nov 17:33

github-actions

b4061

6423c65

b4061

metal : reorder write loop in mul mat kernel + style (#10231)

* metal : reorder write loop

* metal : int -> short, style

ggml-ci

Assets 22

23 Oct 20:35

github-actions

b3969

190a37d

b3969

sync : ggml

Assets 22

14 Oct 14:22

github-actions

b3917

a89f75e

b3917

server : handle "logprobs" field with false value (#9871)

Co-authored-by: Gimling <[email protected]>

Assets 22

23 Sep 19:53

github-actions

b3810

1d48e98

b3810

readme : add programmable prompt engine language CLI (#9599)

Assets 22

16 Sep 08:44

github-actions

b3767

5c3d0f1

b3767

ggml : IQ4_NL sgemm + Q4_0 AVX optimization (#9422)

* squashed

readd my iq4_nl sgemm PR https://github.com/ggerganov/llama.cpp/pull/8049

have ggml_vec_dot_q4_0 do two blocks per loop for avx

try out f16c ggml_vec_dot_iq4_nl, but it's not really faster. as per https://github.com/ggerganov/llama.cpp/pull/8549 we can calculate several blocks at a time with no issue

* shuffle

* remove f16c iq4_nl as i cant make it faster than before

Assets 19

Releases: Xarbirus/llama.cpp

b4513

Uh oh!

b4493

Uh oh!

b4393

Uh oh!

b4320

Uh oh!

b4240

Uh oh!

b4061

Uh oh!

b3969

Uh oh!

b3917

Uh oh!

b3810

Uh oh!

b3767

Uh oh!