Releases · bssrdf/llama.cpp

11 Sep 01:40

00681df

b6445 Latest

Latest

CUDA: Add `fastdiv` to `k_bin_bcast*`, giving 1-3% E2E performance (#…

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-09-11T01:40:04Z
llama-b6445-bin-macos-arm64.zip

sha256:ddcc1694cb69387c759fe6351a83745f3ce475975e1a662cc324d427695db209

11.2 MB 2025-09-11T01:40:14Z
llama-b6445-bin-macos-x64.zip

sha256:8107fa71c0012b295ed5290036b0dc7b2b6dd1e7f41fe206d0d038c0ff5deaf2

29.2 MB 2025-09-11T01:40:15Z
llama-b6445-bin-ubuntu-vulkan-x64.zip

sha256:f221b0df11b402eb1db640a478f16a2c3e64b97018de15de970bb765f7cb1a3c

25.8 MB 2025-09-11T01:40:17Z
llama-b6445-bin-ubuntu-x64.zip

sha256:75b1db636dbadec5a490d7632c7cdf93a5a9b43610be174660b32508326fc016

13.1 MB 2025-09-11T01:40:18Z
llama-b6445-bin-win-cpu-arm64.zip

sha256:d21c58b857f3f2e1ed83a7afa143a4f2c331d336de05904fd00b73e73b4fd93f

11.3 MB 2025-09-11T01:40:19Z
llama-b6445-bin-win-cpu-x64.zip

sha256:58a7b9e988206b71b3bb31c25e157108b0dd823d48414239a0daef71ba99bc54

14.3 MB 2025-09-11T01:40:19Z
llama-b6445-bin-win-cuda-12.4-x64.zip

sha256:f113b8f93fd56f12a2c5a9e0787af6f13e82d9a26aef7eaa639b13b9eb10c5cf

147 MB 2025-09-11T01:40:21Z
llama-b6445-bin-win-hip-radeon-x64.zip

sha256:55caada62de56ed83a0581130d35ac1c75bb2b5a347ea5813ba51fa554ae3c54

288 MB 2025-09-11T01:40:24Z
llama-b6445-bin-win-opencl-adreno-arm64.zip

sha256:5ed0d5ea8c24885edc90ba1f744a1a396b4f8f8e4a28a70a128161a506e7d22f

11.7 MB 2025-09-11T01:40:32Z
Source code (zip)

2025-09-10T20:04:03Z
Source code (tar.gz)

2025-09-10T20:04:03Z

08 Sep 16:11

github-actions

b6419

8802156

b6419

chat : Deepseek V3.1 reasoning and tool calling support (OpenAI Style…

Assets 15

02 Sep 14:00

github-actions

b6356

9961d24

b6356

CANN: Resolve soft_max precision issue (#15730)

Previously, the slope tensor was set to fp16 to improve efficiency.
While this worked correctly in FA, it caused precision issues in soft_max.
This change applies different data types for different operators
to balance both accuracy and performance.

Assets 15

02 Sep 01:52

github-actions

b6351

5d804a4

b6351

ggml-backend: raise GGML_MAX_SPLIT_INPUTS (#15722)

Assets 15

26 Aug 13:08

github-actions

b6286

b3964c1

b6286

metal : optimize FA vec for large sequences and BS <= 8 (#15566)

* metal : optmize FA vec for large heads and sequences

* metal : adjust small-batch mul mv kernels

ggml-ci

* batched-bench : fix total speed computation

ggml-ci

* cont : add comments

ggml-ci

Assets 15

10 Feb 14:00

github-actions

b4681

d7b31a9

b4681

sync: minja (https://github.com/google/minja/commit/a72057e5190de2c61…

Assets 23

22 Jan 03:48

github-actions

b4524

6171c9d

b4524

Add Jinja template support (#11016)

* Copy minja from https://github.com/google/minja/commit/58f0ca6dd74bcbfbd4e71229736640322b31c7f9

* Add --jinja and --chat-template-file flags

* Add missing <optional> include

* Avoid print in get_hf_chat_template.py

* No designated initializers yet

* Try and work around msvc++ non-macro max resolution quirk

* Update test_chat_completion.py

* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

* Refactor test-chat-template

* Test templates w/ minja

* Fix deprecation

* Add --jinja to llama-run

* Update common_chat_format_example to use minja template wrapper

* Test chat_template in e2e test

* Update utils.py

* Update test_chat_completion.py

* Update run.cpp

* Update arg.cpp

* Refactor common_chat_* functions to accept minja template + use_jinja option

* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE

* Revert LLAMA_CHATML_TEMPLATE refactor

* Normalize newlines in test-chat-templates for windows tests

* Forward decl minja::chat_template to avoid eager json dep

* Flush stdout in chat template before potential crash

* Fix copy elision warning

* Rm unused optional include

* Add missing optional include to server.cpp

* Disable jinja test that has a cryptic windows failure

* minja: fix vigogne (https://github.com/google/minja/pull/22)

* Apply suggestions from code review

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

* Finish suggested renamings

* Move chat_templates inside server_context + remove mutex

* Update --chat-template-file w/ recent change to --chat-template

* Refactor chat template validation

* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)

* Warn against missing eos / bos tokens when jinja template references them

* rename: common_chat_template[s]

* reinstate assert on chat_templates.template_default

* Update minja to https://github.com/google/minja/commit/b8437df626ac6cd0ce3b333b3c74ed1129c19f25

* Update minja to https://github.com/google/minja/pull/25

* Update minja from https://github.com/google/minja/pull/27

* rm unused optional header

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Assets 23

21 Apr 12:58

github-actions

b2703

89b0bf0

b2703

llava : use logger in llava-cli (#6797)

This change removes printf() logging so llava-cli is shell scriptable.

Assets 19

20 Apr 01:21

github-actions

b2699

0e4802b

b2699

ci: add ubuntu latest release and fix missing build number (mac & ubu…

Assets 19

23 Feb 23:23

github-actions

b2251

fd43d66

b2251

server : add KV cache quantization options (#5684)

Assets 14

Uh oh!

Releases: bssrdf/llama.cpp

b6445

Uh oh!

b6419

Uh oh!

b6356

Uh oh!

b6351

Uh oh!

b6286

Uh oh!

b4681

Uh oh!

b4524

Uh oh!

b2703

Uh oh!

b2699

Uh oh!

b2251

Uh oh!