Releases · ochafik/llama.cpp

08 Oct 02:33

74b8fc1

b6710 Latest

Latest

ggml webgpu: profiling, CI updates, reworking of command submission (…

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-08T02:33:08Z
llama-b6710-bin-macos-arm64.zip

sha256:5ab9f21bd5ec1daf318548d0b4cf2cc546aa9d2db81f2fe662dd9bd17532e8d6

10.4 MB 2025-10-08T02:33:22Z
llama-b6710-bin-macos-x64.zip

sha256:26b5f70e041462f8dd8599995209f8434895164a177ab8abd9523d072c2a2910

26.8 MB 2025-10-08T02:33:23Z
llama-b6710-bin-ubuntu-vulkan-x64.zip

sha256:4b876f7a0f2e4d687aee1e623de72d3f492c94527d8be6c51a817e4284838506

25.5 MB 2025-10-08T02:33:24Z
llama-b6710-bin-ubuntu-x64.zip

sha256:188564418987e17d0b2308296a41a4050940239339219f2b5e72d94c92396ce7

12.4 MB 2025-10-08T02:33:26Z
llama-b6710-bin-win-cpu-arm64.zip

sha256:ca05e08f5f721d4dfc081df84a9bc44eba4d56f3f36ecffe75502e7e0c233218

10.5 MB 2025-10-08T02:33:27Z
llama-b6710-bin-win-cpu-x64.zip

sha256:4882ed44ba2df1b6029bd22334da5b7f1906d8cb442c0e3a0ad303c33352589d

13.6 MB 2025-10-08T02:33:28Z
llama-b6710-bin-win-cuda-12.4-x64.zip

sha256:8e62b5975d517320cf1d57abd82b0dace73df8e4030f69fec6737e39fbfcc9a2

149 MB 2025-10-08T02:33:29Z
llama-b6710-bin-win-hip-radeon-x64.zip

sha256:909ef32065a8e1878623e9e9939ec2c797fa4ea9628e53d83cf38ef72326692e

313 MB 2025-10-08T02:33:34Z
llama-b6710-bin-win-opencl-adreno-arm64.zip

sha256:6090e01f306d0c2ab3de62b9e94bbf1728c96ebef5c65420593f921a12eb06da

11 MB 2025-10-08T02:33:44Z
Source code (zip)

2025-10-07T20:48:56Z
Source code (tar.gz)

2025-10-07T20:48:56Z

23 Aug 00:44

github-actions

b6250

e92734d

b6250

test-opt: allow slight inprecision (#15503)

Assets 15

08 Aug 02:34

github-actions

b6115

50aa938

b6115

convert : support non-mxfp4 HF model (#15153)

* convert : support non-mxfp4 HF model

* rm redundant check

* disable debug check

Assets 15

06 Aug 21:09

github-actions

b6104

e725a1a

b6104

opencl: add `swiglu_oai` and  `add_id` (#15121)

* opencl: add `swiglu-oai`

* opencl: add `add_id`

* opencl: add missing `add_id.cl`

Assets 15

05 Aug 22:37

github-actions

b6096

fd1234c

b6096

llama : add gpt-oss (#15091)

* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: slaren <[email protected]>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <[email protected]>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <[email protected]>
Co-authored-by: slaren <[email protected]>

Assets 15

02 Aug 12:30

github-actions

b6068

711d5e6

b6068

convert : fix Qwen3-Embedding pre-tokenizer hash (#15030)

Assets 15

30 May 17:11

github-actions

b5546

dd665cc

b5546

parallel : increase the variability of the prompt lengths (#13927)

ggml-ci

Assets 18

29 May 21:49

github-actions

b5537

e83ba3e

b5537

llama : add support for jina-reranker-v2 (#13900)

Assets 18

26 May 20:41

github-actions

b5500

a26c4cc

b5500

scripts : add option to compare commits in Debug (#13806)

* scripts : add option to compare commits in Debug

* cont : reuse existing CMAKE_OPTS

Assets 18

26 May 15:34

github-actions

b5497

03f582a

b5497

server: fix streaming crashes (#13786)

* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash

Assets 18

Releases: ochafik/llama.cpp

b6710

Uh oh!

b6250

Uh oh!

b6115

Uh oh!

b6104

Uh oh!

b6096

Uh oh!

b6068

Uh oh!

b5546

Uh oh!

b5537

Uh oh!

b5500

Uh oh!

b5497

Uh oh!