Releases · nicoboss/llama.cpp

05 Sep 11:38

a812838

b6390 Latest

Latest

gguf: gguf_writer refactor (#15691)

* gguf: split gguf writer into base and buf impl
* gguf: templated gguf write out
* gguf: file based writer (avoid writing everything to memory first!)
* examples(llama2c): fix log not being the same level and compiler nits

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-09-05T11:38:57Z
llama-b6390-bin-macos-arm64.zip

sha256:5c7716441583546262cde59c2c56c72c3b6fee58a6ae9f69b8900f905f58dc4e

11.1 MB 2025-09-05T11:39:11Z
llama-b6390-bin-macos-x64.zip

sha256:130f18eec5c616e8a6914223730dca4e87f0f3db4ad516c721c3325332a611a1

28.6 MB 2025-09-05T11:39:12Z
llama-b6390-bin-ubuntu-vulkan-x64.zip

sha256:fbf9dd0c95b57d2bcd1323f9931f07e0780d83044d6e843c851dd1c412316adc

25.7 MB 2025-09-05T11:39:14Z
llama-b6390-bin-ubuntu-x64.zip

sha256:2db479ad1402aaa7a76b2c506b50a6d6d03a5bd1d568f5a7c1d0ddea9e6f7eae

13 MB 2025-09-05T11:39:16Z
llama-b6390-bin-win-cpu-arm64.zip

sha256:b5c6a9482752618428aa2ea9be6fd7401d858476efe32f6530af7de679606850

11.3 MB 2025-09-05T11:39:17Z
llama-b6390-bin-win-cpu-x64.zip

sha256:e6d18c8e9591614096f1b7a9683594b8ab37c3a40d220a81099904fc8167c6e8

14.2 MB 2025-09-05T11:39:18Z
llama-b6390-bin-win-cuda-12.4-x64.zip

sha256:da052f686c1ea7ea03c0f2e24ba5e0cf253c716d11e484f4974e15a6519277a5

138 MB 2025-09-05T11:39:20Z
llama-b6390-bin-win-hip-radeon-x64.zip

sha256:42ea5895815bc6a08f38ebfdb8bafac11b00e0cf9bc5e2201995abc1e8aff975

287 MB 2025-09-05T11:39:26Z
llama-b6390-bin-win-opencl-adreno-arm64.zip

sha256:52bd3338f869e77439c939c7086c9d881e353d4a4be5d2fe4f7b72e878a12092

11.7 MB 2025-09-05T11:39:39Z
Source code (zip)

2025-09-05T09:34:28Z
Source code (tar.gz)

2025-09-05T09:34:28Z

24 Aug 02:46

github-actions

b6259

710dfc4

b6259

CUDA: fix half2 -> half conversion for HIP (#15529)

Assets 15

04 Aug 20:20

github-actions

b6087

4161343

b6087

cmake: Add GGML_BACKEND_DIR option (#15074)

* cmake: Add GGML_BACKEND_DIR option

This can be used by distributions to specify where to look for backends
when ggml is built with GGML_BACKEND_DL=ON.

* Fix phrasing

Assets 15

22 Jul 19:13

github-actions

b5964

acd6cb1

b5964

ggml : model card yaml tab->2xspace (#14819)

Assets 15

09 Jul 21:38

github-actions

b5943

1bf5cf0

b5943

Merge branch 'mradermacher' into master

Assets 15

09 Jul 21:03

github-actions

b5856

4a5686d

b5856

llama : support Jamba hybrid Transformer-Mamba models (#7531)

* wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba
(and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : use std::find for seq_nodes in llama_rs_cache

* llama : state checkpoints for recurrent models

* llama : correctly handle more edge cases for the rs cache

* llama : rename many llama_kv_cache_* functions

* llama : remove useless return value for some llama_cache_* functions

* llama : rethink recurrent state cell counts

* llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers
to have 0 KV heads.

* llama : gracefully fail when not finding hybrid slot

* llama : support Jamba

* llama : fix BERT inference without KV cache

* convert-hf : check for unprocessed Jamba experts

* convert-hf : support Mini-Jamba conversion

* llama : fix Jamba quantization sanity checks

* llama : sequence-length-aware batch splitting

* llama : use equal-sequence-length sub-batches for recurrent models

* ggml : simplify SSM-related operators

* llama : make recurrent state slot allocation contiguous

* llama : adapt internal uses of batches to llama_ubatch

* llama : fix batch split output count for embeddings

* llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag
on thousands of sequences with very small 100k params Mamba models.

* llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark
with small batch sizes, making it crash.

* llama : avoid copies for simple batch splits

* ggml : make ggml_ssm_scan not modify its source tensors

* llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1'
with a Mamba or Jamba model.

* llama : fix .base() compilation error on Windows

* llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it,
and this makes Mamba's conv step slightly faster.

* mamba : fix non-contiguous usage of ggml_silu

* llama : session saving and reloading for hybrid models

* convert_hf : fix Jamba conversion

* llama : fix mixed signedness comparison

* llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

* llama : begin renaming llama_past back to llama_kv_cache

* llama : remove implicit recurrent state rollbacks

* llama : partially apply clang-format style

* convert : fix jamba conv1d shape squeezing

* graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs.
This *should* make it easier to handle updating the inputs
when caching the graph (eventually).

* model : add Jamba to Mamba-specific hparams printing

* jamba : remove redundant nullptr initializations

* model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* model : make falcon-h1 use shared mamba2 layer builder

* memory : avoid referring to KV in recurrent cache logs

* gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Assets 15

20 Jun 09:20

github-actions

b5716

d27b3ca

b5716

ggml : fix repack work size for mul_mat_id (#14292)

ggml-ci

Assets 15

05 Jun 10:37

github-actions

b5594

d01d112

b5594

readme : add badge (#13938)

Assets 15

30 May 14:07

github-actions

b5541

07e4351

b5541

convert : allow partial update to the chkhsh pre-tokenizer list (#13847)

* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import

Assets 18

07 May 22:54

github-actions

b5307

814f795

b5307

docker : disable arm64 and intel images (#13356)

Assets 21

Releases: nicoboss/llama.cpp

b6390

Uh oh!

b6259

Uh oh!

b6087

Uh oh!

b5964

Uh oh!

b5943

Uh oh!

b5856

Uh oh!

b5716

Uh oh!

b5594

Uh oh!

b5541

Uh oh!

b5307

Uh oh!