Releases · struct/llama.cpp

09 Oct 16:21

56b4795

b6721 Latest

Latest

model-conversion : add support for SentenceTransformers (#16387)

* model-conversion : add support for SentenceTransformers

This commit adds support for models that use SentenceTransformer layers.

The motivation for this is that if converted model includes any of the
numbered layers specified in the original models repository then these
changes enable these models to be used and verified. Currently the
model-conversion only support the base model output without any of
the additional transformation layers.

Usage:
Convert the model that also includes the SentenceTransformer layers:
```console
(venv) $ export EMBEDDING_MODEL_PATH="~/google/embeddinggemma-300M"
(venv) make embedding-convert-model
```

Verify the produced embeddings from the converted model against the
original model embeddings:
```console
(venv) make embedding-verify-logits-st
```

The original model can be run using SentenceTransformer:
```console
(venv) make embedding-run-original-model-st
```

Run the converted model using "SentenceTransformer" layers whic
enables pooling and normalization:
```console
(venv) make embedding-run-converted-model-st
```

* add model-conversion example requirements

* add support for -st flag in embedding model conversion

This commit add support for the -st flag in the embedding model
conversion script. This will enable models to be converted using
sentence transformers dense layers.

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-09T16:21:08Z
llama-b6721-bin-macos-arm64.zip

sha256:729dc7cb6bda9d01c5890975e08af51659b03fb0ba01ff1f6cf23c1ad2b8fc41

10.4 MB 2025-10-09T16:21:19Z
llama-b6721-bin-macos-x64.zip

sha256:54e0f6ffe22c94fbbf63ad4de486e9974f78629a756ea4c0e88f12e5606ba476

26.9 MB 2025-10-09T16:21:20Z
llama-b6721-bin-ubuntu-vulkan-x64.zip

sha256:89c90bd8988d8bd867b90336203f50a415ed5be3dcda9de83d8d97754f94bc73

25.5 MB 2025-10-09T16:21:21Z
llama-b6721-bin-ubuntu-x64.zip

sha256:1876b159729dda0b0307010d308a51eb892e18892ba0efdf6e0d30224eefa12c

12.4 MB 2025-10-09T16:21:23Z
llama-b6721-bin-win-cpu-arm64.zip

sha256:22e8e4481cec8f9f81cd657ed0638fb7a0564d028077abd8d869f53bf82ae03d

10.6 MB 2025-10-09T16:21:24Z
llama-b6721-bin-win-cpu-x64.zip

sha256:210c796aabc5b268a264e3e6ed5278cbed3eee5443e59b483eedd153ac2d59a5

13.6 MB 2025-10-09T16:21:25Z
llama-b6721-bin-win-cuda-12.4-x64.zip

sha256:54bd087b4bc6bd690f7b9cf7052fe3013d6295db8ef652aa9124e17fe1cf619b

149 MB 2025-10-09T16:21:26Z
llama-b6721-bin-win-hip-radeon-x64.zip

sha256:6c2e4e15ba31d482a88971ab31cf37124e5f8bca6479493b332033f955e0c65b

313 MB 2025-10-09T16:21:31Z
llama-b6721-bin-win-opencl-adreno-arm64.zip

sha256:59215e70863d6ac77ecf7a6fec889eefeceddf0248f88ed97402fef6dbfc6944

11 MB 2025-10-09T16:21:40Z
Source code (zip)

2025-10-09T12:35:22Z
Source code (tar.gz)

2025-10-09T12:35:22Z

26 Sep 21:35

github-actions

b6601

624207e

b6601

devops: add s390x & ppc64le CI (#15925)

* devops: move s390x and ppc64le ci build

we have access to ubuntu-24.04-s390x and ppc64le images now

Signed-off-by: Aaron Teo <[email protected]>

* devops: disable ppc64le for now since they have compiler errors

Signed-off-by: Aaron Teo <[email protected]>

* devops: stop warnings as errors

Signed-off-by: Aaron Teo <[email protected]>

* devops: switch to non-macro flag

Signed-off-by: Aaron Teo <[email protected]>

* devops: going the llama macro route

Signed-off-by: Aaron Teo <[email protected]>

* devops: add big-endian gguf test models

Signed-off-by: Aaron Teo <[email protected]>

* devops: disable ppc64le to test s390x, check test build

Signed-off-by: Aaron Teo <[email protected]>

* devops: dup .gguf.inp files for big-endian tests

Signed-off-by: Aaron Teo <[email protected]>

* devops: dup .gguf.out files for big-endian too

Signed-off-by: Aaron Teo <[email protected]>

* devops: add python setup and endian byteswap

Signed-off-by: Aaron Teo <[email protected]>

* devops: pooring thing does not have s390x python3

Signed-off-by: Aaron Teo <[email protected]>

* devops: add missing rust compiler for s390x

Signed-off-by: Aaron Teo <[email protected]>

* devops: try rust actions runner

Signed-off-by: Aaron Teo <[email protected]>

* Revert "devops: try rust actions runner"

This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.

Signed-off-by: Aaron Teo <[email protected]>

* devops: try a different path for rust

Signed-off-by: Aaron Teo <[email protected]>

* devops: dump home directory and user info

Signed-off-by: Aaron Teo <[email protected]>

* devops: install gguf-py only

Signed-off-by: Aaron Teo <[email protected]>

* devops: missed relative path

Signed-off-by: Aaron Teo <[email protected]>

* devops: remove big-endian files since local swapping is working

Signed-off-by: Aaron Teo <[email protected]>

* devops: revert test-tokenizer-0 cmakelists

Signed-off-by: Aaron Teo <[email protected]>

* Fix unicode flags conversion from and to uint16_t

Bitfields are allocated in different order on s390x

Signed-off-by: Aaron Teo <[email protected]>

* Simplify byteswap command

Signed-off-by: Aaron Teo <[email protected]>

* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs

Signed-off-by: Aaron Teo <[email protected]>

* Fix endianness detection in vocab loader

Signed-off-by: Aaron Teo <[email protected]>

* Disable test-thread-safety on s390x

In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.

There is no clean way to separate all those steps
 to add byteswapping between them, so just skip this test.

Signed-off-by: Aaron Teo <[email protected]>

* Fix q8_0 test in test-quantize-fns

vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.

Signed-off-by: Aaron Teo <[email protected]>

* devops: add big-endian stories260K

Signed-off-by: Aaron Teo <[email protected]>

* devops: add s390x test-eval-callback

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix test does not exist

Signed-off-by: Aaron Teo <[email protected]>

* devops: fix model not found llama-eval-callback

Signed-off-by: Aaron Teo <[email protected]>

* Fix q3_K dot product error in test-quantize-fns on s390x

Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.

Signed-off-by: Aaron Teo <[email protected]>

* devops: re-enable ppc64le for testing

Signed-off-by: Aaron Teo <[email protected]>

* devops: activate test-thread-safety for s390x

Signed-off-by: Aaron Teo <[email protected]>

* devops: disable ppc64le tests

for some reason it keeps failing test-thread-safety tests and I do not
    have a machine that is able to replicate the tests.

Signed-off-by: Aaron Teo <[email protected]>

* devops: LLAMA_FATAL_WARNINGS=ON

Signed-off-by: Aaron Teo <[email protected]>

* Correct repository URL for s390x for test-thread-safety model

Signed-off-by: Aaron Teo <[email protected]>

* Fix fs_get_cache_directory

Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.

Signed-off-by: Aaron Teo <[email protected]>

* Re-enable CI for ppc64le

Signed-off-by: Aaron Teo <[email protected]>

* Fortify ggml_rope_impl

Only memcpy data from sections argument if it's non-NULL.

Signed-off-by: Aaron Teo <[email protected]>

* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way

* Update URL for big-endian model

* Update .github/workflows/build.yml

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update remaining mentions of BE models to ggml-org/models repo

---------

Signed-off-by: Aaron Teo <[email protected]>
Co-authored-by: Aleksei Nikiforov <[email protected]>
Co-authored-by: Aleksei Nikiforov <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>

Assets 15

19 Sep 14:33

github-actions

b6520

4067f07

b6520

feat: Improve mobile UI for Settings Dialog (#16084)

* feat: Improve mobile UI for Settings Dialog

* chore: update webui build output

* fix: Linting errors

* chore: update webui build output

Assets 15

01 Sep 18:14

github-actions

b6345

4b20d8b

b6345

convert : remove redundant code (#15708)

Signed-off-by: Jie Fu <[email protected]>

Assets 15

24 Aug 20:25

github-actions

b6264

043fb27

b6264

vulkan: apply MUL_MAT_ID subgroup optimization to non-coopmat devices…

Assets 15

10 Aug 22:55

github-actions

b6123

79c1160

b6123

cuda: refactored ssm_scan and use CUB (#13291)

* cuda: refactored ssm_scan to use CUB

* fixed compilation error when when not using CUB

* assign L to constant and use size_t instead of int

* deduplicated functions

* change min blocks per mp to 1

* Use cub load and store warp transpose

* suppress clang warning

Assets 15

05 Aug 18:25

github-actions

b6094

be42642

b6094

readme : update hot topics (#15097)

Assets 15

30 Jul 14:52

github-actions

b6032

92b8810

b6032

CUDA: skip masked KV slices for all FA kernels (#14924)

Assets 15

30 Jul 02:27

github-actions

b6028

61550f8

b6028

CANN: update ops docs (#14935)

* CANN:add ops docs

* CANN: update ops docs

Assets 15

26 Jul 12:47

github-actions

b5996

11dd5a4

b5996

CANN: Implement GLU ops (#14884)

Implement REGLU, GEGLU, SWIGLU ops according to #14158

Assets 15

Releases: struct/llama.cpp

b6721

Uh oh!

b6601

Uh oh!

b6520

Uh oh!

b6345

Uh oh!

b6264

Uh oh!

b6123

Uh oh!

b6094

Uh oh!

b6032

Uh oh!

b6028

Uh oh!

b5996

Uh oh!