Skip to content

Releases: ServeurpersoCom/llama.cpp

b6710

08 Oct 07:33
74b8fc1
Compare
Choose a tag to compare
ggml webgpu: profiling, CI updates, reworking of command submission (…

b6708

07 Oct 13:21
df1b612
Compare
Choose a tag to compare
server : add `/v1/health` endpoint (#16461)

* server : add /v1/health endpoint

* cont : update readme

b6706

07 Oct 09:04
ef4c5b8
Compare
Choose a tag to compare
presets : fix pooling param for embedding models (#16455)

b6700

06 Oct 18:34
3df2244
Compare
Choose a tag to compare
llama : add --no-host to disable host buffers (#16310)

* implement --no-host to disable host buffer

* fix equal_mparams

* move no-host enumeration order together with other model params

---------

Co-authored-by: slaren <[email protected]>

b6699

06 Oct 18:23
c08002a
Compare
Choose a tag to compare
chat : Granite Docling stopping (#16438)

* fix: Fix duplicate fake image before token on first slice

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use double-newline before overview image

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Remove incorrect newline at the end of granite chat template gen prompt

There should not be one, even for the language models.

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

* tests: Remove bad newline from granite chat template test (legacy)

Branch: GraniteDoclingStopping

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>

b6697

06 Oct 14:33
a23b9bd
Compare
Choose a tag to compare
ggml : fix unaligned access in AMX code (#16315)

b6695

06 Oct 12:59
a80ff18
Compare
Choose a tag to compare
ggml-cpu : fix leftover handling in ggml_vec_scale_f32 for SVE (#16443)

This commit updates the leftover handling in ggml_vec_scale_f32.

The motivation for this is that the code currently incorrectly assumes
there would be fewer than ggml_f32_epr leftover elements. However,
since the main loop processes 2*ggml_f32_epr elements per iteration
, there can be up to (2*ggml_f32_epr - 1) leftover elements.

The original single-pass leftover code could only process ggml_f32_epr
elements, leaving some elements unscaled.

Example scenario with 256-bit SVE:
```
ggml_f32_epr  = 8 (elements per register)
ggml_f32_step = 16 (two registers per iteration)
n             = 25
np            = 16
leftovers     = 9 elements (16-24)

Original    : processes only elements 16-23, misses element 24
This commit : loop processes elements 16-23, then element 24
```

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/18070620247/job/51419855630

b6692

05 Oct 13:54
ca71fb9
Compare
Choose a tag to compare
model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>

* add test model

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>

b6691

05 Oct 04:30
3526657
Compare
Choose a tag to compare
ggml webgpu: actually add softmax, fix rms_norm offset (#16400)

* implement soft_max

* Fix soft_max data race

* Temporary fix, wait on each submit

b6690

04 Oct 23:50
86df2c9
Compare
Choose a tag to compare
vulkan: use a more appropriate amount of threads when generating shad…