Skip to content

Releases: ngxson/llama.cpp

b6235

21 Aug 15:34
96452a3
Compare
Choose a tag to compare
vulkan: Reuse conversion results in prealloc_y (#15410)

* vulkan: Reuse conversion results in prealloc_y

Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.

* don't use shared pointer for prealloc_y_last_pipeline_used

b6229

21 Aug 10:44
2758fa1
Compare
Choose a tag to compare
examples : add model conversion tool/example (#15455)

* examples : add model conversion tool/example

This commit adds an "example/tool" that is intended to help in the
process of converting models to GGUF. Currently it supports normal
causal models and embedding models. The readme contains instructions and
command to guide through the process.

The motivation for this to have a structured and repeatable process for
model conversions and hopefully with time improve upon it to make the
process easier and more reliable. We have started to use this for new
model conversions internally and will continue doing so and improve it
as we go along. Perhaps with time this should be placed in a different
directory than the examples directory, but for now it seems like a good
place to keep it while we are still developing it.

* squash! examples : add model conversion tool/example

Remove dependency on scikit-learn in model conversion example.

* squash! examples : add model conversion tool/example

Update transformer dep to use non-dev version. And also import
`AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility
with the latest version.

* squash! examples : add model conversion tool/example

Remove the logits requirements file from the all requirements file.

b6228

21 Aug 10:32
b108e42
Compare
Choose a tag to compare
ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without…

b6225

21 Aug 09:13
2f3dbff
Compare
Choose a tag to compare
common : fix incorrect print of non-ascii characters in the logging (…

b6224

21 Aug 05:53
945e1f1
Compare
Choose a tag to compare
ggml : fix condition of im2col on Metal backend (#15460)

b6221

21 Aug 03:56
8ad038c
Compare
Choose a tag to compare
musa: add GGML_UNUSED_VARS (#15446)

Signed-off-by: Xiaodong Ye <[email protected]>

b6220

20 Aug 23:57
5682a37
Compare
Choose a tag to compare
sched : copy only the used experts when offloading prompt processing …

b6217

20 Aug 16:00
7a6e91a
Compare
Choose a tag to compare
CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433)

b6216

20 Aug 14:56
fec9519
Compare
Choose a tag to compare
vulkan: shorten pipeline name strings (#15431)

These detailed strings were causing increased build time on gcc.

b6215

20 Aug 12:48
657b8a7
Compare
Choose a tag to compare
chat: handle gpt-oss return/end token inconsistency (#15421)

This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: https://github.com/ggml-org/llama.cpp/issues/15417