Releases: ngxson/llama.cpp
Releases · ngxson/llama.cpp
b6235
vulkan: Reuse conversion results in prealloc_y (#15410) * vulkan: Reuse conversion results in prealloc_y Cache the pipeline and tensor that were most recently used to fill prealloc_y, and skip the conversion if the current pipeline/tensor match. * don't use shared pointer for prealloc_y_last_pipeline_used
b6229
examples : add model conversion tool/example (#15455) * examples : add model conversion tool/example This commit adds an "example/tool" that is intended to help in the process of converting models to GGUF. Currently it supports normal causal models and embedding models. The readme contains instructions and command to guide through the process. The motivation for this to have a structured and repeatable process for model conversions and hopefully with time improve upon it to make the process easier and more reliable. We have started to use this for new model conversions internally and will continue doing so and improve it as we go along. Perhaps with time this should be placed in a different directory than the examples directory, but for now it seems like a good place to keep it while we are still developing it. * squash! examples : add model conversion tool/example Remove dependency on scikit-learn in model conversion example. * squash! examples : add model conversion tool/example Update transformer dep to use non-dev version. And also import `AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility with the latest version. * squash! examples : add model conversion tool/example Remove the logits requirements file from the all requirements file.
b6228
ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without…
b6225
common : fix incorrect print of non-ascii characters in the logging (…
b6224
ggml : fix condition of im2col on Metal backend (#15460)
b6221
musa: add GGML_UNUSED_VARS (#15446) Signed-off-by: Xiaodong Ye <[email protected]>
b6220
sched : copy only the used experts when offloading prompt processing …
b6217
CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433)
b6216
vulkan: shorten pipeline name strings (#15431) These detailed strings were causing increased build time on gcc.
b6215
chat: handle gpt-oss return/end token inconsistency (#15421) This commit addresses an inconsistency during inference by adding a new member to the `templates_params` struct to indicate whether the chat is in inference mode. This allows the gpt-oss specific function `common_chat_params_init_gpt_oss` to check this flag and the `add_generation_prompt` flag to determine if it should replace the `<|return|>` token with the `<|end|>` token in the prompt. The motivation for this change is to ensure that the formatted prompt of past messages in `common_chat_format_single` matches the output of the formatted new message. The issue is that the gpt-oss template returns different end tags: `<|return|>` when `add_generation_prompt` is false, and `<|end|>` when `add_generation_prompt` is true. This causes the substring function to start at an incorrect position, resulting in tokenization starting with 'tart|>' instead of '<|start|>'. Resolves: https://github.com/ggml-org/llama.cpp/issues/15417