Skip to content

Releases: ServeurpersoCom/llama.cpp

b6743

12 Oct 14:36
c7be9fe
Compare
Choose a tag to compare
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521)

* fix/refactor OP argsort, pad

* fix count-equal op

* update SYCL OP list

* fix format issue

---------

Co-authored-by: Zhang Jianyu <[email protected]>

b6739

12 Oct 06:50
41aac5c
Compare
Choose a tag to compare
ggml : Fix FP16 ELU positive branch (#16519)

Co-authored-by: Aaron <[email protected]>

b6735

11 Oct 14:37
a3cb047
Compare
Choose a tag to compare
metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)

b6732

11 Oct 11:58
97870e6
Compare
Choose a tag to compare
cuda : avoid initializing unused devices (#16510)

b6730

10 Oct 20:24
e60f01d
Compare
Choose a tag to compare
server : fix division by zero when reporting stats (#16501)

b6729

10 Oct 17:03
81086cd
Compare
Choose a tag to compare
vocab : mark EOT token for Granite models (#16499)

* vocab : mark EOT token for Granite models

* sampling : fallback to EOS when EOT is not found

b6725

10 Oct 04:59
1faa13a
Compare
Choose a tag to compare
webui: updated the chat service to only include max_tokens in the req…

b6720

09 Oct 11:54
2c0d875
Compare
Choose a tag to compare
ci: add ARM64 Kleidiai build and test support (#16462)

b6715

08 Oct 20:51
12bbc3f
Compare
Choose a tag to compare
refactor: centralize CoT parsing in backend for streaming mode (#16394)

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <[email protected]>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <[email protected]>

---------

Co-authored-by: Aleksander Grygier <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>

b6713

08 Oct 18:26
d2ee056
Compare
Choose a tag to compare
server : fix cancel pending task (#16467)

Co-authored-by: DevAI <[email protected]>