Releases · ServeurpersoCom/llama.cpp

12 Oct 14:36

c7be9fe

b6743 Latest

Latest

[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521)

* fix/refactor OP argsort, pad

* fix count-equal op

* update SYCL OP list

* fix format issue

---------

Co-authored-by: Zhang Jianyu <[email protected]>

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-12T14:36:35Z
llama-b6743-bin-macos-arm64.zip

sha256:d30f7f2e0df915206dec022967f51a9280c0f3968078de6a68eb60f7fc99fac7

10.4 MB 2025-10-12T14:36:47Z
llama-b6743-bin-macos-x64.zip

sha256:73053e7a75cf8692a4cdb959d13f78688b965ed08c99e274adc1c577c5f07dc0

27 MB 2025-10-12T14:36:48Z
llama-b6743-bin-ubuntu-vulkan-x64.zip

sha256:34a6f8bbf3c6db3308f6995f487d4add694ba9ae5e910b792d9c281995c7faa4

25.6 MB 2025-10-12T14:36:50Z
llama-b6743-bin-ubuntu-x64.zip

sha256:7f60a2e238e8e6bf601192d48b8a1462ea9472b759dcc6addd0b857dec46c0da

12.5 MB 2025-10-12T14:36:51Z
llama-b6743-bin-win-cpu-arm64.zip

sha256:c4bdcbcbfb8b71cc2f9b336fbb59b96366d4eb0c75d454713ea6d1c3b96214ba

10.6 MB 2025-10-12T14:36:52Z
llama-b6743-bin-win-cpu-x64.zip

sha256:a90328c1121b31b91e4f051d99fa340fb4f6cd87cdd74bf480fb7dc71fd9abf3

13.6 MB 2025-10-12T14:36:53Z
llama-b6743-bin-win-cuda-12.4-x64.zip

sha256:39d6bf42c17cd3aeee642a267537a215ff0c83c14096e4d40b6fb613eee05aad

161 MB 2025-10-12T14:36:55Z
llama-b6743-bin-win-hip-radeon-x64.zip

sha256:0ef58b466c3f1fbbcc2dbf80b8d78fa339972dafea06f252e539f95fd31a6c20

321 MB 2025-10-12T14:37:01Z
llama-b6743-bin-win-opencl-adreno-arm64.zip

sha256:b4b5b01494af908299b6b7794a7a2819d5f53559734b012f815ea286793ac384

11 MB 2025-10-12T14:37:14Z
Source code (zip)

2025-10-12T13:53:35Z
Source code (tar.gz)

2025-10-12T13:53:35Z

12 Oct 06:50

github-actions

b6739

41aac5c

b6739

ggml : Fix FP16 ELU positive branch (#16519)

Co-authored-by: Aaron <[email protected]>

Assets 15

11 Oct 14:37

github-actions

b6735

a3cb047

b6735

metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)

Assets 15

11 Oct 11:58

github-actions

b6732

97870e6

b6732

cuda : avoid initializing unused devices (#16510)

Assets 15

10 Oct 20:24

github-actions

b6730

e60f01d

b6730

server : fix division by zero when reporting stats (#16501)

Assets 15

10 Oct 17:03

github-actions

b6729

81086cd

b6729

vocab : mark EOT token for Granite models (#16499)

* vocab : mark EOT token for Granite models

* sampling : fallback to EOS when EOT is not found

Assets 15

10 Oct 04:59

github-actions

b6725

1faa13a

b6725

webui: updated the chat service to only include max_tokens in the req…

Assets 15

09 Oct 11:54

github-actions

b6720

2c0d875

b6720

ci: add ARM64 Kleidiai build and test support (#16462)

Assets 15

08 Oct 20:51

github-actions

b6715

12bbc3f

b6715

refactor: centralize CoT parsing in backend for streaming mode (#16394)

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <[email protected]>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <[email protected]>

---------

Co-authored-by: Aleksander Grygier <[email protected]>
Co-authored-by: Xuan Son Nguyen <[email protected]>

Assets 15

08 Oct 18:26

github-actions

b6713

d2ee056

b6713

server : fix cancel pending task (#16467)

Co-authored-by: DevAI <[email protected]>

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ServeurpersoCom/llama.cpp

b6743

Uh oh!

b6739

Uh oh!

b6735

Uh oh!

b6732

Uh oh!

b6730

Uh oh!

b6729

Uh oh!

b6725

Uh oh!

b6720

Uh oh!

b6715

Uh oh!

b6713

Uh oh!