Releases · ServeurpersoCom/llama.cpp

13 Oct 09:37

56fc38b

b6750 Latest

Latest

CANN: fix CPU memory leak in CANN backend (#16549)

This commit fixes a CPU-side memory leak issue in the CANN backend,
which occurred when intermediate aclTensorList objects were not properly
released after operator execution. The leak happened during repeated
invocations of CANN ops (e.g., FlashAttention), leading to increasing
host memory usage over time.

Proper resource cleanup (aclDestroyTensorList and related release logic)
has been added to ensure that all temporary tensors are correctly freed.

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-13T09:37:53Z
llama-b6750-bin-macos-arm64.zip

sha256:4a0a856a9778d9efc8c6904a86d4e526ec028ff152953017de969dd4f1e6a380

10.4 MB 2025-10-13T09:38:04Z
llama-b6750-bin-macos-x64.zip

sha256:2d3882633a7d63cc928b61a498a27e8e3d77a20b2c80157e9b14228a862f656b

27 MB 2025-10-13T09:38:05Z
llama-b6750-bin-ubuntu-vulkan-x64.zip

sha256:2131fbfe8ae8ebf8ded22d74c97b277f3a782c04dfab08cebea398309ecd9350

25.6 MB 2025-10-13T09:38:06Z
llama-b6750-bin-ubuntu-x64.zip

sha256:2439135cdfeec3735b91d5ba1fa6704be7f4b038a9618b4021898810641118df

12.5 MB 2025-10-13T09:38:07Z
llama-b6750-bin-win-cpu-arm64.zip

sha256:29fccb2bb334210ab63564f9ea353c644112600c5668a935aaaa91d7b9951c19

10.6 MB 2025-10-13T09:38:08Z
llama-b6750-bin-win-cpu-x64.zip

sha256:ff7624854dc7a2e41f3a88cccf29974e266c08cd3f23cbe5036b65924623494b

13.6 MB 2025-10-13T09:38:09Z
llama-b6750-bin-win-cuda-12.4-x64.zip

sha256:8ddb798b117054ba82d703f401c3eb59223c93f09b5ee6931474d537b0abfc69

161 MB 2025-10-13T09:38:11Z
llama-b6750-bin-win-hip-radeon-x64.zip

sha256:ca130e097f9887a2d0f3f6397816561bbf0aa9080131c3dfddf01c8fa34c18a8

321 MB 2025-10-13T09:38:16Z
llama-b6750-bin-win-opencl-adreno-arm64.zip

sha256:84df8b0ed4a286456231806d84812f2d25bf99839b33462fd68f6e2f21f19e54

11 MB 2025-10-13T09:38:24Z
Source code (zip)

2025-10-13T09:01:24Z
Source code (tar.gz)

2025-10-13T09:01:24Z

13 Oct 09:22

github-actions

b6749

1fb9504

b6749

fix: add remark plugin to render raw HTML as literal text (#16505)

* fix: add remark plugin to render raw HTML as literal text

Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering

Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure

Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization

Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required

Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify

* fix: address review feedback from allozaur

* chore: update webui build output

Assets 15

13 Oct 08:02

github-actions

b6746

f9bc66c

b6746

CANN: Update several operators to support FP16 data format (#16251)

Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.

In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.

Co-authored-by: noemotiovon <[email protected]>

Assets 15

12 Oct 14:36

github-actions

b6743

c7be9fe

b6743

[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521)

* fix/refactor OP argsort, pad

* fix count-equal op

* update SYCL OP list

* fix format issue

---------

Co-authored-by: Zhang Jianyu <[email protected]>

Assets 15

12 Oct 06:50

github-actions

b6739

41aac5c

b6739

ggml : Fix FP16 ELU positive branch (#16519)

Co-authored-by: Aaron <[email protected]>

Assets 15

11 Oct 14:37

github-actions

b6735

a3cb047

b6735

metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)

Assets 15

11 Oct 11:58

github-actions

b6732

97870e6

b6732

cuda : avoid initializing unused devices (#16510)

Assets 15

10 Oct 20:24

github-actions

b6730

e60f01d

b6730

server : fix division by zero when reporting stats (#16501)

Assets 15

10 Oct 17:03

github-actions

b6729

81086cd

b6729

vocab : mark EOT token for Granite models (#16499)

* vocab : mark EOT token for Granite models

* sampling : fallback to EOS when EOT is not found

Assets 15

10 Oct 04:59

github-actions

b6725

1faa13a

b6725

webui: updated the chat service to only include max_tokens in the req…

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ServeurpersoCom/llama.cpp

b6750

Uh oh!

b6749

Uh oh!

b6746

Uh oh!

b6743

Uh oh!

b6739

Uh oh!

b6735

Uh oh!

b6732

Uh oh!

b6730

Uh oh!

b6729

Uh oh!

b6725

Uh oh!