Skip to content

Releases: ServeurpersoCom/llama.cpp

b6750

13 Oct 09:37
56fc38b
Compare
Choose a tag to compare
CANN: fix CPU memory leak in CANN backend (#16549)

This commit fixes a CPU-side memory leak issue in the CANN backend,
which occurred when intermediate aclTensorList objects were not properly
released after operator execution. The leak happened during repeated
invocations of CANN ops (e.g., FlashAttention), leading to increasing
host memory usage over time.

Proper resource cleanup (aclDestroyTensorList and related release logic)
has been added to ensure that all temporary tensors are correctly freed.

b6749

13 Oct 09:22
1fb9504
Compare
Choose a tag to compare
fix: add remark plugin to render raw HTML as literal text (#16505)

* fix: add remark plugin to render raw HTML as literal text

Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering

Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure

Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization

Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required

Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify

* fix: address review feedback from allozaur

* chore: update webui build output

b6746

13 Oct 08:02
f9bc66c
Compare
Choose a tag to compare
CANN: Update several operators to support FP16 data format (#16251)

Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.

In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.

Co-authored-by: noemotiovon <[email protected]>

b6743

12 Oct 14:36
c7be9fe
Compare
Choose a tag to compare
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521)

* fix/refactor OP argsort, pad

* fix count-equal op

* update SYCL OP list

* fix format issue

---------

Co-authored-by: Zhang Jianyu <[email protected]>

b6739

12 Oct 06:50
41aac5c
Compare
Choose a tag to compare
ggml : Fix FP16 ELU positive branch (#16519)

Co-authored-by: Aaron <[email protected]>

b6735

11 Oct 14:37
a3cb047
Compare
Choose a tag to compare
metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)

b6732

11 Oct 11:58
97870e6
Compare
Choose a tag to compare
cuda : avoid initializing unused devices (#16510)

b6730

10 Oct 20:24
e60f01d
Compare
Choose a tag to compare
server : fix division by zero when reporting stats (#16501)

b6729

10 Oct 17:03
81086cd
Compare
Choose a tag to compare
vocab : mark EOT token for Granite models (#16499)

* vocab : mark EOT token for Granite models

* sampling : fallback to EOS when EOT is not found

b6725

10 Oct 04:59
1faa13a
Compare
Choose a tag to compare
webui: updated the chat service to only include max_tokens in the req…