Releases: ServeurpersoCom/llama.cpp
Releases · ServeurpersoCom/llama.cpp
b6750
CANN: fix CPU memory leak in CANN backend (#16549) This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.
b6749
fix: add remark plugin to render raw HTML as literal text (#16505) * fix: add remark plugin to render raw HTML as literal text Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify * fix: address review feedback from allozaur * chore: update webui build output
b6746
CANN: Update several operators to support FP16 data format (#16251) Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <[email protected]>
b6743
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (#16521) * fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]>
b6739
ggml : Fix FP16 ELU positive branch (#16519) Co-authored-by: Aaron <[email protected]>
b6735
metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494)
b6732
cuda : avoid initializing unused devices (#16510)
b6730
server : fix division by zero when reporting stats (#16501)
b6729
vocab : mark EOT token for Granite models (#16499) * vocab : mark EOT token for Granite models * sampling : fallback to EOS when EOT is not found
b6725
webui: updated the chat service to only include max_tokens in the req…