Skip to content

Releases: chaxu01/llama.cpp

b6298

27 Aug 11:10
1e74897
Compare
Choose a tag to compare
CANN: refactor mask handling and improve performance in FA (#15561)

* CANN(flash-attn): refactor mask handling and improve performance

1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.

Signed-off-by: noemotiovon <[email protected]>

* [CANN]: fix review

Signed-off-by: noemotiovon <[email protected]>

* [CANN]: Optimization FA BNSD to BSND

Signed-off-by: noemotiovon <[email protected]>

---------

Signed-off-by: noemotiovon <[email protected]>

b5891

14 Jul 09:18
0d92267
Compare
Choose a tag to compare
llama : add jinja template for rwkv-world (#14665)

* llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Signed-off-by: Molly Sophia <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>

b5695

18 Jun 09:55
9540255
Compare
Choose a tag to compare
llama-chat : fix multiple system message for gemma, orion (#14246)