Skip to content

Releases: ggml-org/llama.cpp

b5150

18 Apr 07:57
2db9ba1
Compare
Choose a tag to compare
rpc : add RPC_CMD_HELLO (#12955)

Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465

b5149

17 Apr 16:10
2f74c35
Compare
Choose a tag to compare
graph : make FA compatible with MLA + add initial Metal kernels (#12953)

* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci

b5148

17 Apr 14:02
207c22e
Compare
Choose a tag to compare
ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970)

b5147

17 Apr 13:18
7a395f6
Compare
Choose a tag to compare
CANN: Add support for async operator submission (#12864)

Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.

b5146

17 Apr 09:39
971f245
Compare
Choose a tag to compare
llama : recognize IBM Granite 3.3 FIM tokens (#12988)

The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```

b5145

16 Apr 22:08
12b1750
Compare
Choose a tag to compare
opencl: fix incorrect local_size index in profiling log (#12868)

b5144

16 Apr 19:24
015022b
Compare
Choose a tag to compare
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (…

b5143

16 Apr 09:04
b43d89e
Compare
Choose a tag to compare
CANN: Add 310P operator support check (#12962)

b5142

15 Apr 20:25
80f19b4
Compare
Choose a tag to compare
opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886)

* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <[email protected]>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <[email protected]>

b5141

15 Apr 12:55
f8f820c
Compare
Choose a tag to compare
metal : add FA-vec kernels for head size 96 (#12952)

ggml-ci