Releases · ggml-org/llama.cpp

18 Apr 07:57

2db9ba1

b5150

rpc : add RPC_CMD_HELLO (#12955)

Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org

Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465

Assets 26

17 Apr 16:10

github-actions

b5149

2f74c35

b5149

graph : make FA compatible with MLA + add initial Metal kernels (#12953)

* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci

Assets 26

17 Apr 14:02

github-actions

b5148

207c22e

b5148

ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970)

Assets 26

17 Apr 13:18

github-actions

b5147

7a395f6

b5147

CANN: Add support for async operator submission (#12864)

Submit operators using asynchronous threads to improve performance.

Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.

Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.

Assets 26

17 Apr 09:39

github-actions

b5146

971f245

b5146

llama : recognize IBM Granite 3.3 FIM tokens (#12988)

The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```

Assets 26

16 Apr 22:08

github-actions

b5145

12b1750

b5145

opencl: fix incorrect local_size index in profiling log (#12868)

Assets 26

16 Apr 19:24

github-actions

b5144

015022b

b5144

vulkan: enable coopmat2 FA gqa and split_k optimizations more often (…

Assets 26

16 Apr 09:04

github-actions

b5143

b43d89e

b5143

CANN: Add 310P operator support check (#12962)

Assets 26

15 Apr 20:25

github-actions

b5142

80f19b4

b5142

opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886)

* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <[email protected]>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <[email protected]>

Assets 26

15 Apr 12:55

github-actions

b5141

f8f820c

b5141

metal : add FA-vec kernels for head size 96 (#12952)

ggml-ci

Assets 26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ggml-org/llama.cpp

b5150

Uh oh!

b5149

Uh oh!

b5148

Uh oh!

b5147

Uh oh!

b5146

Uh oh!

b5145

Uh oh!

b5144

Uh oh!

b5143

Uh oh!

b5142

Uh oh!

b5141

Uh oh!