Releases · EAddario/llama.cpp

17 Apr 09:42

971f245

b5146

llama : recognize IBM Granite 3.3 FIM tokens (#12988)

The Granite's FIM tokens are very similar to Qwen's; it's just that
they use underscore instead of a dash. So <fim_middle> for example
instead of <fim-middle>.

Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base
shows:

```
    "<fim_prefix>",
    "<fim_middle>",
    "<fim_suffix>",
    "<fim_pad>",
    ...
    "<reponame>",
```

Assets 26

16 Apr 07:43

github-actions

b5142

80f19b4

b5142

opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886)

* opencl: refactor - split the kernel files

---------

Co-authored-by: Shangqing Gu <[email protected]>

* opencl: split more kernels into separate files

* opencl: specify subgroup size instead of querying it

* opencl: refine Adreno cl compiler version parsing

* opencl: skip some kernels not used by Adreno on old compilers

* opencl: refine logic for selecting Adreno kernels

* opencl: refine Adreno cl compiler version

* opencl: cleanup preprocessor for kernels

* opencl: consider Adreno CL compiler on Windows

* opencl: add final newline for `mul_mv_f16_f16.cl`

---------

Co-authored-by: Shangqing Gu <[email protected]>

Assets 26

15 Apr 15:10

github-actions

b5141

f8f820c

b5141

metal : add FA-vec kernels for head size 96 (#12952)

ggml-ci

Assets 26

15 Apr 11:09

github-actions

b5139

84778e9

b5139

CUDA/HIP: Share the same unified memory allocation logic. (#12934)

Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.

Assets 26

15 Apr 08:02

github-actions

b5137

daa4228

b5137

llama : DeepSeek V2/V3 MLA implementation (#12801)

* Merged using squash to remove all noise commit messages

* Force flash attention off for `LLM_ARCH_DEEPSEEK2` - embedding too large

* Removed 3 conts (2x RoPE and 1x RMS-norm)

* Changed to use `<cmath>` instead of `<math.h>`

* Reverted removal of the 3 conts

* Used `reshape` in `llm_graph_context::build_attn_mha()`

* Use `k_pe = ggml_reshape`

* Removed the 3 conts again

* Removed the 3D views of `wk_b` and `wv_b`, and just save and 3D in GGUF

* Removed MQA optimisation from `build_attn_mha()` as no gains now

* Simplified `is_mla` branch in `llm_build_deepseek2()`

* Removed `build_attn_mla` and added `nullptr` to all `build_atnn` calls

* Fixed call to `build_attn` in `llm_build_t5_enc`

Assets 26

14 Apr 22:10

github-actions

b5133

d6d2c2a

b5133

Add performance print for gemma3 in example (#12929)

Assets 26

14 Apr 07:29

github-actions

b5129

526739b

b5129

sync : ggml

ggml-ci

Assets 26

13 Apr 22:30

github-actions

b5126

307bfa2

b5126

ggml: disable CUDA graphs for unsupported DUP and CONT node types (#1…

Assets 26

07 Apr 19:27

github-actions

b5072

4ccea21

b5072

hellaswag: display estimated score confidence interval (#12797)

Assets 26

Releases: EAddario/llama.cpp

b5146

Uh oh!

b5142

Uh oh!

b5141

Uh oh!

b5139

Uh oh!

b5137

Uh oh!

b5133

Uh oh!

b5129

Uh oh!

b5126

Uh oh!

b5072

Uh oh!