Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b5150
rpc : add RPC_CMD_HELLO (#12955) Add RPC_CMD_HELLO for getting the version of the protocol implemend by the server. Follow the semantic versioning rules at https://semver.org Hopefully this bring better user experience when we make breaking changes at the protocol level and avoid issues like #12465
b5149
graph : make FA compatible with MLA + add initial Metal kernels (#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci
b5148
ggml: Re-enable CUDA graphs in presence of CONT and DUP nodes (#12970)
b5147
CANN: Add support for async operator submission (#12864) Submit operators using asynchronous threads to improve performance. Use the environment variable GGML_CANN_ASYNC_MODE to control whether asynchronous submission is enabled. It is disabled by default. Testing shows a 10%–20% performance improvement in scenarios with small parameter sizes, especially in quantized models.
b5146
llama : recognize IBM Granite 3.3 FIM tokens (#12988) The Granite's FIM tokens are very similar to Qwen's; it's just that they use underscore instead of a dash. So <fim_middle> for example instead of <fim-middle>. Opening up tokenizer_config.json in ibm-granite/granite-3.3-8b-base shows: ``` "<fim_prefix>", "<fim_middle>", "<fim_suffix>", "<fim_pad>", ... "<reponame>", ```
b5145
opencl: fix incorrect local_size index in profiling log (#12868)
b5144
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (…
b5143
CANN: Add 310P operator support check (#12962)
b5142
opencl: split `ggml-opencl.cl` into multiple files and cleanup (#12886) * opencl: refactor - split the kernel files --------- Co-authored-by: Shangqing Gu <[email protected]> * opencl: split more kernels into separate files * opencl: specify subgroup size instead of querying it * opencl: refine Adreno cl compiler version parsing * opencl: skip some kernels not used by Adreno on old compilers * opencl: refine logic for selecting Adreno kernels * opencl: refine Adreno cl compiler version * opencl: cleanup preprocessor for kernels * opencl: consider Adreno CL compiler on Windows * opencl: add final newline for `mul_mv_f16_f16.cl` --------- Co-authored-by: Shangqing Gu <[email protected]>
b5141
metal : add FA-vec kernels for head size 96 (#12952) ggml-ci