sync : ggml #3478

ggerganov · 2025-10-14T19:10:05Z

No description provided.

…518) The previous SVE implementation for `ggml_vec_dot_f16_unroll` contained a bug due to a copy-paste error. The wrong variable was used in an FMA instruction, leading to incorrect results. This commit corrects the variable usage and improves the clarity of the code by renaming variables to avoid confusion. Co-authored-by: Aaron <[email protected]>

Co-authored-by: Aaron <[email protected]>

* fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]>

* scaffold to support opt step adamw on metal (not written so far) * add opt-step-adamw kernel for metal * pass op->src[4] as a separate buffer to the pipeline * add bounds check to opt-step-adamw kernel * complete scaffold for GGML_OP_SUM * naive GGML_OP_SUM kernel * remove unwanted comment * change OP_SUM capability gate * Add has_simdgroup_reduction to both ops to pass CI

Many Ascend operators internally use FP16 precision for computation. If input data is in FP32, it must first be cast to FP16 before computation, and then cast back to FP32 after computation, which introduces unnecessary cast operations. Moreover, FP16 computation requires significantly less workload compared to FP32, leading to noticeable efficiency improvements. In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended to support multiple data types. Validation on the Qwen2 0.5b model shows correct accuracy and about 10% performance gain in concurrent scenarios. Co-authored-by: noemotiovon <[email protected]>

* metal: add support for opt_step_sgd * add newline to pass EditorConfig check

This commit fixes a CPU-side memory leak issue in the CANN backend, which occurred when intermediate aclTensorList objects were not properly released after operator execution. The leak happened during repeated invocations of CANN ops (e.g., FlashAttention), leading to increasing host memory usage over time. Proper resource cleanup (aclDestroyTensorList and related release logic) has been added to ensure that all temporary tensors are correctly freed.

* ggml : fix build broken with -march=armv9-a on MacOS Signed-off-by: Jie Fu <[email protected]> * Add #pragma message Signed-off-by: Jie Fu <[email protected]> * Address review comment. Signed-off-by: Jie Fu <[email protected]> * Update ggml/src/ggml-cpu/ggml-cpu.c --------- Signed-off-by: Jie Fu <[email protected]> Co-authored-by: Diego Devesa <[email protected]>

* metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]

* remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function

* CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks

* CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code

Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.

…(llama/16577)

Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]>

JohannesGaessler and others added 23 commits October 14, 2025 22:08

CUDA: faster tile FA, add oob checks, more HSs (llama/16492)

791e60a

ggml : Fix FP16 ELU positive branch (llama/16519)

45e26a5

Co-authored-by: Aaron <[email protected]>

fix UT fault cases: count-equal, argsort, pad OPs (llama/16521)

99d0741

* fix/refactor OP argsort, pad * fix count-equal op * update SYCL OP list * fix format issue --------- Co-authored-by: Zhang Jianyu <[email protected]>

ggml : fix scalar path for computing norm (llama/16558)

6dd8608

metal: add support for opt_step_sgd (llama/16539)

5ef1117

* metal: add support for opt_step_sgd * add newline to pass EditorConfig check

CUDA: fix numerical issues in tile FA kernel (llama/16540)

e2b9c20

opencl: fix build targeting CL 2 (llama/16554)

6839554

metal : FA support F32 K and V and head size = 32 (llama/16531)

d98a164

* metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]

cuda : remove legacy copy-op pointer indirection code (llama/16485)

d541d24

* remove legacy copy-op pointer indirection code * further removal of copy-op indirection code * renamed check_node_graph_compatibility_and_refresh_copy_ops function

CUDA: add fp kernel for larger batch size MoE (llama/16512)

17d67ca

* CUDA: kernel for larger batch sizes for MoE * WIP * WIP * WIP * WIP * WIP * WIP * fixup * tests * Move mmq_ids_helper to mmid * cleanup * Remove redundant checks

CUDA: use fastdiv + ggml_cuda_mad for mmvf (llama/16557)

360acc7

* CUDA: use fastdiv + ggml_cuda_mad for mmvf * use bf16 directly + fix formatting * Add exception for HIP code

CUDA: enable FA for FP32 KV cache (llama/16546)

395008b

vulkan: Improve build time for MSVC (llama/16545)

0f82a3c

Enable CMP0147 so custom build steps (invoking vulkan-shader-gen) are run in parallel. Enable /MP so source files are compiled in parallel.

vulkan: Support FA with K/V in F32 (llama/16543)

c03c434

CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion …

296f8c0

…(llama/16577)

vulkan: Add ACC_TYPE_VEC2 implementation (llama/16203)

8452144

Signed-off-by: Stefan Savic <[email protected]> Co-authored-by: Stefan Savic <[email protected]>

sync : ggml

c5d5a80

talk-llama : sync llama.cpp

2eb25b1

danbev approved these changes Oct 15, 2025

View reviewed changes

ggerganov merged commit 8ba3c13 into master Oct 15, 2025
65 of 66 checks passed

ggerganov deleted the sync-ggml-25-10-14 branch October 15, 2025 06:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3478

sync : ggml #3478

Uh oh!

ggerganov commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

sync : ggml #3478

sync : ggml #3478

Uh oh!

Conversation

ggerganov commented Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants