Skip to content

Conversation

@YushengZhao
Copy link

描述 (Description)

本 PR 在 ggml 的 CANN 后端中新增对 Gated Linear Attention (GLA) 算子的支持。该算子广泛应用于高效注意力机制(如 RWKV、Linear Transformer 变体等),通过引入门控信号和状态累积机制,在保持建模能力的同时显著降低计算复杂度。

变更摘要:

  • ggml/src/ggml-cann/ggml-cann.cpp 中注册 GGML_OP_GATED_LINEAR_ATTN 操作,并绑定到新实现的 ggml_cann_gated_linear_attn 函数。
  • ggml/src/ggml-cann/aclnn_ops.cpp 中实现 ggml_cann_gated_linear_attn 核心逻辑,利用 ACLNN 的 RepeatMulAddMv 等算子完成 GLA 的前向计算。
  • 支持 batched multi-head GLA,输入张量布局为 (C, H, T, B),其中 C = H * DT = B * L,符合 ggml 内部约定。
  • 引入可学习门控 g 和状态 s 作为额外输入,支持状态更新与输出生成的联合计算。

测试 (Testing)

测试步骤:

  1. 编译项目(启用 CANN 后端):

    cmake -B build -DGGML_CANN=ON -DCMAKE_BUILD_TYPE=Release
    cmake --build build --config Release -j
  2. 运行 GLA 专用后端测试(需提前在 test-backend-ops.cpp 中添加对应测试用例):

    ./bin/test-backend-ops test -b CANN0 -o GATED_LINEAR_ATTN

测试结果:

image

备注 (Notes)

slaren and others added 30 commits November 13, 2025 10:59
* metal: accelerated conv2d

* cont : cleanup

---------

Co-authored-by: bghira <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
…heck (ggml-org#17219)

* vulkan: remove shell call from vulkan-shaders-gen tool

* use string vector for command execution

* Fix condition

* use string, remove const_cast

* Fix dependency file quotation on Windows

---------

Co-authored-by: Jeff Bolz <[email protected]>
* Add ops needed for new hybrid models: SOFTPLUS, EXPM1, TRI, SOLVE_TRI, CUMSUM

* Update ggml/include/ggml.h

Co-authored-by: Georgi Gerganov <[email protected]>

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Code review

* Whitespace

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <[email protected]>

* This is actually sigmoid, duh.

* Add CONST, remove TRI_KEEP, other changes from review

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml.c

Co-authored-by: Georgi Gerganov <[email protected]>

* Update ggml/src/ggml-cuda/unary.cu

Co-authored-by: Aman Gupta <[email protected]>

* Remove extra script

* Update ggml/src/ggml.c

Co-authored-by: Diego Devesa <[email protected]>

* Update tests/test-backend-ops.cpp

Co-authored-by: Diego Devesa <[email protected]>

* moving changes from laptop [no ci]

* pre-rebase

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update tests/test-backend-ops.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Refactor tests

* ggml : cleanup

* cont : fix ggml_fill srcs

* tests : add note

* ggml : add ggml_fill_inplace

* ggml : add asserts

* ggml : fix ggml_fill constant cast

* cont : ggml_tri minor

* Use TENSOR_LOCALS

* Fix regression from ggml-org#14596, regenerate

* Don't make commits at night...

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Co-authored-by: Diego Devesa <[email protected]>
Co-authored-by: Aman Gupta <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ggml-cpu: handle 3d tensors in repack mul_mat

* Removed unnecessary branch, removed need for <algorithm>

* Fixed dst_ptr pointer in chunk + clang_format

* GGML_ASSERT to check wdata within bounds

* Accidental ggml.h inclusion

* Improved GGML_ASSERT on wdata boundaries

* Address performance regression in Qwen and llama.cpp due to chunking
* metal : refactor argsort

* cont : sort chunks

* cont : merge sorted buckets

* cont : cleanup
…nstruction (ggml-org#17048)

* fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (ggml-org#17047)

* Replace 'static' workaround, with keeping variable in scope for longer

* Create std::array directly and pass into llama_grammar_init_impl

* Add back the trigger pattern

* Missed array include
* Add AFMOE model support

* Update to vocab

* Add model sizing

* Undo Rope change for ARCEE model

* Address review comments

* Update modeling code is_sliding -> use_rope, replace hard-coded logic

* Fix AFMOE tokenizer

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Update AFMoE tokenizer class identification to be more unique

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
…gml-org#17158)

* vulkan: change graph_compute to be async and enable get_tensor_async

This allows some additional CPU/GPU overlap for large pp workloads. Also seems
to help a bit for token gen, maybe getting rid of a small bubble between
graph_compute and get_tensor.

Async set and copy functions seem to be very rarely used, so I didn't enable
them because I didn't have a good way to test them.

The async commands need to be ordered against each other, so put them all on
the compute queue. The non-async commands still use the transfer queue.

The fence for graph_compute/get_tensor_async is submitted and waited on in
ggml_vk_synchronize.

* fix thread safety errors

* teardown context cleanly

* Handle async read to non-pinned dst
…rg#17244)

* vulkan: Use ggml_vk_tensor_subbuffer in mul_mat_vec(id) paths

* set allow_misalign
* docs: update Vulkan ops

* vulkan: add NEG op

* vulkan: add ABS op

---------

Signed-off-by: Giuseppe Scrivano <[email protected]>
These both show up in gpt-oss. Also, cleanup the mul_mat_vec fusion code a bit.
…ersistence in chat UI (ggml-org#16618)

* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option

* webui: remove scroll listener causing unnecessary layout updates (model selector)

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <[email protected]>

* chore: npm run format & update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <[email protected]>
…g#17278)

* fix: Better pointer events handling in chat processing info elements

* chore: update webui build output
…ide operator support (ggml-org#17213)

* SYCL: add generic unary op implementation for multiple ops (ABS/SGN/…); unify non-contiguous access

* SYCL: update documentation and sycl.csv to reflect new unary op support

* update ops.md after syncing SYCL.csv changes

* Fix SYCL.csv merge conflict

* Update ops.md after fixing SYCL.csv conflicts

* Fix SYCL.csv tail after merge conflict and regenerate ops.md

* Fix line endings and final newline in SYCL.csv

* Remove TOPK_MOE entries from SYCL.csv as requested

* Update ops.md after removing TOPK_MOE from SYCL.csv

* Regenerated SYCL.csv and synced ops.md with upstream

* Update ops.md using create_ops_docs.py
SmartestWashingMachine and others added 12 commits December 4, 2025 12:12
… one. (ggml-org#17749)

* conversion: use existing local chat_template.jinja file if mistral-format model has one.

* fix --mistral-format mistakenly assuming some <=v7 chat template names are file paths and reading them.

* Update convert_hf_to_gguf.py - change from exists() to is_file()

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
This commit skips the model validation check when the user specifies the
--help option.

The motivation for this is that currently and error is thrown before the
--help could be processed. Now skips validation if params.usage is set,
allowing help to display without requiring --model.

Resolves: ggml-org#17754
This commit adds the missing code block end marker in simple-cmake-pkg
to correct the formatting.
* server: move msg diffs tracking to HTTP thread

* wip

* tool call tests ok

* minor : style

* cont : fix

* move states to server_response_reader

* add safe-guard

* fix

* fix 2

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* feat(wip): Port initial TRI impl from pervious work

The kernel does not work and is not optimized, but the
code compiles and runs, so this will be the starting point
now that the core op has been merged.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Remove argument for constant val override

This was added in the original draft, but later removed. With this, the
kernel now passes tests.

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Move the ttype conditional to templating to avoid conditional in kernel

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Type fixes

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Co-authored-by: Georgi Gerganov <[email protected]>

* feat: Add softplus for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add EXPM1 for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* feat: Add FILL for metal

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Branchless version of tri using _ggml_vec_tri_cmp as a mask

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* fix: Remove unused arguments

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

* refactor: Use select instead of branch for softplus non-vec

Branch: ggml-cumsum-tri

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>
* Add support for CUMSUM and TRI for CUDA.

* Minor optimizations.

* Correct warp_prefix_inclusive_sum in float2 variant to return float2

* Optimize TRI

* Whitespace

* Fix strides.

* Implement double loop

* Whitespace

* Fix HIP compilation bugs

* Optimizations + big case performance tests

* Implement using CUB with fallback to custom kernel

* Remove error message.

* Fixes from code review

* Comment out CPU-unsupported F16/BF16 cases to fix CI

* Fine, you win :P

* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

* Vary warp-size based on physical warp size

* Add GGML_UNUSED_VARS in tri as well

* Use constexpr and call prefix_inclusive with warp_size template param

* Update ggml/src/ggml-cuda/cumsum.cu

Co-authored-by: Johannes Gäßler <[email protected]>

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <[email protected]>

* Change to tid % warp_size

* Fix strides; hardcode mask; add ggml_lane_mask_t

* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()

* Too hasty...

---------

Co-authored-by: Johannes Gäßler <[email protected]>
* docs: Regen Metal.csv

Branch: UpdateOpsMd

Signed-off-by: Gabe Goodhart <[email protected]>

* docs: Regen BLAS.csv

Branch: UpdateOpsMd

Signed-off-by: Gabe Goodhart <[email protected]>

* docs: Update ops.md

Branch: UpdateOpsMd

Signed-off-by: Gabe Goodhart <[email protected]>

---------

Signed-off-by: Gabe Goodhart <[email protected]>
…g#17773)

* transform release binary root dir in tar to llama-bXXXX

* bsdtar supports -s instead of --transform
* enabled wmma instructions for most quantizations other than q2k

* fixed the last q2_k test case failure

* address comments: fix out of bound write for RDNA4, add comments after #endif

* clean up rebase: fix ne error in half2

* fix the EditorConfig CI
@noemotiovon
Copy link
Owner

非常感谢你的贡献!!当前上游社区还不支持这个算子,可以直接贡献到上游社区嘛?体验一下完整的开源流程,我们也会在社区进行review

pwilkin and others added 14 commits December 5, 2025 12:00
* Add pwilkin to CODEOWNERS for chat files

* Reorder alphabetically
…rg#17786)

Add nosubs|optimize flags to std::regex constructors to prevent
catastrophic backtracking when processing prompts with repeated
identical characters (e.g., 'A' * 10000).

The nosubs flag disables subgroup capture, significantly reducing
memory usage and backtracking on uniform token sequences
* examples : add idle

* metal : attach residency sets to queue

* idle : add link

* idle : adjust intervals

* metal : add residency sets keep-alive heartbeat

* cont : adjust default keep-alive time
* rpc : fix alloc size logic

* rpc : bump version
* vulkan: set all memory allocations to high priority

* gate by env var
…rg#17764)

* Squashed commit of the following:

commit b3c6bf4
Author: Abhijit Ramesh <[email protected]>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (noemotiovon#11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e
Author: neha-ha <[email protected]>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (noemotiovon#10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <[email protected]>
    Co-authored-by: Reese Levine <[email protected]>

Author: James Contini <[email protected]>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6bae
Author: James Contini <[email protected]>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8f
Author: James Contini <[email protected]>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c6
Author: James Contini <[email protected]>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7
Author: James Contini <[email protected]>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add
Author: James Contini <[email protected]>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 3627499
Author: James Contini <[email protected]>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb08583
Author: James Contini <[email protected]>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e28
Author: James Contini <[email protected]>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa
Merge: 8a6ec84 74b8fc1
Author: James Contini <[email protected]>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec84
Author: James Contini <[email protected]>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae382
Author: James Contini <[email protected]>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2
Author: James Contini <[email protected]>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <[email protected]>
Co-authored-by: Neha Abbas <[email protected]>
Co-authored-by: Abhijit Ramesh <[email protected]>

* Remove extra code and format

* Add ops documentation (finally)

* Update ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: James Contini <[email protected]>
Co-authored-by: Neha Abbas <[email protected]>
Co-authored-by: Abhijit Ramesh <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* vulkan: Reduce temporary memory usage for TOP_K

- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"

For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.

* vulkan: fix top_k bug when there are ties in the input

I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).

I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.

* Update tests/test-backend-ops.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@YushengZhao YushengZhao force-pushed the feature/gatedlinearattn branch from 004f090 to a341f3c Compare December 6, 2025 04:11
@YushengZhao
Copy link
Author

@noemotiovon 已在上游社区提PR:ggml-org#17814

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.