Skip to content

Conversation

@ewykric
Copy link

@ewykric ewykric commented Dec 3, 2025

Description:

在CANN后端实现并优化门控线性注意力(GATED_LINEAR_ATTN)操作。本次实现包含以下主要内容:

  1. 添加了必要的头文件引用( aclnnop/aclnn_mv.h )以支持矩阵向量乘法操作
  2. 实现了完整的 ggml_cann_gated_linear_attn 函数,为华为昇腾AI处理器提供高效的门控线性注意力计算支持
  3. 该实现严格遵循ggml API接口规范,确保与现有框架无缝集成
    该功能的实现使CANN后端能够高效处理使用门控线性注意力机制的模型,如Llama-3.1-Nemotron等现代大语言模型。

Testing:

使用官方测试框架对实现进行了全面验证:
[2501213363@cninfer04 llama.cpp]$ ./build/bin/test-backend-ops test -b CANN0 -o GATED_LINEAR_ATTN
测试环境:

  • 硬件平台:Ascend310P3

  • 设备内存:44280 MB (43087 MB free)

  • 后端:CANN0
    测试结果:

  • GATED_LINEAR_ATTN(type=f32,head_count=32,head_size=64,n_seq_tokens=1,n_seqs=1): OK

  • GATED_LINEAR_ATTN(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=1): OK

  • GATED_LINEAR_ATTN(type=f32,head_count=32,head_size=64,n_seq_tokens=32,n_seqs=4): OK

  • GATED_LINEAR_ATTN(type=f32,head_count=32,head_size=64,n_seq_tokens=128,n_seqs=4): OK

  • 11821/11821 tests passed

  • Backend CANN0: OK

  • 9/9 backends passed
    测试覆盖了不同批次大小、序列长度的组合,验证了实现的正确性和鲁棒性。所有测试用例均成功通过,表明CANN后端的GATED_LINEAR_ATTN实现与预期行为完全一致。

Notes:

核心实现细节

  • 实现了完整的门控线性注意力三步计算流程:
    1. 计算k*v外积
    2. 应用门控并更新状态矩阵
    3. 计算最终输出(状态矩阵转置与查询向量的矩阵向量乘法)

性能优化措施

  1. 预分配缓冲区 :避免在循环中重复分配内存,减少内存分配开销
  2. 可重用张量 :创建固定的缓冲区张量,在迭代过程中重复使用
  3. 预创建参数数组 :提前创建重复模式等参数数组,避免重复构造
  4. 资源及时释放 :确保临时张量和资源在使用完毕后立即释放,优化内存使用

技术特点

  • 支持任意批次大小(B)、序列长度(L)、注意力头数(H)和头维度(D)
  • 正确处理张量内存布局和偏移量计算
  • 实现了高效的状态更新机制,避免不必要的数据复制
  • 通过矩阵转置和矩阵向量乘法优化计算效率

兼容性

  • 完全兼容现有的ggml GATED_LINEAR_ATTN操作接口
  • 支持与CUDA、SYCL等其他后端相同的输入输出规范
  • 实现了相同的缩放因子应用逻辑,确保计算精度一致性
    此实现将使llama.cpp在华为昇腾AI处理器上能够高效运行使用门控线性注意力机制的现代大语言模型,拓展了框架的硬件支持范围。

@ewykric ewykric changed the title Feature/gatedlinearattn CANN: GATED_LINEAR_ATTN Dec 3, 2025
YushengZhao pushed a commit to YushengZhao/llama.cpp that referenced this pull request Dec 6, 2025
…rg#17764)

* Squashed commit of the following:

commit b3c6bf4
Author: Abhijit Ramesh <[email protected]>
Date:   Mon Dec 1 18:29:00 2025 -0800

    ggml webgpu: fix xielu parameter passing (noemotiovon#11)

    The XIELU operation was incorrectly using static_cast to convert
    float parameters to uint32_t, which converted numeric values instead
    of preserving IEEE 754 bit patterns. This caused incorrect values
    to be interpreted by the GPU shader.

    * Use reinterpret_cast to preserve float bit patterns when passing
      through uint32_t params buffer
    * Update WGSL shader parameter types from u32 to f32
    * Re-enable XIELU support (was disabled due to numerical issues)

    Fixes NMSE test failures for XIELU operation on WebGPU backend.

commit 5ca9b5e
Author: neha-ha <[email protected]>
Date:   Tue Nov 18 12:17:00 2025 -0800

    Refactored pipelines and workgroup calculations (noemotiovon#10)

    * refactored pipelines

    * refactored workgroup calculation

    * removed commented out block of prior maps

    * Clean up ceiling division pattern

    ---------

    Co-authored-by: Neha Abbas <[email protected]>
    Co-authored-by: Reese Levine <[email protected]>

Author: James Contini <[email protected]>
Date:   Wed Oct 29 23:13:06 2025 -0700

    formatted embed wgsl and ggml-webgpu.cpp

commit e1f6bae
Author: James Contini <[email protected]>
Date:   Wed Oct 29 23:08:37 2025 -0700

    implemented REPL_Template support and removed bug in unary operators kernel

commit 8c70b8f
Author: James Contini <[email protected]>
Date:   Wed Oct 15 16:14:20 2025 -0700

    responded and dealt with PR comments

commit f9282c6
Author: James Contini <[email protected]>
Date:   Sun Oct 12 13:41:41 2025 -0700

    removed unnecesarry checking if node->src[1] exists for unary operators

commit 4cf28d7
Author: James Contini <[email protected]>
Date:   Sun Oct 12 13:32:45 2025 -0700

    All operators (inlcluding xielu) working

commit 74c6add
Author: James Contini <[email protected]>
Date:   Fri Oct 10 13:16:48 2025 -0700

    fixed autoconfig

commit 3627499
Author: James Contini <[email protected]>
Date:   Fri Oct 10 13:10:46 2025 -0700

    removed vestigial files

commit cb08583
Author: James Contini <[email protected]>
Date:   Fri Oct 10 12:59:32 2025 -0700

    abides by editor-config

commit 5360e28
Author: James Contini <[email protected]>
Date:   Fri Oct 10 12:45:57 2025 -0700

    rms_norm double declaration bug atoned

commit 7b09baa
Merge: 8a6ec84 74b8fc1
Author: James Contini <[email protected]>
Date:   Fri Oct 10 11:50:03 2025 -0700

    resolving merge conflicts

commit 8a6ec84
Author: James Contini <[email protected]>
Date:   Wed Oct 8 18:06:47 2025 -0700

    unary operators pass ggml tests

commit c3ae382
Author: James Contini <[email protected]>
Date:   Wed Oct 1 16:22:40 2025 -0700

    neg passes backend test

commit aa1c9b2
Author: James Contini <[email protected]>
Date:   Tue Sep 30 23:55:27 2025 -0700

    neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

Co-authored-by: James Contini <[email protected]>
Co-authored-by: Neha Abbas <[email protected]>
Co-authored-by: Abhijit Ramesh <[email protected]>

* Remove extra code and format

* Add ops documentation (finally)

* Update ggml/src/ggml-webgpu/wgsl-shaders/embed_wgsl.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: James Contini <[email protected]>
Co-authored-by: Neha Abbas <[email protected]>
Co-authored-by: Abhijit Ramesh <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant