Skip to content

[WIP][Ops] Add AscendC Custom Op for Lightning Attention#7590

Draft
ChenxiQ wants to merge 7 commits intovllm-project:mainfrom
ChenxiQ:br_ascend_c_lightning_attention
Draft

[WIP][Ops] Add AscendC Custom Op for Lightning Attention#7590
ChenxiQ wants to merge 7 commits intovllm-project:mainfrom
ChenxiQ:br_ascend_c_lightning_attention

Conversation

@ChenxiQ
Copy link
Contributor

@ChenxiQ ChenxiQ commented Mar 24, 2026

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

ChenxiQ added 6 commits March 24, 2026 16:27
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@ChenxiQ ChenxiQ changed the title Add AscendC Op for Lightning Attention [WIP] Add AscendC Op for Lightning Attention Mar 24, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vLLM-Ascend backend by introducing dedicated AscendC operators for Lightning Attention's decode and prefill phases. These new custom operations are crucial for optimizing performance on Ascend hardware, enabling more efficient processing of attention mechanisms in large language models. The integration includes the necessary operator definitions, tiling strategies, kernel implementations, and PyTorch bindings, ensuring seamless usability within the existing framework.

Highlights

  • New AscendC Operators: Introduced two new AscendC operators: lightning_attention_decode and lightning_attention_prefill, designed to optimize attention mechanisms on Ascend hardware.
  • Build System Integration: Integrated the new Lightning Attention operators into the build_aclnn.sh script, ensuring they are compiled and available for both ascend910b and ascend910_93 SOC versions.
  • PyTorch Bindings and Meta-Functions: Provided PyTorch bindings (npu_lightning_attention_decode, npu_lightning_attention_prefill) and corresponding meta-functions for shape and data type inference, enabling seamless use within PyTorch workflows.
  • Comprehensive Testing: Added end-to-end test cases for both lightning_attention_decode and lightning_attention_prefill, validating their functionality across various batch sizes, KV cache configurations, sequence lengths, and data types (FP16, FP32).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds new AscendC operators for Lightning Attention, including prefill and decode stages. The implementation covers kernel code, host-side logic, and PyTorch bindings. The changes are extensive and introduce significant new functionality. My review has identified several critical issues, including a functional bug in shape inference, undefined behavior in kernel code, and a broken test case. There are also several high-severity issues related to code clarity, style, and maintainability that should be addressed. For instance, there's a typo in the filename csrc/lightning_attention_decode/lightning_attention_docode_torch_adpt.h (docode should be decode), which should be corrected for consistency.

const gert::Shape* q_shape = context->GetInputShape(INDEX_IN_Q);
gert::Shape* attn_out_shape = context->GetOutputShape(INDEX_OUT_ATTN);
gert::Shape* kv_caches_shape = context->GetOutputShape(INDEX_OUT_KV_CACHES);
*attn_out_shape = *q_shape;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shape inference for attention_out is incorrect. The output shape should be 2D (batch, head_num * head_dim), but it's currently being set to the 4D shape of the query tensor. This will lead to runtime errors or incorrect behavior.

Suggested change
*attn_out_shape = *q_shape;
attn_out_shape->SetDimNum(2);
attn_out_shape->SetDim(0, q_shape->GetDim(0));
attn_out_shape->SetDim(1, q_shape->GetDim(1) * q_shape->GetDim(3));

auto helpTensor = kvCacheBuf_.Get<float>();

uint32_t mOffset;
uint32_t tmp = 0xFF800000; // -inf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Using type punning via a pointer cast *((float *)&tmp) to represent negative infinity is undefined behavior in C++. This can lead to unpredictable results. A safer and more portable way is to use std::numeric_limits. Please replace this with a safe alternative, for example: const float neg_inf = -std::numeric_limits<float>::infinity(); (and include <limits>). This constant should then be used in Duplicate and SetValue calls.

Comment on lines +190 to +191
actual_seq_len = [np.random.randint(1, max_seq_len / block_size + 1) * block_size
for _ in range(batch_size)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line uses np.random.randint, but the numpy library is not imported in this file, which will cause a NameError. Please add import numpy as np at the top of the file. This issue is also present in test_lightning_attention_prefill_with_kv_history.

@@ -0,0 +1,46 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There appears to be a typo in this file's name: lightning_attention_docode_torch_adpt.h. It should likely be lightning_attention_decode_torch_adpt.h. Please rename the file for consistency and to avoid confusion.

* slotIds : required
* inputLayoutOptional : optional
* attentionOut : required
* kvCachesRef : required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The parameter kvCachesRef is documented twice (here and on line 26). Please remove this duplicate line to improve documentation clarity.


install(FILES ${CMAKE_CURRENT_SOURCE_DIR}/aclnn_lightning_attention_prefill.h
DESTINATION ${ACLNN_INC_INSTALL_DIR} OPTIONAL
) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This file is missing a newline at the end. This can cause issues with some build tools and violates common coding standards. Please add a newline character at the end of the file. This issue is present in other new files in this PR as well.

{
uint32_t headStartIdx = 0;
uint32_t headEndIdx = 0;
uint32_t totalBlockCount = totalBlockCount_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The local variable totalBlockCount shadows the class member totalBlockCount_, which can be confusing and error-prone. Please consider renaming the local variable to something like remainingBlockCount for better readability.

    uint32_t remainingBlockCount = totalBlockCount_;

TILING_DATA_FIELD_DEF_ARR(uint16_t, 256, blockCountPerBatch); // max batch size 256
TILING_DATA_FIELD_DEF_ARR(uint16_t, 256, tailBlockSize); // max batch size 256
TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headStart); // max aiv num: 50
TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headEnd);;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's an extra semicolon at the end of this line. Please remove it to fix the syntax.

   TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headEnd);

Comment on lines +933 to +940
// lightning_attentioin_decode
ops.def(
"npu_lightning_attention_prefill(Tensor query, Tensor key, Tensor value, Tensor slope_rate, "
" int block_size, Tensor? kv_history=None, int[]? actual_seq_len=None) -> (Tensor, Tensor)"
);
ops.impl("npu_lightning_attention_prefill", torch::kPrivateUse1, &vllm_ascend::npu_lightning_attention_prefill);

// lightning_attention_prefill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The comments for lightning_attention_decode and lightning_attention_prefill are swapped. Additionally, lightning_attentioin_decode has a typo (attentioin). This is confusing. Please correct the comments and place them with their corresponding operator definitions.

    // lightning_attention_prefill
    ops.def(
        "npu_lightning_attention_prefill(Tensor query, Tensor key, Tensor value, Tensor slope_rate, "
        "                                int block_size, Tensor? kv_history=None, int[]? actual_seq_len=None) -> (Tensor, Tensor)"
    );
    ops.impl("npu_lightning_attention_prefill", torch::kPrivateUse1, &vllm_ascend::npu_lightning_attention_prefill);
    
    // lightning_attention_decode
    ops.def(

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces core optimizations for the vLLM-Ascend backend by implementing lightning_attention_decode and lightning_attention_prefill as custom AscendC operators. This integration aims to significantly enhance the performance of attention mechanisms on Ascend hardware, providing a more efficient foundation for large language models. The changes span from build system updates and PyTorch integration to detailed kernel implementations and dedicated testing, marking a significant step in hardware-accelerated attention processing.

Highlights

  • New AscendC Operators: Introduced two new AscendC custom operators, lightning_attention_decode and lightning_attention_prefill, specifically designed to optimize attention mechanisms on Ascend hardware.
  • Build System Integration: Integrated these new Lightning Attention operators into the build_aclnn.sh script, ensuring they are compiled and available for both ascend910b and ascend910_93 SOC versions.
  • PyTorch Bindings and Meta-Functions: Implemented PyTorch bindings (npu_lightning_attention_decode, npu_lightning_attention_prefill) and corresponding meta-functions for shape and data type inference, enabling seamless use within PyTorch workflows.
  • Comprehensive Testing: Added new end-to-end test cases for both lightning_attention_decode and lightning_attention_prefill, validating their functionality across various batch sizes, KV cache configurations, sequence lengths, and data types (FP16, FP32).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • csrc/build_aclnn.sh
    • Updated custom operator list to include new Lightning Attention operations for ascend910b and ascend910_93.
  • csrc/lightning_attention_decode/lightning_attention_docode_torch_adpt.h
    • Added PyTorch adapter for the Lightning Attention Decode operator (note: filename contains typo 'docode').
  • csrc/lightning_attention_decode/op_host/CMakeLists.txt
    • Added CMake build configuration for the Lightning Attention Decode host-side operator.
  • csrc/lightning_attention_decode/op_host/aclnn_lightning_attention.cpp
    • Implemented ACLNN host-side functions for Lightning Attention Decode.
  • csrc/lightning_attention_decode/op_host/aclnn_lightning_attention_decode.h
    • Declared ACLNN host-side functions for Lightning Attention Decode.
  • csrc/lightning_attention_decode/op_host/lightning_attention_decode_def.cpp
    • Defined the Lightning Attention Decode operator for CANN.
  • csrc/lightning_attention_decode/op_host/lightning_attention_decode_proto.cpp
    • Implemented shape and data type inference for Lightning Attention Decode.
  • csrc/lightning_attention_decode/op_host/lightning_attention_decode_tiling.cpp
    • Implemented tiling logic for Lightning Attention Decode.
  • csrc/lightning_attention_decode/op_host/lightning_attention_decode_tiling.h
    • Defined tiling data structures and the tiling class for Lightning Attention Decode.
  • csrc/lightning_attention_decode/op_kernel/lightning_attention_decode.cpp
    • Implemented the device-side kernel for lightning_attention_decode.
  • csrc/lightning_attention_decode/op_kernel/lightning_attention_decode.h
    • Defined the kernel operator class for lightning_attention_decode.
  • csrc/lightning_attention_prefill/lightning_attention_prefill_torch_adpt.h
    • Added PyTorch adapter for the Lightning Attention Prefill operator.
  • csrc/lightning_attention_prefill/op_host/CMakeLists.txt
    • Added CMake build configuration for the Lightning Attention Prefill host-side operator.
  • csrc/lightning_attention_prefill/op_host/aclnn_lightning_attention.cpp
    • Implemented ACLNN host-side functions for Lightning Attention Prefill.
  • csrc/lightning_attention_prefill/op_host/aclnn_lightning_attention_prefill.h
    • Declared ACLNN host-side functions for Lightning Attention Prefill.
  • csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_def.cpp
    • Defined the Lightning Attention Prefill operator for CANN.
  • csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_proto.cpp
    • Implemented shape and data type inference for Lightning Attention Prefill.
  • csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_tiling.cpp
    • Implemented tiling logic for Lightning Attention Prefill.
  • csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_tiling.h
    • Defined tiling data structures and the tiling class for Lightning Attention Prefill.
  • csrc/lightning_attention_prefill/op_kernel/lightning_attention_prefill.cpp
    • Implemented the device-side kernel for lightning_attention_prefill.
  • csrc/lightning_attention_prefill/op_kernel/lightning_attention_prefill.h
    • Defined the kernel operator class for lightning_attention_prefill.
  • csrc/torch_binding.cpp
    • Included new header files for Lightning Attention operators.
    • Registered npu_lightning_attention_prefill and npu_lightning_attention_decode with PyTorch.
  • csrc/torch_binding_meta.cpp
    • Added meta-functions for npu_lightning_attention_decode and npu_lightning_attention_prefill.
    • Registered these meta-functions with PyTorch.
  • tests/e2e/nightly/single_node/ops/singlecard_ops/test_lightning_attention_decode.py
    • Added end-to-end test cases for Lightning Attention Decode.
  • tests/e2e/nightly/single_node/ops/singlecard_ops/test_lightning_attention_prefill.py
    • Added end-to-end test cases for Lightning Attention Prefill.
Activity
  • 2026-03-24 08:36:09 UTC: github-actions[bot] provided general guidelines for pull request contributions, emphasizing single-purpose PRs, comprehensive testing, and clear commit messages.
  • 2026-03-24 08:36:25 UTC: gemini-code-assist[bot] provided an initial high-level summary of the pull request, highlighting the introduction of new AscendC operators for Lightning Attention, their integration into the build system, PyTorch bindings, and testing.
  • 2026-03-24 08:43:22 UTC: gemini-code-assist[bot] posted several critical and high-priority review comments, addressing issues such as incorrect shape inference in lightning_attention_decode_proto.cpp, unsafe type punning for negative infinity in kernel headers, missing numpy imports in test files, a typo in the filename lightning_attention_docode_torch_adpt.h, duplicate documentation in aclnn_lightning_attention_decode.h, missing newlines in several new files, a local variable shadowing a class member in lightning_attention_prefill_tiling.cpp, an extra semicolon in lightning_attention_prefill_tiling.h, and swapped/typo-ridden comments for operator registrations in torch_binding.cpp.

@ChenxiQ ChenxiQ force-pushed the br_ascend_c_lightning_attention branch 2 times, most recently from c755d48 to 52191c2 Compare March 24, 2026 09:10
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
@ChenxiQ ChenxiQ force-pushed the br_ascend_c_lightning_attention branch from 52191c2 to eabc966 Compare March 24, 2026 11:50
@ChenxiQ ChenxiQ changed the title [WIP] Add AscendC Op for Lightning Attention [WIP][Ops] Add AscendC Custom Op for Lightning Attention Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant