[WIP][Ops] Add AscendC Custom Op for Lightning Attention by ChenxiQ · Pull Request #7590 · vllm-project/vllm-ascend

ChenxiQ · 2026-03-24T08:35:53Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@ed359c4

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

github-actions · 2026-03-24T08:36:09Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-24T08:36:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the vLLM-Ascend backend by introducing dedicated AscendC operators for Lightning Attention's decode and prefill phases. These new custom operations are crucial for optimizing performance on Ascend hardware, enabling more efficient processing of attention mechanisms in large language models. The integration includes the necessary operator definitions, tiling strategies, kernel implementations, and PyTorch bindings, ensuring seamless usability within the existing framework.

Highlights

New AscendC Operators: Introduced two new AscendC operators: lightning_attention_decode and lightning_attention_prefill, designed to optimize attention mechanisms on Ascend hardware.
Build System Integration: Integrated the new Lightning Attention operators into the build_aclnn.sh script, ensuring they are compiled and available for both ascend910b and ascend910_93 SOC versions.
PyTorch Bindings and Meta-Functions: Provided PyTorch bindings (npu_lightning_attention_decode, npu_lightning_attention_prefill) and corresponding meta-functions for shape and data type inference, enabling seamless use within PyTorch workflows.
Comprehensive Testing: Added end-to-end test cases for both lightning_attention_decode and lightning_attention_prefill, validating their functionality across various batch sizes, KV cache configurations, sequence lengths, and data types (FP16, FP32).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds new AscendC operators for Lightning Attention, including prefill and decode stages. The implementation covers kernel code, host-side logic, and PyTorch bindings. The changes are extensive and introduce significant new functionality. My review has identified several critical issues, including a functional bug in shape inference, undefined behavior in kernel code, and a broken test case. There are also several high-severity issues related to code clarity, style, and maintainability that should be addressed. For instance, there's a typo in the filename csrc/lightning_attention_decode/lightning_attention_docode_torch_adpt.h (docode should be decode), which should be corrected for consistency.

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_decode/op_host/lightning_attention_decode_proto.cpp

+    const gert::Shape* q_shape = context->GetInputShape(INDEX_IN_Q);
+    gert::Shape* attn_out_shape = context->GetOutputShape(INDEX_OUT_ATTN);
+    gert::Shape* kv_caches_shape = context->GetOutputShape(INDEX_OUT_KV_CACHES);
+    *attn_out_shape = *q_shape;


The shape inference for attention_out is incorrect. The output shape should be 2D (batch, head_num * head_dim), but it's currently being set to the 4D shape of the query tensor. This will lead to runtime errors or incorrect behavior.

Suggested change

*attn_out_shape = *q_shape;

attn_out_shape->SetDimNum(2);

attn_out_shape->SetDim(0, q_shape->GetDim(0));

attn_out_shape->SetDim(1, q_shape->GetDim(1) * q_shape->GetDim(3));

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_prefill/op_kernel/lightning_attention_prefill.h

+    auto helpTensor = kvCacheBuf_.Get<float>();
+
+    uint32_t mOffset;
+    uint32_t tmp = 0xFF800000; // -inf


Using type punning via a pointer cast *((float *)&tmp) to represent negative infinity is undefined behavior in C++. This can lead to unpredictable results. A safer and more portable way is to use std::numeric_limits. Please replace this with a safe alternative, for example: const float neg_inf = -std::numeric_limits<float>::infinity(); (and include <limits>). This constant should then be used in Duplicate and SetValue calls.

gemini-code-assist · 2026-03-24T08:43:22Z

tests/e2e/nightly/single_node/ops/singlecard_ops/test_lightning_attention_prefill.py

+    actual_seq_len = [np.random.randint(1, max_seq_len / block_size + 1) * block_size
+                        for _ in range(batch_size)]


This line uses np.random.randint, but the numpy library is not imported in this file, which will cause a NameError. Please add import numpy as np at the top of the file. This issue is also present in test_lightning_attention_prefill_with_kv_history.

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_decode/lightning_attention_decode_torch_adpt.h

@@ -0,0 +1,46 @@
+/*


There appears to be a typo in this file's name: lightning_attention_docode_torch_adpt.h. It should likely be lightning_attention_decode_torch_adpt.h. Please rename the file for consistency and to avoid confusion.

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_decode/op_host/aclnn_lightning_attention_decode.h

+ * slotIds : required
+ * inputLayoutOptional : optional
+ * attentionOut : required
+ * kvCachesRef : required


The parameter kvCachesRef is documented twice (here and on line 26). Please remove this duplicate line to improve documentation clarity.

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_prefill/op_host/CMakeLists.txt

+
+install(FILES ${CMAKE_CURRENT_SOURCE_DIR}/aclnn_lightning_attention_prefill.h
+        DESTINATION ${ACLNN_INC_INSTALL_DIR} OPTIONAL
+)


This file is missing a newline at the end. This can cause issues with some build tools and violates common coding standards. Please add a newline character at the end of the file. This issue is present in other new files in this PR as well.

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_tiling.cpp

+{
+    uint32_t headStartIdx = 0;
+    uint32_t headEndIdx = 0;
+    uint32_t totalBlockCount = totalBlockCount_;


The local variable totalBlockCount shadows the class member totalBlockCount_, which can be confusing and error-prone. Please consider renaming the local variable to something like remainingBlockCount for better readability.

uint32_t remainingBlockCount = totalBlockCount_;

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_tiling.h

+   TILING_DATA_FIELD_DEF_ARR(uint16_t, 256, blockCountPerBatch); // max batch size 256
+   TILING_DATA_FIELD_DEF_ARR(uint16_t, 256, tailBlockSize);      // max batch size 256
+   TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headStart);           // max aiv num: 50
+   TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headEnd);;


There's an extra semicolon at the end of this line. Please remove it to fix the syntax.

TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headEnd);

gemini-code-assist · 2026-03-24T08:43:22Z

csrc/torch_binding.cpp

+    // lightning_attentioin_decode
+    ops.def(
+        "npu_lightning_attention_prefill(Tensor query, Tensor key, Tensor value, Tensor slope_rate, "
+        "                                int block_size, Tensor? kv_history=None, int[]? actual_seq_len=None) -> (Tensor, Tensor)"
+    );
+    ops.impl("npu_lightning_attention_prefill", torch::kPrivateUse1, &vllm_ascend::npu_lightning_attention_prefill);
+
+    // lightning_attention_prefill


The comments for lightning_attention_decode and lightning_attention_prefill are swapped. Additionally, lightning_attentioin_decode has a typo (attentioin). This is confusing. Please correct the comments and place them with their corresponding operator definitions.

// lightning_attention_prefill ops.def( "npu_lightning_attention_prefill(Tensor query, Tensor key, Tensor value, Tensor slope_rate, " " int block_size, Tensor? kv_history=None, int[]? actual_seq_len=None) -> (Tensor, Tensor)" ); ops.impl("npu_lightning_attention_prefill", torch::kPrivateUse1, &vllm_ascend::npu_lightning_attention_prefill); // lightning_attention_decode ops.def(

gemini-code-assist · 2026-03-24T08:44:02Z

Summary of Changes

This pull request introduces core optimizations for the vLLM-Ascend backend by implementing lightning_attention_decode and lightning_attention_prefill as custom AscendC operators. This integration aims to significantly enhance the performance of attention mechanisms on Ascend hardware, providing a more efficient foundation for large language models. The changes span from build system updates and PyTorch integration to detailed kernel implementations and dedicated testing, marking a significant step in hardware-accelerated attention processing.

Highlights

New AscendC Operators: Introduced two new AscendC custom operators, lightning_attention_decode and lightning_attention_prefill, specifically designed to optimize attention mechanisms on Ascend hardware.
Build System Integration: Integrated these new Lightning Attention operators into the build_aclnn.sh script, ensuring they are compiled and available for both ascend910b and ascend910_93 SOC versions.
PyTorch Bindings and Meta-Functions: Implemented PyTorch bindings (npu_lightning_attention_decode, npu_lightning_attention_prefill) and corresponding meta-functions for shape and data type inference, enabling seamless use within PyTorch workflows.
Comprehensive Testing: Added new end-to-end test cases for both lightning_attention_decode and lightning_attention_prefill, validating their functionality across various batch sizes, KV cache configurations, sequence lengths, and data types (FP16, FP32).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/build_aclnn.sh
- Updated custom operator list to include new Lightning Attention operations for ascend910b and ascend910_93.
csrc/lightning_attention_decode/lightning_attention_docode_torch_adpt.h
- Added PyTorch adapter for the Lightning Attention Decode operator (note: filename contains typo 'docode').
csrc/lightning_attention_decode/op_host/CMakeLists.txt
- Added CMake build configuration for the Lightning Attention Decode host-side operator.
csrc/lightning_attention_decode/op_host/aclnn_lightning_attention.cpp
- Implemented ACLNN host-side functions for Lightning Attention Decode.
csrc/lightning_attention_decode/op_host/aclnn_lightning_attention_decode.h
- Declared ACLNN host-side functions for Lightning Attention Decode.
csrc/lightning_attention_decode/op_host/lightning_attention_decode_def.cpp
- Defined the Lightning Attention Decode operator for CANN.
csrc/lightning_attention_decode/op_host/lightning_attention_decode_proto.cpp
- Implemented shape and data type inference for Lightning Attention Decode.
csrc/lightning_attention_decode/op_host/lightning_attention_decode_tiling.cpp
- Implemented tiling logic for Lightning Attention Decode.
csrc/lightning_attention_decode/op_host/lightning_attention_decode_tiling.h
- Defined tiling data structures and the tiling class for Lightning Attention Decode.
csrc/lightning_attention_decode/op_kernel/lightning_attention_decode.cpp
- Implemented the device-side kernel for lightning_attention_decode.
csrc/lightning_attention_decode/op_kernel/lightning_attention_decode.h
- Defined the kernel operator class for lightning_attention_decode.
csrc/lightning_attention_prefill/lightning_attention_prefill_torch_adpt.h
- Added PyTorch adapter for the Lightning Attention Prefill operator.
csrc/lightning_attention_prefill/op_host/CMakeLists.txt
- Added CMake build configuration for the Lightning Attention Prefill host-side operator.
csrc/lightning_attention_prefill/op_host/aclnn_lightning_attention.cpp
- Implemented ACLNN host-side functions for Lightning Attention Prefill.
csrc/lightning_attention_prefill/op_host/aclnn_lightning_attention_prefill.h
- Declared ACLNN host-side functions for Lightning Attention Prefill.
csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_def.cpp
- Defined the Lightning Attention Prefill operator for CANN.
csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_proto.cpp
- Implemented shape and data type inference for Lightning Attention Prefill.
csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_tiling.cpp
- Implemented tiling logic for Lightning Attention Prefill.
csrc/lightning_attention_prefill/op_host/lightning_attention_prefill_tiling.h
- Defined tiling data structures and the tiling class for Lightning Attention Prefill.
csrc/lightning_attention_prefill/op_kernel/lightning_attention_prefill.cpp
- Implemented the device-side kernel for lightning_attention_prefill.
csrc/lightning_attention_prefill/op_kernel/lightning_attention_prefill.h
- Defined the kernel operator class for lightning_attention_prefill.
csrc/torch_binding.cpp
- Included new header files for Lightning Attention operators.
- Registered npu_lightning_attention_prefill and npu_lightning_attention_decode with PyTorch.
csrc/torch_binding_meta.cpp
- Added meta-functions for npu_lightning_attention_decode and npu_lightning_attention_prefill.
- Registered these meta-functions with PyTorch.
tests/e2e/nightly/single_node/ops/singlecard_ops/test_lightning_attention_decode.py
- Added end-to-end test cases for Lightning Attention Decode.
tests/e2e/nightly/single_node/ops/singlecard_ops/test_lightning_attention_prefill.py
- Added end-to-end test cases for Lightning Attention Prefill.

Activity

2026-03-24 08:36:09 UTC: github-actions[bot] provided general guidelines for pull request contributions, emphasizing single-purpose PRs, comprehensive testing, and clear commit messages.
2026-03-24 08:36:25 UTC: gemini-code-assist[bot] provided an initial high-level summary of the pull request, highlighting the introduction of new AscendC operators for Lightning Attention, their integration into the build system, PyTorch bindings, and testing.
2026-03-24 08:43:22 UTC: gemini-code-assist[bot] posted several critical and high-priority review comments, addressing issues such as incorrect shape inference in lightning_attention_decode_proto.cpp, unsafe type punning for negative infinity in kernel headers, missing numpy imports in test files, a typo in the filename lightning_attention_docode_torch_adpt.h, duplicate documentation in aclnn_lightning_attention_decode.h, missing newlines in several new files, a local variable shadowing a class member in lightning_attention_prefill_tiling.cpp, an extra semicolon in lightning_attention_prefill_tiling.h, and swapped/typo-ridden comments for operator registrations in torch_binding.cpp.

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

ChenxiQ added 6 commits March 24, 2026 16:27

Add AscendC op lightning attention

f247804

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

update build script

946d970

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

add torch binding for custom op lightning_attention

8f90f4d

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

add test for custom op lightning_attention

38bfc20

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

fix torch binding

e413fde

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

fix test

563f720

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

github-actions bot added the module:tests label Mar 24, 2026

ChenxiQ changed the title ~~Add AscendC Op for Lightning Attention~~ [WIP] Add AscendC Op for Lightning Attention Mar 24, 2026

gemini-code-assist bot reviewed Mar 24, 2026

View reviewed changes

ChenxiQ force-pushed the br_ascend_c_lightning_attention branch 2 times, most recently from c755d48 to 52191c2 Compare March 24, 2026 09:10

fix typo

eabc966

Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>

ChenxiQ force-pushed the br_ascend_c_lightning_attention branch from 52191c2 to eabc966 Compare March 24, 2026 11:50

ChenxiQ changed the title ~~[WIP] Add AscendC Op for Lightning Attention~~ [WIP][Ops] Add AscendC Custom Op for Lightning Attention Mar 24, 2026

-    *attn_out_shape = *q_shape;
+    attn_out_shape->SetDimNum(2);
+    attn_out_shape->SetDim(0, q_shape->GetDim(0));
+    attn_out_shape->SetDim(1, q_shape->GetDim(1) * q_shape->GetDim(3));

		actual_seq_len = [np.random.randint(1, max_seq_len / block_size + 1) * block_size
		for _ in range(batch_size)]

Conversation

ChenxiQ commented Mar 24, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Summary of Changes

Highlights

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChenxiQ commented Mar 24, 2026 •

edited by github-actions bot

Loading