[WIP][Ops] Add AscendC Custom Op for Lightning Attention#7590
[WIP][Ops] Add AscendC Custom Op for Lightning Attention#7590ChenxiQ wants to merge 7 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
Signed-off-by: ChenxiQ <chenxi.qian.cq@outlook.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the vLLM-Ascend backend by introducing dedicated AscendC operators for Lightning Attention's decode and prefill phases. These new custom operations are crucial for optimizing performance on Ascend hardware, enabling more efficient processing of attention mechanisms in large language models. The integration includes the necessary operator definitions, tiling strategies, kernel implementations, and PyTorch bindings, ensuring seamless usability within the existing framework. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This PR adds new AscendC operators for Lightning Attention, including prefill and decode stages. The implementation covers kernel code, host-side logic, and PyTorch bindings. The changes are extensive and introduce significant new functionality. My review has identified several critical issues, including a functional bug in shape inference, undefined behavior in kernel code, and a broken test case. There are also several high-severity issues related to code clarity, style, and maintainability that should be addressed. For instance, there's a typo in the filename csrc/lightning_attention_decode/lightning_attention_docode_torch_adpt.h (docode should be decode), which should be corrected for consistency.
| const gert::Shape* q_shape = context->GetInputShape(INDEX_IN_Q); | ||
| gert::Shape* attn_out_shape = context->GetOutputShape(INDEX_OUT_ATTN); | ||
| gert::Shape* kv_caches_shape = context->GetOutputShape(INDEX_OUT_KV_CACHES); | ||
| *attn_out_shape = *q_shape; |
There was a problem hiding this comment.
The shape inference for attention_out is incorrect. The output shape should be 2D (batch, head_num * head_dim), but it's currently being set to the 4D shape of the query tensor. This will lead to runtime errors or incorrect behavior.
| *attn_out_shape = *q_shape; | |
| attn_out_shape->SetDimNum(2); | |
| attn_out_shape->SetDim(0, q_shape->GetDim(0)); | |
| attn_out_shape->SetDim(1, q_shape->GetDim(1) * q_shape->GetDim(3)); |
| auto helpTensor = kvCacheBuf_.Get<float>(); | ||
|
|
||
| uint32_t mOffset; | ||
| uint32_t tmp = 0xFF800000; // -inf |
There was a problem hiding this comment.
Using type punning via a pointer cast *((float *)&tmp) to represent negative infinity is undefined behavior in C++. This can lead to unpredictable results. A safer and more portable way is to use std::numeric_limits. Please replace this with a safe alternative, for example: const float neg_inf = -std::numeric_limits<float>::infinity(); (and include <limits>). This constant should then be used in Duplicate and SetValue calls.
| actual_seq_len = [np.random.randint(1, max_seq_len / block_size + 1) * block_size | ||
| for _ in range(batch_size)] |
| @@ -0,0 +1,46 @@ | |||
| /* | |||
| * slotIds : required | ||
| * inputLayoutOptional : optional | ||
| * attentionOut : required | ||
| * kvCachesRef : required |
|
|
||
| install(FILES ${CMAKE_CURRENT_SOURCE_DIR}/aclnn_lightning_attention_prefill.h | ||
| DESTINATION ${ACLNN_INC_INSTALL_DIR} OPTIONAL | ||
| ) No newline at end of file |
| { | ||
| uint32_t headStartIdx = 0; | ||
| uint32_t headEndIdx = 0; | ||
| uint32_t totalBlockCount = totalBlockCount_; |
There was a problem hiding this comment.
| TILING_DATA_FIELD_DEF_ARR(uint16_t, 256, blockCountPerBatch); // max batch size 256 | ||
| TILING_DATA_FIELD_DEF_ARR(uint16_t, 256, tailBlockSize); // max batch size 256 | ||
| TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headStart); // max aiv num: 50 | ||
| TILING_DATA_FIELD_DEF_ARR(uint16_t, 50, headEnd);; |
| // lightning_attentioin_decode | ||
| ops.def( | ||
| "npu_lightning_attention_prefill(Tensor query, Tensor key, Tensor value, Tensor slope_rate, " | ||
| " int block_size, Tensor? kv_history=None, int[]? actual_seq_len=None) -> (Tensor, Tensor)" | ||
| ); | ||
| ops.impl("npu_lightning_attention_prefill", torch::kPrivateUse1, &vllm_ascend::npu_lightning_attention_prefill); | ||
|
|
||
| // lightning_attention_prefill |
There was a problem hiding this comment.
The comments for lightning_attention_decode and lightning_attention_prefill are swapped. Additionally, lightning_attentioin_decode has a typo (attentioin). This is confusing. Please correct the comments and place them with their corresponding operator definitions.
// lightning_attention_prefill
ops.def(
"npu_lightning_attention_prefill(Tensor query, Tensor key, Tensor value, Tensor slope_rate, "
" int block_size, Tensor? kv_history=None, int[]? actual_seq_len=None) -> (Tensor, Tensor)"
);
ops.impl("npu_lightning_attention_prefill", torch::kPrivateUse1, &vllm_ascend::npu_lightning_attention_prefill);
// lightning_attention_decode
ops.def(
Summary of ChangesThis pull request introduces core optimizations for the vLLM-Ascend backend by implementing Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
c755d48 to
52191c2
Compare
52191c2 to
eabc966
Compare
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?