[DS 3.2 ] add ReshapeAndCacheByGroup Ascned ops#7382
[DS 3.2 ] add ReshapeAndCacheByGroup Ascned ops#7382ZT-AIA wants to merge 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zengtian (A) <z00893411@china.huawei.com>
…to reshape_ops
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a specialized custom operator, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new Ascend operator, ReshapeAndCacheByGroup, to optimize caching operations. The changes include the operator's C++ implementation, kernel code, build scripts, and integration into the Python-level attention mechanism. A new unit test is also added to verify its functionality.
My review has identified a critical memory leak and a few high-severity issues related to maintainability and code clarity. Please address these points.
Additionally, per the repository's style guide, I have suggestions for the pull request title and summary to improve clarity and consistency.
Suggested PR Title:
[DS 3.2][Ops][Feature] Add ReshapeAndCacheByGroup Ascend op for optimizationSuggested PR Summary:
### What this PR does / why we need it?
This PR introduces a new Ascend operator, `ReshapeAndCacheByGroup`, to optimize the reshape, cache, and scatter operations. This is designed to leverage hardware features of Ascend NPUs for better performance.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
A new unit test file has been added in `tests/ut/ops/test_reshape_and_cachebygroup` to verify the correctness of the new operator.| // std::cout<<"device "<< idxGroups<<" "<< sizeof(allGroups[0])<<" "<< sizeof(uint32_t)<<" "<<device_size<<" "<<&allGroups[0]<<" "<<&allGroups[0].quotient<<std::endl; | ||
| void* devAddr = NULL; | ||
|
|
||
| aclrtMalloc(&devAddr, device_size, ACL_MEM_MALLOC_HUGE_FIRST); |
There was a problem hiding this comment.
There appears to be a memory leak. aclrtMalloc is called to allocate devAddr, but there is no corresponding call to aclrtFree to release this memory. Since this tiling logic is executed for each operation, this will lead to a gradual memory leak on the device. The allocated memory should be freed after it's no longer needed, likely after the kernel execution completes.
| #ifndef ADD_RMS_NORM_BIAS_TORCH_ADPT_H | ||
| #define ADD_RMS_NORM_BIAS_TORCH_ADPT_H |
There was a problem hiding this comment.
The filename add_rms_norm_bias_torch_adpt.h and the header guard ADD_RMS_NORM_BIAS_TORCH_ADPT_H do not match the content of the file, which implements the adapter for reshape_and_cache_by_group. This is misleading and can cause maintenance issues. Please rename the file to reshape_and_cache_by_group_torch_adpt.h and update the header guard accordingly.
| #ifndef ADD_RMS_NORM_BIAS_TORCH_ADPT_H | |
| #define ADD_RMS_NORM_BIAS_TORCH_ADPT_H | |
| #ifndef RESHAPE_AND_CACHE_BY_GROUP_TORCH_ADPT_H | |
| #define RESHAPE_AND_CACHE_BY_GROUP_TORCH_ADPT_H |
| #include "register/tilingdata_base.h" | ||
| #include "tiling/tiling_base.h" | ||
| // #include "op_log.h" | ||
| #include "error_log.h" |
There was a problem hiding this comment.
The file error_log.h is included here. However, another error_log.h file with similar content is also added in csrc/reshape_and_cache_by_group/tiling_base/. Having duplicated utility files increases maintenance overhead. It would be better to consolidate them into a single, shared header file in a common utility directory.
vllm_ascend/attention/sfa_v1.py
Outdated
| ) | ||
| k_nope = k_nope.view(k_nope.shape[0], 1, -1)[: attn_metadata.num_actual_tokens] | ||
| k_pe = k_pe.view(k_pe.shape[0], 1, -1)[: attn_metadata.num_actual_tokens] | ||
| zt_block_size=128 |
There was a problem hiding this comment.
The block size is hardcoded to 128. This magic number is used here and again on line 1190. It should be avoided. It would be better to define it as a constant or retrieve it from a configuration to improve maintainability and clarity. For instance, it could be part of the model or attention configuration.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Signed-off-by: ZT-AIA <1028681969@qq.com>
| add_ops_compile_options( | ||
| OP_NAME ReshapeAndCacheByGroup | ||
| OPTIONS -o0 | ||
| -g |
|
|
||
| void ReshapeAndCacheByGroupCommonTiling::PrintTilingData() | ||
| { | ||
| // OP_LOGD(context_->GetNodeName(), "Start WriteCacheByGroupListTilingData priting"); |
There was a problem hiding this comment.
use proper debug log or remove the PrintTilingData function
| // OP_CHECK_NULL_WITH_CONTEXT(context_, kShape); | ||
| auto dim_num=kShape->GetStorageShape().GetDimNum(); | ||
| if (dim_num<2||dim_num>7){ | ||
| printf("[ERROR] ReshapeAndCacheByGroup Intput first params dim < 2 || dim_num>7"); |
There was a problem hiding this comment.
replace this with custom op log, this is host stdout!
| const gert::RuntimeAttrs *attrs = context_->GetAttrs(); | ||
| auto slotMapping = attrs->GetListInt(0); | ||
| uint32_t slotMappingLen = slotMapping->GetSize(); | ||
| auto slotMappingData=slotMapping->GetData(); |
There was a problem hiding this comment.
check nullptr for attrs slotMapping and slotMappingData
| //slotMapping=[7,8,9,50,51,52,53,54,55,56,57,58,59,30,31,32,33,34,35,36,37,38,39,60,61,62] | ||
|
|
||
|
|
||
| auto kcacheShape = context_->GetInputShape(DIM_1); |
| // std::cout<<"device "<< idxGroups<<" "<< sizeof(allGroups[0])<<" "<< sizeof(uint32_t)<<" "<<device_size<<" "<<&allGroups[0]<<" "<<&allGroups[0].quotient<<std::endl; | ||
| void* devAddr = NULL; | ||
|
|
||
| aclrtMalloc(&devAddr, device_size, ACL_MEM_MALLOC_HUGE_FIRST); |
| #ifdef ZTDEBUG | ||
| std::cout<<"luanxu: "<<j<< slotMappingData[j]<<std::endl; | ||
| #endif | ||
| j++; |
There was a problem hiding this comment.
use binary search would improve performance
| uint32_t idxGroups = 0; | ||
|
|
||
|
|
||
| while (idxSlotmap < slotMappingLen) { |
There was a problem hiding this comment.
Move the entire SlotMapping-compress logic into the kernel, and make the SlotMapping input a tensor. So you achieve:
Multi-kernel acceleration
No need to copy SlotMapping to the host
Removal of potentially large tiling data (eliminating copy and initialization overhead)
No need to allocate device memory in tiling—there is no proper timing to free it anyway
| .DataType({ge::DT_INT8, ge::DT_FLOAT16, ge::DT_BF16}) | ||
| .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}) | ||
| .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND}); | ||
| this->Attr("slotMapping").AttrType(OPTIONAL).ListInt({}); |
Signed-off-by: ZT-AIA <1028681969@qq.com>
Signed-off-by: ZT-AIA <1028681969@qq.com>
What this PR does / why we need it?
Optimization of the reshape and cache and scatter and update operators based on the hardware features of Ascend.
Does this PR introduce any user-facing change?
No
How was this patch tested?