[DS 3.2 ] add ReshapeAndCacheByGroup Ascned ops by ZT-AIA · Pull Request #7382 · vllm-project/vllm-ascend

ZT-AIA · 2026-03-17T11:15:09Z

What this PR does / why we need it?

Optimization of the reshape and cache and scatter and update operators based on the hardware features of Ascend.

Does this PR introduce any user-facing change?

No

How was this patch tested?

vLLM version: v0.17.0
vLLM main: vllm-project/vllm@4034c3d

Signed-off-by: zengtian (A) <z00893411@china.huawei.com>

…to reshape_ops

gemini-code-assist · 2026-03-17T11:15:44Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized custom operator, ReshapeAndCacheByGroup, designed to enhance the performance of key-value cache operations specifically on Ascend hardware. By providing a dedicated, optimized implementation for these critical memory management tasks, the change aims to improve the overall efficiency and speed of models running on Ascend platforms. The integration ensures that PyTorch-based workflows can seamlessly utilize this hardware-accelerated functionality, streamlining the process of updating and accessing KV caches.

Highlights

New Custom Operator: Introduced a new custom Ascend operator named ReshapeAndCacheByGroup to optimize KV cache management on Ascend hardware.
Ascend Integration: Integrated the ReshapeAndCacheByGroup operator into the Ascend build system, including host-side definitions, tiling logic, and AICore kernel implementation for ascend910b and ascend910_93 SOC versions.
PyTorch Binding: Exposed the new custom operator to the PyTorch framework, allowing it to be called from Python code.
KV Cache Optimization: Updated the vllm_ascend attention mechanism to leverage the new reshape_and_cache_by_group operator for more efficient key-value cache updates, replacing previous reshape_and_cache and scatter_nd_update calls.
Unit Testing: Added comprehensive unit tests to validate the correctness and functionality of the ReshapeAndCacheByGroup operator.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

csrc/build_aclnn.sh
- Updated build script to include reshape_and_cache_by_group in custom operations for Ascend910B and Ascend910_93 SOCs.
csrc/reshape_and_cache_by_group/add_rms_norm_bias_torch_adpt.h
- Added a PyTorch adapter for the reshape_and_cache_by_group operator, despite the file name discrepancy.
csrc/reshape_and_cache_by_group/op_host/CMakeLists.txt
- Added CMake configuration for building the ReshapeAndCacheByGroup operator's host-side components.
csrc/reshape_and_cache_by_group/op_host/error_log.h
- Added logging and error checking macros for the optiling namespace, used by the new operator.
csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp
- Added common tiling logic for ReshapeAndCacheByGroup, including parameter parsing and slot mapping.
csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.h
- Added definitions for ReshapeAndCacheByGroupTilingData and ReshapeAndCacheByGroupCommonTiling.
csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_def.cpp
- Added the operator definition for ReshapeAndCacheByGroup within the CANN framework.
csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_infershape.cpp
- Added infer shape and data type implementations for the ReshapeAndCacheByGroup operator.
csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_tiling.cpp
- Added tiling function registration for the ReshapeAndCacheByGroup operator.
csrc/reshape_and_cache_by_group/op_kernel/reshape_and_cache_by_group.cpp
- Added the AICore kernel implementation for reshape_and_cache_by_group.
csrc/reshape_and_cache_by_group/op_kernel/reshape_and_cache_by_group.h
- Added kernel-side definitions for ReshapeAndCacheByGroupTilingData and ReshapeAndCacheByGroupBase.
csrc/reshape_and_cache_by_group/tiling_base/data_copy_transpose_tiling.h
- Added utility functions and structures for data copy and transpose tiling.
csrc/reshape_and_cache_by_group/tiling_base/data_copy_transpose_tiling_def.h
- Added data structure definitions for CopyTransposeTiling parameters.
csrc/reshape_and_cache_by_group/tiling_base/error_log.h
- Added logging and error checking macros for the optiling namespace.
csrc/reshape_and_cache_by_group/tiling_base/tiling_base.h
- Added the base class and framework for operator tiling implementations.
csrc/reshape_and_cache_by_group/tiling_base/tiling_key.h
- Added utilities for generating unique tiling keys.
csrc/reshape_and_cache_by_group/tiling_base/tiling_templates_registry.h
- Added a registry mechanism for managing and selecting tiling templates.
csrc/reshape_and_cache_by_group/tiling_base/tiling_type.h
- Added enums and helper functions for defining tiling parameters and keys.
csrc/reshape_and_cache_by_group/tiling_base/tiling_util.h
- Added utility functions related to tiling context and shape handling.
csrc/torch_binding.cpp
- Modified to register the reshape_and_cache_by_group operator with PyTorch.
csrc/torch_binding_meta.cpp
- Modified to add a meta-function for reshape_and_cache_by_group for PyTorch.
tests/ut/ops/test_reshape_and_cachebygroup
- Added unit tests for the reshape_and_cache_by_group operator, including correctness checks against reference implementations.
vllm_ascend/attention/sfa_v1.py
- Modified attention forward pass to utilize the new torch.ops._C_ascend.reshape_and_cache_by_group custom operator for KV cache updates.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new Ascend operator, ReshapeAndCacheByGroup, to optimize caching operations. The changes include the operator's C++ implementation, kernel code, build scripts, and integration into the Python-level attention mechanism. A new unit test is also added to verify its functionality.

My review has identified a critical memory leak and a few high-severity issues related to maintainability and code clarity. Please address these points.

Additionally, per the repository's style guide, I have suggestions for the pull request title and summary to improve clarity and consistency.

Suggested PR Title:

[DS 3.2][Ops][Feature] Add ReshapeAndCacheByGroup Ascend op for optimization

Suggested PR Summary:

### What this PR does / why we need it?
This PR introduces a new Ascend operator, `ReshapeAndCacheByGroup`, to optimize the reshape, cache, and scatter operations. This is designed to leverage hardware features of Ascend NPUs for better performance.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
A new unit test file has been added in `tests/ut/ops/test_reshape_and_cachebygroup` to verify the correctness of the new operator.

gemini-code-assist · 2026-03-17T11:19:29Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+    // std::cout<<"device "<< idxGroups<<" "<< sizeof(allGroups[0])<<" "<< sizeof(uint32_t)<<" "<<device_size<<" "<<&allGroups[0]<<" "<<&allGroups[0].quotient<<std::endl;
+    void* devAddr = NULL;
+
+    aclrtMalloc(&devAddr, device_size, ACL_MEM_MALLOC_HUGE_FIRST);


There appears to be a memory leak. aclrtMalloc is called to allocate devAddr, but there is no corresponding call to aclrtFree to release this memory. Since this tiling logic is executed for each operation, this will lead to a gradual memory leak on the device. The allocated memory should be freed after it's no longer needed, likely after the kernel execution completes.

+1, this is crazy

gemini-code-assist · 2026-03-17T11:19:29Z

csrc/reshape_and_cache_by_group/add_rms_norm_bias_torch_adpt.h

+#ifndef ADD_RMS_NORM_BIAS_TORCH_ADPT_H
+#define ADD_RMS_NORM_BIAS_TORCH_ADPT_H


The filename add_rms_norm_bias_torch_adpt.h and the header guard ADD_RMS_NORM_BIAS_TORCH_ADPT_H do not match the content of the file, which implements the adapter for reshape_and_cache_by_group. This is misleading and can cause maintenance issues. Please rename the file to reshape_and_cache_by_group_torch_adpt.h and update the header guard accordingly.

Suggested change

#ifndef ADD_RMS_NORM_BIAS_TORCH_ADPT_H

#define ADD_RMS_NORM_BIAS_TORCH_ADPT_H

#ifndef RESHAPE_AND_CACHE_BY_GROUP_TORCH_ADPT_H

#define RESHAPE_AND_CACHE_BY_GROUP_TORCH_ADPT_H

gemini-code-assist · 2026-03-17T11:19:29Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.h

+#include "register/tilingdata_base.h"
+#include "tiling/tiling_base.h"
+// #include "op_log.h"
+#include "error_log.h"


The file error_log.h is included here. However, another error_log.h file with similar content is also added in csrc/reshape_and_cache_by_group/tiling_base/. Having duplicated utility files increases maintenance overhead. It would be better to consolidate them into a single, shared header file in a common utility directory.

gemini-code-assist · 2026-03-17T11:19:29Z

vllm_ascend/attention/sfa_v1.py

-                    )
+                    k_nope = k_nope.view(k_nope.shape[0], 1, -1)[: attn_metadata.num_actual_tokens]
+                    k_pe = k_pe.view(k_pe.shape[0], 1, -1)[: attn_metadata.num_actual_tokens]
+                    zt_block_size=128


The block size is hardcoded to 128. This magic number is used here and again on line 1190. It should be avoided. It would be better to define it as a constant or retrieve it from a configuration to improve maintainability and clarity. For instance, it could be part of the model or attention configuration.

github-actions · 2026-03-17T12:45:33Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: ZT-AIA <1028681969@qq.com>

mitseng · 2026-03-18T08:12:48Z

csrc/reshape_and_cache_by_group/op_host/CMakeLists.txt

+add_ops_compile_options(
+        OP_NAME ReshapeAndCacheByGroup
+        OPTIONS -o0
+                -g


remove -o0 -g in publish build

mitseng · 2026-03-18T08:18:33Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+
+void ReshapeAndCacheByGroupCommonTiling::PrintTilingData()
+{
+    // OP_LOGD(context_->GetNodeName(), "Start WriteCacheByGroupListTilingData priting");


use proper debug log or remove the PrintTilingData function

mitseng · 2026-03-18T08:28:28Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+    // OP_CHECK_NULL_WITH_CONTEXT(context_, kShape);
+    auto dim_num=kShape->GetStorageShape().GetDimNum();
+    if (dim_num<2||dim_num>7){
+        printf("[ERROR] ReshapeAndCacheByGroup Intput first params dim < 2 || dim_num>7");


replace this with custom op log, this is host stdout!

mitseng · 2026-03-18T08:32:11Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+    const gert::RuntimeAttrs *attrs = context_->GetAttrs();
+    auto slotMapping = attrs->GetListInt(0);
+    uint32_t slotMappingLen = slotMapping->GetSize();
+    auto slotMappingData=slotMapping->GetData(); 


check nullptr for attrs slotMapping and slotMappingData

mitseng · 2026-03-18T08:34:37Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+    //slotMapping=[7,8,9,50,51,52,53,54,55,56,57,58,59,30,31,32,33,34,35,36,37,38,39,60,61,62]
+
+
+auto kcacheShape = context_->GetInputShape(DIM_1);


Indentation

mitseng · 2026-03-18T08:52:51Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+    // std::cout<<"device "<< idxGroups<<" "<< sizeof(allGroups[0])<<" "<< sizeof(uint32_t)<<" "<<device_size<<" "<<&allGroups[0]<<" "<<&allGroups[0].quotient<<std::endl;
+    void* devAddr = NULL;
+
+    aclrtMalloc(&devAddr, device_size, ACL_MEM_MALLOC_HUGE_FIRST);


+1, this is crazy

mitseng · 2026-03-18T09:53:16Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+        #ifdef ZTDEBUG
+        std::cout<<"luanxu: "<<j<< slotMappingData[j]<<std::endl;
+        #endif
+        j++;


use binary search would improve performance

mitseng · 2026-03-18T10:00:39Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_common.cpp

+uint32_t idxGroups = 0;
+
+
+while (idxSlotmap < slotMappingLen) {


Move the entire SlotMapping-compress logic into the kernel, and make the SlotMapping input a tensor. So you achieve:

Multi-kernel acceleration
No need to copy SlotMapping to the host
Removal of potentially large tiling data (eliminating copy and initialization overhead)
No need to allocate device memory in tiling—there is no proper timing to free it anyway

mitseng · 2026-03-18T10:11:57Z

csrc/reshape_and_cache_by_group/op_host/reshape_and_cache_by_group_def.cpp

+        .DataType({ge::DT_INT8, ge::DT_FLOAT16, ge::DT_BF16})
+        .Format({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND})
+        .UnknownShapeFormat({ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND});
+    this->Attr("slotMapping").AttrType(OPTIONAL).ListInt({});


slotMapping can't be optional

Signed-off-by: ZT-AIA <1028681969@qq.com>

zengtian (A) and others added 2 commits March 17, 2026 17:12

reshape and cache by group

447f243

Signed-off-by: zengtian (A) <z00893411@china.huawei.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-ascend in…

089338b

…to reshape_ops

ZT-AIA requested review from wangxiyuan, weijinqian0, whx-sjtu and zzzzwwjj as code owners March 17, 2026 11:15

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

github-actions bot added the module:tests label Mar 17, 2026

repair for build

8630668

Signed-off-by: ZT-AIA <1028681969@qq.com>

mitseng reviewed Mar 18, 2026

View reviewed changes

ZT-AIA added 6 commits March 19, 2026 10:27

repair for build

fcc6b8c

Signed-off-by: ZT-AIA <1028681969@qq.com>

old ops

375ae41

Signed-off-by: ZT-AIA <1028681969@qq.com>

repair for sfa_v1

fff19eb

Signed-off-by: ZT-AIA <1028681969@qq.com>

sfa_v1

76bbae4

Signed-off-by: ZT-AIA <1028681969@qq.com>

v1

940dd58

Signed-off-by: ZT-AIA <1028681969@qq.com>

format v1

8742cd1

Signed-off-by: ZT-AIA <1028681969@qq.com>

yiz-liu added this to the v0.18.0rc1 milestone Mar 25, 2026

yiz-liu mentioned this pull request Mar 25, 2026

[Release]: Release checklist for v0.18.0rc1 #7634

Open

52 tasks

		#ifndef ADD_RMS_NORM_BIAS_TORCH_ADPT_H
		#define ADD_RMS_NORM_BIAS_TORCH_ADPT_H

		//slotMapping=[7,8,9,50,51,52,53,54,55,56,57,58,59,30,31,32,33,34,35,36,37,38,39,60,61,62]


		auto kcacheShape = context_->GetInputShape(DIM_1);

		uint32_t idxGroups = 0;


		while (idxSlotmap < slotMappingLen) {

Conversation

ZT-AIA commented Mar 17, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ZT-AIA commented Mar 17, 2026 •

edited by github-actions bot

Loading