A5 support reshape and cache in CP situation#7636
A5 support reshape and cache in CP situation#7636lenghuixing0330 wants to merge 4 commits intovllm-project:mainfrom
Conversation
Signed-off-by: lenghuixing0330 <2531948770@qq.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances A5 device support for KV cache reshape and cache operations within the context parallel attention module. It streamlines the interaction with device-specific operations by routing them through a dedicated Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request refactors the reshape_and_cache operation by introducing a DeviceOperator abstraction. It moves the direct calls to torch_npu._npu_reshape_and_cache into this new operator, improving modularity and maintainability. Additionally, the reshape_and_cache method within the A5DeviceAdaptor is updated to ensure that the key, value, and slot_mapping tensors are contiguous before being passed to torch_npu.npu_scatter_pa_kv_cache, which can enhance performance and prevent potential issues with non-contiguous data. The argument name slot_indices is also updated to slot_mapping for consistency. There are no review comments to address.
Signed-off-by: lenghuixing0330 <2531948770@qq.com>
Signed-off-by: lenghuixing0330 <2531948770@qq.com>
Signed-off-by: lenghuixing0330 <2531948770@qq.com>
What this PR does / why we need it?
The reshape and cache operators of A5 require that the inputs be contiguous.
In some scenarios, such as sequence parallelism, there are some non-contiguous operations, such as slicing with intervals.
slot_mapping = attn_metadata.slot_mapping[: num_decode_tokens * self.pcp_size : self.pcp_size], where slot_mapping is non-contiguous and needs to be contiguous.
Does this PR introduce any user-facing change?
How was this patch tested?