Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces several bug fixes aimed at enhancing the stability and correctness of the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
What this PR does / why we need it?
This pull request introduces several internal adjustments related to attention mechanisms and speculative decoding. It optimizes the _align_to_graph_bucket_tokens function in sfa_cp.py to avoid unnecessary alignment when pcp_size is 1. Additionally, in eagle_proposer.py, it adds a null check before accessing attn_metadata.decode.cp_seq_len to prevent potential AttributeErrors. However, a critical issue was identified where ori_seq_len is explicitly moved to the CPU, which will likely cause a device mismatch error when this tensor is subsequently used in operations with GPU tensors within self.runner.pcp_manager._get_cp_local_seq_lens. This needs to be corrected to ensure ori_seq_len remains on the GPU.
Does this PR introduce any user-facing change?
No, this PR focuses on internal optimizations and bug fixes that do not introduce any user-facing changes.
How was this patch tested?
Existing unit and integration tests should cover these changes. Additional tests might be needed to specifically verify the fix for the AttributeError and to ensure the device mismatch issue is resolved.
| ) | ||
| num_accept_tokens = query_lens_d.to(self.device) - num_reject_tokens | ||
| ori_seq_len = attn_metadata_i.seq_lens[:batch_size].clone() | ||
| ori_seq_len = attn_metadata_i.seq_lens[:batch_size].clone().to(device="cpu") |
There was a problem hiding this comment.
Moving ori_seq_len to the CPU with .to(device="cpu") will likely cause a device mismatch error. This tensor is used in self.runner.pcp_manager._get_cp_local_seq_lens, which performs operations involving tensors created on the default GPU device (e.g., via torch.arange). Performing operations between a CPU tensor and a GPU tensor will lead to a runtime error. ori_seq_len should remain on the GPU.
| ori_seq_len = attn_metadata_i.seq_lens[:batch_size].clone().to(device="cpu") | |
| ori_seq_len = attn_metadata_i.seq_lens[:batch_size].clone() |
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
| ) | ||
| num_accept_tokens = query_lens_d.to(self.device) - num_reject_tokens | ||
| ori_seq_len = attn_metadata_i.seq_lens[:batch_size].clone() | ||
| ori_seq_len = attn_metadata_i.seq_lens[:batch_size].clone().to(device="cpu") |
There was a problem hiding this comment.
plz refactor the mla attention metadata and sfa metadata instead, add a seq_lens_cpu to them to avoid the frequently h2d synchornize
What this PR does / why we need it?
Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2 scenario.
Does this PR introduce any user-facing change?
No
How was this patch tested?