-
Notifications
You must be signed in to change notification settings - Fork 468
[bugfix]change log2phy map to npu #3282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[BugFiModify eplb feature guide.
Signed-off-by: offline0806 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug causing EPLB failures by ensuring the log2phy
map is on the correct NPU device. The fix is applied consistently across three files where this logic is present. While the fix is correct, I've identified a critical potential bug in the called function determine_default_log2phy_map
and a high-severity maintainability issue due to code duplication. Please see my detailed comments.
self.log2phy = determine_default_log2phy_map( | ||
self.global_num_experts, self.ep_size, self.ep_rank, | ||
self.global_redundant_expert_num) | ||
self.global_redundant_expert_num).npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly moves the log2phy
tensor to the NPU device, fixing a device mismatch bug.
However, there are two related points to consider:
-
Potential Bug in
determine_default_log2phy_map
: The called functiondetermine_default_log2phy_map
invllm_ascend/eplb/core/eplb_utils.py
appears to have a bug. On line 122, it usesrank_id
inside a loop that iterates over ranks with variabler
(for r in range(world_size):
). The condition should likely ber < global_redundant_expert_num
instead ofrank_id < global_redundant_expert_num
. Becauseexpert_map_all
is constructed for all ranks within this loop, usingrank_id
will result in an incorrect map for all ranksr != rank_id
. This will causegenerate_log2phy_map
to compute an incorrectlog2phy_map_all
, ultimately providing a faulty map for the current rank. This is a critical issue that should be investigated and fixed. -
Code Duplication: This same fix is required in three files (
vllm_ascend/ops/common_fused_moe.py
,vllm_ascend/ops/fused_moe.py
, andvllm_ascend/torchair/ops/torchair_fused_moe.py
) due to duplicated initialization logic. This code duplication is a maintainability risk, as demonstrated by this bug appearing in multiple places. I recommend refactoring this logic into a shared helper function or a base class method in a follow-up PR to improve maintainability.
Given the critical nature of the potential bug, I recommend addressing it.
self.log2phy = determine_default_log2phy_map( | ||
self.global_num_experts, self.ep_size, self.ep_rank, | ||
self.global_redundant_expert_num) | ||
self.global_redundant_expert_num).npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.log2phy = determine_default_log2phy_map( | ||
self.global_num_experts, self.ep_size, self.ep_rank, | ||
self.global_redundant_expert_num) | ||
self.global_redundant_expert_num).npu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
What this PR does / why we need it?
Resolved the issue of EPLB failure caused by changes in the log2phy map due to device type modifications when using MTP rotation position encoding.
Does this PR introduce any user-facing change?
How was this patch tested?
vllm-project/vllm@releases/v0.11.0