Skip to content
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vllm_ascend/ops/common_fused_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -173,7 +173,7 @@ def __init__(self, *args, **kwargs):
self.global_redundant_expert_num)
self.log2phy = determine_default_log2phy_map(
self.global_num_experts, self.ep_size, self.ep_rank,
self.global_redundant_expert_num)
self.global_redundant_expert_num).npu()
Comment on lines 174 to +176
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly moves the log2phy tensor to the NPU device, fixing a device mismatch bug.

However, there are two related points to consider:

  1. Potential Bug in determine_default_log2phy_map: The called function determine_default_log2phy_map in vllm_ascend/eplb/core/eplb_utils.py appears to have a bug. On line 122, it uses rank_id inside a loop that iterates over ranks with variable r (for r in range(world_size):). The condition should likely be r < global_redundant_expert_num instead of rank_id < global_redundant_expert_num. Because expert_map_all is constructed for all ranks within this loop, using rank_id will result in an incorrect map for all ranks r != rank_id. This will cause generate_log2phy_map to compute an incorrect log2phy_map_all, ultimately providing a faulty map for the current rank. This is a critical issue that should be investigated and fixed.

  2. Code Duplication: This same fix is required in three files (vllm_ascend/ops/common_fused_moe.py, vllm_ascend/ops/fused_moe.py, and vllm_ascend/torchair/ops/torchair_fused_moe.py) due to duplicated initialization logic. This code duplication is a maintainability risk, as demonstrated by this bug appearing in multiple places. I recommend refactoring this logic into a shared helper function or a base class method in a follow-up PR to improve maintainability.

Given the critical nature of the potential bug, I recommend addressing it.

local_num_experts = (torch.sum(
self.expert_map != -1) if self.expert_map is not None else
self.global_num_experts)
Expand Down
2 changes: 1 addition & 1 deletion vllm_ascend/ops/fused_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,7 @@ def __init__(
self.global_redundant_expert_num)
self.log2phy = determine_default_log2phy_map(
self.global_num_experts, self.ep_size, self.ep_rank,
self.global_redundant_expert_num)
self.global_redundant_expert_num).npu()
Comment on lines 265 to +267
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is the same change as in vllm_ascend/ops/common_fused_moe.py. Please see my comment there regarding a potential critical bug in determine_default_log2phy_map and the code duplication issue.

local_num_experts = (torch.sum(self.expert_map != -1)
if self.expert_map is not None else num_experts)
if self.dynamic_eplb:
Expand Down
2 changes: 1 addition & 1 deletion vllm_ascend/torchair/ops/torchair_fused_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1046,7 +1046,7 @@ def __init__(
self.global_redundant_expert_num)
self.log2phy = determine_default_log2phy_map(
self.global_num_experts, self.ep_size, self.ep_rank,
self.global_redundant_expert_num)
self.global_redundant_expert_num).npu()
Comment on lines 1047 to +1049
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is the same change as in vllm_ascend/ops/common_fused_moe.py. Please see my comment there regarding a potential critical bug in determine_default_log2phy_map and the code duplication issue.

local_num_experts = (torch.sum(self.expert_map != -1)
if self.expert_map is not None else num_experts)
if self.dynamic_eplb:
Expand Down
Loading