Skip to content

Conversation

offline893
Copy link
Contributor

@offline893 offline893 commented Sep 30, 2025

What this PR does / why we need it?

Resolved the issue of EPLB failure caused by changes in the log2phy map due to device type modifications when using MTP rotation position encoding.

Does this PR introduce any user-facing change?

How was this patch tested?

vllm-project/vllm@releases/v0.11.0

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug causing EPLB failures by ensuring the log2phy map is on the correct NPU device. The fix is applied consistently across three files where this logic is present. While the fix is correct, I've identified a critical potential bug in the called function determine_default_log2phy_map and a high-severity maintainability issue due to code duplication. Please see my detailed comments.

Comment on lines 174 to +176
self.log2phy = determine_default_log2phy_map(
self.global_num_experts, self.ep_size, self.ep_rank,
self.global_redundant_expert_num)
self.global_redundant_expert_num).npu()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly moves the log2phy tensor to the NPU device, fixing a device mismatch bug.

However, there are two related points to consider:

  1. Potential Bug in determine_default_log2phy_map: The called function determine_default_log2phy_map in vllm_ascend/eplb/core/eplb_utils.py appears to have a bug. On line 122, it uses rank_id inside a loop that iterates over ranks with variable r (for r in range(world_size):). The condition should likely be r < global_redundant_expert_num instead of rank_id < global_redundant_expert_num. Because expert_map_all is constructed for all ranks within this loop, using rank_id will result in an incorrect map for all ranks r != rank_id. This will cause generate_log2phy_map to compute an incorrect log2phy_map_all, ultimately providing a faulty map for the current rank. This is a critical issue that should be investigated and fixed.

  2. Code Duplication: This same fix is required in three files (vllm_ascend/ops/common_fused_moe.py, vllm_ascend/ops/fused_moe.py, and vllm_ascend/torchair/ops/torchair_fused_moe.py) due to duplicated initialization logic. This code duplication is a maintainability risk, as demonstrated by this bug appearing in multiple places. I recommend refactoring this logic into a shared helper function or a base class method in a follow-up PR to improve maintainability.

Given the critical nature of the potential bug, I recommend addressing it.

Comment on lines 265 to +267
self.log2phy = determine_default_log2phy_map(
self.global_num_experts, self.ep_size, self.ep_rank,
self.global_redundant_expert_num)
self.global_redundant_expert_num).npu()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is the same change as in vllm_ascend/ops/common_fused_moe.py. Please see my comment there regarding a potential critical bug in determine_default_log2phy_map and the code duplication issue.

Comment on lines 1047 to +1049
self.log2phy = determine_default_log2phy_map(
self.global_num_experts, self.ep_size, self.ep_rank,
self.global_redundant_expert_num)
self.global_redundant_expert_num).npu()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This is the same change as in vllm_ascend/ops/common_fused_moe.py. Please see my comment there regarding a potential critical bug in determine_default_log2phy_map and the code duplication issue.

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants