Revert "[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp"#7637
Revert "[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp"#7637coder-fny wants to merge 1 commit intovllm-project:mainfrom
Conversation
…hen mult…" This reverts commit 9976e68.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request fully reverts a previous change (#7297) that aimed to address a rank0 load imbalance and Out-Of-Memory (OOM) issues in eager mode by modifying data parallel padding behavior. The revert restores the original padding strategy, where batches are padded to the maximum number of tokens across data parallel ranks, and simplifies the MoE communication type selection process. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request refactors data parallel (DP) padding logic by removing the allow_dp_padding parameter and simplifying token synchronization in _sync_batch_across_dp. It also modifies the ascend_forward_context to directly use num_tokens for selecting the MoE communication method. However, a critical issue was identified where using num_tokens directly for select_moe_comm_method could lead to desynchronization across DP ranks if num_tokens is not consistent, potentially causing a hang. The original logic for max_num_tokens should be restored to ensure a synchronized token count.
| max_num_tokens = int(num_tokens_across_dp.max().item()) if num_tokens_across_dp is not None else num_tokens | ||
| moe_comm_type = select_moe_comm_method(max_num_tokens, vllm_config, is_draft_model) | ||
|
|
||
| moe_comm_type = select_moe_comm_method(num_tokens, vllm_config, is_draft_model) |
There was a problem hiding this comment.
This change removes the logic for determining the maximum number of tokens across data parallel (DP) ranks. It now relies on the num_tokens argument to be consistent across all DP ranks.
However, num_tokens may not be consistent. Specifically, in NPUModelRunner._sync_batch_across_dp, the all_reduce operation is skipped if _skip_all_reduce_across_dp_group() returns true (e.g., for non-MoE models or certain MoE configurations). In this case, num_tokens passed to this function will be the local token count for each rank, which can be different.
This will cause select_moe_comm_method to be called with different num_tokens values on different DP ranks, potentially leading to desynchronization and a hang if they choose different communication methods. This is a critical issue.
While the previous logic was also affected by the issue in _skip_all_reduce_across_dp_group, it correctly showed the intent of using a synchronized maximum token count. This change removes that safeguard.
| moe_comm_type = select_moe_comm_method(num_tokens, vllm_config, is_draft_model) | |
| max_num_tokens = int(num_tokens_across_dp.max().item()) if num_tokens_across_dp is not None else num_tokens | |
| moe_comm_type = select_moe_comm_method(max_num_tokens, vllm_config, is_draft_model) |
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Reverts #7297