Skip to content

Revert "[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp"#7637

Open
coder-fny wants to merge 1 commit intovllm-project:mainfrom
coder-fny:revert-7297-fix-rank0-load-imbalance
Open

Revert "[Bugfix][eager][oom] fix rank0 load imbalance by no padding when multi dp"#7637
coder-fny wants to merge 1 commit intovllm-project:mainfrom
coder-fny:revert-7297-fix-rank0-load-imbalance

Conversation

@coder-fny
Copy link
Contributor

@coder-fny coder-fny commented Mar 25, 2026

Reverts #7297

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fully reverts a previous change (#7297) that aimed to address a rank0 load imbalance and Out-Of-Memory (OOM) issues in eager mode by modifying data parallel padding behavior. The revert restores the original padding strategy, where batches are padded to the maximum number of tokens across data parallel ranks, and simplifies the MoE communication type selection process.

Highlights

  • Revert of Padding Logic: The core change is the reversion of modifications related to data parallel padding. The system now re-enables padding to the maximum number of tokens across data parallel ranks, undoing a previous attempt to avoid padding for load balancing.
  • MoE Communication Type Selection: The logic for selecting the Mixture-of-Experts (MoE) communication type has been reverted to use the local num_tokens directly, rather than the maximum number of tokens across data parallel ranks.
  • Removal of Eager Mode Enforcement Parameter: The force_eager parameter has been removed from the model's forward pass execution, simplifying the call signature and removing explicit eager mode enforcement at that point.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request refactors data parallel (DP) padding logic by removing the allow_dp_padding parameter and simplifying token synchronization in _sync_batch_across_dp. It also modifies the ascend_forward_context to directly use num_tokens for selecting the MoE communication method. However, a critical issue was identified where using num_tokens directly for select_moe_comm_method could lead to desynchronization across DP ranks if num_tokens is not consistent, potentially causing a hang. The original logic for max_num_tokens should be restored to ensure a synchronized token count.

max_num_tokens = int(num_tokens_across_dp.max().item()) if num_tokens_across_dp is not None else num_tokens
moe_comm_type = select_moe_comm_method(max_num_tokens, vllm_config, is_draft_model)

moe_comm_type = select_moe_comm_method(num_tokens, vllm_config, is_draft_model)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change removes the logic for determining the maximum number of tokens across data parallel (DP) ranks. It now relies on the num_tokens argument to be consistent across all DP ranks.

However, num_tokens may not be consistent. Specifically, in NPUModelRunner._sync_batch_across_dp, the all_reduce operation is skipped if _skip_all_reduce_across_dp_group() returns true (e.g., for non-MoE models or certain MoE configurations). In this case, num_tokens passed to this function will be the local token count for each rank, which can be different.

This will cause select_moe_comm_method to be called with different num_tokens values on different DP ranks, potentially leading to desynchronization and a hang if they choose different communication methods. This is a critical issue.

While the previous logic was also affected by the issue in _skip_all_reduce_across_dp_group, it correctly showed the intent of using a synchronized maximum token count. This change removes that safeguard.

Suggested change
moe_comm_type = select_moe_comm_method(num_tokens, vllm_config, is_draft_model)
max_num_tokens = int(num_tokens_across_dp.max().item()) if num_tokens_across_dp is not None else num_tokens
moe_comm_type = select_moe_comm_method(max_num_tokens, vllm_config, is_draft_model)

@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant