[main] flashcomm_v2 optim solution #3232

Levi-JQ · 2025-09-28T07:30:53Z

What this PR does / why we need it?

Supports generalized FlashComm2 optimization, which reduces communication overhead, decreases RmsNorm computation, and saves one AllGather step by replacing Allreduce operations in Attention/MLP modules with pre-AlltoAll and post-AllGather operations. This feature is enabled during the Prefill phase and delivers broad performance improvements, especially in long sequence scenarios with large tensor parallelism (TP). Benchmark tests show a 10%-20% performance acceleration under TP16DP1 configuration.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: vllm-project/vllm@releases/v0.11.0

github-actions · 2025-09-28T07:31:03Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces the FlashComm2 optimization for tensor parallelism on Ascend NPUs, aiming to improve performance by optimizing communication patterns. The changes span configuration, parallel state management, and operator implementations. My review has identified a few issues: a critical bug in the parallel group initialization that can lead to a crash, a related potential resource leak in the group destruction logic, and incorrect formatting of error messages in the configuration validation. These issues should be addressed to ensure correctness and robustness.

gemini-code-assist · 2025-09-28T07:33:22Z

vllm_ascend/distributed/parallel_state.py

+        _FLASHCOMM2_OTP = None
+        _FLASHCOMM2_ODP = get_tp_group()
+
+        if flashcomm2_otp_size > 1:


The process group creation for FlashComm2 is guarded by if flashcomm2_otp_size > 1:. This causes _FLASHCOMM2_OTP to be None when flashcomm2_oproj_tensor_parallel_size is 1. However, Flashcomm2OProjRowParallelOp is still used in this case, and it attempts to access methods on the _FLASHCOMM2_OTP group, which will lead to a crash. The logic within this if block appears to correctly handle the size == 1 case by creating groups of size 1. The conditional guard should be removed, and its content unindented, to fix this critical bug.

gemini-code-assist · 2025-09-28T07:33:22Z

vllm_ascend/ascend_config.py

+                raise AssertionError(
+                    "flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size}) cannot exceed global tensor parallel size ({global_tp_size})"
+                )
+            if global_tp_size % self.flashcomm2_oproj_tensor_parallel_size != 0:
+                raise AssertionError(
+                    "Global tensor parallel size ({global_tp_size}) must be divisible by flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size})"
+                )


The error message strings are not f-strings, so the variables inside the curly braces will not be interpolated. This will result in confusing and unhelpful error messages for users.

Suggested change

raise AssertionError(

"flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size}) cannot exceed global tensor parallel size ({global_tp_size})"

)

if global_tp_size % self.flashcomm2_oproj_tensor_parallel_size != 0:

raise AssertionError(

"Global tensor parallel size ({global_tp_size}) must be divisible by flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size})"

)

raise AssertionError(

f"flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size}) cannot exceed global tensor parallel size ({global_tp_size})"

)

if global_tp_size % self.flashcomm2_oproj_tensor_parallel_size != 0:

raise AssertionError(

f"Global tensor parallel size ({global_tp_size}) must be divisible by flashcomm2_oproj_tensor_parallel_size ({self.flashcomm2_oproj_tensor_parallel_size})"

)

gemini-code-assist · 2025-09-28T07:33:22Z

vllm_ascend/distributed/parallel_state.py

    _OTP = None
+
+    global _FLASHCOMM2_OTP
+    if _FLASHCOMM2_OTP and get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1:


The condition get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1 will prevent the _FLASHCOMM2_OTP group from being destroyed when its size is 1. If the initialization logic is fixed to create a group for size 1 (as suggested in another comment), this will cause a resource leak. The group should be destroyed if it was created, regardless of its size.

Suggested change

if _FLASHCOMM2_OTP and get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1:

if _FLASHCOMM2_OTP:

github-actions · 2025-09-30T07:13:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

github-actions bot added module:ops module:core module:quantization labels Sep 28, 2025

gemini-code-assist bot reviewed Sep 28, 2025

View reviewed changes

Levi-JQ force-pushed the official-fc2 branch 4 times, most recently from 8b9a5a2 to 5b6c013 Compare September 30, 2025 02:34

Levi-JQ added 2 commits September 30, 2025 10:38

[main] flashcomm_v2 optim solution

b216949

Independent Flashcomm2 == [TODO1]

1b8cdb3

Levi-JQ force-pushed the official-fc2 branch from 5b6c013 to 1b8cdb3 Compare September 30, 2025 02:38

github-actions bot added the merge-conflicts label Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[main] flashcomm_v2 optim solution #3232

[main] flashcomm_v2 optim solution #3232

Levi-JQ commented Sep 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 28, 2025

Uh oh!

gemini-code-assist bot Sep 28, 2025

Uh oh!

gemini-code-assist bot Sep 28, 2025

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

Uh oh!

	if _FLASHCOMM2_OTP and get_ascend_config().flashcomm2_oproj_tensor_parallel_size != 1:
	if _FLASHCOMM2_OTP:

[main] flashcomm_v2 optim solution #3232

Are you sure you want to change the base?

[main] flashcomm_v2 optim solution #3232

Conversation

Levi-JQ commented Sep 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Sep 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

Uh oh!

Levi-JQ commented Sep 28, 2025 •

edited by github-actions bot

Loading