Add data parallel support #80

wuxun-zhang · 2025-08-14T02:00:10Z

This is to add data parallel support for V1 gaudi plugin.

add dp aware padding
use all_gather and reduce_scatter
add data parallel example

adobrzyn · 2025-08-22T11:57:51Z

Please resolve conflicts

adobrzyn · 2025-08-22T11:58:02Z

/run-gaudi-tests

vllm_gaudi/v1/worker/hpu_worker.py

wuxun-zhang · 2025-08-25T00:29:32Z

@adobrzyn Removed unused codes and rebased. Please review again.

wuxun-zhang · 2025-08-25T07:41:10Z

/run-gaudi-tests

sys-hab-pt-service · 2025-08-25T07:41:34Z

Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn

wuxun-zhang · 2025-08-27T02:07:40Z

@adobrzyn @xuechendi @mswiniarsk @kzawora-intel Please help review this. Thanks.

adobrzyn · 2025-08-29T06:40:46Z

/run-gaudi-tests

wuxun-zhang · 2025-08-29T07:38:56Z

Seems upstream vllm changes break gaudi plugin.

FAILED vllm-gaudi/tests/unit_tests/worker/test_hpu_model_runner.py::test_init_kv_cache_without_kv_sharing - AttributeError: 'ModelConfig' object has no attribute 'is_multimodal_raw_input_supported'. Did you mean: 'is_multimodal_raw_input_only_model'?

adobrzyn · 2025-09-03T08:50:30Z

/run-gaudi-tests

Copilot

Pull Request Overview

This PR adds data parallel support for the V1 Gaudi plugin by implementing DP-aware padding mechanisms and collective operations.

Implements DP-aware padding for prefill and decode batches to ensure consistent tensor shapes across data parallel ranks
Adds collective communication operations (all_gather, reduce_scatter) for expert parallelism support
Includes a comprehensive data parallel example script with multi-node support

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_worker.py	Updates distributed initialization to handle data parallel configuration and adds dummy batch execution
vllm_gaudi/v1/worker/hpu_model_runner.py	Implements DP-aware padding logic and dummy batch creation for consistent execution across ranks
vllm_gaudi/platform.py	Adds simple compile backend configuration
vllm_gaudi/distributed/device_communicators/hpu_communicator.py	Implements dispatch/combine methods for expert parallelism with collective operations
tests/full_tests/ci_tests.sh	Adds CI test for data parallel functionality
examples/data_parallel.py	Provides complete example demonstrating data parallel usage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

vllm_gaudi/v1/worker/hpu_model_runner.py

Copilot · 2025-09-03T09:04:20Z

examples/data_parallel.py

+
+def generate_random_token_ids(repeat=1) -> list[int]:
+    """
+    For testing different seuquence length in data parallel scenario


Fix typo: 'seuquence' should be 'sequence'.

Suggested change

For testing different seuquence length in data parallel scenario

For testing different sequence length in data parallel scenario

adobrzyn · 2025-09-03T09:16:27Z

examples/data_parallel.py

@@ -0,0 +1,254 @@
+# SPDX-License-Identifier: Apache-2.0


Is this a copy-paste example file from upstream examples/offline_inference/data_parallel.py? Why are we coping it to our rep? We can use it from upstream

It's intended for CI test, so I directly copy into plugin repo.
If copying not expected, any suggestion on how to test in CI?

- enable profile run Signed-off-by: Wuxun Zhang <[email protected]>

Signed-off-by: Wuxun Zhang <[email protected]>

wuxun-zhang · 2025-09-03T14:16:38Z

where False = <function isclose at 0x71d02ac55a70>(0.9, 0.84375, rtol=0.06)
2025-09-03T09:17:18Z tensorflow test_lm_eval_correctness.py:196: AssertionError
FAILED test_lm_eval_correctness.py::test_lm_eval_correctness - assert False

CI failed on above case, but I re-test in k8s and cannot reproduce due to measured_value is higher.

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from d4a4c41 to 5ad7ff8 Compare August 20, 2025 09:00

wuxun-zhang marked this pull request as ready for review August 20, 2025 09:06

wuxun-zhang requested review from kzawora-intel, xuechendi, mswiniarsk and adobrzyn as code owners August 20, 2025 09:06

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from a355755 to e9bc231 Compare August 20, 2025 15:11

adobrzyn reviewed Aug 22, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

adobrzyn reviewed Aug 22, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from 7fdd7dd to c056e11 Compare August 24, 2025 15:12

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from c056e11 to 86c8e41 Compare August 25, 2025 07:37

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from eae75cd to 1142665 Compare August 27, 2025 02:06

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 7ae55fd to afe1a72 Compare September 3, 2025 08:26

adobrzyn requested a review from Copilot September 3, 2025 09:02

Copilot AI reviewed Sep 3, 2025

View reviewed changes

adobrzyn reviewed Sep 3, 2025

View reviewed changes

wuxun-zhang and others added 2 commits September 3, 2025 16:48

Support Data Parallel

b4b908a

- enable profile run Signed-off-by: Wuxun Zhang <[email protected]>

fix

90532c8

Signed-off-by: Wuxun Zhang <[email protected]>

wuxun-zhang added 8 commits September 3, 2025 16:48

fix dummy run

cdcc3cc

Signed-off-by: Wuxun Zhang <[email protected]>

fix lazy hang

5620188

Signed-off-by: Wuxun Zhang <[email protected]>

add dp padding for prefill bs/seqlen/blocks

7a029da

Signed-off-by: Wuxun Zhang <[email protected]>

add dp into ci test

5d44aee

Signed-off-by: Wuxun Zhang <[email protected]>

use reduce_scatter instead of all_reduce

37c4485

Signed-off-by: Wuxun Zhang <[email protected]>

fix dummy prefill batch for eager

97a84b0

Signed-off-by: Wuxun Zhang <[email protected]>

fix rebase error

213f54b

Signed-off-by: Wuxun Zhang <[email protected]>

fix ci error

ea44413

Signed-off-by: Wuxun Zhang <[email protected]>

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from afe1a72 to ea44413 Compare September 3, 2025 14:16

	For testing different seuquence length in data parallel scenario
	For testing different sequence length in data parallel scenario

Add data parallel support #80

Are you sure you want to change the base?

Add data parallel support #80

Conversation

wuxun-zhang commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adobrzyn commented Aug 22, 2025

Uh oh!

adobrzyn commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

wuxun-zhang commented Aug 25, 2025

Uh oh!

wuxun-zhang commented Aug 25, 2025

Uh oh!

sys-hab-pt-service commented Aug 25, 2025

Uh oh!

wuxun-zhang commented Aug 27, 2025

Uh oh!

adobrzyn commented Aug 29, 2025

Uh oh!

wuxun-zhang commented Aug 29, 2025

Uh oh!

adobrzyn commented Sep 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

adobrzyn Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

wuxun-zhang Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

wuxun-zhang commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wuxun-zhang commented Aug 14, 2025 •

edited

Loading

wuxun-zhang commented Sep 3, 2025 •

edited

Loading