Skip to content

Conversation

wuxun-zhang
Copy link

@wuxun-zhang wuxun-zhang commented Aug 14, 2025

This is to add data parallel support for V1 gaudi plugin.

  • add dp aware padding
  • use all_gather and reduce_scatter
  • add data parallel example

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from d4a4c41 to 5ad7ff8 Compare August 20, 2025 09:00
@wuxun-zhang wuxun-zhang marked this pull request as ready for review August 20, 2025 09:06
@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from a355755 to e9bc231 Compare August 20, 2025 15:11
@adobrzyn
Copy link
Collaborator

Please resolve conflicts

@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from 7fdd7dd to c056e11 Compare August 24, 2025 15:12
@wuxun-zhang
Copy link
Author

@adobrzyn Removed unused codes and rebased. Please review again.

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from c056e11 to 86c8e41 Compare August 25, 2025 07:37
@wuxun-zhang
Copy link
Author

/run-gaudi-tests

@sys-hab-pt-service
Copy link
Collaborator

Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from eae75cd to 1142665 Compare August 27, 2025 02:06
@wuxun-zhang
Copy link
Author

@adobrzyn @xuechendi @mswiniarsk @kzawora-intel Please help review this. Thanks.

@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

@wuxun-zhang
Copy link
Author

Seems upstream vllm changes break gaudi plugin.

FAILED vllm-gaudi/tests/unit_tests/worker/test_hpu_model_runner.py::test_init_kv_cache_without_kv_sharing - AttributeError: 'ModelConfig' object has no attribute 'is_multimodal_raw_input_supported'. Did you mean: 'is_multimodal_raw_input_only_model'?

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 7ae55fd to afe1a72 Compare September 3, 2025 08:26
@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 3, 2025

/run-gaudi-tests

@adobrzyn adobrzyn requested a review from Copilot September 3, 2025 09:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds data parallel support for the V1 Gaudi plugin by implementing DP-aware padding mechanisms and collective operations.

  • Implements DP-aware padding for prefill and decode batches to ensure consistent tensor shapes across data parallel ranks
  • Adds collective communication operations (all_gather, reduce_scatter) for expert parallelism support
  • Includes a comprehensive data parallel example script with multi-node support

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_worker.py Updates distributed initialization to handle data parallel configuration and adds dummy batch execution
vllm_gaudi/v1/worker/hpu_model_runner.py Implements DP-aware padding logic and dummy batch creation for consistent execution across ranks
vllm_gaudi/platform.py Adds simple compile backend configuration
vllm_gaudi/distributed/device_communicators/hpu_communicator.py Implements dispatch/combine methods for expert parallelism with collective operations
tests/full_tests/ci_tests.sh Adds CI test for data parallel functionality
examples/data_parallel.py Provides complete example demonstrating data parallel usage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


def generate_random_token_ids(repeat=1) -> list[int]:
"""
For testing different seuquence length in data parallel scenario
Copy link
Preview

Copilot AI Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo: 'seuquence' should be 'sequence'.

Suggested change
For testing different seuquence length in data parallel scenario
For testing different sequence length in data parallel scenario

Copilot uses AI. Check for mistakes.

@@ -0,0 +1,254 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a copy-paste example file from upstream examples/offline_inference/data_parallel.py? Why are we coping it to our rep? We can use it from upstream

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's intended for CI test, so I directly copy into plugin repo.
If copying not expected, any suggestion on how to test in CI?

wuxun-zhang and others added 2 commits September 3, 2025 16:48
- enable profile run

Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from afe1a72 to ea44413 Compare September 3, 2025 14:16
@wuxun-zhang
Copy link
Author

wuxun-zhang commented Sep 3, 2025

where False = <function isclose at 0x71d02ac55a70>(0.9, 0.84375, rtol=0.06)
2025-09-03T09:17:18Z tensorflow test_lm_eval_correctness.py:196: AssertionError
FAILED test_lm_eval_correctness.py::test_lm_eval_correctness - assert False

CI failed on above case, but I re-test in k8s and cannot reproduce due to measured_value is higher.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants