-
Notifications
You must be signed in to change notification settings - Fork 28
Add data parallel support #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d4a4c41
to
5ad7ff8
Compare
a355755
to
e9bc231
Compare
Please resolve conflicts |
/run-gaudi-tests |
7fdd7dd
to
c056e11
Compare
@adobrzyn Removed unused codes and rebased. Please review again. |
c056e11
to
86c8e41
Compare
/run-gaudi-tests |
Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn |
eae75cd
to
1142665
Compare
@adobrzyn @xuechendi @mswiniarsk @kzawora-intel Please help review this. Thanks. |
/run-gaudi-tests |
Seems upstream vllm changes break gaudi plugin.
|
7ae55fd
to
afe1a72
Compare
/run-gaudi-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds data parallel support for the V1 Gaudi plugin by implementing DP-aware padding mechanisms and collective operations.
- Implements DP-aware padding for prefill and decode batches to ensure consistent tensor shapes across data parallel ranks
- Adds collective communication operations (all_gather, reduce_scatter) for expert parallelism support
- Includes a comprehensive data parallel example script with multi-node support
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
vllm_gaudi/v1/worker/hpu_worker.py | Updates distributed initialization to handle data parallel configuration and adds dummy batch execution |
vllm_gaudi/v1/worker/hpu_model_runner.py | Implements DP-aware padding logic and dummy batch creation for consistent execution across ranks |
vllm_gaudi/platform.py | Adds simple compile backend configuration |
vllm_gaudi/distributed/device_communicators/hpu_communicator.py | Implements dispatch/combine methods for expert parallelism with collective operations |
tests/full_tests/ci_tests.sh | Adds CI test for data parallel functionality |
examples/data_parallel.py | Provides complete example demonstrating data parallel usage |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
||
def generate_random_token_ids(repeat=1) -> list[int]: | ||
""" | ||
For testing different seuquence length in data parallel scenario |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix typo: 'seuquence' should be 'sequence'.
For testing different seuquence length in data parallel scenario | |
For testing different sequence length in data parallel scenario |
Copilot uses AI. Check for mistakes.
@@ -0,0 +1,254 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a copy-paste example file from upstream examples/offline_inference/data_parallel.py
? Why are we coping it to our rep? We can use it from upstream
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's intended for CI test, so I directly copy into plugin repo.
If copying not expected, any suggestion on how to test in CI?
- enable profile run Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
afe1a72
to
ea44413
Compare
CI failed on above case, but I re-test in k8s and cannot reproduce due to |
This is to add data parallel support for V1 gaudi plugin.