Skip to content

[Bugfix] Allow to enable HSDP alone#1567

Open
gcanlin wants to merge 6 commits intovllm-project:mainfrom
gcanlin:hsdp-bugfix
Open

[Bugfix] Allow to enable HSDP alone#1567
gcanlin wants to merge 6 commits intovllm-project:mainfrom
gcanlin:hsdp-bugfix

Conversation

@gcanlin
Copy link
Contributor

@gcanlin gcanlin commented Feb 28, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

For standalone HSDP: when all other parallelism dimensions are 1, but fully_shard_degree > 1, use fully_shard_degree as dit_parallel_size. This ensures orthogonal rank generation works correctly for HSDP workers

Test Plan

vllm serve Wan-AI/Wan2.2-TI2V-5B-Diffusers --omni --port 8091  --use-hsdp --hsdp-shard-size 4

Test Result

[Stage-0] INFO 02-28 08:29:42 [diffusers_loader.py:301] Loading weights took 2.36 seconds
[Stage-0] INFO 02-28 08:29:42 [hsdp.py:128] HSDP Inference: replicate_size=1, shard_size=4, world_size=4, rank=0, fs_world_size=4, fs_rank=0
[Stage-0] INFO 02-28 08:29:42 [diffusers_loader.py:301] Loading weights took 2.17 seconds
[Stage-0] INFO 02-28 08:29:42 [hsdp.py:128] HSDP Inference: replicate_size=1, shard_size=4, world_size=4, rank=2, fs_world_size=4, fs_rank=2
[Stage-0] INFO 02-28 08:29:42 [diffusers_loader.py:301] Loading weights took 2.32 seconds
[Stage-0] INFO 02-28 08:29:42 [hsdp.py:128] HSDP Inference: replicate_size=1, shard_size=4, world_size=4, rank=1, fs_world_size=4, fs_rank=1
[Stage-0] INFO 02-28 08:29:42 [diffusers_loader.py:301] Loading weights took 2.20 seconds
[Stage-0] INFO 02-28 08:29:42 [hsdp.py:128] HSDP Inference: replicate_size=1, shard_size=4, world_size=4, rank=3, fs_world_size=4, fs_rank=3

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74fbf6fa6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@gcanlin gcanlin changed the title [Bugfix] Allow to enable HSDP alone [WIP][Bugfix] Allow to enable HSDP alone Feb 28, 2026
@gcanlin gcanlin marked this pull request as draft February 28, 2026 08:37
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
@gcanlin gcanlin changed the title [WIP][Bugfix] Allow to enable HSDP alone [Bugfix] Allow to enable HSDP alone Feb 28, 2026
@gcanlin gcanlin marked this pull request as ready for review February 28, 2026 16:59
is_standalone_hsdp = dit_parallel_size == 1 and fully_shard_degree > 1

# For standalone HSDP: use (fully_shard_degree * hsdp_replicate_size) as dit_parallel_size
# This ensures orthogonal rank generation works correctly for all HSDP workers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing regression test for standalone HSDP mode. Consider adding a unit test that verifies is_standalone_hsdp detection and dit_parallel_size calculation with various parallel configs (standalone HSDP, HSDP+SP, HSDP+TP) to prevent future regressions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding! Done now.

gcanlin added 2 commits March 2, 2026 09:23
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants