Skip to content

[Feature]: Support for Full Shared Data Parallel (FSDP) in vLLM-Omni #1326

@FrosterHan

Description

@FrosterHan

🚀 The feature, motivation and pitch

Feature:
Add native support for Full Shared Data Parallel (FSDP) training/inference to vLLM-Omni, enabling efficient large-model execution across multiple GPU/NPUs with memory-optimized parameter sharding, gradient checkpointing, and communication overlap.

Motivation:
I am currently working on inference with the wan2.2-14B model, whose size exceeds the memory capacity of a single GPU/NPU. While vLLM-Omni currently supports Tensor Parallelism (TP), our experiments show that TP does not yield meaningful performance gains for our workload and model architecture. Instead, FSDP has become the industry standard for large-scale video generation models and is explicitly recommended in the official wan2.2 paper for efficient multi-GPU/NPU scaling.

The absence of FSDP support forces users to either:
1 Use less efficient parallelism strategies
2 Manually implement external sharding solutions
3 Switch to other frameworks that support FSDP natively
This feature would align vLLM-Omni with modern large-model training practices and meet the needs of users working with models where TP is suboptimal.

Pitch:
Implementing FSDP would allow vLLM-Omni users to:
Train and run inference on models larger than single-GPU/NPU memory
Achieve better performance for models where TP overhead outweighs benefits
Adopt best practices already established in video generation and other large-model domains
Maintain compatibility with PyTorch's FSDP ecosystem

Alternatives

FSDP provides the optimal balance of memory efficiency (sharding parameters, gradients, and optimizer states) and computational performance (communication overlap, flexible sharding strategies) for models like wan2.2-14B where TP shows diminishing returns.

Additional context

1 Major video generation frameworks (e.g., those for diffusion models, autoregressive video models) have adopted FSDP as their primary multi-GPU strategy
2 The wan2.2 paper specifically mentions: "For multi-GPU training, we recommend FSDP over TP for better memory efficiency and scaling performance"
3 Other LLM frameworks are adding FSDP support (or have already) as users work with increasingly larger models

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestgood first issueGood for newcomershigh priorityhigh priority issue, needs to be done asap

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions