-
Notifications
You must be signed in to change notification settings - Fork 425
Description
🚀 The feature, motivation and pitch
Feature:
Add native support for Full Shared Data Parallel (FSDP) training/inference to vLLM-Omni, enabling efficient large-model execution across multiple GPU/NPUs with memory-optimized parameter sharding, gradient checkpointing, and communication overlap.
Motivation:
I am currently working on inference with the wan2.2-14B model, whose size exceeds the memory capacity of a single GPU/NPU. While vLLM-Omni currently supports Tensor Parallelism (TP), our experiments show that TP does not yield meaningful performance gains for our workload and model architecture. Instead, FSDP has become the industry standard for large-scale video generation models and is explicitly recommended in the official wan2.2 paper for efficient multi-GPU/NPU scaling.
The absence of FSDP support forces users to either:
1 Use less efficient parallelism strategies
2 Manually implement external sharding solutions
3 Switch to other frameworks that support FSDP natively
This feature would align vLLM-Omni with modern large-model training practices and meet the needs of users working with models where TP is suboptimal.
Pitch:
Implementing FSDP would allow vLLM-Omni users to:
Train and run inference on models larger than single-GPU/NPU memory
Achieve better performance for models where TP overhead outweighs benefits
Adopt best practices already established in video generation and other large-model domains
Maintain compatibility with PyTorch's FSDP ecosystem
Alternatives
FSDP provides the optimal balance of memory efficiency (sharding parameters, gradients, and optimizer states) and computational performance (communication overlap, flexible sharding strategies) for models like wan2.2-14B where TP shows diminishing returns.
Additional context
1 Major video generation frameworks (e.g., those for diffusion models, autoregressive video models) have adopted FSDP as their primary multi-GPU strategy
2 The wan2.2 paper specifically mentions: "For multi-GPU training, we recommend FSDP over TP for better memory efficiency and scaling performance"
3 Other LLM frameworks are adding FSDP support (or have already) as users work with increasingly larger models
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.