-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Description
🚀 The feature, motivation and pitch
Currently, MLA layer is only simple-sharded. Attention computation is redundantly replicated across all GPUs. We need to properly support head parallelism for better performance. This involves updating not only linear nodes, but other shape, split, and view nodes to propagate sharded shapes across subgraph's dataflow.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality supportNew feature or request. This includes new model, dtype, functionality support
Type
Projects
Status
Done