Skip to content

[Feature]: Deepseek's MLA head parallelism support #9985

@greg-kwasniewski1

Description

@greg-kwasniewski1

🚀 The feature, motivation and pitch

Currently, MLA layer is only simple-sharded. Attention computation is redundantly replicated across all GPUs. We need to properly support head parallelism for better performance. This involves updating not only linear nodes, but other shape, split, and view nodes to propagate sharded shapes across subgraph's dataflow.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

AutoDeploy<NV> AutoDeploy Backendfeature requestNew feature or request. This includes new model, dtype, functionality support

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions