You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training large-scale models for video generation presents two major challenges: (1) The extremely long context length of video tokens, which reaching up to 4 million during training, results in prohibitive computational and memory overhead. (2) The combination of block-causal attention and Packing-and-Padding (PnP) introduces highly complex attention mask patterns.
94
+
Training large-scale models for video generation presents two major challenges: (1) The extremely long context length of video tokens, which reaching up to 4 million during training, results in prohibitive computational and memory overhead. (2) The combination of block-causal attention and Patch-and-Pack (PnP) introduces highly complex attention mask patterns.
95
95
96
96
To address these challenges, we propose [MagiAttention](https://github.com/SandAI-org/MagiAttention), which aims to support a wide variety of attention mask types with **kernel-level flexibility**, while achieving **linear scalability** with respect to context-parallel (CP) size across a broad range of scenarios, particularly suitable for training tasks involving <u><em>ultra-long, heterogeneous mask</em></u> training like video-generation for [Magi-1](https://github.com/SandAI-org/MAGI-1).
97
97
@@ -102,7 +102,7 @@ Training large-scale autoregressive diffusion models like \magi for video genera
102
102
103
103
- The extremely long context length of video tokens, which reaching up to 4 million during training, results in prohibitive computational and memory overhead. Context-Parallelism (CP) is designed for dealing such long context challenge, but existing state-of-the-art CP methods<d-cite key="jacobs2023deepspeed,liu2023ringattentionblockwisetransformers,fang2024uspunifiedsequenceparallelism,gu2024loongtrainefficienttraininglongsequence,chen2024longvilascalinglongcontextvisual"></d-cite> face scalability limitations that face scalability limitations due to size constraints or the high communication overhead inherent in inefficient ring-style point-to-point (P2P) patterns. While recent efforts<d-cite key="wang2024datacentricheterogeneityadaptivesequenceparallelism,zhang2024dcp,ge2025bytescaleefficientscalingllm"></d-cite> dynamically adjust CP sizes to avoid unnecessary sharding and redundant communication for shorter sequences, they still incur extra memory overhead for NCCL buffers and involve complex scheduling to balance loads and synchronize across different subsets of ranks.
104
104
105
-
- The combination of block-causal attention and Packing-and-Padding (PnP) introduces highly complex attention mask patterns with variable sequence lengths, which cannot be efficiently handled by existing attention implementations.
105
+
- The combination of block-causal attention and Patch-and-Pack (PnP)<d-citekey="dehghani2023patchnpacknavit"></d-cite> introduces highly complex attention mask patterns with variable sequence lengths, which cannot be efficiently handled by existing attention implementations.
106
106
107
107
108
108
To address the aforementioned challenges, we propose MagiAttention, which aims to support a wide variety of attention mask types (\emph{i.e.} kernel flexibility) while achieving linear scalability with respect to context-parallel (CP) size across a broad range of scenarios. Achieving this goal depends on meeting the following fundamental conditions:
@@ -389,29 +389,31 @@ comming soon ...
389
389
390
390
## Future Work
391
391
392
-
comming soon ...
392
+
For now, please check [RoadMap](https://github.com/SandAI-org/MagiAttention?tab=readme-ov-file#roadmap-%EF%B8%8F).
393
393
394
394
## FAQ
395
395
396
396
comming soon ...
397
397
398
+
398
399
## Acknowledgement
399
400
400
401
We are grateful to the contributors listed below for their valuable contributions during the early stages of MagiAttention.
401
402
402
-
| Member | Affiliations | Email | GitHub Account |
Copy file name to clipboardExpand all lines: assets/bibliography/magiattn.bib
+10Lines changed: 10 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -230,4 +230,14 @@ @article{xu2024chatqa
230
230
author={Xu, Peng and Ping, Wei and Wu, Xianchao and Liu, Zihan and Shoeybi, Mohammad and Catanzaro, Bryan},
231
231
journal={arXiv preprint arXiv:2407.14482},
232
232
year={2024}
233
+
}
234
+
235
+
@misc{dehghani2023patchnpacknavit,
236
+
title={Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
237
+
author={Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey Gritsenko and Mario Lučić and Neil Houlsby},
0 commit comments