Skip to content

Commit 9ca5472

Browse files
authored
fix star history (#95)
1 parent 2670969 commit 9ca5472

File tree

1 file changed

+24
-22
lines changed

1 file changed

+24
-22
lines changed

README.md

Lines changed: 24 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66
<p align="center"><a href="https://github.com/NUS-HPC-AI-Lab/OpenDiT">[Homepage]</a> | <a href="https://discord.gg/6UzVWm9a">[Discord]</a> | <a href="./figure/wechat.png">[WeChat]</a> | <a href="https://twitter.com/YangYou1991/status/1762447718105170185">[Twitter]</a> | <a href="https://zhuanlan.zhihu.com/p/684457582">[Zhihu]</a> | <a href="https://mp.weixin.qq.com/s/IBb9vlo8hfYKrj9ztxkhjg">[Media]</a></p>
77
</p>
88

9-
### Latest News 🔥
9+
### Latest News 🔥
1010

11-
* [2024/03/01] Support DiT-based Latte for text-to-video generation.
12-
* [2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.
11+
- [2024/03/01] Support DiT-based Latte for text-to-video generation.
12+
- [2024/02/27] Officially release OpenDiT: An Easy, Fast and Memory-Efficent System for DiT Training and Inference.
1313

1414
# About
1515

@@ -18,18 +18,18 @@ OpenDiT is an open-source project that provides a high-performance implementatio
1818
OpenDiT boasts the performance by the following techniques:
1919

2020
1. Up to 80% speedup and 50% memory reduction on GPU
21-
* Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
22-
* Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
21+
- Kernel optimization including FlashAttention, Fused AdaLN, and Fused layernorm kernel.
22+
- Hybrid parallelism methods including ZeRO, Gemini, and DDP. Also, sharding the ema model further reduces the memory cost.
2323
2. FastSeq: A novel sequence parallelism method
24-
* Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
25-
* Up to 48% communication save for intra-node sequence parallel.
26-
* Break the memory limitation of a single GPU and reduce the overall training and inference time.
24+
- Specially designed for DiT-like workloads where the activation size is large but the parameter size is small.
25+
- Up to 48% communication save for intra-node sequence parallel.
26+
- Break the memory limitation of a single GPU and reduce the overall training and inference time.
2727
3. Ease of use
28-
* Huge performance improvement gains with a few line changes
29-
* Users do not need to know the implementation of distributed training.
28+
- Huge performance improvement gains with a few line changes
29+
- Users do not need to know the implementation of distributed training.
3030
4. Complete pipeline of text-to-image and text-to-video generation
31-
* Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
32-
* Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.
31+
- Researchers and engineers can easily use and adapt our pipeline to real-world applications without modifying the parallel part.
32+
- Verify the accuracy of OpenDiT with text-to-image training on ImageNet and release checkpoint.
3333

3434
<p align="center">
3535
<img width="600px" alt="end2end" src="./figure/end2end.png">
@@ -43,9 +43,9 @@ More features are coming soon!
4343

4444
Prerequisites:
4545

46-
- Python >= 3.10
47-
- PyTorch >= 1.13 (We recommend to use a >2.0 version)
48-
- CUDA >= 11.6
46+
- Python >= 3.10
47+
- PyTorch >= 1.13 (We recommend to use a >2.0 version)
48+
- CUDA >= 11.6
4949

5050
We strongly recommend using Anaconda to create a new environment (Python >= 3.10) to run our examples:
5151

@@ -87,7 +87,6 @@ git checkout 741bdf50825a97664db08574981962d66436d16a
8787
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"
8888
```
8989

90-
9190
## Usage
9291

9392
### Image
@@ -105,6 +104,7 @@ torchrun --standalone --nproc_per_node=2 train.py \
105104
```
106105

107106
We disable all speedup methods by default. Here are details of some key arguments for training:
107+
108108
- `--nproc_per_node`: The GPU number you want to use for the current node.
109109
- `--plugin`: The booster plugin used by ColossalAI, `zero2` and `ddp` are supported. The default value is `zero2`. Recommend to enable `zero2`.
110110
- `--mixed_precision`: The data type for mixed precision training. The default value is `bf16`.
@@ -116,7 +116,6 @@ We disable all speedup methods by default. Here are details of some key argument
116116
- `--load`: Load previous saved checkpoint dir and continue training.
117117
- `--num_classes`: Label class number. Should be 10 for CIFAR10 and 1000 for ImageNet. Only used for label-to-image generation.
118118

119-
120119
For more details on the configuration of the training process, please visit our code.
121120

122121
<b>Multi-Node Training.</b>
@@ -149,11 +148,14 @@ python sample.py \
149148
--num_classes 10 \
150149
--ckpt ckpt_path
151150
```
151+
152152
Here are details of some addtional key arguments for inference:
153+
153154
- `--ckpt`: The weight of ema model `ema.pt`. To check your training progress, it can also be our saved base model `epochXX-global_stepXX/model`, it will produce better results than ema in early training stage.
154155
- `--num_classes`: Label class number. Should be 10 for CIFAR10, and 1000 for ImageNet (including official and our checkpoint).
155156

156157
### Video
158+
157159
<b>Training.</b> We current support `VDiT` and `Latte` for video generation. VDiT adopts DiT structure and use video as inputs data. Latte further use more efficient spatial & temporal blocks based on VDiT (not exactly align with origin [Latte](https://github.com/Vchitect/Latte)).
158160

159161
Our video training pipeline is a faithful implementation, and we encourage you to explore your own strategies using OpenDiT. You can train the video DiT model by executing the following command:
@@ -203,8 +205,9 @@ Inference tips: 1) EMA model requires quite long time to converge and produce me
203205
![fastseq_overview](./figure/fastseq_overview.png)
204206

205207
In the realm of visual generation models, such as DiT, sequence parallelism is indispensable for effective long-sequence training and low-latency inference. Two key features can summarize the distinctive nature of these tasks:
206-
* The model parameter is smaller compared with LLMs, but the sequence can be very long, making communication a bottleneck.
207-
* As the model size is relatively small, it only needs sequence parallelism within a node.
208+
209+
- The model parameter is smaller compared with LLMs, but the sequence can be very long, making communication a bottleneck.
210+
- As the model size is relatively small, it only needs sequence parallelism within a node.
208211

209212
However, existing methods like DeepSpeed-Ulysses and Megatron-LM Sequence Parallelism face limitations when applied to such tasks. They either introduce excessive sequence communication or lack efficiency in handling small-scale sequence parallelism.
210213

@@ -214,7 +217,6 @@ Here are the results of our experiments, more results will be coming soon:
214217

215218
![fastseq_exp](./figure/fastseq_exp.png)
216219

217-
218220
## DiT Reproduction Result
219221

220222
We have trained DiT using the origin method with OpenDiT to verify our accuracy. We have trained the model from scratch on ImageNet for 80k steps on 8xA100. Here are some results generated by our trained DiT:
@@ -237,7 +239,6 @@ torchrun --standalone --nproc_per_node=8 train.py \
237239
--num_classes 1000
238240
```
239241

240-
241242
## Acknowledgement
242243

243244
We extend our gratitude to [Zangwei Zheng](https://zhengzangw.github.io/) for providing valuable insights into algorithms and aiding in the development of the video pipeline. Additionally, we acknowledge [Shenggan Cheng](https://shenggan.github.io/) for his guidance on code optimization and parallelism. Our appreciation also goes to [Fuzhao Xue](https://xuefuzhao.github.io/), [Shizun Wang](https://littlepure2333.github.io/home/), [Yuchao Gu](https://ycgu.site/), [Shenggui Li](https://franklee.xyz/), and [Haofan Wang](https://haofanwang.github.io/) for their invaluable advice and contributions.
@@ -249,6 +250,7 @@ This codebase borrows from [Meta's DiT](https://github.com/facebookresearch/DiT)
249250
If you encounter problems using OpenDiT or have a feature request, feel free to create an issue! We also welcome pull requests from the community.
250251

251252
## Citation
253+
252254
```
253255
@misc{zhao2024opendit,
254256
author = {Xuanlei Zhao, Zhongkai Zhao, Ziming Liu, Haotian Zhou, Qianli Ma, and Yang You},
@@ -262,4 +264,4 @@ If you encounter problems using OpenDiT or have a feature request, feel free to
262264

263265
## Star History
264266

265-
[![Star History Chart](https://api.star-history.com/repos=NUS-HPC-AI-Lab/OpenDiT&type=Date)](https://star-history.com/#NUS-HPC-AI-Lab/OpenDiT&Date)
267+
[![Star History Chart](https://api.star-history.com/svg?repos=NUS-HPC-AI-Lab/OpenDiT&type=Date)](https://star-history.com/#NUS-HPC-AI-Lab/OpenDiT&Date)

0 commit comments

Comments
 (0)