Skip to content

Commit ddd3dcd

Browse files
Merge pull request #469 from THUDM/CogVideoX_dev
CogVideoX1.5-SAT
2 parents 075fad4 + f1f539a commit ddd3dcd

19 files changed

+1379
-717
lines changed

README.md

Lines changed: 41 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,10 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
2222

2323
## Project Updates
2424

25-
- 🔥🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
25+
- 🔥🔥 News: ```2024/11/08```: We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
26+
The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution.
27+
The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [here](https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT).
28+
- 🔥 **News**: ```2024/10/13```: A more cost-effective fine-tuning framework for `CogVideoX-5B` that works with a single
2629
4090 GPU, [cogvideox-factory](https://github.com/a-r-r-o-w/cogvideox-factory), has been released. It supports
2730
fine-tuning with multiple resolutions. Feel free to use it!
2831
- 🔥 **News**: ```2024/10/10```: We have updated our technical report. Please
@@ -68,7 +71,6 @@ Jump to a specific section:
6871
- [Tools](#tools)
6972
- [Introduction to CogVideo(ICLR'23) Model](#cogvideoiclr23)
7073
- [Citations](#Citation)
71-
- [Open Source Project Plan](#Open-Source-Project-Plan)
7274
- [Model License](#Model-License)
7375

7476
## Quick Start
@@ -172,67 +174,71 @@ models we currently offer, along with their foundational information.
172174
<th style="text-align: center;">CogVideoX-2B</th>
173175
<th style="text-align: center;">CogVideoX-5B</th>
174176
<th style="text-align: center;">CogVideoX-5B-I2V</th>
177+
<th style="text-align: center;">CogVideoX1.5-5B</th>
178+
<th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
175179
</tr>
176180
<tr>
177-
<td style="text-align: center;">Model Description</td>
178-
<td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
179-
<td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
180-
<td style="text-align: center;">CogVideoX-5B image-to-video version.</td>
181+
<td style="text-align: center;">Release Date</td>
182+
<th style="text-align: center;">August 6, 2024</th>
183+
<th style="text-align: center;">August 27, 2024</th>
184+
<th style="text-align: center;">September 19, 2024</th>
185+
<th style="text-align: center;">November 8, 2024</th>
186+
<th style="text-align: center;">November 8, 2024</th>
187+
</tr>
188+
<tr>
189+
<td style="text-align: center;">Video Resolution</td>
190+
<td colspan="3" style="text-align: center;">720 * 480</td>
191+
<td colspan="1" style="text-align: center;">1360 * 768</td>
192+
<td colspan="1" style="text-align: center;">256 <= W <=1360<br>256 <= H <=768<br> W,H % 16 == 0</td>
181193
</tr>
182194
<tr>
183195
<td style="text-align: center;">Inference Precision</td>
184196
<td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, FP8*, INT8, not supported: INT4</td>
185-
<td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
197+
<td colspan="2" style="text-align: center;"><b>BF16(recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
198+
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
186199
</tr>
187200
<tr>
188-
<td style="text-align: center;">Single GPU Memory Usage<br></td>
189-
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: from 4GB* </b><br><b>diffusers INT8 (torchao): from 3.6GB*</b></td>
190-
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: from 5GB* </b><br><b>diffusers INT8 (torchao): from 4.4GB*</b></td>
201+
<td style="text-align: center;">Single GPU Memory Usage</td>
202+
<td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB<br><b>diffusers FP16: from 4GB*</b><br><b>diffusers INT8(torchao): from 3.6GB*</b></td>
203+
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB<br><b>diffusers BF16 : from 5GB*</b><br><b>diffusers INT8(torchao): from 4.4GB*</b></td>
204+
<td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB<br></td>
191205
</tr>
192206
<tr>
193-
<td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
207+
<td style="text-align: center;">Multi-GPU Memory Usage</td>
194208
<td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
195209
<td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
210+
<td colspan="2" style="text-align: center;"><b>Not supported</b><br></td>
196211
</tr>
197212
<tr>
198213
<td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
199214
<td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
200215
<td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
201-
</tr>
202-
<tr>
203-
<td style="text-align: center;">Fine-tuning Precision</td>
204-
<td style="text-align: center;"><b>FP16</b></td>
205-
<td colspan="2" style="text-align: center;"><b>BF16</b></td>
206-
</tr>
207-
<tr>
208-
<td style="text-align: center;">Fine-tuning Memory Usage</td>
209-
<td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
210-
<td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
211-
<td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
216+
<td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video)<br>Single H100: ~550 seconds (5-second video)</td>
212217
</tr>
213218
<tr>
214219
<td style="text-align: center;">Prompt Language</td>
215-
<td colspan="3" style="text-align: center;">English*</td>
220+
<td colspan="5" style="text-align: center;">English*</td>
216221
</tr>
217222
<tr>
218-
<td style="text-align: center;">Maximum Prompt Length</td>
223+
<td style="text-align: center;">Prompt Token Limit</td>
219224
<td colspan="3" style="text-align: center;">226 Tokens</td>
225+
<td colspan="2" style="text-align: center;">224 Tokens</td>
220226
</tr>
221227
<tr>
222228
<td style="text-align: center;">Video Length</td>
223-
<td colspan="3" style="text-align: center;">6 Seconds</td>
229+
<td colspan="3" style="text-align: center;">6 seconds</td>
230+
<td colspan="2" style="text-align: center;">5 or 10 seconds</td>
224231
</tr>
225232
<tr>
226233
<td style="text-align: center;">Frame Rate</td>
227-
<td colspan="3" style="text-align: center;">8 Frames / Second</td>
234+
<td colspan="3" style="text-align: center;">8 frames / second</td>
235+
<td colspan="2" style="text-align: center;">16 frames / second</td>
228236
</tr>
229237
<tr>
230-
<td style="text-align: center;">Video Resolution</td>
231-
<td colspan="3" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
232-
</tr>
233-
<tr>
234-
<td style="text-align: center;">Position Encoding</td>
238+
<td style="text-align: center;">Positional Encoding</td>
239+
<td style="text-align: center;">3d_sincos_pos_embed</td>
235240
<td style="text-align: center;">3d_sincos_pos_embed</td>
241+
<td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
236242
<td style="text-align: center;">3d_sincos_pos_embed</td>
237243
<td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
238244
</tr>
@@ -241,10 +247,12 @@ models we currently offer, along with their foundational information.
241247
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
242248
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
243249
<td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
250+
<td colspan="2" style="text-align: center;"> Coming Soon </td>
244251
</tr>
245252
<tr>
246253
<td style="text-align: center;">Download Link (SAT)</td>
247-
<td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
254+
<td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
255+
<td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td>
248256
</tr>
249257
</table>
250258

@@ -422,7 +430,7 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
422430

423431
We welcome your contributions! You can click [here](resources/contribute.md) for more information.
424432

425-
## License Agreement
433+
## Model-License
426434

427435
The code in this repository is released under the [Apache 2.0 License](LICENSE).
428436

0 commit comments

Comments
 (0)