@@ -22,7 +22,10 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
2222
2323## Project Updates
2424
25- - 🔥🔥 ** News** : ``` 2024/10/13 ``` : A more cost-effective fine-tuning framework for ` CogVideoX-5B ` that works with a single
25+ - 🔥🔥 News: ``` 2024/11/08 ``` : We have released the CogVideoX1.5 model. CogVideoX1.5 is an upgraded version of the open-source model CogVideoX.
26+ The CogVideoX1.5-5B series supports 10-second videos with higher resolution, and CogVideoX1.5-5B-I2V supports video generation at any resolution.
27+ The SAT code has already been updated, while the diffusers version is still under adaptation. Download the SAT version code [ here] ( https://huggingface.co/THUDM/CogVideoX1.5-5B-SAT ) .
28+ - 🔥 ** News** : ``` 2024/10/13 ``` : A more cost-effective fine-tuning framework for ` CogVideoX-5B ` that works with a single
2629 4090 GPU, [ cogvideox-factory] ( https://github.com/a-r-r-o-w/cogvideox-factory ) , has been released. It supports
2730 fine-tuning with multiple resolutions. Feel free to use it!
2831- 🔥 ** News** : ``` 2024/10/10 ``` : We have updated our technical report. Please
@@ -68,7 +71,6 @@ Jump to a specific section:
6871 - [ Tools] ( #tools )
6972- [ Introduction to CogVideo(ICLR'23) Model] ( #cogvideoiclr23 )
7073- [ Citations] ( #Citation )
71- - [ Open Source Project Plan] ( #Open-Source-Project-Plan )
7274- [ Model License] ( #Model-License )
7375
7476## Quick Start
@@ -172,67 +174,71 @@ models we currently offer, along with their foundational information.
172174 <th style="text-align: center;">CogVideoX-2B</th>
173175 <th style="text-align: center;">CogVideoX-5B</th>
174176 <th style="text-align: center;">CogVideoX-5B-I2V</th>
177+ <th style="text-align: center;">CogVideoX1.5-5B</th>
178+ <th style="text-align: center;">CogVideoX1.5-5B-I2V</th>
175179 </tr >
176180 <tr >
177- <td style="text-align: center;">Model Description</td>
178- <td style="text-align: center;">Entry-level model, balancing compatibility. Low cost for running and secondary development.</td>
179- <td style="text-align: center;">Larger model with higher video generation quality and better visual effects.</td>
180- <td style="text-align: center;">CogVideoX-5B image-to-video version.</td>
181+ <td style="text-align: center;">Release Date</td>
182+ <th style="text-align: center;">August 6, 2024</th>
183+ <th style="text-align: center;">August 27, 2024</th>
184+ <th style="text-align: center;">September 19, 2024</th>
185+ <th style="text-align: center;">November 8, 2024</th>
186+ <th style="text-align: center;">November 8, 2024</th>
187+ </tr >
188+ <tr >
189+ <td style="text-align: center;">Video Resolution</td>
190+ <td colspan="3" style="text-align: center;">720 * 480</td>
191+ <td colspan="1" style="text-align: center;">1360 * 768</td>
192+ <td colspan="1" style="text-align: center;">256 <= W <=1360<br>256 <= H <=768<br> W,H % 16 == 0</td>
181193 </tr >
182194 <tr >
183195 <td style="text-align: center;">Inference Precision</td>
184196 <td style="text-align: center;"><b>FP16*(recommended)</b>, BF16, FP32, FP8*, INT8, not supported: INT4</td>
185- <td colspan="2" style="text-align: center;"><b>BF16 (recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
197+ <td colspan="2" style="text-align: center;"><b>BF16(recommended)</b>, FP16, FP32, FP8*, INT8, not supported: INT4</td>
198+ <td colspan="2" style="text-align: center;"><b>BF16</b></td>
186199 </tr >
187200 <tr >
188- <td style="text-align: center;">Single GPU Memory Usage<br></td>
189- <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB <br><b>diffusers FP16: from 4GB* </b><br><b>diffusers INT8 (torchao): from 3.6GB*</b></td>
190- <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB <br><b>diffusers BF16: from 5GB* </b><br><b>diffusers INT8 (torchao): from 4.4GB*</b></td>
201+ <td style="text-align: center;">Single GPU Memory Usage</td>
202+ <td style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> FP16: 18GB<br><b>diffusers FP16: from 4GB*</b><br><b>diffusers INT8(torchao): from 3.6GB*</b></td>
203+ <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 26GB<br><b>diffusers BF16 : from 5GB*</b><br><b>diffusers INT8(torchao): from 4.4GB*</b></td>
204+ <td colspan="2" style="text-align: center;"><a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a> BF16: 66GB<br></td>
191205 </tr >
192206 <tr >
193- <td style="text-align: center;">Multi-GPU Inference Memory Usage</td>
207+ <td style="text-align: center;">Multi-GPU Memory Usage</td>
194208 <td style="text-align: center;"><b>FP16: 10GB* using diffusers</b><br></td>
195209 <td colspan="2" style="text-align: center;"><b>BF16: 15GB* using diffusers</b><br></td>
210+ <td colspan="2" style="text-align: center;"><b>Not supported</b><br></td>
196211 </tr >
197212 <tr >
198213 <td style="text-align: center;">Inference Speed<br>(Step = 50, FP/BF16)</td>
199214 <td style="text-align: center;">Single A100: ~90 seconds<br>Single H100: ~45 seconds</td>
200215 <td colspan="2" style="text-align: center;">Single A100: ~180 seconds<br>Single H100: ~90 seconds</td>
201- </tr >
202- <tr >
203- <td style="text-align: center;">Fine-tuning Precision</td>
204- <td style="text-align: center;"><b>FP16</b></td>
205- <td colspan="2" style="text-align: center;"><b>BF16</b></td>
206- </tr >
207- <tr >
208- <td style="text-align: center;">Fine-tuning Memory Usage</td>
209- <td style="text-align: center;">47 GB (bs=1, LORA)<br> 61 GB (bs=2, LORA)<br> 62GB (bs=1, SFT)</td>
210- <td style="text-align: center;">63 GB (bs=1, LORA)<br> 80 GB (bs=2, LORA)<br> 75GB (bs=1, SFT)<br></td>
211- <td style="text-align: center;">78 GB (bs=1, LORA)<br> 75GB (bs=1, SFT, 16GPU)<br></td>
216+ <td colspan="2" style="text-align: center;">Single A100: ~1000 seconds (5-second video)<br>Single H100: ~550 seconds (5-second video)</td>
212217 </tr >
213218 <tr >
214219 <td style="text-align: center;">Prompt Language</td>
215- <td colspan="3 " style="text-align: center;">English*</td>
220+ <td colspan="5 " style="text-align: center;">English*</td>
216221 </tr >
217222 <tr >
218- <td style="text-align: center;">Maximum Prompt Length </td>
223+ <td style="text-align: center;">Prompt Token Limit </td>
219224 <td colspan="3" style="text-align: center;">226 Tokens</td>
225+ <td colspan="2" style="text-align: center;">224 Tokens</td>
220226 </tr >
221227 <tr >
222228 <td style="text-align: center;">Video Length</td>
223- <td colspan="3" style="text-align: center;">6 Seconds</td>
229+ <td colspan="3" style="text-align: center;">6 seconds</td>
230+ <td colspan="2" style="text-align: center;">5 or 10 seconds</td>
224231 </tr >
225232 <tr >
226233 <td style="text-align: center;">Frame Rate</td>
227- <td colspan="3" style="text-align: center;">8 Frames / Second</td>
234+ <td colspan="3" style="text-align: center;">8 frames / second</td>
235+ <td colspan="2" style="text-align: center;">16 frames / second</td>
228236 </tr >
229237 <tr >
230- <td style="text-align: center;">Video Resolution</td>
231- <td colspan="3" style="text-align: center;">720 x 480, no support for other resolutions (including fine-tuning)</td>
232- </tr >
233- <tr>
234- <td style="text-align: center;">Position Encoding</td>
238+ <td style="text-align: center;">Positional Encoding</td>
239+ <td style="text-align: center;">3d_sincos_pos_embed</td>
235240 <td style="text-align: center;">3d_sincos_pos_embed</td>
241+ <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
236242 <td style="text-align: center;">3d_sincos_pos_embed</td>
237243 <td style="text-align: center;">3d_rope_pos_embed + learnable_pos_embed</td>
238244 </tr >
@@ -241,10 +247,12 @@ models we currently offer, along with their foundational information.
241247 <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-2b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-2b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-2b">🟣 WiseModel</a></td>
242248 <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b">🟣 WiseModel</a></td>
243249 <td style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX-5b-I2V">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX-5b-I2V">🟣 WiseModel</a></td>
250+ <td colspan="2" style="text-align: center;"> Coming Soon </td>
244251 </tr >
245252 <tr >
246253 <td style="text-align: center;">Download Link (SAT)</td>
247- <td colspan="3" style="text-align: center;"><a href="./sat/README.md">SAT</a></td>
254+ <td colspan="3" style="text-align: center;"><a href="./sat/README_zh.md">SAT</a></td>
255+ <td colspan="2" style="text-align: center;"><a href="https://huggingface.co/THUDM/CogVideoX1.5-5b-SAT">🤗 HuggingFace</a><br><a href="https://modelscope.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🤖 ModelScope</a><br><a href="https://wisemodel.cn/models/ZhipuAI/CogVideoX1.5-5b-SAT">🟣 WiseModel</a></td>
248256 </tr >
249257</table >
250258
@@ -422,7 +430,7 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
422430
423431We welcome your contributions! You can click [ here] ( resources/contribute.md ) for more information.
424432
425- ## License Agreement
433+ ## Model- License
426434
427435The code in this repository is released under the [ Apache 2.0 License] ( LICENSE ) .
428436
0 commit comments