Skip to content

Commit e6ee283

Browse files
Merge branch 'CogVideoX_dev' of github.com:THUDM/CogVideo into CogVideoX_dev
2 parents e169e7b + ff87660 commit e6ee283

File tree

6 files changed

+114
-27
lines changed

6 files changed

+114
-27
lines changed

README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,20 @@ Experience the CogVideoX-5B model online at <a href="https://huggingface.co/spac
1818
</p>
1919
<p align="center">
2020
📍 Visit <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">QingYing</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models.
21+
22+
We have publicly shared the feishu <a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh">technical documentation</a> on CogVideoX fine-tuning scenarios, aiming to further increase the flexibility of distribution. All examples in the public documentation can be fully replicated.
23+
24+
CogVideoX fine-tuning is divided into SFT and LoRA fine-tuning. Based on our publicly available data processing scripts, you can more easily align specific styles in vertical scenarios. We provide guidance for ablation experiments on character image (IP) and scene style, further reducing the difficulty of replicating fine-tuning tasks.
25+
26+
We look forward to creative explorations and contributions.
2127
</p>
2228

2329
## Project Updates
2430

31+
- 🔥🔥 **News**: ```2024/10/10```: We have updated our technical report, including more training details and demos.
32+
33+
- 🔥🔥 **News**: ```2024/10/09```: We have publicly released the [technical documentation](https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh) for CogVideoX fine-tuning on Feishu, further increasing distribution flexibility. All examples in the public documentation can be fully reproduced.
34+
2535
- 🔥🔥 **News**: ```2024/9/25```: CogVideoX web demo is available on Replicate. Try the text-to-video model **CogVideoX-5B** here [![Replicate](https://replicate.com/chenxwh/cogvideox-t2v/badge)](https://replicate.com/chenxwh/cogvideox-t2v) and image-to-video model **CogVideoX-5B-I2V** here [![Replicate](https://replicate.com/chenxwh/cogvideox-i2v/badge)](https://replicate.com/chenxwh/cogvideox-i2v).
2636
- 🔥🔥 **News**: ```2024/9/19```: We have open-sourced the CogVideoX series image-to-video model **CogVideoX-5B-I2V**.
2737
This model can take an image as a background input and generate a video combined with prompt words, offering greater
@@ -294,6 +304,8 @@ works have already been adapted for CogVideoX, and we invite everyone to use the
294304
Space image provided by community members.
295305
+ [Interior Design Fine-Tuning Model](https://huggingface.co/collections/bertjiazheng/koolcogvideox-66e4762f53287b7f39f8f3ba):
296306
is a fine-tuned model based on CogVideoX, specifically designed for interior design.
307+
+ [xDiT](https://github.com/xdit-project/xDiT): xDiT is a scalable inference engine for Diffusion Transformers (DiTs)
308+
on multiple GPU Clusters. xDiT supports real-time image and video generations services.
297309

298310
## Project Structure
299311

README_ja.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,18 @@
1717
👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a> と <a href="https://discord.gg/dCGfUsagrD" target="_blank">Discord</a> に参加
1818
</p>
1919
<p align="center">
20-
📍 <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">清影</a> と <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">APIプラットフォーム</a> を訪問して、より大規模な商用ビデオ生成モデルを体験
20+
📍 <a href="https://chatglm.cn/video?lang=en?fr=osm_cogvideo">清影</a> と <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">APIプラットフォーム</a> を訪問して、より大規模な商用ビデオ生成モデルを体験.
21+
CogVideoXの動画生成に関連するエコシステムコミュニティをさらに活性化させるためには、生成モデルの最適化が非常に重要な方向性です。私たちは、CogVideoXの微調整シナリ飛書オを<a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh">技術文書</a>で公開し、分配の自由度をさらに高めるために、公開されている全てのサンプルを完全に再現可能にしています。
22+
23+
CogVideoXの微調整方法は、SFTとLoRA微調整に分かれており、公開されているデータ処理スクリプトを使用することで、特定の分野においてスタイルの一致をより手軽に達成できます。また、キャラクターイメージ(IP)やシーンスタイルのアブレーション実験のガイドも提供しており、微調整タスクの再現の難易度をさらに低減します。 私たちは、さらに創造的な探索が加わることを期待しています。
2124
</p>
2225

2326
## 更新とニュース
2427

28+
- 🔥🔥 **ニュース**: ```2024/10/10```: 技術報告書を更新し、より詳細なトレーニング情報とデモを追加しました。
29+
30+
- 🔥🔥 **ニュース**: ```2024/10/09```: 飛書の[技術ドキュメント](https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh)でCogVideoXの微調整ガイドを公開しています。分配の自由度をさらに高めるため、公開されているドキュメント内のすべての例が完全に再現可能です。
31+
2532
- 🔥🔥 **ニュース**: ```2024/9/19```: CogVideoXシリーズの画像生成ビデオモデル **CogVideoX-5B-I2V**
2633
をオープンソース化しました。このモデルは、画像を背景入力として使用し、プロンプトワードと組み合わせてビデオを生成することができ、より高い制御性を提供します。これにより、CogVideoXシリーズのモデルは、テキストからビデオ生成、ビデオの継続、画像からビデオ生成の3つのタスクをサポートするようになりました。オンラインでの[体験](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)
2734
をお楽しみください。
@@ -271,6 +278,7 @@ pipe.vae.enable_tiling()
271278
+ [AutoDLイメージ](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): コミュニティメンバーが提供するHuggingface
272279
Spaceイメージのワンクリックデプロイメント。
273280
+ [インテリアデザイン微調整モデル](https://huggingface.co/collections/bertjiazheng/koolcogvideox-66e4762f53287b7f39f8f3ba): は、CogVideoXを基盤にした微調整モデルで、インテリアデザイン専用に設計されています。
281+
+ [xDiT](https://github.com/xdit-project/xDiT): xDiTは、複数のGPUクラスター上でDiTsを並列推論するためのエンジンです。xDiTはリアルタイムの画像およびビデオ生成サービスをサポートしています。
274282

275283
## プロジェクト構造
276284

README_zh.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,18 @@
1919
</p>
2020
<p align="center">
2121
📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验更大规模的商业版视频生成模型。
22+
23+
我们在飞书<a href="https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh">技术文档</a>公开CogVideoX微调指导,以进一步增加分发自由度,公开文档中所有示例可以完全复现
24+
25+
CogVideoX微调方式分为SFT和lora微调,在我们公开的数据处理的脚本上,你可以更加便捷的在垂类的场景上完成某些风格对齐,我们提供了人物形象(IP)和场景风格的消融实验指导,进一步减少复现微调任务的难度
26+
我们期待更加有创意探索加入[新月脸]
2227
</p>
2328

2429
## 项目更新
2530

31+
- 🔥🔥 **News**: ```2024/10/10```: 我们更新了我们的技术报告,附上了更多的训练细节和demo
32+
33+
- 🔥🔥 **News**: ```2024/10/09```: 我们在飞书[技术文档](https://zhipu-ai.feishu.cn/wiki/DHCjw1TrJiTyeukfc9RceoSRnCh")公开CogVideoX微调指导,以进一步增加分发自由度,公开文档中所有示例可以完全复现
2634
- 🔥🔥 **News**: ```2024/9/19```: 我们开源 CogVideoX 系列图生视频模型 **CogVideoX-5B-I2V**
2735
。该模型可以将一张图像作为背景输入,结合提示词一起生成视频,具有更强的可控性。
2836
至此,CogVideoX系列模型已经支持文本生成视频,视频续写,图片生成视频三种任务。欢迎前往在线[体验](https://huggingface.co/spaces/THUDM/CogVideoX-5B-Space)
@@ -256,6 +264,8 @@ pipe.vae.enable_tiling()
256264
+ [AutoDL镜像](https://www.codewithgpu.com/i/THUDM/CogVideo/CogVideoX-5b-demo): 由社区成员提供的一键部署Huggingface
257265
Space镜像。
258266
+ [室内设计微调模型](https://huggingface.co/collections/bertjiazheng/koolcogvideox-66e4762f53287b7f39f8f3ba) 基于 CogVideoX的微调模型,它专为室内设计而设计
267+
+ [xDiT](https://github.com/xdit-project/xDiT): xDiT是一个用于在多GPU集群上对DiTs并行推理的引擎。xDiT支持实时图像和视频生成服务。
268+
259269

260270
## 完整项目代码结构
261271

finetune/train_cogvideox_lora.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
from diffusers.pipelines.cogvideo.pipeline_cogvideox import get_resize_crop_region_for_grid
4040
from diffusers.training_utils import (
4141
cast_training_params,
42-
clear_objs_and_retain_memory,
42+
free_memory,
4343
)
4444
from diffusers.utils import check_min_version, convert_unet_state_dict_to_peft, export_to_video, is_wandb_available
4545
from diffusers.utils.hub_utils import load_or_create_model_card, populate_model_card
@@ -725,7 +725,7 @@ def log_validation(
725725
}
726726
)
727727

728-
clear_objs_and_retain_memory([pipe])
728+
free_memory()
729729

730730
return videos
731731

inference/gradio_composite_demo/app.py

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,13 +37,15 @@
3737

3838
device = "cuda" if torch.cuda.is_available() else "cpu"
3939

40+
MODEL = "THUDM/CogVideoX-5b"
41+
4042
hf_hub_download(repo_id="ai-forever/Real-ESRGAN", filename="RealESRGAN_x4.pth", local_dir="model_real_esran")
4143
snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")
4244

43-
pipe = CogVideoXPipeline.from_pretrained("/share/official_pretrains/hf_home/CogVideoX-5b", torch_dtype=torch.bfloat16).to(device)
45+
pipe = CogVideoXPipeline.from_pretrained(MODEL, torch_dtype=torch.bfloat16).to(device)
4446
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config, timestep_spacing="trailing")
4547
pipe_video = CogVideoXVideoToVideoPipeline.from_pretrained(
46-
"/share/official_pretrains/hf_home/CogVideoX-5b",
48+
MODEL,
4749
transformer=pipe.transformer,
4850
vae=pipe.vae,
4951
scheduler=pipe.scheduler,
@@ -53,9 +55,9 @@
5355
).to(device)
5456

5557
pipe_image = CogVideoXImageToVideoPipeline.from_pretrained(
56-
"/share/official_pretrains/hf_home/CogVideoX-5b-I2V",
58+
MODEL,
5759
transformer=CogVideoXTransformer3DModel.from_pretrained(
58-
"/share/official_pretrains/hf_home/CogVideoX-5b-I2V", subfolder="transformer", torch_dtype=torch.bfloat16
60+
MODEL, subfolder="transformer", torch_dtype=torch.bfloat16
5961
),
6062
vae=pipe.vae,
6163
scheduler=pipe.scheduler,
@@ -315,7 +317,7 @@ def delete_old_files():
315317
"></a>
316318
</div>
317319
<div style="text-align: center; font-size: 15px; font-weight: bold; color: red; margin-bottom: 20px;">
318-
⚠️ This demo is for academic research and experiential use only.
320+
⚠️ This demo is for academic research and experimental use only.
319321
</div>
320322
""")
321323
with gr.Row():

inference/gradio_composite_demo/rife_model.py

Lines changed: 74 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,9 @@
88
import logging
99
import skvideo.io
1010
from rife.RIFE_HDv3 import Model
11-
11+
from huggingface_hub import hf_hub_download, snapshot_download
1212
logger = logging.getLogger(__name__)
13+
1314
device = "cuda" if torch.cuda.is_available() else "cpu"
1415

1516

@@ -18,8 +19,8 @@ def pad_image(img, scale):
1819
tmp = max(32, int(32 / scale))
1920
ph = ((h - 1) // tmp + 1) * tmp
2021
pw = ((w - 1) // tmp + 1) * tmp
21-
padding = (0, 0, pw - w, ph - h)
22-
return F.pad(img, padding)
22+
padding = (0, pw - w, 0, ph - h)
23+
return F.pad(img, padding), padding
2324

2425

2526
def make_inference(model, I0, I1, upscale_amount, n):
@@ -36,30 +37,56 @@ def make_inference(model, I0, I1, upscale_amount, n):
3637

3738
@torch.inference_mode()
3839
def ssim_interpolation_rife(model, samples, exp=1, upscale_amount=1, output_device="cpu"):
39-
40+
print(f"samples dtype:{samples.dtype}")
41+
print(f"samples shape:{samples.shape}")
4042
output = []
43+
pbar = utils.ProgressBar(samples.shape[0], desc="RIFE inference")
4144
# [f, c, h, w]
4245
for b in range(samples.shape[0]):
4346
frame = samples[b : b + 1]
4447
_, _, h, w = frame.shape
48+
4549
I0 = samples[b : b + 1]
4650
I1 = samples[b + 1 : b + 2] if b + 2 < samples.shape[0] else samples[-1:]
47-
I1 = pad_image(I1, upscale_amount)
51+
52+
I0, padding = pad_image(I0, upscale_amount)
53+
I0 = I0.to(torch.float)
54+
I1, _ = pad_image(I1, upscale_amount)
55+
I1 = I1.to(torch.float)
56+
4857
# [c, h, w]
4958
I0_small = F.interpolate(I0, (32, 32), mode="bilinear", align_corners=False)
5059
I1_small = F.interpolate(I1, (32, 32), mode="bilinear", align_corners=False)
5160

5261
ssim = ssim_matlab(I0_small[:, :3], I1_small[:, :3])
5362

5463
if ssim > 0.996:
55-
I1 = I0
56-
I1 = pad_image(I1, upscale_amount)
64+
I1 = samples[b : b + 1]
65+
# print(f'upscale_amount:{upscale_amount}')
66+
# print(f'ssim:{upscale_amount}')
67+
# print(f'I0 shape:{I0.shape}')
68+
# print(f'I1 shape:{I1.shape}')
69+
I1, padding = pad_image(I1, upscale_amount)
70+
# print(f'I0 shape:{I0.shape}')
71+
# print(f'I1 shape:{I1.shape}')
5772
I1 = make_inference(model, I0, I1, upscale_amount, 1)
58-
59-
I1_small = F.interpolate(I1[0], (32, 32), mode="bilinear", align_corners=False)
60-
ssim = ssim_matlab(I0_small[:, :3], I1_small[:, :3])
61-
frame = I1[0]
73+
74+
# print(f'I0 shape:{I0.shape}')
75+
# print(f'I1[0] shape:{I1[0].shape}')
6276
I1 = I1[0]
77+
78+
# print(f'I1[0] unpadded shape:{I1.shape}')
79+
I1_small = F.interpolate(I1, (32, 32), mode="bilinear", align_corners=False)
80+
ssim = ssim_matlab(I0_small[:, :3], I1_small[:, :3])
81+
if padding[3] > 0 and padding[1] >0 :
82+
83+
frame = I1[:, :, : -padding[3],:-padding[1]]
84+
elif padding[3] > 0:
85+
frame = I1[:, :, : -padding[3],:]
86+
elif padding[1] >0:
87+
frame = I1[:, :, :,:-padding[1]]
88+
else:
89+
frame = I1
6390

6491
tmp_output = []
6592
if ssim < 0.2:
@@ -69,10 +96,17 @@ def ssim_interpolation_rife(model, samples, exp=1, upscale_amount=1, output_devi
6996
else:
7097
tmp_output = make_inference(model, I0, I1, upscale_amount, 2**exp - 1) if exp else []
7198

72-
frame = pad_image(frame, upscale_amount)
73-
tmp_output = [frame] + tmp_output
74-
for i, frame in enumerate(tmp_output):
75-
output.append(frame.to(output_device))
99+
frame, _ = pad_image(frame, upscale_amount)
100+
# print(f'frame shape:{frame.shape}')
101+
102+
frame = F.interpolate(frame, size=(h, w))
103+
output.append(frame.to(output_device))
104+
for i, tmp_frame in enumerate(tmp_output):
105+
106+
# tmp_frame, _ = pad_image(tmp_frame, upscale_amount)
107+
tmp_frame = F.interpolate(tmp_frame, size=(h, w))
108+
output.append(tmp_frame.to(output_device))
109+
pbar.update(1)
76110
return output
77111

78112

@@ -94,14 +128,26 @@ def frame_generator(video_capture):
94128

95129

96130
def rife_inference_with_path(model, video_path):
131+
# Open the video file
97132
video_capture = cv2.VideoCapture(video_path)
98-
tot_frame = video_capture.get(cv2.CAP_PROP_FRAME_COUNT)
133+
fps = video_capture.get(cv2.CAP_PROP_FPS) # Get the frames per second
134+
tot_frame = int(video_capture.get(cv2.CAP_PROP_FRAME_COUNT)) # Total frames in the video
99135
pt_frame_data = []
100136
pt_frame = skvideo.io.vreader(video_path)
101-
for frame in pt_frame:
137+
# Cyclic reading of the video frames
138+
while video_capture.isOpened():
139+
ret, frame = video_capture.read()
140+
141+
if not ret:
142+
break
143+
144+
# BGR to RGB
145+
frame_rgb = frame[..., ::-1]
146+
frame_rgb = frame_rgb.copy()
147+
tensor = torch.from_numpy(frame_rgb).float().to("cpu", non_blocking=True).float() / 255.0
102148
pt_frame_data.append(
103-
torch.from_numpy(np.transpose(frame, (2, 0, 1))).to("cpu", non_blocking=True).float() / 255.0
104-
)
149+
tensor.permute(2, 0, 1)
150+
) # to [c, h, w,]
105151

106152
pt_frame = torch.from_numpy(np.stack(pt_frame_data))
107153
pt_frame = pt_frame.to(device)
@@ -122,8 +168,17 @@ def rife_inference_with_latents(model, latents):
122168
for i in range(latents.size(0)):
123169
# [f, c, w, h]
124170
latent = latents[i]
171+
125172
frames = ssim_interpolation_rife(model, latent)
126173
pt_image = torch.stack([frames[i].squeeze(0) for i in range(len(frames))]) # (to [f, c, w, h])
127174
rife_results.append(pt_image)
128175

129176
return torch.stack(rife_results)
177+
178+
179+
# if __name__ == "__main__":
180+
# snapshot_download(repo_id="AlexWortega/RIFE", local_dir="model_rife")
181+
# model = load_rife_model("model_rife")
182+
183+
# video_path = rife_inference_with_path(model, "/mnt/ceph/develop/jiawei/CogVideo/output/20241003_130720.mp4")
184+
# print(video_path)

0 commit comments

Comments
 (0)