Skip to content

Commit 6dc9bdb

Browse files
committed
final changes
1 parent 33056c5 commit 6dc9bdb

File tree

2 files changed

+7
-19
lines changed

2 files changed

+7
-19
lines changed

docs/source/en/api/pipelines/cogvideox.md

Lines changed: 5 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,11 @@ from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
4848
from diffusers.utils import export_to_video,load_image
4949
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b").to("cuda") # or "THUDM/CogVideoX-2b"
5050
```
51+
5152
If you are using the image-to-video pipeline, load it as follows:
53+
5254
```python
53-
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda") # Image-to-Video pipeline
55+
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda")
5456
```
5557

5658
Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
@@ -59,7 +61,7 @@ Then change the memory layout of the pipelines `transformer` component to `torch
5961
pipe.transformer.to(memory_format=torch.channels_last)
6062
```
6163

62-
compile the components and run inference:
64+
Compile the components and run inference:
6365

6466
```python
6567
pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
@@ -69,22 +71,7 @@ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wood
6971
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
7072
```
7173

72-
if you are using the image-to-video pipeline, you can use the following code to generate a video from an image:
73-
74-
```python
75-
image = load_image("image_of_panda.jpg")
76-
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
77-
video = pipe(prompt=prompt, image=image, guidance_scale=6, num_inference_steps=50).frames[0]
78-
```
79-
80-
To save the video, use the following code:
81-
82-
```python
83-
export_to_video(video, "panda_video.mp4")
84-
```
85-
86-
87-
The [benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
74+
The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
8875

8976
```
9077
Without torch.compile(): Average inference time: 96.89 seconds.

scripts/convert_cogvideox_to_diffusers.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -241,9 +241,10 @@ def get_args():
241241
if args.vae_ckpt_path is not None:
242242
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype)
243243

244-
text_encoder_id = "/share/official_pretrains/hf_home//t5-v1_1-xxl"
244+
text_encoder_id = "google/t5-v1_1-xxl"
245245
tokenizer = T5Tokenizer.from_pretrained(text_encoder_id, model_max_length=TOKENIZER_MAX_LENGTH)
246246
text_encoder = T5EncoderModel.from_pretrained(text_encoder_id, cache_dir=args.text_encoder_cache_dir)
247+
247248
# Apparently, the conversion does not work anymore without this :shrug:
248249
for param in text_encoder.parameters():
249250
param.data = param.data.contiguous()

0 commit comments

Comments
 (0)