CogVideoX 1.5 #9877

zRzRzRzRzRzRzR · 2024-11-06T14:16:04Z

What does this PR do?

This PR is a draft about the new generation of CogVideoX, which has not yet been fully realized
Main work:

The new PatchEmbedding has increased t_patch.
The resolution needs to consider compatibility with previous versions; the new version has a higher resolution, but it is also a fixed value. The Rotray Embedding is estimated to need changes.
Image-to-video has more resolution features.
The SAT version is very slow, and the memory usage is quite high; this may need optimization in future PRs.

Currently achieved

The new PatchEmbedding has increased t_patch.

Who Can help this

@a-r-r-o-w

HuggingFaceDocBuilderDev · 2024-11-08T22:17:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2024-11-09T16:25:04Z

scripts/convert_cogvideox_to_diffusers.py

+            "patch_bias": False,
+            "sample_height": 768 // vae_scale_factor_spatial,
+            "sample_width": 1360 // vae_scale_factor_spatial,
+            "sample_frames": 81, # TODO: Need Test with 161 for 10 seconds


Suggested change

"sample_frames": 81, # TODO: Need Test with 161 for 10 seconds

"sample_frames": 81,

This is just to determine the default number of frames for sampling, so we do not need to make a modification here (which would affect the config.json of the converted transformer model). Users can still specify 161 frames (in the call to pipeline) for generation normally and we will still be compatible without needing any modifications here.

a-r-r-o-w · 2024-11-09T16:30:02Z

scripts/convert_cogvideox_to_diffusers.py

    if args.vae_ckpt_path is not None:
-        vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype)
+        # Keep VAE in float32 for better quality
+        vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, torch.float32)


@zRzRzRzRzRzRzR This is a bit of a breaking change. The SAT VAE is in fp32 but the diffusers format VAE is in bf16/fp16. This can lead to poorer quality, so it is best to just keep the VAE in fp32 and let users decide what configuration to use. I will open a PR to the other model weight CogVideoX repositories with the updated VAE weights soon.

cc @yiyixuxu @DN6 The VAE quality doesn't take too much of a hit, but best to have the default in FP32 and update all existing checkpoints. Apologies that this slipped through earlier but I definitely notice very minor differences in quality (atleast in training cc @sayakpaul). The transformer modeling weights don't use variants because there is no FP32 weights as training is done in BF16

a-r-r-o-w · 2024-11-09T16:31:11Z

scripts/convert_cogvideox_to_diffusers.py

        scheduler=scheduler,
    )

-    if args.fp16:


Due to the explanation above, we shouldn't typecast all weights in the pipeline. VAE is best in FP32, text encoder could be saved in FP32 but works well at lower precisions as well, and transformer is either in BF16, or FP16 for CogVideoX-2B text-to-video

Understood, this is the right thing to do.

a-r-r-o-w · 2024-11-09T16:32:04Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

-        height: int = 480,
-        width: int = 720,
-        num_frames: int = 49,
+        height: Optional[int] = None,


This is not a breaking change since we can determine these exact values using transformer config parameters

a-r-r-o-w · 2024-11-09T16:33:14Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py

+
+        base_size_width = self.transformer.config.sample_width // p
+        base_size_height = self.transformer.config.sample_height // p
+        base_num_frames = (num_frames + p_t - 1) // p_t


This is not a breaking change either because they are mathematically equivalent for 1.0 models which do not use temporal patch embedding, but the ceil div is required for new 1.5 CogVideoX models.

a-r-r-o-w · 2024-11-09T16:34:22Z

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

            `tuple`. When returning a tuple, the first element is a list with the generated images.
        """

-        if num_frames > 49:


We need to remove the frame restriction since the newer models can generate at higher frame resolutions. 49 frames will still be the default for 1.0 models because we determine that by the transformer config params

… into cogvideox1.1-5b

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py

kijai · 2024-11-11T10:57:52Z

Hey,

I've been testing this draft within my ComfyUI wrapper nodes, and I'd like clarity on couple of things as I'm unsure how this should perform at this stage, mostly to know if I have made a mistake somewhere:

With text to video model the input hidden_states are always padded, which seems to cause noise/corruption for the first latent/first 4 frames for the resulting video. Is this unavoidable?
With both models many resolutions don't really work at all, especially smaller ones, for example the old default 720x480 is blurry mess with T2V and just colorful blocks with I2V. New default resolution is fine, also gotten really good results with square ones (640p, 768,p 1024p). I've made sure it's always divisibly by 16.

a-r-r-o-w · 2024-11-11T11:12:42Z

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?

It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.

Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

kijai · 2024-11-11T11:31:47Z

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?

It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.

Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

I did try without the padding already, but then you get the first 4 frames without movement, I'll check if that branch has something for that thanks.

To illustrate what I mean with the I2V and 720x480 (for example):

CogVideoX-I2V_00001.10.mp4

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:

cogvideo1_5_test.mp4

zRzRzRzRzRzRzR · 2024-11-11T12:42:24Z

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?嗨。你能试着在这个分支上实验一下吗： zRzRzRzRzRzRzR#1 ？

It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.

Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

In this version, T2V works very well; I believe the main issue lies in I2V. For T2V, we indeed trained with only one resolution, 1360x768, so it’s quite normal that the original 720x480 resolution doesn’t work properly.

zRzRzRzRzRzRzR · 2024-11-11T13:22:37Z

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?
It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.
Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

I did try without the padding already, but then you get the first 4 frames without movement, I'll check if that branch has something for that thanks.

To illustrate what I mean with the I2V and 720x480 (for example):
CogVideoX-I2V_00001.10.mp4

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:
cogvideo1_5_test.mp4

What I want to say is that it should actually be 21, not 22, because the first block of each latent is useless. Therefore, we need to try to remove the useless latents. This method should effectively output a 16fps video of 5 seconds, with the input num_frames being 81

zRzRzRzRzRzRzR · 2024-11-11T13:23:59Z

Now, all changes are in the cogvideox1.1-5b branch.

a-r-r-o-w · 2024-11-11T13:39:27Z

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:

@kijai It could be possible that there is something wrong with our implementation then. Could you provide the patch/changes you made to do the aspect ratio flip?

kijai · 2024-11-11T14:07:44Z

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:

@kijai It could be possible that there is something wrong with our implementation then. Could you provide the patch/changes you made to do the aspect ratio flip?

I really don't know if this means anything, I only simply swapped the base_size_width and base_size_height here in effort to understand what it even does, but it's not a real solution as it ruins the other resolutions then:

diffusers/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py

Line 458 in 27441fc

(grid_height, grid_width), base_size_width, base_size_height

Also I did just now replicate this to double check it indeed somehow fixes at least that video I showed above.

a-r-r-o-w · 2024-11-14T23:39:24Z

@yiyixuxu Re-requesting a review because of the changes made to 3D rope embeds

kijai · 2024-11-15T00:49:32Z

@kijai Would you be able to try it out now? Please make sure to update the config with the correct sample height and width as done in the latest commit! It looks to me like the issue you mentioned is fixed (but I have only done limited testing with 768 x 1360, 1360 x 768, 768 x 1152, 1152 x 768).

Another possible bug might be the max_sequence_length. I'm unsure if this should be 226 or 224 for CogVideoX 1.5. @zRzRzRzRzRzRzR Could you clarify which is correct?

I have been on vacation for the past few days, so could not really invest time to carefully check out each change that was made, and missed a few commits made to the original repo. Apologies for the delay here but once @zRzRzRzRzRzRzR and you confirm that it is working as intended, we can proceed with merging here. Thanks for all the help testing!

I don't have time to test more currently as it's 3am here, but quickly testing both examples that failed before indeed works now:

CogVideoX-1-5-I2V_test_768_1360.mp4

CogVideoX-1-5-I2V_test_720_480.mp4

Thank you for your hard work!

zRzRzRzRzRzRzR · 2024-11-15T15:11:43Z

I have replied to some of the issues mentioned by @a-r-r-o-w on Slack. I am currently reproducing the issue mentioned in the latest issue report.

zRzRzRzRzRzRzR · 2024-11-15T16:02:40Z

The content of the activity for 768 x 1360 is incorrect, but did both @kijai and @a-r-r-o-w succeed?
In my work, @kijai made a broad assumption about an error previously mentioned.

kijai · 2024-11-15T16:21:22Z

The content of the activity for 768 x 1360 is incorrect, but did both @kijai and @a-r-r-o-w succeed? In my work, @kijai made a broad assumption about an error previously mentioned.

The latest fix from @a-r-r-o-w is working perfectly for me, every resolution I've tried has worked to some extend, even really small ones.

a-r-r-o-w · 2024-11-17T07:05:41Z

@zRzRzRzRzRzRzR Let me know if you think this is good to merge now, and we can go ahead :)

docs/source/en/api/pipelines/cogvideox.md

a-r-r-o-w · 2024-11-18T19:25:58Z

Discussed in private DM with @zRzRzRzRzRzRzR but we have verified that the model works as intended for all supported resolutions during training. It was just a config bug in the official checkpoints which has now been corrected.

YanzuoLu · 2024-11-19T11:31:44Z

The current settings for T2V transformer model seem to problematic.
Any ideas on this issue?

a-r-r-o-w · 2024-11-19T12:12:02Z

Just fixed in #9963. A config attribute was updated on the transformer checkpoint to make the T2V behaviour consistent with SAT codebase, but the specific change only existed in I2V for diffusers - but now should work same as before

* CogVideoX1_1PatchEmbed test * 1360 * 768 * refactor * make style * update docs * add modeling tests for cogvideox 1.5 * update * make fix-copies * add ofs embed(for convert) * add ofs embed(for convert) * more resolution for cogvideox1.5-5b-i2v * use even number of latent frames only * update pipeline implementations * make style * set patch_size_t as None by default * #skip frames 0 * refactor * make style * update docs * fix ofs_embed * update docs * invert_scale_latents * update * fix * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update src/diffusers/models/transformers/cogvideox_transformer_3d.py * update conversion script * remove copied from * fix test * Update docs/source/en/api/pipelines/cogvideox.md * Update docs/source/en/api/pipelines/cogvideox.md * Update docs/source/en/api/pipelines/cogvideox.md * Update docs/source/en/api/pipelines/cogvideox.md --------- Co-authored-by: Aryan <[email protected]> Co-authored-by: Steven Liu <[email protected]>

cyberluke · 2025-01-17T09:28:46Z

Hi, do you know how many people use HuggingFace and run on original code and original model and you introduce them this big breaking changes and they can spend weeks debugging it?

This should me communicated with original author and he should put this information in his README.md !!!

It would be great to supply a complete migration guide not only two paragraphs here in docs where people need to read in between the lines to understand consequences.

So you recommend also this for finetuning:
The original repository uses a lora_alpha of 1. We found this not suitable in many runs, possibly due to difference in modeling backends and training settings. Our recommendation is to set to the lora_alpha to either rank or rank // 2.

Now I want to download some LoRas from community and they did the finetuning on the original model, will it work?

Next, did you know that I2V model is not only for image 2 video, but also for text 2 video because you can supply both image and text or only one of them? Therefore you can interactively decide for each scene if you supply this or that or both.

The original bfloat16 model did fit nicely in my 4090 24GB VRAM, now I will have to look at it, fp16 is definitely better. Would you recommend Accelerate with Deepseek combo (accelerate supports deepseek in config I guess) ? Before I did use that offloading and VAE slicing. But ultimately I would like to test drive this model on TensorRT because that should be much more powerful than CUDA only.

I'm just newbie, I will be happy for any tips how to optimize speed.

I still also dont understand what checkpoints I should update and where to get them,. THANK YOU

cyberluke · 2025-01-17T09:37:27Z

Also why you please mention on HuggingFace inference=false if everyone was using I2V model for inference?

zRzRzRzRzRzRzR and others added 5 commits November 6, 2024 22:10

CogVideoX1_1PatchEmbed test

b02915b

1360 * 768

87535d6

refactor

b033aad

make style

67cb373

Merge branch 'main' into cogvideox1.1-5b

de84a04

a-r-r-o-w and others added 6 commits November 8, 2024 23:40

update docs

e481843

add modeling tests for cogvideox 1.5

9edddc1

update

ea56788

make fix-copies

d833f72

add ofs embed(for convert)

b87b07e

add ofs embed(for convert)

e254bcb

a-r-r-o-w reviewed Nov 9, 2024

View reviewed changes

zRzRzRzRzRzRzR and others added 6 commits November 10, 2024 16:34

Merge branch 'huggingface:main' into cogvideox1.1-5b

5e96cae

more resolution for cogvideox1.5-5b-i2v

be80dbf

Merge branch 'cogvideox1.1-5b' of github.com:zRzRzRzRzRzRzR/diffusers…

be8aff7

… into cogvideox1.1-5b

use even number of latent frames only

b94c704

update pipeline implementations

048a5f0

make style

0c98aad

a-r-r-o-w reviewed Nov 11, 2024

View reviewed changes

src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py Outdated Show resolved Hide resolved

set patch_size_t as None by default

7a1b579

#skip frames 0

27441fc

a-r-r-o-w requested a review from yiyixuxu November 14, 2024 23:38

fix test

9ef66d1

jpgallegoar mentioned this pull request Nov 15, 2024

Kijai/CogVideoX-5b-1.5 kijai/ComfyUI-CogVideoXWrapper#214

Closed

zRzRzRzRzRzRzR mentioned this pull request Nov 16, 2024

diffusers version zai-org/CogVideo#507

Merged

a-r-r-o-w approved these changes Nov 17, 2024

View reviewed changes

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

a-r-r-o-w and others added 3 commits November 17, 2024 12:38

Update docs/source/en/api/pipelines/cogvideox.md

23abe7b

Update docs/source/en/api/pipelines/cogvideox.md

f47516d

Merge branch 'main' into cogvideox1.1-5b

b4d629d

a-r-r-o-w reviewed Nov 17, 2024

View reviewed changes

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

docs/source/en/api/pipelines/cogvideox.md Outdated Show resolved Hide resolved

a-r-r-o-w added 2 commits November 17, 2024 13:28

Update docs/source/en/api/pipelines/cogvideox.md

4a4df63

Update docs/source/en/api/pipelines/cogvideox.md

ea166f8

This was referenced Nov 17, 2024

ValueError: Trying to set a tensor of shape torch.Size([3072, 128]) in "weight" (which has shape torch.Size([3072, 16, 2, 2])), this looks incorrect. zai-org/CogVideo#510

Closed

Update colab for the new image to video model? zai-org/CogVideo#508

Closed

jinqiupeter mentioned this pull request Nov 18, 2024

'CogVideoXTransformer3DModel' object has no attribute 'ofs_embedding' zai-org/CogVideo#512

Closed

2 tasks

a-r-r-o-w changed the title ~~New CogVideoX Improve(Draft)~~ CogVideoX 1.5 Nov 18, 2024

a-r-r-o-w merged commit 3b28306 into huggingface:main Nov 18, 2024
15 checks passed

zRzRzRzRzRzRzR deleted the cogvideox1.1-5b branch January 14, 2025 06:47

madebyollin mentioned this pull request Apr 2, 2025

Tiny VAE for CogVideoX madebyollin/taehv#2

Closed

	"sample_frames": 81, # TODO: Need Test with 161 for 10 seconds
	"sample_frames": 81,

Uh oh!

CogVideoX 1.5 #9877

CogVideoX 1.5 #9877

Uh oh!

Conversation

zRzRzRzRzRzRzR commented Nov 6, 2024

What does this PR do?

Who Can help this

Uh oh!

HuggingFaceDocBuilderDev commented Nov 8, 2024

Uh oh!

a-r-r-o-w Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

zRzRzRzRzRzRzR Nov 10, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Nov 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kijai commented Nov 11, 2024

Uh oh!

a-r-r-o-w commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kijai commented Nov 11, 2024

Uh oh!

zRzRzRzRzRzRzR commented Nov 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zRzRzRzRzRzRzR commented Nov 11, 2024

Uh oh!

zRzRzRzRzRzRzR commented Nov 11, 2024

Uh oh!

a-r-r-o-w commented Nov 11, 2024

Uh oh!

kijai commented Nov 11, 2024

Uh oh!

a-r-r-o-w commented Nov 14, 2024

Uh oh!

kijai commented Nov 15, 2024

Uh oh!

zRzRzRzRzRzRzR commented Nov 15, 2024

Uh oh!

zRzRzRzRzRzRzR commented Nov 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kijai commented Nov 15, 2024

Uh oh!

a-r-r-o-w commented Nov 17, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a-r-r-o-w commented Nov 18, 2024

Uh oh!

Uh oh!

YanzuoLu commented Nov 19, 2024

Uh oh!

a-r-r-o-w commented Nov 19, 2024

Uh oh!

cyberluke commented Jan 17, 2025

Uh oh!

cyberluke commented Jan 17, 2025

Uh oh!

Reviewers

a-r-r-o-w commented Nov 11, 2024 •

edited

Loading

zRzRzRzRzRzRzR commented Nov 11, 2024 •

edited

Loading

zRzRzRzRzRzRzR commented Nov 15, 2024 •

edited

Loading