Skip to content

Conversation

@zRzRzRzRzRzRzR
Copy link
Contributor

What does this PR do?

This PR is a draft about the new generation of CogVideoX, which has not yet been fully realized
Main work:

  1. The new PatchEmbedding has increased t_patch.
  2. The resolution needs to consider compatibility with previous versions; the new version has a higher resolution, but it is also a fixed value. The Rotray Embedding is estimated to need changes.
  3. Image-to-video has more resolution features.
    The SAT version is very slow, and the memory usage is quite high; this may need optimization in future PRs.

Currently achieved

  1. The new PatchEmbedding has increased t_patch.

Who Can help this

@a-r-r-o-w

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

"patch_bias": False,
"sample_height": 768 // vae_scale_factor_spatial,
"sample_width": 1360 // vae_scale_factor_spatial,
"sample_frames": 81, # TODO: Need Test with 161 for 10 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"sample_frames": 81, # TODO: Need Test with 161 for 10 seconds
"sample_frames": 81,

This is just to determine the default number of frames for sampling, so we do not need to make a modification here (which would affect the config.json of the converted transformer model). Users can still specify 161 frames (in the call to pipeline) for generation normally and we will still be compatible without needing any modifications here.

if args.vae_ckpt_path is not None:
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype)
# Keep VAE in float32 for better quality
vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, torch.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zRzRzRzRzRzRzR This is a bit of a breaking change. The SAT VAE is in fp32 but the diffusers format VAE is in bf16/fp16. This can lead to poorer quality, so it is best to just keep the VAE in fp32 and let users decide what configuration to use. I will open a PR to the other model weight CogVideoX repositories with the updated VAE weights soon.

cc @yiyixuxu @DN6 The VAE quality doesn't take too much of a hit, but best to have the default in FP32 and update all existing checkpoints. Apologies that this slipped through earlier but I definitely notice very minor differences in quality (atleast in training cc @sayakpaul). The transformer modeling weights don't use variants because there is no FP32 weights as training is done in BF16

scheduler=scheduler,
)

if args.fp16:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the explanation above, we shouldn't typecast all weights in the pipeline. VAE is best in FP32, text encoder could be saved in FP32 but works well at lower precisions as well, and transformer is either in BF16, or FP16 for CogVideoX-2B text-to-video

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, this is the right thing to do.

height: int = 480,
width: int = 720,
num_frames: int = 49,
height: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a breaking change since we can determine these exact values using transformer config parameters


base_size_width = self.transformer.config.sample_width // p
base_size_height = self.transformer.config.sample_height // p
base_num_frames = (num_frames + p_t - 1) // p_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a breaking change either because they are mathematically equivalent for 1.0 models which do not use temporal patch embedding, but the ceil div is required for new 1.5 CogVideoX models.

`tuple`. When returning a tuple, the first element is a list with the generated images.
"""

if num_frames > 49:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove the frame restriction since the newer models can generate at higher frame resolutions. 49 frames will still be the default for 1.0 models because we determine that by the transformer config params

@kijai
Copy link

kijai commented Nov 11, 2024

Hey,

I've been testing this draft within my ComfyUI wrapper nodes, and I'd like clarity on couple of things as I'm unsure how this should perform at this stage, mostly to know if I have made a mistake somewhere:

  • With text to video model the input hidden_states are always padded, which seems to cause noise/corruption for the first latent/first 4 frames for the resulting video. Is this unavoidable?

  • With both models many resolutions don't really work at all, especially smaller ones, for example the old default 720x480 is blurry mess with T2V and just colorful blocks with I2V. New default resolution is fine, also gotten really good results with square ones (640p, 768,p 1024p). I've made sure it's always divisibly by 16.

@a-r-r-o-w
Copy link
Contributor

a-r-r-o-w commented Nov 11, 2024

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?

It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.

Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

@kijai
Copy link

kijai commented Nov 11, 2024

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?

It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.

Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

I did try without the padding already, but then you get the first 4 frames without movement, I'll check if that branch has something for that thanks.

To illustrate what I mean with the I2V and 720x480 (for example):

CogVideoX-I2V_00001.10.mp4

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:

cogvideo1_5_test.mp4

@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Nov 11, 2024

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?嗨。你能试着在这个分支上实验一下吗: zRzRzRzRzRzRzR#1

It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.

Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

In this version, T2V works very well; I believe the main issue lies in I2V. For T2V, we indeed trained with only one resolution, 1360x768, so it’s quite normal that the original 720x480 resolution doesn’t work properly.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1?
It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan.
Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today)

I did try without the padding already, but then you get the first 4 frames without movement, I'll check if that branch has something for that thanks.

To illustrate what I mean with the I2V and 720x480 (for example):
CogVideoX-I2V_00001.10.mp4

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:
cogvideo1_5_test.mp4

What I want to say is that it should actually be 21, not 22, because the first block of each latent is useless. Therefore, we need to try to remove the useless latents. This method should effectively output a 16fps video of 5 seconds, with the input num_frames being 81

@zRzRzRzRzRzRzR
Copy link
Contributor Author

Now, all changes are in the cogvideox1.1-5b branch.

@a-r-r-o-w
Copy link
Contributor

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:

@kijai It could be possible that there is something wrong with our implementation then. Could you provide the patch/changes you made to do the aspect ratio flip?

@kijai
Copy link

kijai commented Nov 11, 2024

Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one:

@kijai It could be possible that there is something wrong with our implementation then. Could you provide the patch/changes you made to do the aspect ratio flip?

I really don't know if this means anything, I only simply swapped the base_size_width and base_size_height here in effort to understand what it even does, but it's not a real solution as it ruins the other resolutions then:

(grid_height, grid_width), base_size_width, base_size_height

Also I did just now replicate this to double check it indeed somehow fixes at least that video I showed above.

@a-r-r-o-w a-r-r-o-w requested a review from yiyixuxu November 14, 2024 23:38
@a-r-r-o-w
Copy link
Contributor

@yiyixuxu Re-requesting a review because of the changes made to 3D rope embeds

@kijai
Copy link

kijai commented Nov 15, 2024

@kijai Would you be able to try it out now? Please make sure to update the config with the correct sample height and width as done in the latest commit! It looks to me like the issue you mentioned is fixed (but I have only done limited testing with 768 x 1360, 1360 x 768, 768 x 1152, 1152 x 768).

Another possible bug might be the max_sequence_length. I'm unsure if this should be 226 or 224 for CogVideoX 1.5. @zRzRzRzRzRzRzR Could you clarify which is correct?

I have been on vacation for the past few days, so could not really invest time to carefully check out each change that was made, and missed a few commits made to the original repo. Apologies for the delay here but once @zRzRzRzRzRzRzR and you confirm that it is working as intended, we can proceed with merging here. Thanks for all the help testing!

I don't have time to test more currently as it's 3am here, but quickly testing both examples that failed before indeed works now:

CogVideoX-1-5-I2V_test_768_1360.mp4
CogVideoX-1-5-I2V_test_720_480.mp4

Thank you for your hard work!

@zRzRzRzRzRzRzR
Copy link
Contributor Author

I have replied to some of the issues mentioned by @a-r-r-o-w on Slack. I am currently reproducing the issue mentioned in the latest issue report.

@zRzRzRzRzRzRzR
Copy link
Contributor Author

zRzRzRzRzRzRzR commented Nov 15, 2024

The content of the activity for 768 x 1360 is incorrect, but did both @kijai and @a-r-r-o-w succeed?
In my work, @kijai made a broad assumption about an error previously mentioned.

@kijai
Copy link

kijai commented Nov 15, 2024

The content of the activity for 768 x 1360 is incorrect, but did both @kijai and @a-r-r-o-w succeed? In my work, @kijai made a broad assumption about an error previously mentioned.

The latest fix from @a-r-r-o-w is working perfectly for me, every resolution I've tried has worked to some extend, even really small ones.

@a-r-r-o-w
Copy link
Contributor

@zRzRzRzRzRzRzR Let me know if you think this is good to merge now, and we can go ahead :)

@a-r-r-o-w
Copy link
Contributor

Discussed in private DM with @zRzRzRzRzRzRzR but we have verified that the model works as intended for all supported resolutions during training. It was just a config bug in the official checkpoints which has now been corrected.

@a-r-r-o-w a-r-r-o-w changed the title New CogVideoX Improve(Draft) CogVideoX 1.5 Nov 18, 2024
@a-r-r-o-w a-r-r-o-w merged commit 3b28306 into huggingface:main Nov 18, 2024
15 checks passed
@YanzuoLu
Copy link

The current settings for T2V transformer model seem to problematic.
Any ideas on this issue?

@a-r-r-o-w
Copy link
Contributor

Just fixed in #9963. A config attribute was updated on the transformer checkpoint to make the T2V behaviour consistent with SAT codebase, but the specific change only existed in I2V for diffusers - but now should work same as before

sayakpaul pushed a commit that referenced this pull request Dec 23, 2024
* CogVideoX1_1PatchEmbed test

* 1360 * 768

* refactor

* make style

* update docs

* add modeling tests for cogvideox 1.5

* update

* make fix-copies

* add ofs embed(for convert)

* add ofs embed(for convert)

* more resolution for cogvideox1.5-5b-i2v

* use even number of latent frames only

* update pipeline implementations

* make style

* set patch_size_t as None by default

* #skip frames 0

* refactor

* make style

* update docs

* fix ofs_embed

* update docs

* invert_scale_latents

* update

* fix

* Update docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <[email protected]>

* Update docs/source/en/api/pipelines/cogvideox.md

Co-authored-by: Steven Liu <[email protected]>

* Update src/diffusers/models/transformers/cogvideox_transformer_3d.py

* update conversion script

* remove copied from

* fix test

* Update docs/source/en/api/pipelines/cogvideox.md

* Update docs/source/en/api/pipelines/cogvideox.md

* Update docs/source/en/api/pipelines/cogvideox.md

* Update docs/source/en/api/pipelines/cogvideox.md

---------

Co-authored-by: Aryan <[email protected]>
Co-authored-by: Steven Liu <[email protected]>
@zRzRzRzRzRzRzR zRzRzRzRzRzRzR deleted the cogvideox1.1-5b branch January 14, 2025 06:47
@cyberluke
Copy link

Hi, do you know how many people use HuggingFace and run on original code and original model and you introduce them this big breaking changes and they can spend weeks debugging it?

This should me communicated with original author and he should put this information in his README.md !!!

It would be great to supply a complete migration guide not only two paragraphs here in docs where people need to read in between the lines to understand consequences.

So you recommend also this for finetuning:
The original repository uses a lora_alpha of 1. We found this not suitable in many runs, possibly due to difference in modeling backends and training settings. Our recommendation is to set to the lora_alpha to either rank or rank // 2.

Now I want to download some LoRas from community and they did the finetuning on the original model, will it work?

Next, did you know that I2V model is not only for image 2 video, but also for text 2 video because you can supply both image and text or only one of them? Therefore you can interactively decide for each scene if you supply this or that or both.

The original bfloat16 model did fit nicely in my 4090 24GB VRAM, now I will have to look at it, fp16 is definitely better. Would you recommend Accelerate with Deepseek combo (accelerate supports deepseek in config I guess) ? Before I did use that offloading and VAE slicing. But ultimately I would like to test drive this model on TensorRT because that should be much more powerful than CUDA only.

I'm just newbie, I will be happy for any tips how to optimize speed.

I still also dont understand what checkpoints I should update and where to get them,. THANK YOU

@cyberluke
Copy link

Also why you please mention on HuggingFace inference=false if everyone was using I2V model for inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants