-
Couldn't load subscription status.
- Fork 6.5k
CogVideoX 1.5 #9877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CogVideoX 1.5 #9877
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| "patch_bias": False, | ||
| "sample_height": 768 // vae_scale_factor_spatial, | ||
| "sample_width": 1360 // vae_scale_factor_spatial, | ||
| "sample_frames": 81, # TODO: Need Test with 161 for 10 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "sample_frames": 81, # TODO: Need Test with 161 for 10 seconds | |
| "sample_frames": 81, |
This is just to determine the default number of frames for sampling, so we do not need to make a modification here (which would affect the config.json of the converted transformer model). Users can still specify 161 frames (in the call to pipeline) for generation normally and we will still be compatible without needing any modifications here.
| if args.vae_ckpt_path is not None: | ||
| vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, dtype) | ||
| # Keep VAE in float32 for better quality | ||
| vae = convert_vae(args.vae_ckpt_path, args.scaling_factor, torch.float32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zRzRzRzRzRzRzR This is a bit of a breaking change. The SAT VAE is in fp32 but the diffusers format VAE is in bf16/fp16. This can lead to poorer quality, so it is best to just keep the VAE in fp32 and let users decide what configuration to use. I will open a PR to the other model weight CogVideoX repositories with the updated VAE weights soon.
cc @yiyixuxu @DN6 The VAE quality doesn't take too much of a hit, but best to have the default in FP32 and update all existing checkpoints. Apologies that this slipped through earlier but I definitely notice very minor differences in quality (atleast in training cc @sayakpaul). The transformer modeling weights don't use variants because there is no FP32 weights as training is done in BF16
| scheduler=scheduler, | ||
| ) | ||
|
|
||
| if args.fp16: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the explanation above, we shouldn't typecast all weights in the pipeline. VAE is best in FP32, text encoder could be saved in FP32 but works well at lower precisions as well, and transformer is either in BF16, or FP16 for CogVideoX-2B text-to-video
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, this is the right thing to do.
| height: int = 480, | ||
| width: int = 720, | ||
| num_frames: int = 49, | ||
| height: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a breaking change since we can determine these exact values using transformer config parameters
|
|
||
| base_size_width = self.transformer.config.sample_width // p | ||
| base_size_height = self.transformer.config.sample_height // p | ||
| base_num_frames = (num_frames + p_t - 1) // p_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a breaking change either because they are mathematically equivalent for 1.0 models which do not use temporal patch embedding, but the ceil div is required for new 1.5 CogVideoX models.
| `tuple`. When returning a tuple, the first element is a list with the generated images. | ||
| """ | ||
|
|
||
| if num_frames > 49: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to remove the frame restriction since the newer models can generate at higher frame resolutions. 49 frames will still be the default for 1.0 models because we determine that by the transformer config params
… into cogvideox1.1-5b
src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py
Outdated
Show resolved
Hide resolved
|
Hey, I've been testing this draft within my ComfyUI wrapper nodes, and I'd like clarity on couple of things as I'm unsure how this should perform at this stage, mostly to know if I have made a mistake somewhere:
|
|
Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1? It removes the padding of hidden_states which seems to cause corruption. Instead, you must specify frame values whose latent size is divisible by patch_size_t (such as 85 or 165). I believe there might still be some modifications remaining on how to handle frames based on my conversation with Yuxuan. Regarding resolutions, the T2V model works best only at specific resolutions (the recommended is to always do 1360x768). The I2V model can generate at multiple resolutions. Also, both models can generate best at 85 and 165 frames (as per my PR above, but this should actually be 81 and 161 to be consistent with original implementation. We'll try to figure out the best way to support this today) |
I did try without the padding already, but then you get the first 4 frames without movement, I'll check if that branch has something for that thanks. To illustrate what I mean with the I2V and 720x480 (for example): CogVideoX-I2V_00001.10.mp4Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one: cogvideo1_5_test.mp4 |
In this version, T2V works very well; I believe the main issue lies in I2V. For T2V, we indeed trained with only one resolution, 1360x768, so it’s quite normal that the original 720x480 resolution doesn’t work properly. |
|
Hi @kijai. Could you try experimenting with this branch: zRzRzRzRzRzRzR#1? I did try without the padding already, but then you get the first 4 frames without movement, I'll check if that branch has something for that thanks. To illustrate what I mean with the I2V and 720x480 (for example): Also forgive me for not knowing anything about this, and fully admittedly without actually understanding what I was doing, I tried flipping the aspect for the rotary pos embed crop coords, (which probably results it not being applied at all or something), seemingly cleared it up for this resolution, but ruining it for the default one: What I want to say is that it should actually be 21, not 22, because the first block of each latent is useless. Therefore, we need to try to remove the useless latents. This method should effectively output a 16fps video of 5 seconds, with the input num_frames being 81 |
|
Now, all changes are in the cogvideox1.1-5b branch. |
@kijai It could be possible that there is something wrong with our implementation then. Could you provide the patch/changes you made to do the aspect ratio flip? |
I really don't know if this means anything, I only simply swapped the base_size_width and base_size_height here in effort to understand what it even does, but it's not a real solution as it ruins the other resolutions then:
Also I did just now replicate this to double check it indeed somehow fixes at least that video I showed above. |
|
@yiyixuxu Re-requesting a review because of the changes made to 3D rope embeds |
I don't have time to test more currently as it's 3am here, but quickly testing both examples that failed before indeed works now: CogVideoX-1-5-I2V_test_768_1360.mp4CogVideoX-1-5-I2V_test_720_480.mp4Thank you for your hard work! |
|
I have replied to some of the issues mentioned by @a-r-r-o-w on Slack. I am currently reproducing the issue mentioned in the latest issue report. |
|
The content of the activity for 768 x 1360 is incorrect, but did both @kijai and @a-r-r-o-w succeed? |
The latest fix from @a-r-r-o-w is working perfectly for me, every resolution I've tried has worked to some extend, even really small ones. |
|
@zRzRzRzRzRzRzR Let me know if you think this is good to merge now, and we can go ahead :) |
|
Discussed in private DM with @zRzRzRzRzRzRzR but we have verified that the model works as intended for all supported resolutions during training. It was just a config bug in the official checkpoints which has now been corrected. |
|
The current settings for T2V transformer model seem to problematic. |
|
Just fixed in #9963. A config attribute was updated on the transformer checkpoint to make the T2V behaviour consistent with SAT codebase, but the specific change only existed in I2V for diffusers - but now should work same as before |
* CogVideoX1_1PatchEmbed test * 1360 * 768 * refactor * make style * update docs * add modeling tests for cogvideox 1.5 * update * make fix-copies * add ofs embed(for convert) * add ofs embed(for convert) * more resolution for cogvideox1.5-5b-i2v * use even number of latent frames only * update pipeline implementations * make style * set patch_size_t as None by default * #skip frames 0 * refactor * make style * update docs * fix ofs_embed * update docs * invert_scale_latents * update * fix * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update docs/source/en/api/pipelines/cogvideox.md Co-authored-by: Steven Liu <[email protected]> * Update src/diffusers/models/transformers/cogvideox_transformer_3d.py * update conversion script * remove copied from * fix test * Update docs/source/en/api/pipelines/cogvideox.md * Update docs/source/en/api/pipelines/cogvideox.md * Update docs/source/en/api/pipelines/cogvideox.md * Update docs/source/en/api/pipelines/cogvideox.md --------- Co-authored-by: Aryan <[email protected]> Co-authored-by: Steven Liu <[email protected]>
|
Hi, do you know how many people use HuggingFace and run on original code and original model and you introduce them this big breaking changes and they can spend weeks debugging it? This should me communicated with original author and he should put this information in his README.md !!! It would be great to supply a complete migration guide not only two paragraphs here in docs where people need to read in between the lines to understand consequences. So you recommend also this for finetuning: Now I want to download some LoRas from community and they did the finetuning on the original model, will it work? Next, did you know that I2V model is not only for image 2 video, but also for text 2 video because you can supply both image and text or only one of them? Therefore you can interactively decide for each scene if you supply this or that or both. The original bfloat16 model did fit nicely in my 4090 24GB VRAM, now I will have to look at it, fp16 is definitely better. Would you recommend Accelerate with Deepseek combo (accelerate supports deepseek in config I guess) ? Before I did use that offloading and VAE slicing. But ultimately I would like to test drive this model on TensorRT because that should be much more powerful than CUDA only. I'm just newbie, I will be happy for any tips how to optimize speed. I still also dont understand what checkpoints I should update and where to get them,. THANK YOU |
|
Also why you please mention on HuggingFace inference=false if everyone was using I2V model for inference? |
What does this PR do?
This PR is a draft about the new generation of CogVideoX, which has not yet been fully realized
Main work:
The SAT version is very slow, and the memory usage is quite high; this may need optimization in future PRs.
Currently achieved
Who Can help this
@a-r-r-o-w