[refactor] CogVideoX VAE #9903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

a-r-r-o-w wants to merge 5 commits into main from refactor-cogvideox-vae

Contributor

a-r-r-o-w commented Nov 11, 2024

What does this PR do?

Refactors the CogVideoX VAE to make the implementation similar to Mochi VAE for consistency. Will follow-up on Allegro in a separate PR

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu @sayakpaul

a-r-r-o-w added 2 commits

November 11, 2024 03:50


          refactor

01d6a1b


          make style

af00830

a-r-r-o-w requested review from sayakpaul and yiyixuxu

November 11, 2024 02:53

HuggingFaceDocBuilderDev commented Nov 11, 2024

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.


          fight tests

2f1f43c

sayakpaul reviewed

View reviewed changes

Member

sayakpaul left a comment

This is promising! I left some comments, many of which are nits. LMK if they make sense.

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

Comment on lines +1125 to +1126

    
                      self.tile_sample_min_height = 256

                      self.tile_sample_min_width = 256

Member

sayakpaul Nov 11, 2024

Can we make sure this is 100% not backwards-breaking?

Contributor Author

a-r-r-o-w Nov 12, 2024

I think it is a safe change and the outputs look as good as the original generated videos to me. Maybe @zRzRzRzRzRzRzR can help confirm if 256 is a good resolution for individual tiles that are decoded.

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

Comment on lines -1121 to +1129

    
                      self.tile_overlap_factor_height = 1 / 6

                      self.tile_overlap_factor_width = 1 / 5

                      self.tile_sample_stride_height = 192

                      self.tile_sample_stride_width = 192

Member

sayakpaul Nov 11, 2024

Why are we completely getting rid of the overlap_factors? It is still used in enable_tiling() no? Oh because we're deprecating them. Okay.

Contributor Author

a-r-r-o-w Nov 12, 2024 •

edited

Loading

The overlap factors are more complex to understand and work with IMO, and their usage to calculate strides are prone to rounding errors (for example, if you wanted the stride to be 192, you would have to try to figure out the correct overlap factor fraction, and even then you could end up with something like 191 or 193). I feel like this makes things easier to understand and simple, but it is mostly a personal preference and done for consistency reasons across our VAE implementations (currently only with Mochi but I do plan to refactor Allegro the same way too). I know that the CogVideoX ComfyUI wrapper allows users to modify these settings, and some overlap values can lead to artifacts when decoding, so this is an attempt to make those kinds of errors appear lesser.

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py Outdated

    
                          deprecate(

                              "tile_overlap_factor",

                              "1.0.0",

                              "The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead.",

Member

sayakpaul Nov 11, 2024

Suggested change

      
                            "The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead.",
          
                            "The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated and will be ignored. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead. For now, we will use these flags automatically without breaking the existing behaviour.",

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py Outdated

Comment on lines 1168 to 1169

    
                          tile_sample_stride_height = int((1 - tile_overlap_factor_height) * self.tile_sample_min_height) // 8 * 8

                          tile_sample_stride_width = int((1 - tile_overlap_factor_width) * self.tile_sample_min_width) // 8 * 8

Member

sayakpaul Nov 11, 2024

Should 8 be kept into a sensible variable for better readability?

Contributor Author

a-r-r-o-w Nov 12, 2024

Ah yeah. This should be self.spatial_compression_ratio

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

Comment on lines +1100 to +1103

    
                      # When decoding temporally long video latents, the memory requirement is very high. By decoding latent frames

                      # at a fixed frame batch size (based on `self.num_latent_frames_batch_sizes`), the memory requirement can be lowered.

                      self.use_framewise_encoding = True

                      self.use_framewise_decoding = True

Member

sayakpaul Nov 11, 2024

If we're fixing them to True, does it even make sense to have these flags then? If we're mutating later, then maybe add a note about it here to make it clear?

Contributor Author

a-r-r-o-w Nov 12, 2024

Remember we discussed about allowing one shot decoding and applying quantization on the activations? Allowing framewise stuff to default, but configurably opt-out would be helpful for that because until now we did not support oneshot decoding

src/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py

    
                      # The tiles have an overlap to avoid seams between tiles.

                      rows = []

                      for i in range(0, height, overlap_height):

                      for i in range(0, height, self.tile_sample_stride_height):

Member

sayakpaul Nov 11, 2024

Feel free to disregard this comment entirely but I think it could make sense to add some comments before the for loops as each one of them includes a lot of indexing operations and it can be difficult for readability.


          apply suggestions from review

b49ae8c

a-r-r-o-w requested a review from DN6

November 19, 2024 21:42


          Merge branch 'main' into refactor-cogvideox-vae

da9e4ba

Contributor Author

a-r-r-o-w commented Nov 22, 2024

Gentle ping @DN6

a-r-r-o-w mentioned this pull request

Using StableDiffusionControlNetImg2ImgPipeline Enable_vae_tiling(), seemingly fixed the patch is 512 x 512, where should I set the relevant parameters #9983

Closed

a-r-r-o-w marked this pull request as draft

November 27, 2024 01:02

Contributor Author

a-r-r-o-w commented Nov 27, 2024

Converting to draft for a bit. I had something in mind that I wanted to try. Will open for review again soon

Contributor

github-actions bot commented Dec 21, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot added the stale label

a-r-r-o-w closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale