-
Couldn't load subscription status.
- Fork 6.5k
[refactor] CogVideoX VAE #9903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[refactor] CogVideoX VAE #9903
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is promising! I left some comments, many of which are nits. LMK if they make sense.
| self.tile_sample_min_height = 256 | ||
| self.tile_sample_min_width = 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make sure this is 100% not backwards-breaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a safe change and the outputs look as good as the original generated videos to me. Maybe @zRzRzRzRzRzRzR can help confirm if 256 is a good resolution for individual tiles that are decoded.
| self.tile_overlap_factor_height = 1 / 6 | ||
| self.tile_overlap_factor_width = 1 / 5 | ||
| self.tile_sample_stride_height = 192 | ||
| self.tile_sample_stride_width = 192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we completely getting rid of the overlap_factors? It is still used in enable_tiling() no? Oh because we're deprecating them. Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overlap factors are more complex to understand and work with IMO, and their usage to calculate strides are prone to rounding errors (for example, if you wanted the stride to be 192, you would have to try to figure out the correct overlap factor fraction, and even then you could end up with something like 191 or 193). I feel like this makes things easier to understand and simple, but it is mostly a personal preference and done for consistency reasons across our VAE implementations (currently only with Mochi but I do plan to refactor Allegro the same way too). I know that the CogVideoX ComfyUI wrapper allows users to modify these settings, and some overlap values can lead to artifacts when decoding, so this is an attempt to make those kinds of errors appear lesser.
| deprecate( | ||
| "tile_overlap_factor", | ||
| "1.0.0", | ||
| "The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead.", | |
| "The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated and will be ignored. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead. For now, we will use these flags automatically without breaking the existing behaviour.", |
| tile_sample_stride_height = int((1 - tile_overlap_factor_height) * self.tile_sample_min_height) // 8 * 8 | ||
| tile_sample_stride_width = int((1 - tile_overlap_factor_width) * self.tile_sample_min_width) // 8 * 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should 8 be kept into a sensible variable for better readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah. This should be self.spatial_compression_ratio
| # When decoding temporally long video latents, the memory requirement is very high. By decoding latent frames | ||
| # at a fixed frame batch size (based on `self.num_latent_frames_batch_sizes`), the memory requirement can be lowered. | ||
| self.use_framewise_encoding = True | ||
| self.use_framewise_decoding = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're fixing them to True, does it even make sense to have these flags then? If we're mutating later, then maybe add a note about it here to make it clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember we discussed about allowing one shot decoding and applying quantization on the activations? Allowing framewise stuff to default, but configurably opt-out would be helpful for that because until now we did not support oneshot decoding
| # The tiles have an overlap to avoid seams between tiles. | ||
| rows = [] | ||
| for i in range(0, height, overlap_height): | ||
| for i in range(0, height, self.tile_sample_stride_height): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to disregard this comment entirely but I think it could make sense to add some comments before the for loops as each one of them includes a lot of indexing operations and it can be difficult for readability.
|
Gentle ping @DN6 |
|
Converting to draft for a bit. I had something in mind that I wanted to try. Will open for review again soon |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Refactors the CogVideoX VAE to make the implementation similar to Mochi VAE for consistency. Will follow-up on Allegro in a separate PR
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu @sayakpaul