Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

What does this PR do?

Refactors the CogVideoX VAE to make the implementation similar to Mochi VAE for consistency. Will follow-up on Allegro in a separate PR

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yiyixuxu @sayakpaul

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is promising! I left some comments, many of which are nits. LMK if they make sense.

Comment on lines +1125 to +1126
self.tile_sample_min_height = 256
self.tile_sample_min_width = 256
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make sure this is 100% not backwards-breaking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a safe change and the outputs look as good as the original generated videos to me. Maybe @zRzRzRzRzRzRzR can help confirm if 256 is a good resolution for individual tiles that are decoded.

Comment on lines -1121 to +1129
self.tile_overlap_factor_height = 1 / 6
self.tile_overlap_factor_width = 1 / 5
self.tile_sample_stride_height = 192
self.tile_sample_stride_width = 192
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we completely getting rid of the overlap_factors? It is still used in enable_tiling() no? Oh because we're deprecating them. Okay.

Copy link
Contributor Author

@a-r-r-o-w a-r-r-o-w Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap factors are more complex to understand and work with IMO, and their usage to calculate strides are prone to rounding errors (for example, if you wanted the stride to be 192, you would have to try to figure out the correct overlap factor fraction, and even then you could end up with something like 191 or 193). I feel like this makes things easier to understand and simple, but it is mostly a personal preference and done for consistency reasons across our VAE implementations (currently only with Mochi but I do plan to refactor Allegro the same way too). I know that the CogVideoX ComfyUI wrapper allows users to modify these settings, and some overlap values can lead to artifacts when decoding, so this is an attempt to make those kinds of errors appear lesser.

deprecate(
"tile_overlap_factor",
"1.0.0",
"The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead.",
"The parameters `tile_overlap_factor_height` and `tile_overlap_factor_width` are deprecated and will be ignored. Please use `tile_sample_stride_height` and `tile_sample_stride_width` instead. For now, we will use these flags automatically without breaking the existing behaviour.",

Comment on lines 1168 to 1169
tile_sample_stride_height = int((1 - tile_overlap_factor_height) * self.tile_sample_min_height) // 8 * 8
tile_sample_stride_width = int((1 - tile_overlap_factor_width) * self.tile_sample_min_width) // 8 * 8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should 8 be kept into a sensible variable for better readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah. This should be self.spatial_compression_ratio

Comment on lines +1100 to +1103
# When decoding temporally long video latents, the memory requirement is very high. By decoding latent frames
# at a fixed frame batch size (based on `self.num_latent_frames_batch_sizes`), the memory requirement can be lowered.
self.use_framewise_encoding = True
self.use_framewise_decoding = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're fixing them to True, does it even make sense to have these flags then? If we're mutating later, then maybe add a note about it here to make it clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember we discussed about allowing one shot decoding and applying quantization on the activations? Allowing framewise stuff to default, but configurably opt-out would be helpful for that because until now we did not support oneshot decoding

# The tiles have an overlap to avoid seams between tiles.
rows = []
for i in range(0, height, overlap_height):
for i in range(0, height, self.tile_sample_stride_height):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to disregard this comment entirely but I think it could make sense to add some comments before the for loops as each one of them includes a lot of indexing operations and it can be difficult for readability.

@a-r-r-o-w a-r-r-o-w requested a review from DN6 November 19, 2024 21:42
@a-r-r-o-w
Copy link
Contributor Author

Gentle ping @DN6

@a-r-r-o-w
Copy link
Contributor Author

Converting to draft for a bit. I had something in mind that I wanted to try. Will open for review again soon

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Dec 21, 2024
@a-r-r-o-w a-r-r-o-w closed this Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants