-
Couldn't load subscription status.
- Fork 6.5k
Enhance-A-Video #10815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance-A-Video #10815
Conversation
| scores = mean_scores.mean() * (num_frames + weight) | ||
| scores = scores.clamp(min=1) | ||
| return scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yangluo7 @oahzxl scores here is always 1 with many different inputs that I tried. I've copied this part from the original implementation: https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video/blob/088d9e047b1738a45a253fd7cbe37fdf8526fb97/enhance_a_video/enhance.py
Am I doing something incorrect here or elsewhere? Thanks for your time!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@a-r-r-o-w I noticed you set the enhance weight as 1 in "config = EnhanceAVideoConfig(weight=1.0, num_frames_callback=lambda: latent_num_frames, _attention_type=1)." Maybe it is too small to affect the final output. In our experiments, the weight is at least 5 for LTX-Video with the setting "width=768, height=512, num_frames=121." Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I see it taking effect now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enhance weight is the only introduced parameter in our proposed method, it is affected by several factors including num_frames and prompts, so it needs to be further tuned based on them. We sincerely thank you for incorporating our method into diffusers, which makes our work more accessible to the community :)
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| num_off_diag = num_frames * num_frames - num_frames | ||
| mean_scores = attn_wo_diag.sum(dim=(1, 2)) / num_off_diag | ||
|
|
||
| scores = mean_scores.mean() * (num_frames + weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, should the mean be taken across all dimensions? I think it might be incorrect since each batch of data should have a different score due to different conditioning. Since we concatenate both unconditional and conditional branches and run batched inference, I believe this should be mean_scores.mean(list(range(1, mean_scores.ndim))). This will give us a tensor of shape (B,), which will also be compatible for multiplication and seems more correct to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your advice. We have tried this implementation before by calculating the mean score for each branch separately but found no obvious difference in the final output. As a result, we chose a more concise implementation by calculating the mean score together.
|
I tested this. The first thing to note is that the models isn't released from memory when the inference has finished. I used this prompt, captioned by chatGPT, imitating the style of the prompt above:
I used the code above, but changed the weight value. This is with no enhance: girl_output.mp4Weight: 2 girl_enhance2_output.mp4Weight: 3 girl_enhance3_output.mp4Weight: 5 girl_enhance5_output.mp4Weight: 7 girl_enhance7_output.mp4Weight: 8.5 girl_enhance8.5_output.mp4Weight: 10 girl_enhance10_output.mp4Weight: 15 girl_enhance15_output.mp4They all have deformities, so it is kind of hard to conclude anything on (is between 7 and 10 best? But the colors start to fry at 15?), but maybe it's because of the LTX video being wonky from the start (personally, I find it very hard to get LTX to produce anything without severe deformities, when not using the default prompt). So, it is clear with the current implementation, that the enhance function is doing something, but it is not super clear to me if it is enough to save poor input material? Would it clean up the deformities it run twice? |
|
Thanks for the testing. Firstly, Enhance-A-Video improves the generated video's quality based on the foundation models' existing attention weights and makes moderate adjustments in the residual connection, so the final video quality still relies on the existing generative quality of the foundation model itself. If the original generated video quality is quite low, it is hard to generate an ideal video without improving the pre-training phase. Secondly, we can find that the video quality is improved with weights between 8-10, which demonstrates the effectiveness of Enhance-A-Video. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
design looks good to me
I think we should start to add test to enforce certain styles now we want to expand and encourage the usage of hooks
| """ | ||
|
|
||
| weight: float = 1.0 | ||
| num_frames_callback: Callable[[], int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to be a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... there's no easy way to determine this. Some models use dim=1 as frame dimension, whereas some do it in dim=2 (consider 5D tensor as the input going into transformer). Some models don't do this at all, for example LTX Video already flattens the FHW dimension before the transformer forward.
The information about number of latent frames is only available in the model transformer. Even then, sometimes it is modified by a patch embedding layer -- we don't know for sure, in general case, how to determine number of frames being used for inference.
In the Attention block where we attach hooks, the dimension of tensors are [B, S, D], we don't have access to this info either.
The only source for accurately getting this information is the user :( I'm open to suggestions and holding on to the PR for longer if we can figure out better way to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's fine,
I was just wondering why it is a function, not a constant
|
|
||
| def new_forward(self, module, *args, **kwargs): | ||
| # Here, query and key have two shapes (considering the general diffusers-style model implementation): | ||
| # 1. [batch_size, attention_heads, latents_sequence_length, head_dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we start to automatically test and enforce this (make sure the case for all new models we implemented)?
OmniGen almost did not follow this, and it was not always easy to spot such things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I can add some tests soon to enforce this on any model that is added
|
|
||
| def reshape_for_framewise_attention(tensor: torch.Tensor) -> torch.Tensor: | ||
| # This code assumes tensor is [B, N, S, C]. This should be true for most diffusers-style implementations. | ||
| # [B, N, S, C] -> [B, N, F, S, C] -> [B, S, N, F, C] -> [B * S, N, F, C] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
can we start to enforce this?
|
|
||
| weight: Union[float, Dict[str, float]] = 1.0 | ||
| num_frames_callback: Callable[[], int] = None | ||
| _attention_type: _AttentionType = _AttentionType.SELF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems only this is a config
weight and num_frames is more like runtime arguments, no? currently how do we update these for each generation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently only support dynamically updating these values if user first removes all hooks by calling remove_enhance_a_video and then doing apply_enhance_a_video again. It's, uh, not really ideal but is a lightweight operation so we can get away with it.
Alternatively, to update dynamically, do you think we should do this:
- when user calls
apply_enhance_a_video, we return them some kind of handle object that has knowledge about the hooks - they can call a
set_weightandset_framesmethod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, perhaps we don't need the _attention_type argument. I can define a simple dictionary in the _common.py file that categorizes each attention processor into the three groups -- I think this is good info to have for some other methods that we could integrate soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove_enhance_a_video is fine I think,!
we can wait for a couple more use cases to decide how to support
(I don't like the set_weight and set_frame method because it's specific for each config name, I think we need something more generic)
|
@tin2tin Our LTX implementation is missing a few of the latest features implemented in the original repository to improve generation quality. This will be improved soon as I find some time to work on it. I added support for adding weight factor per block (you can specify a dictionary of regex pattern mapping to weight values), so you can play around a bit and see what layers are best suited for applying the method -- from my testing, applying on blocks between 5-20 seem to work best and do not modify predictions too much |
|
@yangluo7 Would you be able to give a final review as well for correctness check? The implementation will not change much after my latest commit, apart from docs/tests, so is more or less finalized. Thanks! |
| # 1. [batch_size, attention_heads, latents_sequence_length, head_dim] | ||
| # 2. [batch_size, attention_heads, latents_sequence_length + encoder_sequence_length, head_dim] | ||
| # 3. [batch_size, attention_heads, encoder_sequence_length + latents_sequence_length, head_dim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # 1. [batch_size, attention_heads, latents_sequence_length, head_dim] | |
| # 2. [batch_size, attention_heads, latents_sequence_length + encoder_sequence_length, head_dim] | |
| # 3. [batch_size, attention_heads, encoder_sequence_length + latents_sequence_length, head_dim] | |
| # 1. [batch_size, latents_sequence_length, embedding_dim] | |
| # 2. [batch_size, latents_sequence_length + encoder_sequence_length, embedding_dim] | |
| # 3. [batch_size, encoder_sequence_length + latents_sequence_length, embedding_dim] |
| return module | ||
|
|
||
| def new_forward(self, module, *args, **kwargs): | ||
| # Here, query and key have two shapes (considering the general diffusers-style model implementation): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Here, query and key have two shapes (considering the general diffusers-style model implementation): | |
| # Here, hidden_states could have three shapes (considering the general diffusers-style model implementation): |
| hook_registry.register_hook(hook, _ENHANCE_A_VIDEO) | ||
|
|
||
|
|
||
| def remove_enhance_a_video(module: torch.nn.Module) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a method to call when we want to remove all the model hooks on a model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet :(
We can only remove hooks one at a time:
diffusers/src/diffusers/hooks/hooks.py
Line 179 in f8b54cf
| def remove_hook(self, name: str, recurse: bool = True) -> None: |
I can add a method that allows removing all hooks if you'd like, LMK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm might be good to add a methods such as
enable_hook, disable_hook, disable_all_hooks etc to ModelMixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DN6 Sounds good. I think we should add those first, so will hold off merging here and open a PR for that first
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Adds support for Enhance-A-Video.
Paper: https://huggingface.co/papers/2502.07508
Project: https://oahzxl.github.io/Enhance_A_Video
Code: https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video
The PR needs some rework in terms of user-facing API design. I'll need some reviews to gather thoughts on how best to implement this and make available with most, if not all, diffusers or diffusers-like video model implementations.
Currently, I've only tested with LTX Video.
The intended effect of Enhance-A-Video does not seem to be applied yet, as outputs with & without it are the same. I did some quick debugging and it seems like the enhance scores are always
1. This leads to no effect on thehidden_states * scoresthat are returned from the attention block. Will need to investigate with authors if I'm doing something wrong.cc @yangluo7 @oahzxl @kaiwang960112