Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

@a-r-r-o-w a-r-r-o-w commented Feb 18, 2024

What does this PR do?

This PR adds support for the Stable Video Diffusion version of MotionCtrl as a community pipeline. This is the continuation of #6844 to keep the changes clean. This version of MotionCtrl only supports camera control. For more details, you can check out the linked issue below.

Fixes #6688.

Colab: https://colab.research.google.com/drive/17xIdW-xWk4hCAIkGq0OfiJYUqwWSPSAz?usp=sharing
Paper: https://arxiv.org/abs/2312.03641
Project site: https://wzhouxiff.github.io/projects/MotionCtrl/
Authors: @wzhouxiff @jiangyzy @xinntao Tianshui Chen Menghan Xia Ping Luo Ying Shan

Update: MotionCtrl was just featured on Two Minute Papers. What a time to be alive!

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @sayakpaul

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Feb 18, 2024

Results (manually downscaled):

I've also taken the liberty to add linear interpolation between the camera projection embeddings and original hidden states in SVD. Apart from that, I believe the implementation is faithful to the original implementation. From the results, it can be seen that MotionCtrl with SVD, currently, allows panning, zooming, and other complex camera motions. With the lerp, we can have some more object motion (see results below; 0 is essentially normal SVD because it is not making use of camera proj embeds, with the difference being the attn2 layer weights of TemporalBasicTransformerBlock from original SVD as they were trained too). If you want a more panning/zooming effect, set motionctrl_scale higher using pipe.unet.set_motionctrl_scale(0.8). Increasing camera_speed allows for faster panning/zooming.

motionctrl_scale=0 motionctrl_scale=0.2 motionctrl_scale=0.4
motionctrl_scale=0.6 motionctrl_scale=0.8 motionctrl_scale=1.0

@a-r-r-o-w
Copy link
Contributor Author

@sayakpaul I thought about moving to research_projects but the amount of code duplication was insane (2000+ lines), so I feel it is better here with the hacky unet. I'd like to know your thoughts on the use_legacy flag (which enables/disables the quant_conv layer because some of the new checkpoints like dragnuwa/motionctrl do not use it), and whether the from examples.community.pipeline_stable_video_motionctrl_diffusion import UNetSpatioTemporalConditionMotionCtrlModel part is okay (because you can only use this unet if you clone the repo, and it is not possible if you're trying to directly use after pip install diffusers)

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Feb 18, 2024 via email

cross_attention_dim: Optional[int] = None,
):
super().__init__()
self.time_mix_inner_dim = time_mix_inner_dim
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DN6 I think it would really help if every module could store its init parameters as attributes. It helps with customizing models and I've experienced roadblocks in getting the correct dimensions when modding different layers for experimentation. Here's an example to demonstrate the use case:

for _, module in self.named_modules():
    if isinstance(module, ResnetBlock2d):
        new_layer = nn.Linear(module.in_channels, module.out_channels)
        module.add_module("new_layer", new_layer)
  
        new_forward = custom_resnetblock2d_forward.__get__(module, module.__class__)
        setattr(module, "forward", new_forward)

Many modelling blocks already do this, as is the case with ResnetBlock2D, but many don't such as TemporalBasicTransformerBlock. It would help if it was consistent across all modelling components. WDYT?

@a-r-r-o-w
Copy link
Contributor Author

@DN6 @sayakpaul Ready for another review :)

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Mar 26, 2024

Could you verify if you're using this branch for running? That is, installed diffusers using pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl or similar. It is required because there are changes made to the autoencoder file here.

Your error hints towards quant_conv layers being used despite saying not to in the config as those weights are being expected. I haven't been able to reproduce this with my branch, but that error does come up if you're using, say, the main/pypi branch.

@jhj7905
Copy link

jhj7905 commented Mar 26, 2024

@a-r-r-o-w
Thank you for replying it.
installed diffusers using pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl -> solve the problem..
now i implement the training code with your code

@a-r-r-o-w
Copy link
Contributor Author

Awesome, glad to know that worked!

Regarding training the SVD version, since a few projection layers are the only addition for Camera Motion module, I went ahead and repurposed the stable diffusion training script last weekend. However, when actually trying to train, 24/32 GB GPUs were not enough (out of memory errors) and I lack access to better compute for testing at the moment, which has put it on hold for me. Would be awesome if you're able to create it :) The idea you mention in our email thread is very cool and lots of potential applications, hope it's a success!

@jhj7905
Copy link

jhj7905 commented Mar 26, 2024

@a-r-r-o-w Oh, Cool
How about sharing your training code?
I checked the SVD training code with this repo(https://github.com/pixeli99/SVD_Xtend/tree/main)
and implemented it

@a-r-r-o-w a-r-r-o-w requested a review from DN6 March 26, 2024 17:12
@a-r-r-o-w
Copy link
Contributor Author

This is what I used too. Only minor changes needed and copying the UNet modifications from here and freezing remaining params. Problem is I run into out-of-memory and can't verify correctness of script. I will put it in a PR some time in near future when I am able to test on A100.

@jhj7905
Copy link

jhj7905 commented Mar 28, 2024

@a-r-r-o-w
Hello, I have implemented training code with this repo(https://github.com/pixeli99/SVD_Xtend/tree/main)
I have questions..
First, There is 'do_classifier_free_guidance' to in StableVideoMotionCtrlDiffusionPipeline Class
is it necessary in train code?
Second, camera_pose = camera_pose.repeat(2, 1, 1) is it right?

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Mar 28, 2024

@jhj7905

First, There is 'do_classifier_free_guidance' to in StableVideoMotionCtrlDiffusionPipeline Class
is it necessary in train code?

You can use SVD without classifier free guidance by setting both max_guidance_scale to a value <= 1. You are correct, it should not be necessary in training (or inference).

Second, camera_pose = camera_pose.repeat(2, 1, 1) is it right?

Camera pose is a tensor of shape (num_frames, 3, 4). We repeat the num_frames dimension because we have to apply it for both unconditional and conditional latents. This will always be the case when max_guidance_scale > 1. But I do see a mistake from my side here, and that is the repeat should only happen when do_classifier_free_guidance is True because otherwise there will not be any unconditional latent. I will fix this soon. Is this causing problems in training when classifier free guidance is enabled?

@jhj7905
Copy link

jhj7905 commented Apr 3, 2024

@a-r-r-o-w @DN6
Hello, I finished implementing the training code.
But the result was quite weird. So I started to debug the code...
On my debug process, At first, I found out that the output(video) was different when i ran the code repo(https://github.com/TencentARC/MotionCtrl/tree/svd, https://huggingface.co/TencentARC/MotionCtrl/tree/main) and repo(https://github.com/a-r-r-o-w/diffusers/tree/re-motionctrl, https://huggingface.co/a-r-r-o-w/motionctrl-svd/tree/main) with the same image....

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Apr 3, 2024

Can you share an example of difference comparing the output of theirs vs. what we have here? I'm on a bit of a vacation and an not carrying my personal laptop but I can try debugging the difference in implementation code wise

Have you made sure that same seed is used? It could also be possible that the order of operations that depend on random generator is different.

@jhj7905
Copy link

jhj7905 commented Apr 4, 2024

@jhj7905
Copy link

jhj7905 commented Apr 8, 2024

@a-r-r-o-w Did you solve it?
Still, I have debugged it

@a-r-r-o-w
Copy link
Contributor Author

@a-r-r-o-w Did you solve it? Still, I have debugged it

Hi. I'm on a bit of a vacation and am not carrying my personal laptop to test things out. Apologies for the delay... If you're able to find the mistake, please feel free to fork my branch and add changes. I should be more free in 2-3 days to figure out the problems

@jhj7905
Copy link

jhj7905 commented Apr 8, 2024

@a-r-r-o-w
Oh. I see...
Thank you for replying it...
Have a good vacation!!
Okay, If i find out the mistake then fork the branch!

@T0L0ve
Copy link

T0L0ve commented Apr 10, 2024

@jhj7905
when I run the inference code I get an AttributeError: 'TemporalBasicTransformerBlock' object has no attribute 'time_mix_inner_dim', do you know how to solve it

@a-r-r-o-w
Copy link
Contributor Author

@jhj7905 when I run the inference code I get an AttributeError: 'TemporalBasicTransformerBlock' object has no attribute 'time_mix_inner_dim', do you know how to solve it

did you install diffusers from my branch? I'm guessing that could be the issue. Try:

pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl

@jhj7905
Copy link

jhj7905 commented Apr 11, 2024

@jhj7905 when I run the inference code I get an AttributeError: 'TemporalBasicTransformerBlock' object has no attribute 'time_mix_inner_dim', do you know how to solve it

You can solve it by using pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl

@a-r-r-o-w
Copy link
Contributor Author

a-r-r-o-w commented Apr 23, 2024

@DN6 @asomoza @jhj7905 Requesting a review. This is hopefully ready to merge I think.

@github-actions
Copy link
Contributor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Sep 14, 2024
@yiyixuxu yiyixuxu removed the stale Issues that haven't received updates label Sep 17, 2024
@yiyixuxu
Copy link
Collaborator

@a-r-r-o-w do we still want this?

@a-r-r-o-w
Copy link
Contributor Author

Nah, I think it's okay to close it. There are better video generation models now, and this one's only got 230 downloads all time and did not catch up to the hype. I believe Dhruv wanted it initially (there's a community issue open) but I think okay to close now.

@a-r-r-o-w a-r-r-o-w closed this Sep 17, 2024
@a-r-r-o-w a-r-r-o-w mentioned this pull request Sep 17, 2024
2 tasks
@2hiTee
Copy link

2hiTee commented Apr 21, 2025

Nah, I think it's okay to close it. There are better video generation models now, and this one's only got 230 downloads all time and did not catch up to the hype. I believe Dhruv wanted it initially (there's a community issue open) but I think okay to close now.

Hi, thanks for your work on implementation of integrating MotionCtrl into diffusers. But When I pip install git+https://github.com/a-r-r-o-w/diffusers@re-motionctrl, I met errors:

image

Could you give me some advice? Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MotionCntrl

6 participants