Skip to content

Commit 3be7c96

Browse files
authored
[docs] Stable video diffusion (#6472)
svd
1 parent 3c79dd9 commit 3be7c96

File tree

1 file changed

+33
-44
lines changed
  • docs/source/en/using-diffusers

1 file changed

+33
-44
lines changed

docs/source/en/using-diffusers/svd.md

Lines changed: 33 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -14,23 +14,19 @@ specific language governing permissions and limitations under the License.
1414

1515
[[open-in-colab]]
1616

17-
[Stable Video Diffusion](https://static1.squarespace.com/static/6213c340453c3f502425776e/t/655ce779b9d47d342a93c890/1700587395994/stable_video_diffusion.pdf) is a powerful image-to-video generation model that can generate high resolution (576x1024) 2-4 second videos conditioned on the input image.
17+
[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127) is a powerful image-to-video generation model that can generate 2-4 second high resolution (576x1024) videos conditioned on an input image.
1818

19-
This guide will show you how to use SVD to short generate videos from images.
19+
This guide will show you how to use SVD to generate short videos from images.
2020

2121
Before you begin, make sure you have the following libraries installed:
2222

2323
```py
2424
!pip install -q -U diffusers transformers accelerate
2525
```
2626

27-
## Image to Video Generation
27+
The are two variants of this model, [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid) and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.
2828

29-
The are two variants of SVD. [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)
30-
and [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt). The svd checkpoint is trained to generate 14 frames and the svd-xt checkpoint is further
31-
finetuned to generate 25 frames.
32-
33-
We will use the `svd-xt` checkpoint for this guide.
29+
You'll use the SVD-XT checkpoint for this guide.
3430

3531
```python
3632
import torch
@@ -53,60 +49,54 @@ frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
5349
export_to_video(frames, "generated.mp4", fps=7)
5450
```
5551

56-
| **Source Image** | **Video** |
57-
|:------------:|:-----:|
58-
| ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png) | ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif) |
59-
60-
61-
<Tip>
62-
Since generating videos is more memory intensive we can use the `decode_chunk_size` argument to control how many frames are decoded at once. This will reduce the memory usage. It's recommended to tweak this value based on your GPU memory.
63-
Setting `decode_chunk_size=1` will decode one frame at a time and will use the least amount of memory but the video might have some flickering.
64-
65-
Additionally, we also use [model cpu offloading](../../optimization/memory#model-offloading) to reduce the memory usage.
66-
</Tip>
67-
52+
<div class="flex gap-4">
53+
<div>
54+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
55+
<figcaption class="mt-2 text-center text-sm text-gray-500">"source image of a rocket"</figcaption>
56+
</div>
57+
<div>
58+
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
59+
<figcaption class="mt-2 text-center text-sm text-gray-500">"generated video from source image"</figcaption>
60+
</div>
61+
</div>
6862

69-
### Torch.compile
63+
## torch.compile
7064

71-
You can achieve a 20-25% speed-up at the expense of slightly increased memory by compiling the UNet as follows:
65+
You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/torch2.0#torchcompile) the UNet.
7266

7367
```diff
7468
- pipe.enable_model_cpu_offload()
7569
+ pipe.to("cuda")
7670
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
7771
```
7872

79-
### Low-memory
73+
## Reduce memory usage
8074

81-
Video generation is very memory intensive as we have to essentially generate `num_frames` all at once. The mechanism is very comparable to text-to-image generation with a high batch size. To reduce the memory requirement you have multiple options. The following options trade inference speed against lower memory requirement:
82-
- enable model offloading: Each component of the pipeline is offloaded to CPU once it's not needed anymore.
83-
- enable feed-forward chunking: The feed-forward layer runs in a loop instead of running with a single huge feed-forward batch size
84-
- reduce `decode_chunk_size`: This means that the VAE decodes frames in chunks instead of decoding them all together. **Note that**, in addition to leading to a small slowdown, this method also slightly leads to video quality deterioration.
75+
Video generation is very memory intensive because you're essentially generating `num_frames` all at once, similar to text-to-image generation with a high batch size. To reduce the memory requirement, there are multiple options that trade-off inference speed for lower memory requirement:
8576

86-
You can enable them as follows:
77+
- enable model offloading: each component of the pipeline is offloaded to the CPU once it's not needed anymore.
78+
- enable feed-forward chunking: the feed-forward layer runs in a loop instead of running a single feed-forward with a huge batch size.
79+
- reduce `decode_chunk_size`: the VAE decodes frames in chunks instead of decoding them all together. Setting `decode_chunk_size=1` decodes one frame at a time and uses the least amount of memory (we recommend adjusting this value based on your GPU memory) but the video might have some flickering.
8780

8881
```diff
89-
-pipe.enable_model_cpu_offload()
90-
-frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
91-
+pipe.enable_model_cpu_offload()
92-
+pipe.unet.enable_forward_chunking()
93-
+frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
82+
- pipe.enable_model_cpu_offload()
83+
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
84+
+ pipe.enable_model_cpu_offload()
85+
+ pipe.unet.enable_forward_chunking()
86+
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
9487
```
9588

89+
Using all these tricks togethere should lower the memory requirement to less than 8GB VRAM.
9690

97-
Including all these tricks should lower the memory requirement to less than 8GB VRAM.
98-
99-
### Micro-conditioning
91+
## Micro-conditioning
10092

101-
Along with conditioning image Stable Diffusion Video also allows providing micro-conditioning that allows more control over the generated video.
102-
It accepts the following arguments:
93+
Stable Diffusion Video also accepts micro-conditioning, in addition to the conditioning image, which allows more control over the generated video:
10394

104-
- `fps`: The frames per second of the generated video.
105-
- `motion_bucket_id`: The motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id will increase the motion of the generated video.
106-
- `noise_aug_strength`: The amount of noise added to the conditioning image. The higher the values the less the video will resemble the conditioning image. Increasing this value will also increase the motion of the generated video.
107-
108-
Here is an example of using micro-conditioning to generate a video with more motion.
95+
- `fps`: the frames per second of the generated video.
96+
- `motion_bucket_id`: the motion bucket id to use for the generated video. This can be used to control the motion of the generated video. Increasing the motion bucket id increases the motion of the generated video.
97+
- `noise_aug_strength`: the amount of noise added to the conditioning image. The higher the values the less the video resembles the conditioning image. Increasing this value also increases the motion of the generated video.
10998

99+
For example, to generate a video with more motion, use the `motion_bucket_id` and `noise_aug_strength` micro-conditioning parameters:
110100

111101
```python
112102
import torch
@@ -129,4 +119,3 @@ export_to_video(frames, "generated.mp4", fps=7)
129119
```
130120

131121
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket_with_conditions.gif)
132-

0 commit comments

Comments
 (0)