Skip to content

Commit 6b227a0

Browse files
authored
Update diffusion forcing in README.md
1 parent 4dedfce commit 6b227a0

File tree

1 file changed

+26
-5
lines changed

1 file changed

+26
-5
lines changed

README.md

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -174,10 +174,11 @@ After downloading, set the model path in your generation commands:
174174

175175
#### Single GPU Inference
176176

177-
- **Diffusion Forcing**
177+
- **Diffusion Forcing for Long Video Generation**
178178

179-
The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes.
179+
The <a href="https://arxiv.org/abs/2407.01392">**Diffusion Forcing**</a> version model allows us to generate Infinite-Length videos. This model supports both **text-to-video (T2V)** and **image-to-video (I2V)** tasks, and it can perform inference in both synchronous and asynchronous modes. Here we demonstrate 2 running scripts as examples for long video generation. If you want to adjust the inference parameters, e.g., the duration of video, inference mode, read the Note below first.
180180

181+
synchronous generation for 10s video
181182
```shell
182183
model_id=Skywork/SkyReels-V2-DF-14B-540P
183184
# synchronous inference
@@ -192,10 +193,30 @@ python3 generate_video_df.py \
192193
--addnoise_condition 20 \
193194
--offload
194195
```
196+
197+
asynchronous generation for 30s video
198+
```shell
199+
model_id=Skywork/SkyReels-V2-DF-14B-540P
200+
# synchronous inference
201+
python3 generate_video_df.py \
202+
--model_id ${model_id} \
203+
--resolution 540P \
204+
--ar_step 5 \
205+
--causal_block_size 5 \
206+
--base_num_frames 97 \
207+
--num_frames 737 \
208+
--overlap_history 17 \
209+
--prompt "A graceful white swan with a curved neck and delicate feathers swimming in a serene lake at dawn, its reflection perfectly mirrored in the still water as mist rises from the surface, with the swan occasionally dipping its head into the water to feed." \
210+
--addnoise_condition 20 \
211+
--offload
212+
```
213+
195214
> **Note**:
196-
> - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)** prompt including the description of the first-frame image.
197-
> - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommanded.
198-
> - To reduce peak VRAM, lower the `--base_num_frames` for the same generative length `--num_frames`. This may slightly reduce video quality.
215+
> - If you want to run the **image-to-video (I2V)** task, add `--image ${image_path}` to your command and it is also better to use **text-to-video (T2V)** prompt including the description of the first-frame image.
216+
> - For long video generation, you can just switch the `--num_frames`, e.g., `--num_frames 257` for 10s video, `--num_frames 377` for 15s video, `--num_frames 737` for 30s video, `--num_frames 1457` for 60s video. The number is not strictly algined with the logical frame number for specified time duration, but it is aligned with some training parameters, which means it may perform better.
217+
> - You can use `--ar_step 5` to enable asynchronous inference. When asynchronous inference, `--causal_block_size 5` is recommanded while it is not supposed to be set for synchronous generation, and remember that your base frame latent number (e.g., (97-1)//4+1=25 for base_num_frames=97) must be divided by causal_block_size. Asynchronous inference will take more steps to diffuse the whole sequence which means it will be slower than synchronous mode.
218+
> - To reduce peak VRAM, just lower the `--base_num_frames`, e.g., to 77 or 57, while keeping the same generative length `--num_frames` you want to generate. This may slightly reduce video quality.
219+
> - `--addnoise_condition` is used to help smooth the long video generation by adding some noise to the clean condition. Too large noise can cause the inconsistency as well. 20 is a recommended value, and you may try larger ones, but it is recommended to not exceed 50.
199220
> - Generating a 540P video using the 1.3B model requires approximately 14.7GB peak VRAM, while the same resolution video using the 14B model demands around 51.2GB peak VRAM.
200221
201222
- **Text To Video & Image To Video**

0 commit comments

Comments
 (0)