Skip to content

Commit e82d7b6

Browse files
committed
docs: enhance parameter examples and formatting in skyreels_v2.md
1 parent 46f4f22 commit e82d7b6

File tree

1 file changed

+41
-23
lines changed

1 file changed

+41
-23
lines changed

docs/source/en/api/pipelines/skyreels_v2.md

Lines changed: 41 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -44,17 +44,24 @@ The following SkyReels-V2 models are supported in Diffusers:
4444
4545
### A _Visual_ Demonstration
4646

47-
```text
48-
An example with these parameters:
49-
base_num_frames=97, num_frames=97, num_inference_steps=30, ar_step=5, causal_block_size=5
47+
The example below has the following parameters:
48+
49+
- `base_num_frames=97`
50+
- `num_frames=97`
51+
- `num_inference_steps=30`
52+
- `ar_step=5`
53+
- `causal_block_size=5`
54+
55+
With `vae_scale_factor_temporal=4`, expect `5` blocks of `5` frames each as calculated by:
5056

51-
vae_scale_factor_temporal -> 4
52-
num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each
57+
`num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each`
5358

54-
base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 → blocks = 25//5 = 5 blocks
55-
This 5 blocks means the maximum context length of the model is 25 frames in the latent space.
59+
And the maximum context length in the latent space is calculated with `base_num_latent_frames`:
60+
61+
`base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 -> 25//5 = 5 blocks`
5662

5763
Asynchronous Processing Timeline:
64+
```text
5865
┌─────────────────────────────────────────────────────────────────┐
5966
│ Steps: 1 6 11 16 21 26 31 36 41 46 50 │
6067
│ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
@@ -63,11 +70,13 @@ Asynchronous Processing Timeline:
6370
│ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
6471
│ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
6572
└─────────────────────────────────────────────────────────────────┘
73+
```
6674

67-
For Long Videos (num_frames > base_num_frames):
68-
base_num_frames acts as the "sliding window size" for processing long videos.
75+
For Long Videos (`num_frames` > `base_num_frames`):
76+
`base_num_frames` acts as the "sliding window size" for processing long videos.
6977

70-
Example: 257-frame video with base_num_frames=97, overlap_history=17
78+
Example: 257-frame video with `base_num_frames=97`, `overlap_history=17`
79+
```text
7180
┌──── Iteration 1 (frames 1-97) ────┐
7281
│ Processing window: 97 frames │ → 5 blocks,
7382
│ Generates: frames 1-97 │ async processing
@@ -82,34 +91,40 @@ Example: 257-frame video with base_num_frames=97, overlap_history=17
8291
│ Overlap: 17 frames (161-177) from prev │ → 5 blocks,
8392
│ Generates: frames 178-257 │ async processing
8493
└──────────────────────────────────────────┘
94+
```
8595

86-
Each iteration independently runs the asynchronous processing with its own 5 blocks.
87-
base_num_frames controls:
96+
Each iteration independently runs the asynchronous processing with its own `5` blocks.
97+
`base_num_frames` controls:
8898
1. Memory usage (larger window = more VRAM)
8999
2. Model context length (must match training constraints)
90-
3. Number of blocks per iteration (base_num_latent_frames // causal_block_size)
100+
3. Number of blocks per iteration (`base_num_latent_frames // causal_block_size`)
91101

92-
Each block takes 30 steps to complete denoising.
93-
Block N starts at step: 1 + (N-1) x ar_step
94-
Total steps: 30 + (5-1) x 5 = 50 steps
102+
Each block takes `30` steps to complete denoising.
103+
Block N starts at step: `1 + (N-1) x ar_step`
104+
Total steps: `30 + (5-1) x 5 = 50` steps
95105

96106

97-
Synchronous mode (ar_step=0) would process all blocks/frames simultaneously:
107+
Synchronous mode (`ar_step=0`) would process all blocks/frames simultaneously:
108+
```text
98109
┌──────────────────────────────────────────────┐
99110
│ Steps: 1 ... 30 │
100111
│ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
101112
└──────────────────────────────────────────────┘
102-
Total steps: 30 steps
113+
```
114+
Total steps: `30` steps
103115

104116

105117
An example on how the step matrix is constructed for asynchronous processing:
106-
Given the parameters: (num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5)
118+
Given the parameters: (`num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5`)
119+
```
107120
- num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25
108121
- step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948,
109122
941, 932, 922, 912, 901, 888, 874, 859, 841, 822,
110123
799, 773, 743, 708, 666, 615, 551, 470, 363, 216]
124+
```
111125

112-
The algorithm creates a 50x25 step_matrix where:
126+
The algorithm creates a `50x25` `step_matrix` where:
127+
```
113128
- Row 1: [999×5, 999×5, 999×5, 999×5, 999×5]
114129
- Row 2: [995×5, 999×5, 999×5, 999×5, 999×5]
115130
- Row 3: [991×5, 999×5, 999×5, 999×5, 999×5]
@@ -123,16 +138,19 @@ The algorithm creates a 50x25 step_matrix where:
123138
- Row 42: [ 0×5, 0×5, 0×5, 551×5, 773×5]
124139
- ...
125140
- Row 50: [ 0×5, 0×5, 0×5, 0×5, 216×5]
141+
```
126142

127-
Detailed Row 6 Analysis:
143+
Detailed Row `6` Analysis:
144+
```
128145
- step_matrix[5]: [ 975×5, 999×5, 999×5, 999×5, 999×5]
129146
- step_index[5]: [ 6×5, 1×5, 0×5, 0×5, 0×5]
130147
- step_update_mask[5]: [True×5, True×5, False×5, False×5, False×5]
131148
- valid_interval[5]: (0, 25)
149+
```
132150

133-
Key Pattern: Block i lags behind Block i-1 by exactly ar_step=5 timesteps, creating the
151+
Key Pattern: Block `i` lags behind Block `i-1` by exactly `ar_step=5` timesteps, creating the
134152
staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks.
135-
```
153+
136154

137155
### Text-to-Video Generation
138156

0 commit comments

Comments
 (0)