Skip to content

Commit 92dbf97

Browse files
committed
Fixes formatting in SkyReels-V2 documentation
Wraps the visual demonstration section in a Markdown code block. This change corrects the rendering of ASCII diagrams and examples, improving the overall readability of the document.
1 parent 4d72277 commit 92dbf97

File tree

1 file changed

+89
-87
lines changed

1 file changed

+89
-87
lines changed

docs/source/en/api/pipelines/skyreels_v2.md

Lines changed: 89 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -44,93 +44,95 @@ The following SkyReels-V2 models are supported in Diffusers:
4444
4545
### A _Visual_ Demonstration
4646

47-
An example with these parameters:
48-
base_num_frames=97, num_frames=97, num_inference_steps=30, ar_step=5, causal_block_size=5
49-
50-
vae_scale_factor_temporal -> 4
51-
num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each
52-
53-
base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 → blocks = 25//5 = 5 blocks
54-
This 5 blocks means the maximum context length of the model is 25 frames in the latent space.
55-
56-
Asynchronous Processing Timeline:
57-
┌─────────────────────────────────────────────────────────────────┐
58-
│ Steps: 1 6 11 16 21 26 31 36 41 46 50 │
59-
│ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
60-
│ Block 2: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
61-
│ Block 3: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
62-
│ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
63-
│ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
64-
└─────────────────────────────────────────────────────────────────┘
65-
66-
For Long Videos (num_frames > base_num_frames):
67-
base_num_frames acts as the "sliding window size" for processing long videos.
68-
69-
Example: 257-frame video with base_num_frames=97, overlap_history=17
70-
┌──── Iteration 1 (frames 1-97) ────┐
71-
│ Processing window: 97 frames │ → 5 blocks, async processing
72-
│ Generates: frames 1-97 │
73-
└───────────────────────────────────┘
74-
┌────── Iteration 2 (frames 81-177) ──────┐
75-
│ Processing window: 97 frames │
76-
│ Overlap: 17 frames (81-97) from prev │ → 5 blocks, async processing
77-
│ Generates: frames 98-177 │
78-
└─────────────────────────────────────────┘
79-
┌────── Iteration 3 (frames 161-257) ──────┐
80-
│ Processing window: 97 frames │
81-
│ Overlap: 17 frames (161-177) from prev │ → 5 blocks, async processing
82-
│ Generates: frames 178-257 │
83-
└──────────────────────────────────────────┘
84-
85-
Each iteration independently runs the asynchronous processing with its own 5 blocks.
86-
base_num_frames controls:
87-
1. Memory usage (larger window = more VRAM)
88-
2. Model context length (must match training constraints)
89-
3. Number of blocks per iteration (base_num_latent_frames // causal_block_size)
90-
91-
Each block takes 30 steps to complete denoising.
92-
Block N starts at step: 1 + (N-1) x ar_step
93-
Total steps: 30 + (5-1) x 5 = 50 steps
94-
95-
96-
Synchronous mode (ar_step=0) would process all blocks/frames simultaneously:
97-
┌──────────────────────────────────────────────┐
98-
│ Steps: 1 ... 30 │
99-
│ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
100-
└──────────────────────────────────────────────┘
101-
Total steps: 30 steps
102-
103-
104-
An example on how the step matrix is constructed for asynchronous processing:
105-
Given the parameters: (num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5)
106-
- num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25
107-
- step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948,
108-
941, 932, 922, 912, 901, 888, 874, 859, 841, 822,
109-
799, 773, 743, 708, 666, 615, 551, 470, 363, 216]
110-
111-
The algorithm creates a 50x25 step_matrix where:
112-
- Row 1: [999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
113-
- Row 2: [995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
114-
- Row 3: [991, 991, 991, 991, 991, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
115-
- ...
116-
- Row 7: [969, 969, 969, 969, 969, 995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
117-
- ...
118-
- Row 21: [799, 799, 799, 799, 799, 888, 888, 888, 888, 888, 941, 941, 941, 941, 941, 975, 975, 975, 975, 975, 999, 999, 999, 999, 999]
119-
- ...
120-
- Row 35: [ 0, 0, 0, 0, 0, 216, 216, 216, 216, 216, 666, 666, 666, 666, 666, 822, 822, 822, 822, 822, 901, 901, 901, 901, 901]
121-
- ...
122-
- Row 42: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 551, 551, 551, 551, 551, 773, 773, 773, 773, 773]
123-
- ...
124-
- Row 50: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 216, 216, 216, 216, 216]
125-
126-
Detailed Row 6 Analysis:
127-
- step_matrix[5]: [ 975, 975, 975, 975, 975, 999, 999, 999, 999, 999, 999, ..., 999]
128-
- step_index[5]: [ 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 0, ..., 0]
129-
- step_update_mask[5]: [True,True,True,True,True,True,True,True,True,True,False, ...,False]
130-
- valid_interval[5]: (0, 25)
131-
132-
Key Pattern: Block i lags behind Block i-1 by exactly ar_step=5 timesteps, creating the
133-
staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks.
47+
```
48+
An example with these parameters:
49+
base_num_frames=97, num_frames=97, num_inference_steps=30, ar_step=5, causal_block_size=5
50+
51+
vae_scale_factor_temporal -> 4
52+
num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each
53+
54+
base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 → blocks = 25//5 = 5 blocks
55+
This 5 blocks means the maximum context length of the model is 25 frames in the latent space.
56+
57+
Asynchronous Processing Timeline:
58+
┌─────────────────────────────────────────────────────────────────┐
59+
│ Steps: 1 6 11 16 21 26 31 36 41 46 50 │
60+
│ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
61+
│ Block 2: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
62+
│ Block 3: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
63+
│ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
64+
│ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
65+
└─────────────────────────────────────────────────────────────────┘
66+
67+
For Long Videos (num_frames > base_num_frames):
68+
base_num_frames acts as the "sliding window size" for processing long videos.
69+
70+
Example: 257-frame video with base_num_frames=97, overlap_history=17
71+
┌──── Iteration 1 (frames 1-97) ────┐
72+
│ Processing window: 97 frames │ → 5 blocks,
73+
│ Generates: frames 1-97 │ async processing
74+
└───────────────────────────────────┘
75+
┌────── Iteration 2 (frames 81-177) ──────┐
76+
│ Processing window: 97 frames │
77+
│ Overlap: 17 frames (81-97) from prev │ → 5 blocks,
78+
│ Generates: frames 98-177 │ async processing
79+
└─────────────────────────────────────────┘
80+
┌────── Iteration 3 (frames 161-257) ──────┐
81+
│ Processing window: 97 frames │
82+
│ Overlap: 17 frames (161-177) from prev │ → 5 blocks,
83+
│ Generates: frames 178-257 │ async processing
84+
└──────────────────────────────────────────┘
85+
86+
Each iteration independently runs the asynchronous processing with its own 5 blocks.
87+
base_num_frames controls:
88+
1. Memory usage (larger window = more VRAM)
89+
2. Model context length (must match training constraints)
90+
3. Number of blocks per iteration (base_num_latent_frames // causal_block_size)
91+
92+
Each block takes 30 steps to complete denoising.
93+
Block N starts at step: 1 + (N-1) x ar_step
94+
Total steps: 30 + (5-1) x 5 = 50 steps
95+
96+
97+
Synchronous mode (ar_step=0) would process all blocks/frames simultaneously:
98+
┌──────────────────────────────────────────────┐
99+
│ Steps: 1 ... 30 │
100+
│ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
101+
└──────────────────────────────────────────────┘
102+
Total steps: 30 steps
103+
104+
105+
An example on how the step matrix is constructed for asynchronous processing:
106+
Given the parameters: (num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5)
107+
- num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25
108+
- step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948,
109+
941, 932, 922, 912, 901, 888, 874, 859, 841, 822,
110+
799, 773, 743, 708, 666, 615, 551, 470, 363, 216]
111+
112+
The algorithm creates a 50x25 step_matrix where:
113+
- Row 1: [999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
114+
- Row 2: [995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
115+
- Row 3: [991, 991, 991, 991, 991, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
116+
- ...
117+
- Row 7: [969, 969, 969, 969, 969, 995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
118+
- ...
119+
- Row 21: [799, 799, 799, 799, 799, 888, 888, 888, 888, 888, 941, 941, 941, 941, 941, 975, 975, 975, 975, 975, 999, 999, 999, 999, 999]
120+
- ...
121+
- Row 35: [ 0, 0, 0, 0, 0, 216, 216, 216, 216, 216, 666, 666, 666, 666, 666, 822, 822, 822, 822, 822, 901, 901, 901, 901, 901]
122+
- ...
123+
- Row 42: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 551, 551, 551, 551, 551, 773, 773, 773, 773, 773]
124+
- ...
125+
- Row 50: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 216, 216, 216, 216, 216]
126+
127+
Detailed Row 6 Analysis:
128+
- step_matrix[5]: [ 975, 975, 975, 975, 975, 999, 999, 999, 999, 999, 999, ..., 999]
129+
- step_index[5]: [ 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 0, ..., 0]
130+
- step_update_mask[5]: [True,True,True,True,True,True,True,True,True,True,False, ...,False]
131+
- valid_interval[5]: (0, 25)
132+
133+
Key Pattern: Block i lags behind Block i-1 by exactly ar_step=5 timesteps, creating the
134+
staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks.
135+
```
134136

135137
### Text-to-Video Generation
136138

0 commit comments

Comments
 (0)