@@ -44,93 +44,95 @@ The following SkyReels-V2 models are supported in Diffusers:
4444
4545### A _ Visual_ Demonstration
4646
47- An example with these parameters:
48- base_num_frames=97, num_frames=97, num_inference_steps=30, ar_step=5, causal_block_size=5
49-
50- vae_scale_factor_temporal -> 4
51- num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each
52-
53- base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 → blocks = 25//5 = 5 blocks
54- This 5 blocks means the maximum context length of the model is 25 frames in the latent space.
55-
56- Asynchronous Processing Timeline:
57- ┌─────────────────────────────────────────────────────────────────┐
58- │ Steps: 1 6 11 16 21 26 31 36 41 46 50 │
59- │ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
60- │ Block 2: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
61- │ Block 3: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
62- │ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
63- │ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
64- └─────────────────────────────────────────────────────────────────┘
65-
66- For Long Videos (num_frames > base_num_frames):
67- base_num_frames acts as the "sliding window size" for processing long videos.
68-
69- Example: 257-frame video with base_num_frames=97, overlap_history=17
70- ┌──── Iteration 1 (frames 1-97) ────┐
71- │ Processing window: 97 frames │ → 5 blocks, async processing
72- │ Generates: frames 1-97 │
73- └───────────────────────────────────┘
74- ┌────── Iteration 2 (frames 81-177) ──────┐
75- │ Processing window: 97 frames │
76- │ Overlap: 17 frames (81-97) from prev │ → 5 blocks, async processing
77- │ Generates: frames 98-177 │
78- └─────────────────────────────────────────┘
79- ┌────── Iteration 3 (frames 161-257) ──────┐
80- │ Processing window: 97 frames │
81- │ Overlap: 17 frames (161-177) from prev │ → 5 blocks, async processing
82- │ Generates: frames 178-257 │
83- └──────────────────────────────────────────┘
84-
85- Each iteration independently runs the asynchronous processing with its own 5 blocks.
86- base_num_frames controls:
87- 1. Memory usage (larger window = more VRAM)
88- 2. Model context length (must match training constraints)
89- 3. Number of blocks per iteration (base_num_latent_frames // causal_block_size)
90-
91- Each block takes 30 steps to complete denoising.
92- Block N starts at step: 1 + (N-1) x ar_step
93- Total steps: 30 + (5-1) x 5 = 50 steps
94-
95-
96- Synchronous mode (ar_step=0) would process all blocks/frames simultaneously:
97- ┌──────────────────────────────────────────────┐
98- │ Steps: 1 ... 30 │
99- │ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
100- └──────────────────────────────────────────────┘
101- Total steps: 30 steps
102-
103-
104- An example on how the step matrix is constructed for asynchronous processing:
105- Given the parameters: (num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5)
106- - num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25
107- - step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948,
108- 941, 932, 922, 912, 901, 888, 874, 859, 841, 822,
109- 799, 773, 743, 708, 666, 615, 551, 470, 363, 216]
110-
111- The algorithm creates a 50x25 step_matrix where:
112- - Row 1: [999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
113- - Row 2: [995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
114- - Row 3: [991, 991, 991, 991, 991, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
115- - ...
116- - Row 7: [969, 969, 969, 969, 969, 995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
117- - ...
118- - Row 21: [799, 799, 799, 799, 799, 888, 888, 888, 888, 888, 941, 941, 941, 941, 941, 975, 975, 975, 975, 975, 999, 999, 999, 999, 999]
119- - ...
120- - Row 35: [ 0, 0, 0, 0, 0, 216, 216, 216, 216, 216, 666, 666, 666, 666, 666, 822, 822, 822, 822, 822, 901, 901, 901, 901, 901]
121- - ...
122- - Row 42: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 551, 551, 551, 551, 551, 773, 773, 773, 773, 773]
123- - ...
124- - Row 50: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 216, 216, 216, 216, 216]
125-
126- Detailed Row 6 Analysis:
127- - step_matrix[5]: [ 975, 975, 975, 975, 975, 999, 999, 999, 999, 999, 999, ..., 999]
128- - step_index[5]: [ 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 0, ..., 0]
129- - step_update_mask[5]: [True,True,True,True,True,True,True,True,True,True,False, ...,False]
130- - valid_interval[5]: (0, 25)
131-
132- Key Pattern: Block i lags behind Block i-1 by exactly ar_step=5 timesteps, creating the
133- staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks.
47+ ```
48+ An example with these parameters:
49+ base_num_frames=97, num_frames=97, num_inference_steps=30, ar_step=5, causal_block_size=5
50+
51+ vae_scale_factor_temporal -> 4
52+ num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each
53+
54+ base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 → blocks = 25//5 = 5 blocks
55+ This 5 blocks means the maximum context length of the model is 25 frames in the latent space.
56+
57+ Asynchronous Processing Timeline:
58+ ┌─────────────────────────────────────────────────────────────────┐
59+ │ Steps: 1 6 11 16 21 26 31 36 41 46 50 │
60+ │ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
61+ │ Block 2: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
62+ │ Block 3: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
63+ │ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
64+ │ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
65+ └─────────────────────────────────────────────────────────────────┘
66+
67+ For Long Videos (num_frames > base_num_frames):
68+ base_num_frames acts as the "sliding window size" for processing long videos.
69+
70+ Example: 257-frame video with base_num_frames=97, overlap_history=17
71+ ┌──── Iteration 1 (frames 1-97) ────┐
72+ │ Processing window: 97 frames │ → 5 blocks,
73+ │ Generates: frames 1-97 │ async processing
74+ └───────────────────────────────────┘
75+ ┌────── Iteration 2 (frames 81-177) ──────┐
76+ │ Processing window: 97 frames │
77+ │ Overlap: 17 frames (81-97) from prev │ → 5 blocks,
78+ │ Generates: frames 98-177 │ async processing
79+ └─────────────────────────────────────────┘
80+ ┌────── Iteration 3 (frames 161-257) ──────┐
81+ │ Processing window: 97 frames │
82+ │ Overlap: 17 frames (161-177) from prev │ → 5 blocks,
83+ │ Generates: frames 178-257 │ async processing
84+ └──────────────────────────────────────────┘
85+
86+ Each iteration independently runs the asynchronous processing with its own 5 blocks.
87+ base_num_frames controls:
88+ 1. Memory usage (larger window = more VRAM)
89+ 2. Model context length (must match training constraints)
90+ 3. Number of blocks per iteration (base_num_latent_frames // causal_block_size)
91+
92+ Each block takes 30 steps to complete denoising.
93+ Block N starts at step: 1 + (N-1) x ar_step
94+ Total steps: 30 + (5-1) x 5 = 50 steps
95+
96+
97+ Synchronous mode (ar_step=0) would process all blocks/frames simultaneously:
98+ ┌──────────────────────────────────────────────┐
99+ │ Steps: 1 ... 30 │
100+ │ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
101+ └──────────────────────────────────────────────┘
102+ Total steps: 30 steps
103+
104+
105+ An example on how the step matrix is constructed for asynchronous processing:
106+ Given the parameters: (num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5)
107+ - num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25
108+ - step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948,
109+ 941, 932, 922, 912, 901, 888, 874, 859, 841, 822,
110+ 799, 773, 743, 708, 666, 615, 551, 470, 363, 216]
111+
112+ The algorithm creates a 50x25 step_matrix where:
113+ - Row 1: [999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
114+ - Row 2: [995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
115+ - Row 3: [991, 991, 991, 991, 991, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
116+ - ...
117+ - Row 7: [969, 969, 969, 969, 969, 995, 995, 995, 995, 995, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999, 999]
118+ - ...
119+ - Row 21: [799, 799, 799, 799, 799, 888, 888, 888, 888, 888, 941, 941, 941, 941, 941, 975, 975, 975, 975, 975, 999, 999, 999, 999, 999]
120+ - ...
121+ - Row 35: [ 0, 0, 0, 0, 0, 216, 216, 216, 216, 216, 666, 666, 666, 666, 666, 822, 822, 822, 822, 822, 901, 901, 901, 901, 901]
122+ - ...
123+ - Row 42: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 551, 551, 551, 551, 551, 773, 773, 773, 773, 773]
124+ - ...
125+ - Row 50: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 216, 216, 216, 216, 216]
126+
127+ Detailed Row 6 Analysis:
128+ - step_matrix[5]: [ 975, 975, 975, 975, 975, 999, 999, 999, 999, 999, 999, ..., 999]
129+ - step_index[5]: [ 6, 6, 6, 6, 6, 1, 1, 1, 1, 1, 0, ..., 0]
130+ - step_update_mask[5]: [True,True,True,True,True,True,True,True,True,True,False, ...,False]
131+ - valid_interval[5]: (0, 25)
132+
133+ Key Pattern: Block i lags behind Block i-1 by exactly ar_step=5 timesteps, creating the
134+ staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks.
135+ ```
134136
135137### Text-to-Video Generation
136138
0 commit comments