@@ -44,17 +44,24 @@ The following SkyReels-V2 models are supported in Diffusers:
4444
4545### A _ Visual_ Demonstration
4646
47- ``` text
48- An example with these parameters:
49- base_num_frames=97, num_frames=97, num_inference_steps=30, ar_step=5, causal_block_size=5
47+ The example below has the following parameters:
48+
49+ - ` base_num_frames=97 `
50+ - ` num_frames=97 `
51+ - ` num_inference_steps=30 `
52+ - ` ar_step=5 `
53+ - ` causal_block_size=5 `
54+
55+ With ` vae_scale_factor_temporal=4 ` , expect ` 5 ` blocks of ` 5 ` frames each as calculated by:
5056
51- vae_scale_factor_temporal -> 4
52- num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each
57+ ` num_latent_frames: (97-1)//vae_scale_factor_temporal+1 = 25 frames -> 5 blocks of 5 frames each `
5358
54- base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 → blocks = 25//5 = 5 blocks
55- This 5 blocks means the maximum context length of the model is 25 frames in the latent space.
59+ And the maximum context length in the latent space is calculated with ` base_num_latent_frames ` :
60+
61+ ` base_num_latent_frames = (97-1)//vae_scale_factor_temporal+1 = 25 -> 25//5 = 5 blocks `
5662
5763Asynchronous Processing Timeline:
64+ ``` text
5865┌─────────────────────────────────────────────────────────────────┐
5966│ Steps: 1 6 11 16 21 26 31 36 41 46 50 │
6067│ Block 1: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
@@ -63,11 +70,13 @@ Asynchronous Processing Timeline:
6370│ Block 4: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
6471│ Block 5: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
6572└─────────────────────────────────────────────────────────────────┘
73+ ```
6674
67- For Long Videos (num_frames > base_num_frames):
68- base_num_frames acts as the "sliding window size" for processing long videos.
75+ For Long Videos (` num_frames ` > ` base_num_frames ` ):
76+ ` base_num_frames ` acts as the "sliding window size" for processing long videos.
6977
70- Example: 257-frame video with base_num_frames=97, overlap_history=17
78+ Example: 257-frame video with ` base_num_frames=97 ` , ` overlap_history=17 `
79+ ``` text
7180┌──── Iteration 1 (frames 1-97) ────┐
7281│ Processing window: 97 frames │ → 5 blocks,
7382│ Generates: frames 1-97 │ async processing
@@ -82,34 +91,40 @@ Example: 257-frame video with base_num_frames=97, overlap_history=17
8291 │ Overlap: 17 frames (161-177) from prev │ → 5 blocks,
8392 │ Generates: frames 178-257 │ async processing
8493 └──────────────────────────────────────────┘
94+ ```
8595
86- Each iteration independently runs the asynchronous processing with its own 5 blocks.
87- base_num_frames controls:
96+ Each iteration independently runs the asynchronous processing with its own ` 5 ` blocks.
97+ ` base_num_frames ` controls:
88981 . Memory usage (larger window = more VRAM)
89992 . Model context length (must match training constraints)
90- 3. Number of blocks per iteration (base_num_latent_frames // causal_block_size)
100+ 3 . Number of blocks per iteration (` base_num_latent_frames // causal_block_size ` )
91101
92- Each block takes 30 steps to complete denoising.
93- Block N starts at step: 1 + (N-1) x ar_step
94- Total steps: 30 + (5-1) x 5 = 50 steps
102+ Each block takes ` 30 ` steps to complete denoising.
103+ Block N starts at step: ` 1 + (N-1) x ar_step `
104+ Total steps: ` 30 + (5-1) x 5 = 50 ` steps
95105
96106
97- Synchronous mode (ar_step=0) would process all blocks/frames simultaneously:
107+ Synchronous mode (` ar_step=0 ` ) would process all blocks/frames simultaneously:
108+ ``` text
98109┌──────────────────────────────────────────────┐
99110│ Steps: 1 ... 30 │
100111│ All blocks: [■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■] │
101112└──────────────────────────────────────────────┘
102- Total steps: 30 steps
113+ ```
114+ Total steps: ` 30 ` steps
103115
104116
105117An example on how the step matrix is constructed for asynchronous processing:
106- Given the parameters: (num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5)
118+ Given the parameters: (` num_inference_steps=30, flow_shift=8, num_frames=97, ar_step=5, causal_block_size=5 ` )
119+ ```
107120- num_latent_frames = (97 frames - 1) // (4 temporal downsampling) + 1 = 25
108121- step_template = [999, 995, 991, 986, 980, 975, 969, 963, 956, 948,
109122 941, 932, 922, 912, 901, 888, 874, 859, 841, 822,
110123 799, 773, 743, 708, 666, 615, 551, 470, 363, 216]
124+ ```
111125
112- The algorithm creates a 50x25 step_matrix where:
126+ The algorithm creates a ` 50x25 ` ` step_matrix ` where:
127+ ```
113128- Row 1: [999×5, 999×5, 999×5, 999×5, 999×5]
114129- Row 2: [995×5, 999×5, 999×5, 999×5, 999×5]
115130- Row 3: [991×5, 999×5, 999×5, 999×5, 999×5]
@@ -123,16 +138,19 @@ The algorithm creates a 50x25 step_matrix where:
123138- Row 42: [ 0×5, 0×5, 0×5, 551×5, 773×5]
124139- ...
125140- Row 50: [ 0×5, 0×5, 0×5, 0×5, 216×5]
141+ ```
126142
127- Detailed Row 6 Analysis:
143+ Detailed Row ` 6 ` Analysis:
144+ ```
128145- step_matrix[5]: [ 975×5, 999×5, 999×5, 999×5, 999×5]
129146- step_index[5]: [ 6×5, 1×5, 0×5, 0×5, 0×5]
130147- step_update_mask[5]: [True×5, True×5, False×5, False×5, False×5]
131148- valid_interval[5]: (0, 25)
149+ ```
132150
133- Key Pattern: Block i lags behind Block i-1 by exactly ar_step=5 timesteps, creating the
151+ Key Pattern: Block ` i ` lags behind Block ` i-1 ` by exactly ` ar_step=5 ` timesteps, creating the
134152staggered "diffusion forcing" effect where later blocks condition on cleaner earlier blocks.
135- ```
153+
136154
137155### Text-to-Video Generation
138156
0 commit comments