Skip to content

Commit 2cc7a22

Browse files
authored
Add Wan2.2 S2V draft (#388)
* Add Wan2.2 S2V draft * Update docs * Update audio links
1 parent d2f3f00 commit 2cc7a22

File tree

4 files changed

+246
-0
lines changed

4 files changed

+246
-0
lines changed

docs.json

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,7 @@
157157
"group": "Wan Video",
158158
"pages": [
159159
"tutorials/video/wan/wan2_2",
160+
"tutorials/video/wan/wan2-2-s2v",
160161
"tutorials/video/wan/wan2-2-fun-inp",
161162
"tutorials/video/wan/wan2-2-fun-control",
162163
"tutorials/video/wan/wan2-2-fun-camera",
@@ -713,6 +714,7 @@
713714
"group": "万相视频",
714715
"pages": [
715716
"zh-CN/tutorials/video/wan/wan2_2",
717+
"zh-CN/tutorials/video/wan/wan2-2-s2v",
716718
"zh-CN/tutorials/video/wan/wan2-2-fun-inp",
717719
"zh-CN/tutorials/video/wan/wan2-2-fun-control",
718720
"zh-CN/tutorials/video/wan/wan2-2-fun-camera",
948 KB
Loading

tutorials/video/wan/wan2-2-s2v.mdx

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
---
2+
title: Wan2.2-S2V Audio-Driven Video Generation ComfyUI Native Workflow Example
3+
description: This is a native workflow example for Wan2.2-S2V audio-driven video generation in ComfyUI.
4+
sidebarTitle: "Wan2.2 S2V"
5+
---
6+
7+
import UpdateReminder from '/snippets/tutorials/update-reminder.mdx'
8+
9+
We're excited to announce that Wan2.2-S2V, the advanced audio-driven video generation model, is now natively supported in ComfyUI! This powerful AI model can transform static images and audio inputs into dynamic video content, supporting dialogue, singing, performance, and various creative content needs.
10+
11+
**Model Highlights**
12+
- **Audio-Driven Video Generation**: Transforms static images and audio into synchronized videos
13+
- **Cinematic-Grade Quality**: Generates film-quality videos with natural expressions and movements
14+
- **Minute-Level Generation**: Supports long-form video creation
15+
- **Multi-Format Support**: Works with full-body and half-body characters
16+
- **Enhanced Motion Control**: Generates actions and environments from text instructions
17+
18+
Wan2.2 S2V Code: [GitHub](https://github.com/aigc-apps/VideoX-Fun)
19+
Wan2.2 S2V Model: [Hugging Face](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B)
20+
21+
22+
## Wan2.2 S2V ComfyUI Native Workflow
23+
24+
<UpdateReminder/>
25+
26+
### 1. Download Workflow File
27+
28+
Download the following workflow file and drag it into ComfyUI to load the workflow.
29+
30+
<video
31+
controls
32+
className="w-full aspect-video"
33+
src="https://raw.githubusercontent.com/Comfy-Org/example_workflows/refs/heads/main/video/wan/wan2.2_s2v/wan2.2-s2v.mp4"
34+
></video>
35+
36+
<a className="prose" target='_blank' href="https://raw.githubusercontent.com/Comfy-Org/workflow_templates/refs/heads/main/templates/video_wan2_2_14B_s2v.json" style={{ display: 'inline-block', backgroundColor: '#0078D6', color: '#ffffff', padding: '10px 20px', borderRadius: '8px', borderColor: "transparent", textDecoration: 'none', fontWeight: 'bold'}}>
37+
<p className="prose" style={{ margin: 0, fontSize: "0.8rem" }}>Download JSON Workflow</p>
38+
</a>
39+
40+
Download the following image and audio as input:
41+
![input](https://raw.githubusercontent.com/Comfy-Org/example_workflows/refs/heads/main/video/wan/wan2.2_s2v/input.jpg)
42+
43+
44+
<a className="prose" target='_blank' href="https://raw.githubusercontent.com/Comfy-Org/example_workflows/refs/heads/main/video/wan/wan2.2_s2v/input_audio.MP3" style={{ display: 'inline-block', backgroundColor: '#0078D6', color: '#ffffff', padding: '10px 20px', borderRadius: '8px', borderColor: "transparent", textDecoration: 'none', fontWeight: 'bold'}}>
45+
<p className="prose" style={{ margin: 0, fontSize: "0.8rem" }}>Download Input Audio</p>
46+
</a>
47+
48+
### 2. Model Links
49+
50+
You can find the models in [our repo](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged)
51+
52+
**diffusion_models**
53+
- [wan2.2_s2v_14B_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_fp8_scaled.safetensors)
54+
- [wan2.2_s2v_14B_bf16.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_bf16.safetensors)
55+
56+
**audio_encoders**
57+
- [wav2vec2_large_english_fp16.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/audio_encoders/wav2vec2_large_english_fp16.safetensors)
58+
59+
**vae**
60+
- [wan_2.1_vae.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors)
61+
62+
**text_encoders**
63+
- [umt5_xxl_fp8_e4m3fn_scaled.safetensors](https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors)
64+
65+
66+
```
67+
ComfyUI/
68+
├───📂 models/
69+
│ ├───📂 diffusion_models/
70+
│ │ ├─── wan2.2_s2v_14B_fp8_scaled.safetensors
71+
│ │ └─── wan2.2_s2v_14B_bf16.safetensors
72+
│ ├───📂 text_encoders/
73+
│ │ └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors
74+
│ ├───📂 audio_encoders/ # Create one if you can't find this folder
75+
│ │ └─── wav2vec2_large_english_fp16.safetensors
76+
│ └───📂 vae/
77+
│ └── wan_2.1_vae.safetensors
78+
```
79+
80+
81+
### 3. Workflow Instructions
82+
83+
![Workflow Instructions](/images/tutorial/video/wan/wan_2.2_14b_s2v.jpg)
84+
85+
#### 3.1 About Lightning LoRA
86+
87+
#### 3.2 About fp8_scaled and bf16 Models
88+
89+
You can find both models [here](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/diffusion_models):
90+
91+
- [wan2.2_s2v_14B_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_fp8_scaled.safetensors)
92+
- [wan2.2_s2v_14B_bf16.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_bf16.safetensors)
93+
94+
This template uses `wan2.2_s2v_14B_fp8_scaled.safetensors`, which requires less VRAM. But you can try `wan2.2_s2v_14B_bf16.safetensors` to reduce quality degradation.
95+
96+
#### 3.3 Step-by-Step Operation Instructions
97+
98+
**Step 1: Load Models**
99+
1. **Load Diffusion Model**: Load `wan2.2_s2v_14B_fp8_scaled.safetensors` or `wan2.2_s2v_14B_bf16.safetensors`
100+
- The provided workflow uses `wan2.2_s2v_14B_fp8_scaled.safetensors`, which requires less VRAM
101+
- But you can try `wan2.2_s2v_14B_bf16.safetensors` to reduce quality degradation
102+
2. **Load CLIP**: Load `umt5_xxl_fp8_e4m3fn_scaled.safetensors`
103+
3. **Load VAE**: Load `wan_2.1_vae.safetensors`
104+
4. **AudioEncoderLoader**: Load `wav2vec2_large_english_fp16.safetensors`
105+
5. **LoraLoaderModelOnly**: Load `wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors` (Lightning LoRA)
106+
- We tested all wan2.2 lightning LoRAs. Since this is not a LoRA specifically trained for Wan2.2 S2V, many key values don't match, but we added it because it significantly reduces generation time. We will continue to optimize this template
107+
- Using it will cause significant dynamic and quality loss
108+
- If you find the output quality too poor, you can try the original 20-step workflow
109+
6. **LoadAudio**: Upload our provided audio file or your own audio
110+
7. **Load Image**: Upload reference image
111+
8. **Batch sizes**: Set according to the number of Video S2V Extend subgraph nodes you add
112+
- Each Video S2V Extend subgraph adds 77 frames to the final output
113+
- For example: If you added 2 Video S2V Extend subgraphs, the batch size should be 3, which means the total number of sampling iterations
114+
- **Chunk Length**: Keep the default value of 77
115+
116+
9. **Sampler Settings**: Choose different settings based on whether you use Lightning LoRA
117+
- With 4-step Lightning LoRA: steps: 4, cfg: 1.0
118+
- Without 4-step Lightning LoRA: steps: 20, cfg: 6.0
119+
10. **Size Settings**: Set the output video dimensions
120+
11. **Video S2V Extend**: Video extension subgraph nodes. Since our default frames per sampling is 77, and this is a 16fps model, each extension will generate 77 / 16 = 4.8125 seconds of video
121+
- You need some calculation to match the number of video extension subgraph nodes with the input audio length. For example: If input audio is 14s, the total frames needed are 14x16=224, each video extension is 77 frames, so you need 224/77 = 2.9, rounded up to 3 video extension subgraph nodes
122+
12. Use Ctrl-Enter or click the Run button to execute the workflow
123+
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
---
2+
title: Wan2.2-S2V 音频驱动视频生成 ComfyUI 原生工作流示例
3+
description: 这是一个基于 ComfyUI 的 Wan2.2-S2V 音频驱动视频生成原生工作流示例。
4+
sidebarTitle: "Wan2.2 S2V"
5+
---
6+
7+
我们很高兴地宣布,先进的音频驱动视频生成模型 Wan2.2-S2V 现已原生支持 ComfyUI!这个强大的 AI 模型可以将静态图片和音频输入转化为动态视频内容,支持对话、唱歌、表演等多种创意内容需求。
8+
9+
**模型亮点**
10+
- **音频驱动视频生成**:将静态图片和音频转化为同步视频
11+
- **电影级画质**:生成具有自然表情和动作的高质量视频
12+
- **分钟级生成**:支持长时长视频创作
13+
- **多格式支持**:适用于全身和半身角色
14+
- **增强动作控制**:可根据文本指令生成动作和环境
15+
16+
Wan2.2 S2V 代码仓库:[Github](https://github.com/aigc-apps/VideoX-Fun)
17+
Wan2.2 S2V 模型仓库:[Hugging Face](https://huggingface.co/Wan-AI/Wan2.2-S2V-14B)
18+
19+
20+
## Wan2.2 S2V ComfyUI 原生工作流
21+
22+
<UpdateReminder/>
23+
24+
### 1. 工作流文件下载
25+
26+
下载以下工作流文件并拖入 ComfyUI 中加载工作流。
27+
28+
<video
29+
controls
30+
className="w-full aspect-video"
31+
src="https://raw.githubusercontent.com/Comfy-Org/example_workflows/refs/heads/main/video/wan/wan2.2_s2v/wan2.2-s2v.mp4"
32+
></video>
33+
34+
<a className="prose" target='_blank' href="https://raw.githubusercontent.com/Comfy-Org/workflow_templates/refs/heads/main/templates/video_wan2_2_14B_s2v.json" style={{ display: 'inline-block', backgroundColor: '#0078D6', color: '#ffffff', padding: '10px 20px', borderRadius: '8px', borderColor: "transparent", textDecoration: 'none', fontWeight: 'bold'}}>
35+
<p className="prose" style={{ margin: 0, fontSize: "0.8rem" }}>Download JSON Workflow</p>
36+
</a>
37+
38+
下载下面的图片及音频作为输入:
39+
![input](https://raw.githubusercontent.com/Comfy-Org/example_workflows/refs/heads/main/video/wan/wan2.2_s2v/input.jpg)
40+
41+
<a className="prose" target='_blank' href="https://raw.githubusercontent.com/Comfy-Org/example_workflows/refs/heads/main/video/wan/wan2.2_s2v/input_audio.MP3" style={{ display: 'inline-block', backgroundColor: '#0078D6', color: '#ffffff', padding: '10px 20px', borderRadius: '8px', borderColor: "transparent", textDecoration: 'none', fontWeight: 'bold'}}>
42+
<p className="prose" style={{ margin: 0, fontSize: "0.8rem" }}>下载输入音频</p>
43+
</a>
44+
45+
### 2. 模型链接
46+
47+
你可以在 [我们的仓库](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged) 中找到所有模型。
48+
49+
**diffusion_models**
50+
- [wan2.2_s2v_14B_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_fp8_scaled.safetensors)
51+
- [wan2.2_s2v_14B_bf16.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_bf16.safetensors)
52+
53+
**audio_encoders**
54+
- [wav2vec2_large_english_fp16.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/audio_encoders/wav2vec2_large_english_fp16.safetensors)
55+
56+
**vae**
57+
- [wan_2.1_vae.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors)
58+
59+
**text_encoders**
60+
- [umt5_xxl_fp8_e4m3fn_scaled.safetensors](https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/text_encoders/umt5_xxl_fp8_e4m3fn_scaled.safetensors)
61+
62+
63+
```
64+
ComfyUI/
65+
├───📂 models/
66+
│ ├───📂 diffusion_models/
67+
│ │ ├──── wan2.2_s2v_14B_fp8_scaled.safetensors
68+
│ │ └─── wan2.2_s2v_14B_bf16.safetensors
69+
│ ├───📂 text_encoders/
70+
│ │ └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors
71+
│ ├───📂 audio_encoders/ # 如果这个文件夹不存在请手动创建一个
72+
│ │ └─── wav2vec2_large_english_fp16.safetensors
73+
│ └───📂 vae/
74+
│ └── wan_2.1_vae.safetensors
75+
```
76+
77+
78+
### 3. 工作流说明
79+
80+
![工作流说明](/images/tutorial/video/wan/wan_2.2_14b_s2v.jpg)
81+
82+
#### 3.1 关于 Lightning LoRA
83+
84+
85+
#### 3.2 关于 fp8_scaled 和 bf16 模型
86+
87+
你可以在 [这里](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/tree/main/split_files/diffusion_models) 找到两种模型:
88+
89+
- [wan2.2_s2v_14B_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_fp8_scaled.safetensors)
90+
- [wan2.2_s2v_14B_bf16.safetensors](https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/diffusion_models/wan2.2_s2v_14B_bf16.safetensors)
91+
92+
本模板使用 `wan2.2_s2v_14B_fp8_scaled.safetensors`,它需要更少的显存。但你可以尝试 `wan2.2_s2v_14B_bf16.safetensors` 来减少质量损失。
93+
94+
#### 3.3 逐步操作说明
95+
96+
**步骤 1:加载模型**
97+
1. **Load Diffusion Model**:加载 `wan2.2_s2v_14B_fp8_scaled.safetensors``wan2.2_s2v_14B_bf16.safetensors`
98+
- 提供工作流使用 `wan2.2_s2v_14B_fp8_scaled.safetensors`,它需要更少的显存
99+
- 但你可以尝试 `wan2.2_s2v_14B_bf16.safetensors` 来减少质量损失
100+
2. **Load CLIP**:加载 `umt5_xxl_fp8_e4m3fn_scaled.safetensors`
101+
3. **Load VAE**:加载 `wan_2.1_vae.safetensors`
102+
4. **AudioEncoderLoader**:加载 `wav2vec2_large_english_fp16.safetensors`
103+
5. **LoraLoaderModelOnly**:加载 `wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors`(Lightning LoRA)
104+
- 测试了所有 wan2.2 lightning LoRAs,由于这并不是一个专门为 Wan2.2 S2V 训练的 LoRA,很多键值不匹配,但由于它能大幅减少生成时间,后续将继续优化这个模板
105+
- 使用它会导致极大的动态和质量损失
106+
- 如果你发现输出质量太差,可以尝试原始的 20 步工作流
107+
6. **LoadAudio**:上传我们提供的音频文件,或者你自己的音频
108+
7. **Load Image**:上传参考图片
109+
8. **Batch sizes**:根据你添加的 Video S2V Extend 子图节点数量设置
110+
- 每个 Video S2V Extend 子图会为最终输出添加 77 帧
111+
- 例如:如果添加了 2 个 Video S2V Extend 子图,批处理大小应设为 3, 也就是这里应为所有总采样次数
112+
- **Chunk Length**:保持默认值 77
113+
114+
9. **采样器设置**: 根据是否使用 Lightning LoRA 选择不同设置
115+
- 使用 4 步 Lightning LoRA: steps: 4, cfg: 1.0
116+
- 不使用 4 步 Lightning LoRA: steps: 20, cfg: 6.0
117+
10. **尺寸设置**: 设置输出视频的尺寸
118+
11. **Video S2V Extend**:视频扩展子图节点,由于我们默认的每次采样帧数为 77, 由于这是一个 16fps 的模型,所以每个扩展将会生成 77 / 16 = 4.8125 秒的视频
119+
- 你需要一定的计算来使得视频扩展子图节点的数量和输入音频数量匹配,如: 输入音频为 14s, 则需要的总帧数为 14x16=224, 每个视频扩展为 77 帧,所以你需要 224/77 = 2.9 向上取整则为 3 个视频扩展子图节点
120+
12. 使用 Ctrl-Enter 或者点击 运行按钮来运行工作流
121+

0 commit comments

Comments
 (0)