Generate high-fidelity, synchronized foley audio for any video directly within ComfyUI, powered by Tencent's HunyuanVideo-Foley model.
This custom node set provides a modular and offline-capable workflow for AI sound effect generation.
Note: The sampler node has been updated! It now accepts
IMAGE
frames andfps
directly from video loader nodes for a more integrated and efficient workflow. Also exposed Negative Prompt.
- High-Fidelity Audio: Generates 48kHz stereo audio using the advanced DAC VAE.
- Video-to-Audio Synchronization: Leverages the Synchformer model to ensure audio events are timed with visual actions.
- Text-Guided Control: Use positive and negative text prompts, powered by the CLAP model, to creatively direct the type of sound you want to generate.
- Modular: The workflow is broken into logical
Loader
,Sampler
, andVAE Decode
nodes, mirroring the standard Stable Diffusion workflow. - Integrated: Accepts video frames directly from popular loader nodes like
VHS_LoadVideoPath
, avoiding redundant file operations. - VRAM Management: Caches models in VRAM for fast, repeated generations. Includes an optional "Low VRAM" mode to unload models after use, ideal for memory-constrained systems.
- Offline Capable: No automatic model downloads. Once you've downloaded the models, the node works entirely offline.
- Open ComfyUI Manager.
- Click on
Install Custom Nodes
. - Search for
ComfyUI-HunyuanVideo_Foley
and clickInstall
. - Restart ComfyUI.
- Follow the Download Models instructions below.
- Navigate to your ComfyUI
custom_nodes
directory.cd ComfyUI/custom_nodes/
- Clone this repository:
git clone https://github.com/BobRandomNumber/ComfyUI-HunyuanVideo_Foley.git
- Install the required dependencies:
cd ComfyUI-HunyuanVideo-Foley/ pip install -r requirements.txt
- Restart ComfyUI.
This node requires you to download the model files manually and organize them in a specific folder structure. This ensures the node works offline and gives you full control.
-
Navigate to
ComfyUI/models/
. -
Create a new folder named
hunyuan_foley
. -
Download the following and place them inside your
hunyuan_foley
folder-
Hunyuan-Foley Base Models from Tencent/HunyuanVideo-Foley on Hugging Face:
hunyuanvideo_foley.pth
synchformer_state_dict.pth
vae_128d_48k.pth
-
SigLIP Vision Model from google/siglip2-base-patch16-512 on Hugging Face:
- Create a new folder named
siglip2
. - Download
model.safetensors
,config.json
andpreprocessor_config.json
place them inside thesiglip2
folder.
- Create a new folder named
-
CLAP Text Model from laion/larger_clap_general on Hugging Face:
- Create a new folder named
clap
. - Download
model.safetensors
,config.json
,merges.txt
andvocab.json
and place them inside theclap
folder.
- Create a new folder named
-
Your final folder structure should look exactly like this:
ComfyUI/
└── models/
└── hunyuan_foley/ <-- You will see this folder selected in the Loader node
├── hunyuanvideo_foley.pth
├── synchformer_state_dict.pth
├── vae_128d_48k.pth
│
├── siglip2/ <-- Subfolder for SigLIP2
│ ├── model.safetensors
│ ├── config.json
│ └── preprocessor_config.json
│
└── clap/ <-- Subfolder for CLAP
├── model.safetensors
├── config.json
├── merges.txt
└── vocab.json
The workflow is designed to be modular and familiar to ComfyUI users.
This node loads the main diffusion model and all necessary conditioning models (SigLIP2, CLAP, Synchformer) into VRAM. These models are cached for fast subsequent generations.
- Inputs:
model_path_name
: The model folder you created.
- Outputs:
FOLEY_MODEL
: The loaded models, ready to be passed to the sampler.
This node loads the specialized DAC audio VAE used for decoding the final sound. Keeping it separate saves VRAM during the sampling process.
- Inputs:
vae_name
: A dropdown to select thevae_128d_48k.pth
file. It will search yourhunyuan_foley
model folder.
- Outputs:
VAE
: The loaded DAC VAE model.
This is the core node where the audio generation happens. It takes video frames, prompts, and sampling parameters to generate a latent representation of the audio.
- Inputs:
foley_model
: The model from theHunyuan-Foley model loader
.video_frames
(IMAGE): A batch of video frames, typically from a video loader node likeVHS_LoadVideoPath
.fps
(FLOAT): The framerate of the original video. This is crucial for correct timing. You can get this from a node likeVHS_VideoInfoSource
.prompt
: Your text description of the desired sound.negative_prompt
: A text description of sounds to avoid (e.g., "noisy, harsh, muffled").guidance_scale
,steps
,seed
: Standard diffusion sampling parameters.
- Outputs:
LATENT
: A latent tensor representing the generated audio. This is passed to the VAE Decode node.
This node takes the latent tensor from the sampler and converts it into a final audio waveform. It also contains the VRAM management toggle.
- Inputs:
samples
: TheLATENT
output from theHunyuan-Foley Sampler
.vae
: TheVAE
output from theHunyuan-Foley VAE loader
.unload_models_after_use
(Boolean Toggle):False
(Default): Keeps the main models in VRAM for fast subsequent generations.True
(Low VRAM Mode): Frees VRAM by moving the main models to system RAM after generation is complete. The next generation will be slower as it requires a full reload.
- Outputs:
AUDIO
: The final audio waveform, which can be connected toSave Audio
,Preview Audio
, or aVideo Combine
node.
-
VRAM Requirement: For the best performance (keeping models cached), a GPU with approximately 10-12GB of VRAM is recommended.
-
Initial Load: The first time you run a workflow, the
Hunyuan-Foley model loader
will take a moment to load all models from disk into VRAM. Subsequent runs in the same session will be faster as long as models are not unloaded. -
Connecting The Sampler (Recommended Workflow):
- Use a
VHS_LoadVideoPath
node to load your video. This will output the frames (IMAGE
) and video information (VHS_VIDEOINFO
). - Connect the
IMAGE
output fromVHS_LoadVideoPath
directly to thevideo_frames
input on theHunyuan-Foley Sampler
. - Add a
VHS_VideoInfoSource
node. - Connect the
VHS_VIDEOINFO
output from the loader to theVHS_VideoInfoSource
node. - Connect the
fps
output fromVHS_VideoInfoSource
to thefps
input on theHunyuan-Foley Sampler
.
- Use a
-
Low VRAM Mode: If you are running low on VRAM or only need to generate a single audio track, set the
unload_models_after_use
toggle on theHunyuan-Foley VAE Decode
node toTrue
. This will significantly reduce the idle VRAM footprint after the workflow completes.
-
Tencent Hunyuan: For creating and open-sourcing the original HunyuanVideo-Foley model.
-
Google Research: For the SigLIP model.
-
LAION: For the CLAP model.
-
Descript: For the descript-audio-codec (DAC VAE).
-
v-iashin: For the Synchformer model.