Generate high-fidelity, synchronized foley audio for any video directly within ComfyUI, powered by Tencent's HunyuanVideo-Foley model.
This custom node set provides a modular and offline-capable workflow for AI sound effect generation.
Note: The lighter weight hunyuanvideo_foley_xl.pth model has been added.
- High-Fidelity Audio: Generates 48kHz stereo audio using the advanced DAC VAE.
- Video-to-Audio Synchronization: Leverages the Synchformer model to ensure audio events are timed with visual actions.
- Text-Guided Control: Use positive and negative text prompts, powered by the CLAP model, to creatively direct the type of sound you want to generate.
- Flexible Model Choice: Includes support for the original high-quality model and a smaller, faster XL variant.
- Modular: The workflow is broken into logical
Loader,Sampler, andVAE Decodenodes, mirroring the standard Stable Diffusion workflow. - Integrated: Accepts video frames directly from popular loader nodes like
VHS_LoadVideoPath, avoiding redundant file operations. - VRAM Management: Caches models in VRAM for fast, repeated generations. Includes an optional "Low VRAM" mode to unload models after use, ideal for memory-constrained systems.
- Offline Capable: No automatic model downloads. Once you've downloaded the models, the node works entirely offline.
- Open ComfyUI Manager.
- Click on
Install Custom Nodes. - Search for
ComfyUI-HunyuanVideo_Foleyand clickInstall. - Restart ComfyUI.
- Follow the Download Models instructions below.
- Navigate to your ComfyUI
custom_nodesdirectory.cd ComfyUI/custom_nodes/ - Clone this repository:
git clone https://github.com/BobRandomNumber/ComfyUI-HunyuanVideo_Foley.git
- Install the required dependencies:
cd ComfyUI-HunyuanVideo-Foley/ pip install -r requirements.txt - Restart ComfyUI.
This node requires you to download the model files manually and organize them in a specific folder structure. This ensures the node works offline and gives you full control.
-
Navigate to
ComfyUI/models/. -
Create a new folder named
hunyuan_foley. -
Download the following and place them inside your
hunyuan_foleyfolder-
Hunyuan-Foley Base Models from Tencent/HunyuanVideo-Foley on Hugging Face:
hunyuanvideo_foley.pthhunyuanvideo_foley_xl.pth(Optional: A smaller, faster alternative model)synchformer_state_dict.pthvae_128d_48k.pth
-
SigLIP Vision Model from google/siglip2-base-patch16-512 on Hugging Face:
- Create a new folder named
siglip2. - Download
model.safetensors,config.jsonandpreprocessor_config.jsonplace them inside thesiglip2folder.
- Create a new folder named
-
CLAP Text Model from laion/larger_clap_general on Hugging Face:
- Create a new folder named
clap. - Download
model.safetensors,config.json,merges.txtandvocab.jsonand place them inside theclapfolder.
- Create a new folder named
-
Your final folder structure should look exactly like this:
ComfyUI/
└── models/
└── hunyuan_foley/ <-- You will see this folder selected in the Loader node
├── hunyuanvideo_foley.pth
├── hunyuanvideo_foley_xl.pth
├── synchformer_state_dict.pth
├── vae_128d_48k.pth
│
├── siglip2/ <-- Subfolder for SigLIP2
│ ├── model.safetensors
│ ├── config.json
│ └── preprocessor_config.json
│
└── clap/ <-- Subfolder for CLAP
├── model.safetensors
├── config.json
├── merges.txt
└── vocab.json
The workflow is designed to be modular and familiar to ComfyUI users.
This node loads the main diffusion model and all necessary conditioning models (SigLIP2, CLAP, Synchformer) into VRAM. These models are cached for fast subsequent generations.
- Inputs:
model_path_name: The model folder you created.foley_checkpoint_name: A dropdown to select which main model checkpoint to use (.pthfile). Allows switching between the base and XL models.
- Outputs:
FOLEY_MODEL: The loaded models, ready to be passed to the sampler.
This node loads the specialized DAC audio VAE used for decoding the final sound. Keeping it separate saves VRAM during the sampling process.
- Inputs:
vae_name: A dropdown to select thevae_128d_48k.pthfile. It will search yourhunyuan_foleymodel folder.
- Outputs:
VAE: The loaded DAC VAE model.
This is the core node where the audio generation happens. It takes video frames, prompts, and sampling parameters to generate a latent representation of the audio.
- Inputs:
foley_model: The model from theHunyuan-Foley model loader.video_frames(IMAGE): A batch of video frames, typically from a video loader node likeVHS_LoadVideoPath.fps(FLOAT): The framerate of the original video. This is crucial for correct timing. You can get this from a node likeVHS_VideoInfoSource.prompt: Your text description of the desired sound.negative_prompt: A text description of sounds to avoid (e.g., "noisy, harsh, muffled").guidance_scale,steps,seed: Standard diffusion sampling parameters.
- Outputs:
LATENT: A latent tensor representing the generated audio. This is passed to the VAE Decode node.
This node takes the latent tensor from the sampler and converts it into a final audio waveform. It also contains the VRAM management toggle.
- Inputs:
samples: TheLATENToutput from theHunyuan-Foley Sampler.vae: TheVAEoutput from theHunyuan-Foley VAE loader.unload_models_after_use(Boolean Toggle):False(Default): Keeps the main models in VRAM for fast subsequent generations.True(Low VRAM Mode): Frees VRAM by moving the main models to system RAM after generation is complete. The next generation will be slower as it requires a full reload.
- Outputs:
AUDIO: The final audio waveform, which can be connected toSave Audio,Preview Audio, or aVideo Combinenode.
-
VRAM Requirement: For the best performance (keeping models cached), a GPU with approximately 10-12GB of VRAM is recommended.
-
Initial Load: The first time you run a workflow, the
Hunyuan-Foley model loaderwill take a moment to load all models from disk into VRAM. Subsequent runs in the same session will be faster as long as models are not unloaded. -
XL Model Advantage: The
hunyuanvideo_foley_xl.pthmodel is smaller than the original. It may offer quicker loading and inference, making it a great choice for users prioritizing speed or working with more limited VRAM. -
Connecting The Sampler (Recommended Workflow):
- Use a
VHS_LoadVideoPathnode to load your video. This will output the frames (IMAGE) and video information (VHS_VIDEOINFO). - Connect the
IMAGEoutput fromVHS_LoadVideoPathdirectly to thevideo_framesinput on theHunyuan-Foley Sampler. - Add a
VHS_VideoInfoSourcenode. - Connect the
VHS_VIDEOINFOoutput from the loader to theVHS_VideoInfoSourcenode. - Connect the
fpsoutput fromVHS_VideoInfoSourceto thefpsinput on theHunyuan-Foley Sampler.
- Use a
-
Low VRAM Mode: If you are running low on VRAM or only need to generate a single audio track, set the
unload_models_after_usetoggle on theHunyuan-Foley VAE Decodenode toTrue. This will significantly reduce the idle VRAM footprint after the workflow completes.
-
Tencent Hunyuan: For creating and open-sourcing the original HunyuanVideo-Foley model.
-
Google Research: For the SigLIP model.
-
LAION: For the CLAP model.
-
Descript: For the descript-audio-codec (DAC VAE).
-
v-iashin: For the Synchformer model.
