ComfyUI custom nodes for LongCat-AudioDiT — Diffusion-based Zero-Shot Text-to-Speech
LongCat-AudioDiT is a diffusion-based text-to-speech model by Meituan that generates high-quality speech audio using a DiT (Diffusion Transformer) architecture with an ODE Euler solver. It supports zero-shot voice cloning from reference audio without any fine-tuning.
This ComfyUI wrapper provides native node-based integration with:
- Zero-shot TTS from text input
- Voice cloning from reference audio (3-15 seconds recommended)
- Multi-speaker conversation synthesis with multiple cloned voices
- 24kHz output at broadcast quality
⚠️ Audio Generation Duration: The model supports generating up to 60 seconds of audio, but longer outputs may have issues with word repetition or dropped words. For best results, keep generated audio in the 15-30 second range. (Note: This refers to the output audio duration, not the input reference audio.)
- Zero-Shot Voice Cloning — Clone any voice from a short reference audio clip
- Multi-Speaker TTS — Generate conversations with multiple cloned voices using
[speaker_N]:tags - Diffusion-Based Generation — DiT transformer with ODE Euler solver for high-quality audio
- FP8 / FP16 / BF16 / FP32 Support — FP8 models are auto-dequantized to BF16; FP16 works for TTS only (voice cloning requires BF16)
- Native ComfyUI Integration — AUDIO noodle inputs, progress bars, interruption support
- Smart Auto-Download — Model weights auto-downloaded from HuggingFace on first use
- Smart Caching — Optional model caching with CPU offload between runs
- Optimized Attention — Support for SDPA, SageAttention backends
- Auto-Install Dependencies — Missing packages are installed automatically on startup
- GPU: NVIDIA GPU with 8GB+ VRAM for bf16 model, 16GB+ VRAM for fp32 model
- CPU: Supported but slow
- MPS: Experimental
- Python: 3.10+
- CUDA: 11.8+ (for GPU inference)
| Model | VRAM | Description |
|---|---|---|
| LongCat-AudioDiT-1B | ~6-8GB | 1B params (FP32) — smallest model, lower VRAM |
| LongCat-AudioDiT-3.5B-bf16 | ~10-14GB | BF16 quantized (recommended) — best balance of quality and VRAM |
| LongCat-AudioDiT-3.5B-fp8 | ~8-12GB | FP8 quantized — smallest download, dequantized to BF16 at load time |
| LongCat-AudioDiT-3.5B | ~20GB | FP32 original — highest quality, requires more VRAM |
Models are auto-downloaded from HuggingFace on first use:
- meituan-longcat/LongCat-AudioDiT-1B — 1B params model
- meituan-longcat/LongCat-AudioDiT-3.5B — original FP32 model
- drbaph/LongCat-AudioDiT-3.5B-bf16 — BF16 quantized
- drbaph/LongCat-AudioDiT-3.5B-fp8 — FP8 quantized
Click to expand installation methods
- Open ComfyUI Manager
- Search for "LongCat-AudioDiT"
- Click Install
- Restart ComfyUI
cd ComfyUI/custom_nodes
git clone https://github.com/saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
cd ComfyUI-LongCat-AudioDIT-TTS
pip install -r requirements.txt| Node | Description |
|---|---|
| LongCat AudioDiT TTS | Text-to-speech synthesis |
| LongCat AudioDiT Voice Clone TTS | Voice cloning from reference audio |
| LongCat AudioDiT Multi-Speaker TTS | Multi-speaker conversation synthesis |
-
Download Model
- Models are auto-downloaded from HuggingFace on first use
- Or manually place in
ComfyUI/models/audiodit/
-
Text-to-Speech
- Add
LongCat AudioDiT TTSnode - Enter your text
- Run!
- Add
-
Voice Cloning
- Add
LongCat AudioDiT Voice Clone TTSnode - Connect reference audio (3-15 seconds recommended)
- Optionally provide transcript of the reference audio
- Enter text to synthesize in the cloned voice
- Run!
- Add
-
Multi-Speaker
- Add
LongCat AudioDiT Multi-Speaker TTSnode - Set number of speakers
- Connect reference audio for each speaker
- Use
[speaker_1]:,[speaker_2]:tags in text - Run!
- Add
Basic text-to-speech synthesis.
Inputs:
model_path: Model selection (auto-downloads on first use)text: Text to synthesizesteps: Number of ODE Euler steps (4-64, default 16)guidance_strength: CFG/APG guidance strength (0-10, default 4.0)guidance_method: Guidance method (cfgorapg)device: Compute device (auto,cuda,cpu,mps)dtype: Model precision (auto,bf16,fp16,fp32)attention: Attention implementation (auto,sdpa,sage_attention,flash_attention)seed: Random seed (0 = random)keep_model_loaded: Keep model in memory between runs
Outputs:
audio: Generated speech (AUDIO)
Voice cloning from reference audio.
Inputs:
- All inputs from TTS, plus:
prompt_audio: Reference audio to clone (AUDIO)prompt_text: Transcript of reference audio (improves quality)
Outputs:
audio: Generated speech in cloned voice (AUDIO)
Multi-speaker conversation synthesis with dynamic speaker inputs (ComfyUI v3) or fixed slots (v2 fallback).
Inputs:
- All inputs from Voice Clone TTS, plus:
num_speakers: Number of speakers (2-10)speaker_N_audio: Reference audio for each speakerspeaker_N_ref_text: Transcript for each speaker's reference audiopause_after_speaker: Seconds of silence between speakers (default 0.4)
Text Format:
[speaker_1]: Hello, I'm speaker one.
[speaker_2]: And I'm speaker two!
Outputs:
audio: Generated multi-speaker conversation (AUDIO)
| Parameter | Description | Recommended |
|---|---|---|
| steps | ODE Euler solver steps | 16 (balanced), 32 (higher quality) |
| guidance_strength | Guidance scale | 4.0 (balanced), higher = more guidance |
| guidance_method | Guidance algorithm | cfg (TTS), apg (voice clone) |
| dtype | Model precision | auto or bf16 (recommended) |
| attention | Attention backend | auto (default), sage_attention (fastest) |
| keep_model_loaded | Cache model between runs | True for repeated use |
Note on precision: FP16 runs the transformer in float16 while keeping the text encoder in BF16 (UMT5 layer_norm overflows in fp16) and accumulating ODE steps in fp32. FP8 models are dequantized to BF16 during loading. For best results, use
autoorbf16.
⚠️ FP16 does NOT support voice cloning: Voice Clone TTS and Multi-Speaker TTS nodes automatically upgrade FP16 to BF16. This is required because the latent conditioning path (encoding the reference audio) causes numerical overflow in FP16, resulting in NaN values that cascade through the ODE solver and produce silent output. Basic TTS works with FP16 because it uses zero latents for conditioning. If you select FP16 for voice cloning, you'll see a warning and the node will automatically use BF16 instead.
ComfyUI/
├── models/
│ └── audiodit/
│ ├── LongCat-AudioDiT-3.5B/ # FP32 model (auto-downloaded)
│ ├── LongCat-AudioDiT-3.5B-bf16/ # BF16 model (auto-downloaded)
│ └── LongCat-AudioDiT-3.5B-fp8/ # FP8 model (auto-downloaded)
└── custom_nodes/
└── ComfyUI-LongCat-AudioDIT-TTS/
├── __init__.py # Node registration + auto-install
├── pyproject.toml
├── requirements.txt
├── nodes/
│ ├── tts_node.py # TTS node
│ ├── voice_clone_node.py # Voice clone node
│ ├── multi_speaker_node.py # Multi-speaker node
│ ├── loader.py # Model loading + attention patching
│ └── model_cache.py # Caching + offload logic
└── audiodit/
├── __init__.py # AutoConfig/AutoModel registration
├── modeling_audiodit.py # AudioDiT model implementation
├── configuration_audiodit.py # Config class
├── fp8_linear.py # FP8 layer (unused, kept for reference)
└── utils.py
Click to expand troubleshooting guide
Manually download from HuggingFace:
pip install -U huggingface_hub
# 1B model (smallest)
huggingface-cli download meituan-longcat/LongCat-AudioDiT-1B --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-1B
# BF16 model (recommended)
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-bf16 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-bf16
# FP32 model
huggingface-cli download meituan-longcat/LongCat-AudioDiT-3.5B --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B
# FP8 model
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8All required packages are auto-installed on first startup. If the node fails to load, restart ComfyUI once — the installer runs before nodes register. If it still fails after a restart, install manually:
pip install -r requirements.txtFP8 models are dequantized to BF16 during loading — the 598 FP8 tensors are converted using per-tensor scales from fp8_scales.json. FP16 works correctly for basic TTS with fp32 ODE accumulation.
Important: FP16 is not supported for Voice Clone TTS and Multi-Speaker TTS. These nodes automatically upgrade FP16 to BF16 because the latent conditioning from reference audio causes numerical overflow in FP16, leading to NaN propagation and silent output. You will see a warning message if you select FP16 for these nodes.
- Use the bf16 model (~7GB VRAM)
- Set
keep_model_loaded=False - Enable
offload_to_cpu— model moves to CPU between runs
Clicking "Free model and node cache" in ComfyUI always fully unloads the model regardless of keep_model_loaded. That setting only controls auto-offload behavior between generation runs.
- Ensure
dtypeis set toautoorbf16for best quality - The VAE audio decoder is always kept in BF16 on CUDA for audio quality
- FP16 and FP8 are supported but
bf16orautogives the most consistent results
If your reference audio is too loud (clipping) or has inconsistent volume, the voice clone quality will suffer. Normalize your reference audio to -3dB to -6dB peak before using it:
# Quick normalization script
import librosa
import soundfile as sf
audio, sr = librosa.load("reference.wav", sr=24000, mono=True)
peak = audio.max()
target_peak = 0.5 # -6dB
audio = audio * (target_peak / peak)
sf.write("reference_normalized.wav", audio, sr)- 1B Model: meituan-longcat/LongCat-AudioDiT-1B
- Original Model (FP32): meituan-longcat/LongCat-AudioDiT-3.5B
- BF16 Model: drbaph/LongCat-AudioDiT-3.5B-bf16
- FP8 Model: drbaph/LongCat-AudioDiT-3.5B-fp8
- Original Repository: meituan-longcat/LongCat-AudioDiT
MIT License. See LICENSE for details.