ComfyUI-LongCat-AudioDIT-TTS

ComfyUI custom nodes for LongCat-AudioDiT — Diffusion-based Zero-Shot Text-to-Speech

Overview

LongCat-AudioDiT is a diffusion-based text-to-speech model by Meituan that generates high-quality speech audio using a DiT (Diffusion Transformer) architecture with an ODE Euler solver. It supports zero-shot voice cloning from reference audio without any fine-tuning.

This ComfyUI wrapper provides native node-based integration with:

Zero-shot TTS from text input
Voice cloning from reference audio (3-15 seconds recommended)
Multi-speaker conversation synthesis with multiple cloned voices
24kHz output at broadcast quality

⚠️ Audio Generation Duration: The model supports generating up to 60 seconds of audio, but longer outputs may have issues with word repetition or dropped words. For best results, keep generated audio in the 15-30 second range. (Note: This refers to the output audio duration, not the input reference audio.)

Features

Zero-Shot Voice Cloning — Clone any voice from a short reference audio clip
Multi-Speaker TTS — Generate conversations with multiple cloned voices using [speaker_N]: tags
Diffusion-Based Generation — DiT transformer with ODE Euler solver for high-quality audio
FP8 / FP16 / BF16 / FP32 Support — FP8 models are auto-dequantized to BF16; FP16 works for TTS only (voice cloning requires BF16)
Native ComfyUI Integration — AUDIO noodle inputs, progress bars, interruption support
Smart Auto-Download — Model weights auto-downloaded from HuggingFace on first use
Smart Caching — Optional model caching with CPU offload between runs
Optimized Attention — Support for SDPA, SageAttention backends
Auto-Install Dependencies — Missing packages are installed automatically on startup

Requirements

GPU: NVIDIA GPU with 8GB+ VRAM for bf16 model, 16GB+ VRAM for fp32 model
CPU: Supported but slow
MPS: Experimental
Python: 3.10+
CUDA: 11.8+ (for GPU inference)

Models

Model	VRAM	Description
LongCat-AudioDiT-1B	~6-8GB	1B params (FP32) — smallest model, lower VRAM
LongCat-AudioDiT-3.5B-bf16	~10-14GB	BF16 quantized (recommended) — best balance of quality and VRAM
LongCat-AudioDiT-3.5B-fp8	~8-12GB	FP8 quantized — smallest download, dequantized to BF16 at load time
LongCat-AudioDiT-3.5B	~20GB	FP32 original — highest quality, requires more VRAM

Models are auto-downloaded from HuggingFace on first use:

meituan-longcat/LongCat-AudioDiT-1B — 1B params model
meituan-longcat/LongCat-AudioDiT-3.5B — original FP32 model
drbaph/LongCat-AudioDiT-3.5B-bf16 — BF16 quantized
drbaph/LongCat-AudioDiT-3.5B-fp8 — FP8 quantized

Installation

Click to expand installation methods

Method 1: ComfyUI Manager (Recommended)

Open ComfyUI Manager
Search for "LongCat-AudioDiT"
Click Install
Restart ComfyUI

Method 2: Manual Installation

cd ComfyUI/custom_nodes
git clone https://github.com/saganaki22/ComfyUI-LongCat-AudioDIT-TTS.git
cd ComfyUI-LongCat-AudioDIT-TTS
pip install -r requirements.txt

Quick Start

Node Overview

Node	Description
LongCat AudioDiT TTS	Text-to-speech synthesis
LongCat AudioDiT Voice Clone TTS	Voice cloning from reference audio
LongCat AudioDiT Multi-Speaker TTS	Multi-speaker conversation synthesis

Basic Workflow

Download Model
- Models are auto-downloaded from HuggingFace on first use
- Or manually place in ComfyUI/models/audiodit/
Text-to-Speech
- Add LongCat AudioDiT TTS node
- Enter your text
- Run!
Voice Cloning
- Add LongCat AudioDiT Voice Clone TTS node
- Connect reference audio (3-15 seconds recommended)
- Optionally provide transcript of the reference audio
- Enter text to synthesize in the cloned voice
- Run!
Multi-Speaker
- Add LongCat AudioDiT Multi-Speaker TTS node
- Set number of speakers
- Connect reference audio for each speaker
- Use [speaker_1]:, [speaker_2]: tags in text
- Run!

Node Reference

LongCat AudioDiT TTS

Basic text-to-speech synthesis.

Inputs:

model_path: Model selection (auto-downloads on first use)
text: Text to synthesize
steps: Number of ODE Euler steps (4-64, default 16)
guidance_strength: CFG/APG guidance strength (0-10, default 4.0)
guidance_method: Guidance method (cfg or apg)
device: Compute device (auto, cuda, cpu, mps)
dtype: Model precision (auto, bf16, fp16, fp32)
attention: Attention implementation (auto, sdpa, sage_attention, flash_attention)
seed: Random seed (0 = random)
keep_model_loaded: Keep model in memory between runs

Outputs:

audio: Generated speech (AUDIO)

LongCat AudioDiT Voice Clone TTS

Voice cloning from reference audio.

Inputs:

All inputs from TTS, plus:
prompt_audio: Reference audio to clone (AUDIO)
prompt_text: Transcript of reference audio (improves quality)

Outputs:

audio: Generated speech in cloned voice (AUDIO)

LongCat AudioDiT Multi-Speaker TTS

Multi-speaker conversation synthesis with dynamic speaker inputs (ComfyUI v3) or fixed slots (v2 fallback).

Inputs:

All inputs from Voice Clone TTS, plus:
num_speakers: Number of speakers (2-10)
speaker_N_audio: Reference audio for each speaker
speaker_N_ref_text: Transcript for each speaker's reference audio
pause_after_speaker: Seconds of silence between speakers (default 0.4)

Text Format:

[speaker_1]: Hello, I'm speaker one.
[speaker_2]: And I'm speaker two!

Outputs:

audio: Generated multi-speaker conversation (AUDIO)

Parameters Explained

Parameter	Description	Recommended
steps	ODE Euler solver steps	`16` (balanced), `32` (higher quality)
guidance_strength	Guidance scale	`4.0` (balanced), higher = more guidance
guidance_method	Guidance algorithm	`cfg` (TTS), `apg` (voice clone)
dtype	Model precision	`auto` or `bf16` (recommended)
attention	Attention backend	`auto` (default), `sage_attention` (fastest)
keep_model_loaded	Cache model between runs	`True` for repeated use

Note on precision: FP16 runs the transformer in float16 while keeping the text encoder in BF16 (UMT5 layer_norm overflows in fp16) and accumulating ODE steps in fp32. FP8 models are dequantized to BF16 during loading. For best results, use auto or bf16.

⚠️ FP16 does NOT support voice cloning: Voice Clone TTS and Multi-Speaker TTS nodes automatically upgrade FP16 to BF16. This is required because the latent conditioning path (encoding the reference audio) causes numerical overflow in FP16, resulting in NaN values that cascade through the ODE solver and produce silent output. Basic TTS works with FP16 because it uses zero latents for conditioning. If you select FP16 for voice cloning, you'll see a warning and the node will automatically use BF16 instead.

File Structure

ComfyUI/
├── models/
│   └── audiodit/
│       ├── LongCat-AudioDiT-3.5B/          # FP32 model (auto-downloaded)
│       ├── LongCat-AudioDiT-3.5B-bf16/     # BF16 model (auto-downloaded)
│       └── LongCat-AudioDiT-3.5B-fp8/      # FP8 model (auto-downloaded)
└── custom_nodes/
    └── ComfyUI-LongCat-AudioDIT-TTS/
        ├── __init__.py                      # Node registration + auto-install
        ├── pyproject.toml
        ├── requirements.txt
        ├── nodes/
        │   ├── tts_node.py                  # TTS node
        │   ├── voice_clone_node.py          # Voice clone node
        │   ├── multi_speaker_node.py         # Multi-speaker node
        │   ├── loader.py                    # Model loading + attention patching
        │   └── model_cache.py               # Caching + offload logic
        └── audiodit/
            ├── __init__.py                  # AutoConfig/AutoModel registration
            ├── modeling_audiodit.py         # AudioDiT model implementation
            ├── configuration_audiodit.py    # Config class
            ├── fp8_linear.py                # FP8 layer (unused, kept for reference)
            └── utils.py

Troubleshooting

Click to expand troubleshooting guide

Models Not Downloading?

Manually download from HuggingFace:

pip install -U huggingface_hub

# 1B model (smallest)
huggingface-cli download meituan-longcat/LongCat-AudioDiT-1B --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-1B

# BF16 model (recommended)
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-bf16 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-bf16

# FP32 model
huggingface-cli download meituan-longcat/LongCat-AudioDiT-3.5B --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B

# FP8 model
huggingface-cli download drbaph/LongCat-AudioDiT-3.5B-fp8 --local-dir ComfyUI/models/audiodit/LongCat-AudioDiT-3.5B-fp8

Nodes Not Loading / Missing Dependencies?

All required packages are auto-installed on first startup. If the node fails to load, restart ComfyUI once — the installer runs before nodes register. If it still fails after a restart, install manually:

pip install -r requirements.txt

FP16 / FP8 Produces Silent Output?

FP8 models are dequantized to BF16 during loading — the 598 FP8 tensors are converted using per-tensor scales from fp8_scales.json. FP16 works correctly for basic TTS with fp32 ODE accumulation.

Important: FP16 is not supported for Voice Clone TTS and Multi-Speaker TTS. These nodes automatically upgrade FP16 to BF16 because the latent conditioning from reference audio causes numerical overflow in FP16, leading to NaN propagation and silent output. You will see a warning message if you select FP16 for these nodes.

Out of Memory?

Use the bf16 model (~7GB VRAM)
Set keep_model_loaded=False
Enable offload_to_cpu — model moves to CPU between runs

Model Not Unloading?

Clicking "Free model and node cache" in ComfyUI always fully unloads the model regardless of keep_model_loaded. That setting only controls auto-offload behavior between generation runs.

Buzzing / Hissing Audio?

Ensure dtype is set to auto or bf16 for best quality
The VAE audio decoder is always kept in BF16 on CUDA for audio quality
FP16 and FP8 are supported but bf16 or auto gives the most consistent results

Poor Voice Cloning Quality?

If your reference audio is too loud (clipping) or has inconsistent volume, the voice clone quality will suffer. Normalize your reference audio to -3dB to -6dB peak before using it:

# Quick normalization script
import librosa
import soundfile as sf

audio, sr = librosa.load("reference.wav", sr=24000, mono=True)
peak = audio.max()
target_peak = 0.5  # -6dB
audio = audio * (target_peak / peak)
sf.write("reference_normalized.wav", audio, sr)

Links

HuggingFace

1B Model: meituan-longcat/LongCat-AudioDiT-1B
Original Model (FP32): meituan-longcat/LongCat-AudioDiT-3.5B
BF16 Model: drbaph/LongCat-AudioDiT-3.5B-bf16
FP8 Model: drbaph/LongCat-AudioDiT-3.5B-fp8

Source

Original Repository: meituan-longcat/LongCat-AudioDiT

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
audiodit		audiodit
example_workflows		example_workflows
nodes		nodes
voice_samples		voice_samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ComfyUI-LongCat-AudioDIT-TTS

Overview

Features

Requirements

Models

Installation

Method 1: ComfyUI Manager (Recommended)

Method 2: Manual Installation

Quick Start

Node Overview

Basic Workflow

Node Reference

LongCat AudioDiT TTS

LongCat AudioDiT Voice Clone TTS

LongCat AudioDiT Multi-Speaker TTS

Parameters Explained

File Structure

Troubleshooting

Models Not Downloading?

Nodes Not Loading / Missing Dependencies?

FP16 / FP8 Produces Silent Output?

Out of Memory?

Model Not Unloading?

Buzzing / Hissing Audio?

Poor Voice Cloning Quality?

Links

HuggingFace

Source

License

About

Uh oh!

Releases 8

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ComfyUI-LongCat-AudioDIT-TTS

Overview

Features

Requirements

Models

Installation

Method 1: ComfyUI Manager (Recommended)

Method 2: Manual Installation

Quick Start

Node Overview

Basic Workflow

Node Reference

LongCat AudioDiT TTS

LongCat AudioDiT Voice Clone TTS

LongCat AudioDiT Multi-Speaker TTS

Parameters Explained

File Structure

Troubleshooting

Models Not Downloading?

Nodes Not Loading / Missing Dependencies?

FP16 / FP8 Produces Silent Output?

Out of Memory?

Model Not Unloading?

Buzzing / Hissing Audio?

Poor Voice Cloning Quality?

Links

HuggingFace

Source

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Contributors

Uh oh!

Languages