Skip to content

Latest commit

 

History

History
256 lines (199 loc) · 6.18 KB

File metadata and controls

256 lines (199 loc) · 6.18 KB

Qwen3-TTS ComfyUI Quick Start Guide

5-Minute Setup

1. Installation

cd ComfyUI/custom_nodes/
git clone https://github.com/yourusername/ComfyUI-Qwen3-TTS
cd ComfyUI-Qwen3-TTS
pip install -r requirements.txt

2. Optional: Install Flash Attention (2x faster)

pip install flash-attn --no-build-isolation

3. Restart ComfyUI

Your First Workflow

Simple TTS Generation (30 seconds)

  1. Add these nodes:

    • Qwen3 TTS Model Loader
    • Qwen3 TTS Custom Voice
    • Preview Audio (built-in)
    • Save Audio (built-in)
  2. Connect them:

    Model Loader → Custom Voice → Preview Audio → Save Audio
    
  3. Configure:

    • Model Loader: Select Qwen3-TTS-12Hz-1.7B-CustomVoice
    • Custom Voice:
      • Text: "Hello! This is my first AI voice generation."
      • Speaker: "Vivian"
      • Language: "English"
  4. Queue Prompt!

Common Workflows

Workflow 1: Multilingual Narration

Use Case: Generate narration in multiple languages with consistent voice

Nodes:

Model Loader (CustomVoice) 
  → Custom Voice #1 (English, Ryan) → Preview
  → Custom Voice #2 (Chinese, Dylan) → Preview
  → Custom Voice #3 (Japanese, Ono_Anna) → Preview

Tips:

  • Use the same model instance for all generations (faster)
  • Each speaker has a native language for best quality
  • Can use any speaker with any language

Workflow 2: Voice Cloning from File

Use Case: Clone your voice or any voice from an audio file

Nodes:

Model Loader (Base)
  → Load Audio
  → Voice Clone (File)
  → Preview Audio

Setup:

  1. Model Loader: Qwen3-TTS-12Hz-1.7B-Base
  2. Load Audio: Choose your reference audio (3+ seconds)
  3. Voice Clone:
    • Text: What you want to say
    • ref_audio_path: Path to your audio file
    • ref_text: Exact transcript of the audio
    • Language: Target language

Example:

  • ref_audio_path: /path/to/my_voice.wav
  • ref_text: "This is my natural speaking voice."
  • text: "Now I can say anything in this voice!"

Workflow 3: Custom Character Voice Design

Use Case: Create unique character voices for storytelling/games

Nodes:

Model Loader (VoiceDesign)
  → Voice Design
  → Preview Audio

Character Examples:

Grumpy Old Wizard:

Text: "Back in my day, we didn't have fancy magic wands!"
Description: "Elderly male voice, gruff and gravelly, annoyed tone with slight wheeze"
Language: English

Cheerful Shopkeeper:

Text: "欢迎光临!今天有特价哦!"
Description: "中年男性,热情洋溢,略带地方口音,语速稍快"
Language: Chinese

Mysterious Villain:

Text: "You fools, you've walked right into my trap."
Description: "Deep male voice, smooth and sinister, speaking slowly with dramatic pauses"
Language: English

Workflow 4: Batch Voice Cloning (Efficient)

Use Case: Generate multiple lines with the same cloned voice efficiently

Nodes:

Model Loader (Base)
  → Load Audio
  → Create Clone Prompt
    → Clone with Prompt #1 (Line 1)
    → Clone with Prompt #2 (Line 2)
    → Clone with Prompt #3 (Line 3)

Why This Is Better:

  • Creates the voice embedding ONCE
  • Reuses it for multiple generations
  • 3-5x faster than cloning separately
  • Perfect for audiobooks, podcasts, tutorials

Example Text List:

Welcome to this tutorial.
Today we'll learn about AI voice synthesis.
Let's get started!

Workflow 5: Emotion Control

Use Case: Same voice, different emotions

Nodes:

Model Loader (CustomVoice)
  → Custom Voice #1 → Preview (Happy)
  → Custom Voice #2 → Preview (Sad)
  → Custom Voice #3 → Preview (Angry)

Setup All with Same Text: "I can't believe what just happened!"

Different Instructions:

  • Happy: "Very excited and joyful"
  • Sad: "Disappointed and melancholic"
  • Angry: "Furious and intense"

Pro Tips

Speed Optimization

  1. Model Caching: Load model once, use multiple times
  2. Flash Attention: Always enable if available
  3. bfloat16: Best speed/quality balance
  4. Batch Generation: Use for multiple similar generations
  5. Clone Prompts: Reuse for same voice

Quality Optimization

  1. Clean Audio: For cloning, use clear audio with minimal background noise
  2. Accurate Transcripts: Especially important for cloning
  3. Native Languages: Use speaker's native language when possible
  4. Specific Descriptions: More detail = better voice design
  5. Punctuation: Affects rhythm and pauses

Common Mistakes to Avoid

❌ Using VoiceDesign model for CustomVoice (won't work) ❌ Forgetting ref_text when cloning ❌ Using dirty/noisy reference audio ❌ Expecting real-time on CPU (use GPU!) ❌ Not caching models (loads every time)

Language-Specific Tips

Chinese

  • Use Dylan/Eric for regional dialects
  • Punctuation matters: 。!?
  • Works great with poetry/formal text

English

  • Ryan = energetic, Aiden = casual
  • Great with complex sentences
  • Handles slang well

Japanese

  • Ono_Anna optimized for Japanese phonetics
  • Handles kanji/kana naturally
  • Good for anime-style voices

Korean

  • Sohee native pronunciation
  • Handles honorifics correctly

Troubleshooting Quick Fixes

Problem: "qwen-tts not installed"

pip install qwen-tts

Problem: "Out of memory"

  • Use 0.6B model instead of 1.7B
  • Lower dtype to float16
  • Close other programs
  • Reduce batch size

Problem: "Model download slow"

# China users:
export HF_ENDPOINT=https://hf-mirror.com

Problem: "Voice quality poor"

  • Check reference audio quality (cloning)
  • Verify transcript accuracy
  • Try bfloat16 instead of float16
  • Use 1.7B instead of 0.6B

Problem: "Flash Attention error"

  • Disable in Model Loader settings
  • Or install: pip install flash-attn --no-build-isolation

Next Steps

  1. Check out README.md for full documentation
  2. Experiment with different voices and languages
  3. Try voice design with creative descriptions
  4. Share your workflows with the community!

Need Help?

  • GitHub Issues: Report bugs and request features
  • Discord: Join the ComfyUI community
  • Documentation: Qwen3-TTS Official Repo

Happy voice generating! 🎙️