A comprehensive web-based interface for Higgs Audio V2, featuring advanced text-to-speech capabilities with voice cloning, multi-speaker support, and background music generation.
This repository has been modified for Pinokio installer compatibility. The model paths have been changed from:
bosonai/higgs-audio-v2-generation-3B-base→models/higgs-audio-v2-generation-3B-basebosonai/higgs-audio-v2-tokenizer→models/higgs-audio-v2-tokenizer
If you're not using Pinokio, you'll need to change these paths back to the original HuggingFace hub paths in gradio_interface.py lines 31-32.
- Expressive Speech Generation: Convert text to natural, expressive speech
- Voice Cloning: Clone voices using reference audio samples
- Multi-Speaker Dialogues: Generate conversations with different speakers
- Background Music Generation: Add music to speech using special tags
- Advanced Scene Control: Customize audio environment and recording conditions
- Template System: Pre-configured templates for different TTS modes
- Extended Temperature Range: 0.0-1.5 for fine-tuned creativity control
- RAS (Repetition Avoidance Sampling): Prevents repetitive output
- Custom Stop Strings: Control generation termination
- Advanced Sampling Parameters: Top-p, top-k, and token settings
- Real-time Audio Playback: Listen to generated speech directly in the browser
- Voice Preset Library: Pre-loaded voice samples for quick cloning
- Template-based Examples: Smart voice, voice cloning, multi-speaker, BGM, and more
- Custom Reference Audio: Upload your own audio for voice cloning
- Voice Sample Preview: Listen to voice presets before selection
- Enhanced Theme: Professional UI with custom styling
- Easy-to-use Interface: Clean, intuitive web interface built with Gradio
Higgs Audio V2 is a 3.6B parameter audio foundation model that:
- Trained on 10M+ hours of diverse audio data
- Achieves 75.7% win rate over GPT-4o-mini-TTS on emotional speech
- Supports multilingual multi-speaker dialogues
- Can generate speech with background music and prosody adaptation
- Requires no fine-tuning for high-quality results
- Python 3.8 or higher
- CUDA-compatible GPU (recommended) or CPU
- At least 8GB RAM (16GB+ recommended for GPU)
-
Clone or download this repository:
git clone <your-repo-url> cd "Higgs Audio V2"
-
Create and activate virtual environment:
# On Windows: python -m venv higgs_audio_env higgs_audio_env\Scripts\activate # On macOS/Linux: python3 -m venv higgs_audio_env source higgs_audio_env/bin/activate
-
Install PyTorch with CUDA support (recommended):
# For CUDA 12.8 (latest): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 # For CUDA 12.6: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 # For CUDA 11.8: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 # CPU-only (if no GPU): pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
-
Clone and install the Higgs Audio package:
# Clone the official repository git clone https://github.com/boson-ai/higgs-audio.git temp_higgs cd temp_higgs # Install dependencies and the package pip install -r requirements.txt pip install -e . cd ..
-
Install interface dependencies:
pip install -r requirements.txt
-
Authenticate with HuggingFace (Required):
# Install huggingface-hub if not already installed pip install huggingface-hub # Login to HuggingFace to access the models hf auth login
When prompted, enter your HuggingFace Access Token. You can create one at: https://huggingface.co/settings/tokens
-
Clean up (optional):
# Remove the temporary clone directory rmdir /s temp_higgs # Windows # rm -rf temp_higgs # macOS/Linux
-
Clean up (optional):
# Remove the temporary clone directory rmdir /s temp_higgs # Windows # rm -rf temp_higgs # macOS/Linux
-
Activate the virtual environment:
# On Windows: higgs_audio_env\Scripts\activate # On macOS/Linux: source higgs_audio_env/bin/activate
-
Ensure HuggingFace authentication (if not done during installation):
hf auth login
-
Start the Gradio interface:
python gradio_interface.py
-
Open your browser and navigate to
http://localhost:7860 -
Initialize the model by clicking the "Initialize Model" button (first-time setup may take a few minutes to download models)
-
Generate speech:
- Enter your text in the "Text to speak" field
- Optionally customize the scene description
- Adjust advanced settings if needed
- Click "Generate Speech"
- Listen to the generated audio in the output player
- Text Input: Enter the text you want to convert to speech
- Scene Description: Describe the recording environment (e.g., "quiet room", "outdoor park")
- Generate Speech Button: Process the text and create audio
- Temperature (0.1-1.0): Controls creativity/randomness in generation
- Top-p (0.1-1.0): Nucleus sampling parameter for token selection
- Top-k (1-100): Limits token selection to top k choices
- Max Tokens (256-2048): Maximum length of generated audio sequence
Pre-loaded examples include:
- Factual narration
- Presentation speech
- Storytelling
- Casual conversation
- CPU: Multi-core processor (Intel i5/AMD Ryzen 5 or better)
- RAM: 8GB
- Storage: 10GB free space
- OS: Windows 10/11, macOS 10.15+, or Linux
- GPU: NVIDIA RTX 3070 or better with 8GB+ VRAM
- RAM: 16GB or more
- Storage: SSD with 20GB+ free space
"Error loading model: argument of type 'HiggsAudioConfig' is not iterable":
- This occurs when using an incomplete installation of boson_multimodal
- Solution: Follow the updated installation steps above to clone and install from the official repository
- Make sure to activate the virtual environment before running the interface
Model fails to load:
- Ensure you have sufficient RAM/VRAM (minimum 8GB RAM, 16GB+ recommended)
- Check that CUDA is properly installed if using GPU
- Verify internet connection for model download (first run downloads ~6GB)
- Verify HuggingFace authentication: Run
hf auth loginand enter your access token - Make sure you're running in the activated virtual environment
"ModuleNotFoundError: No module named 'boson_multimodal.serve'":
- This indicates the package wasn't installed correctly or virtual environment isn't activated
- Solution:
- Make sure virtual environment is activated:
higgs_audio_env\Scripts\activate(Windows) orsource higgs_audio_env/bin/activate(macOS/Linux) - Reinstall using the proper method described in step 4 above
- Make sure virtual environment is activated:
Audio generation is slow:
- Reduce max_tokens setting (try 512 instead of 1024)
- Use CPU if GPU memory is insufficient
- Close other applications to free up resources
- First generation is always slower due to model loading
Poor audio quality:
- Adjust temperature (lower values like 0.1-0.3 for more consistent output)
- Modify scene description for better context
- Try different top-p/top-k values
- Ensure your input text ends with proper punctuation
- First run: Model download and initialization may take 5-10 minutes (downloads ~6GB of model files)
- GPU usage: Monitor VRAM usage; the model requires ~8GB VRAM for optimal performance
- CPU fallback: The interface automatically falls back to CPU if CUDA is unavailable (much slower)
- Virtual environment: Always run the interface within the activated virtual environment
- Memory management: Close other applications if experiencing memory issues
- Generation speed: First generation is slower due to model loading; subsequent generations are faster
After installation, your project directory should look like this:
Higgs Audio V2/
├── gradio_interface.py # Main Gradio web interface
├── requirements.txt # Interface dependencies
├── README.md # This file
├── .gitignore # Git ignore rules
└── higgs_audio_env/ # Virtual environment (created during setup)
├── Scripts/ # Windows activation scripts
├── bin/ # macOS/Linux activation scripts
└── Lib/ # Installed packages
The interface can be customized by modifying gradio_interface.py:
# Change default model paths
MODEL_PATH = "bosonai/higgs-audio-v2-generation-3B-base"
AUDIO_TOKENIZER_PATH = "bosonai/higgs-audio-v2-tokenizer"
# Modify server settings
demo.launch(
share=True, # Enable public sharing
server_name="0.0.0.0", # Allow external connections
server_port=7860 # Change port number
)This project uses the Higgs Audio V2 model. Please refer to the original repository for licensing information.
- Higgs Audio V2: Developed by Boson AI
- Original Repository: https://github.com/boson-ai/higgs-audio
- Model Page: https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base
For issues related to:
- Higgs Audio model: Visit the official repository
- This Gradio interface: Open an issue in this repository
- General Gradio questions: Check the Gradio documentation