Higgs is a high-quality neural audio codec with 8 codebooks using Residual Vector Quantization (RVQ). This repository provides tools for TTS training with LLMs using Higgs codec.
conda create -n higgs_tts python=3.10
conda activate higgs_ttspip install torch torchaudio transformers datasets accelerate
pip install librosa numpy huggingface-hub pyyaml pandas soundfile pydub
pip install descript-audio-codec vector-quantize-pytorch pyloudnorm
pip install wandb # For training metricsThe Higgs audio tokenizer will be automatically downloaded from HuggingFace on first use:
from audio_processing.higgs_audio_tokenizer import load_higgs_audio_tokenizer
tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer", device="cuda")Or manually download:
# The model will be cached to ~/.cache/huggingface/hub/
# No manual download needed!Higgs TTS supports CSV or JSONL datasets with audio and text.
Create a CSV file with these columns:
| Column | Required | Description | Example |
|---|---|---|---|
audio_path |
β Yes | Path to audio file | /data/audio/sample1.wav |
text |
β Yes | Text transcription | Hello, how are you? |
speaker_name |
β Yes | Speaker identifier | speaker_1 |
file_id |
Optional | Unique identifier | audio_001 |
Example CSV (my_dataset.csv):
audio_path,text,speaker_name
/data/audio/file1.wav,Hello everyone,female_speaker
/data/audio/file2.wav,The weather is nice today,male_speaker
/data/audio/file3.wav,Thank you very much,female_speakerCreate a JSONL file (one JSON object per line):
Example JSONL (my_dataset.jsonl):
{"audio_filepath": "/data/audio/file1.wav", "text": "Hello everyone", "speaker": "female_speaker"}
{"audio_filepath": "/data/audio/file2.wav", "text": "The weather is nice today", "speaker": "male_speaker"}
{"audio_filepath": "/data/audio/file3.wav", "text": "Thank you very much", "speaker": "female_speaker"}Required fields:
audio_filepathoraudio_pathtextortranscriptionspeaker
- Format: WAV, MP3, FLAC (any format supported by
torchaudio) - Sample Rate: Any (auto-resampled to 16kHz for encoding)
- Channels: Mono or stereo (auto-converted to mono)
- Duration: Less than 30 seconds per file recommended
- Quality: Clean speech without excessive background noise
Create datasets_config.yaml:
datasets:
- dataset_path: "/data/english_dataset.csv"
speaker_name: "speaker_1"
num_samples: 1000 # Optional: limit samples
normalize: false # Optional: normalize audio
language: "English" # Optional: add language prefix
- dataset_path: "/data/spanish_dataset.jsonl"
speaker_name: "speaker_2"
num_samples: 500
language: "Spanish"Parameters:
dataset_path: Path to CSV or JSONL filespeaker_name: Speaker identifier (will override JSONL speaker field if set)num_samples: (Optional) Limit number of samples to processnormalize: (Optional) Normalize audio volume (default: false)language: (Optional) Add language prefix to prompt
python create_dataset_higgs.py \
--config datasets_config.yaml \
--output_dir /workspace/higgs_dataset \
--higgs_model bosonai/higgs-audio-v2-tokenizer \
--device cudaParameters:
--config: Path to datasets config YAML--output_dir: Where to save processed dataset--higgs_model: Higgs model path or HuggingFace repo--device: Device (cuda or cpu)
Output:
- Processed dataset saved to
output_dir - Metadata file with codec info and speaker statistics
- Ready for LLM training!
Before training, verify your dataset quality:
# Reconstruct a few samples from the dataset
python reconstruct_from_dataset.py \
--dataset /workspace/higgs_dataset \
--higgs_model bosonai/higgs-audio-v2-tokenizer \
--output_dir reconstructed_samples \
--num_samples 5 \
--device cudaThis will:
- Extract 5 random samples from the dataset
- Decode them back to audio
- Save to
reconstructed_samples/for quality checking
# Dataset
TTS_dataset: "/workspace/higgs_dataset"
# Model
model_name: "meta-llama/Llama-3.2-3B" # Base model to start from
# model_name: "./checkpoints_higgs/checkpoint-5000" # Or continue from checkpoint
# Training Args
epochs: 3
batch_size: 1 # Adjust based on GPU memory (1-4 typical)
number_processes: 4 # Number of GPUs
pad_token: 128263
save_steps: 1000
save_total_limit: 2 # Keep only last 2 checkpoints
learning_rate: 5.0e-5
lr_scheduler_type: "constant"
# Higgs Token Configuration
add_custom_tokens: true # true for first training, false for fine-tuning
num_codebooks: 8 # Higgs uses 8 codebooks
codebook_size: 1024 # Each codebook has 1024 codes
base_llama_tokenizer: "meta-llama/Llama-3.2-3B"
# Naming and paths
save_folder: "checkpoints_higgs_v1"
project_name: "higgs-tts"
run_name: "train-higgs-v1"
resume_from_checkpoint: false| Parameter | Description | Example |
|---|---|---|
model_name |
Base model or checkpoint path | "meta-llama/Llama-3.2-3B" |
TTS_dataset |
Path to processed dataset | "/workspace/higgs_dataset" |
| Parameter | Description | Recommended |
|---|---|---|
batch_size |
Per-device batch size | 1-4 (adjust for GPU) |
number_processes |
Number of GPUs | 4 for multi-GPU |
epochs |
Training epochs | 3-5 for pretraining |
learning_rate |
Learning rate | 5.0e-5 for scratch, 1.0e-5 for fine-tune |
save_steps |
Save checkpoint every N steps | 1000-5000 |
| Parameter | Description | Values |
|---|---|---|
add_custom_tokens |
Add Higgs tokens to model | true for first training, false for fine-tuning |
num_codebooks |
Number of codebooks | 8 (fixed for Higgs) |
codebook_size |
Codes per codebook | 1024 (fixed for Higgs) |
cd finetune_llm
python train.py-
Update accelerate config (
accelerate_config.yaml):num_processes: 8 # Set to your number of GPUs
-
Launch training:
cd finetune_llm bash train_multi_gpu.sh
Or manually:
cd finetune_llm
accelerate launch --config_file accelerate_config.yaml train.pyTraining will create:
checkpoints_higgs_v1/- Standard checkpointcheckpoints_higgs_v1/fp16/- FP16 model (for VLLM)checkpoints_higgs_v1/bf16/- BF16 modelcheckpoints_higgs_v1/gguf/- GGUF-ready config
python inference_transformers.py \
--model ./checkpoints_higgs_v1 \
--higgs_tokenizer bosonai/higgs-audio-v2-tokenizer \
--text "Hello, this is a test of Higgs TTS." \
--output output.wav \
--voice Speaker \
--temperature 0.4 \
--top_p 0.9 \
--top_k 40 \
--max_tokens 1024 \
--device cudaParameters:
--model: Path to trained model checkpoint--higgs_tokenizer: Higgs audio tokenizer (usually keep default)--text: Text to synthesize--output: Output audio file path--voice: Voice identifier (must match training data)--temperature: Sampling temperature (0.3-0.6 recommended)--top_p: Top-p sampling (0.9 recommended)--top_k: Top-k sampling (40 recommended)--max_tokens: Max tokens to generate (~400 = 5 seconds)--device: Device (cuda or cpu)
Create a script for batch processing:
#!/bin/bash
MODEL="./checkpoints_higgs_v1"
cat texts.txt | while read line; do
filename=$(echo "$line" | md5sum | cut -d' ' -f1)
python inference_transformers.py \
--model $MODEL \
--text "$line" \
--output "outputs/${filename}.wav" \
--temperature 0.4
doneHiggs adds 8,202 custom tokens to the LLM:
| Token Range | Count | Purpose |
|---|---|---|
| 128257-128266 | 10 | Special tokens (START_OF_SPEECH, END_OF_SPEECH, etc.) |
| 128266-129289 | 1,024 | Codebook 0 |
| 129290-130313 | 1,024 | Codebook 1 |
| 130314-131337 | 1,024 | Codebook 2 |
| 131338-132361 | 1,024 | Codebook 3 |
| 132362-133385 | 1,024 | Codebook 4 |
| 133386-134409 | 1,024 | Codebook 5 |
| 134410-135433 | 1,024 | Codebook 6 |
| 135434-136457 | 1,024 | Codebook 7 |
Total new tokens: 10 + (8 Γ 1,024) = 8,202
Training sequences follow this structure:
[START_OF_HUMAN] +
[BOS] + "Speaker: text" + [END_OF_TEXT] +
[END_OF_HUMAN] +
[START_OF_AI] +
[START_OF_SPEECH] +
[cb0_t0, cb1_t0, ..., cb7_t0, cb0_t1, cb1_t1, ..., cb7_t1, ...] +
[END_OF_SPEECH] +
[END_OF_AI]
Where audio tokens are interleaved by frame:
- Frame 0: [cb0, cb1, cb2, cb3, cb4, cb5, cb6, cb7]
- Frame 1: [cb0, cb1, cb2, cb3, cb4, cb5, cb6, cb7]
- ...
| Metric | Value |
|---|---|
| Sample Rate | 16,000 Hz |
| Frame Rate | 50 fps (20ms per frame) |
| Codebooks | 8 (RVQ) |
| Codebook Size | 1,024 codes each |
| Bitrate | ~1.6 kbps |
| Compression | ~150-200x |
| Tokens per Second | 400 (8 codebooks Γ 50 fps) |
| Codec | Codebooks | Total Tokens | Sample Rate | Frame Rate | Tokens/sec |
|---|---|---|---|---|---|
| Higgs | 8 | 8,192 | 16 kHz | 50 Hz | 400 |
| LongCat | 1+3 | 32,492 | 24 kHz | 50 Hz | 200 |
| NeuCodec | 1 (FSQ) | 65,536 | 16 kHz | 50 Hz | 50 |
| SNAC | 3 hierarchical | 12,288 | 24 kHz | 50 Hz | 150 |
python higgs-encoder.py input.wav codes.npzpython higgs-decoder.py codes.npz output.wav# Create tokens.txt with token IDs (one per line or comma-separated)
python reconstruct_single_sequence.py \
--input_ids tokens.txt \
--output reconstructed.wav \
--higgs_model bosonai/higgs-audio-v2-tokenizerimport sys
sys.path.insert(0, '/path/to/higgs')
from audio_processing.higgs_audio_tokenizer import load_higgs_audio_tokenizer
import torch
import numpy as np
# Load tokenizer
tokenizer = load_higgs_audio_tokenizer(
"bosonai/higgs-audio-v2-tokenizer",
device="cuda"
)
# Encode audio
codes = tokenizer.encode("input.wav", sr=16000, loudness_normalize=False)
print(f"Codes shape: {codes.shape}") # [8, num_frames]
# Decode to audio
audio = tokenizer.decode(codes.unsqueeze(0))
print(f"Audio shape: {audio.shape}") # [1, 1, num_samples]
# Save
import torchaudio
torchaudio.save("output.wav", audio[0], tokenizer.sampling_rate)