Higgs Audio TTS Training

Higgs is a high-quality neural audio codec with 8 codebooks using Residual Vector Quantization (RVQ). This repository provides tools for TTS training with LLMs using Higgs codec.

🛠️ Installation

1. Create Environment

conda create -n higgs_tts python=3.10
conda activate higgs_tts

2. Install Dependencies

pip install torch torchaudio transformers datasets accelerate
pip install librosa numpy huggingface-hub pyyaml pandas soundfile pydub
pip install descript-audio-codec vector-quantize-pytorch pyloudnorm
pip install wandb  # For training metrics

📦 Model Download

Higgs Audio Tokenizer

The Higgs audio tokenizer will be automatically downloaded from HuggingFace on first use:

from audio_processing.higgs_audio_tokenizer import load_higgs_audio_tokenizer

tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer", device="cuda")

Or manually download:

# The model will be cached to ~/.cache/huggingface/hub/
# No manual download needed!

📊 Dataset Preparation

Input Format Options

Higgs TTS supports CSV or JSONL datasets with audio and text.

Option 1: CSV Format

Create a CSV file with these columns:

Column	Required	Description	Example
`audio_path`	✅ Yes	Path to audio file	`/data/audio/sample1.wav`
`text`	✅ Yes	Text transcription	`Hello, how are you?`
`speaker_name`	✅ Yes	Speaker identifier	`speaker_1`
`file_id`	Optional	Unique identifier	`audio_001`

Example CSV (my_dataset.csv):

audio_path,text,speaker_name
/data/audio/file1.wav,Hello everyone,female_speaker
/data/audio/file2.wav,The weather is nice today,male_speaker
/data/audio/file3.wav,Thank you very much,female_speaker

Option 2: JSONL Format

Create a JSONL file (one JSON object per line):

Example JSONL (my_dataset.jsonl):

{"audio_filepath": "/data/audio/file1.wav", "text": "Hello everyone", "speaker": "female_speaker"}
{"audio_filepath": "/data/audio/file2.wav", "text": "The weather is nice today", "speaker": "male_speaker"}
{"audio_filepath": "/data/audio/file3.wav", "text": "Thank you very much", "speaker": "female_speaker"}

Required fields:

audio_filepath or audio_path
text or transcription
speaker

Audio Requirements

Format: WAV, MP3, FLAC (any format supported by torchaudio)
Sample Rate: Any (auto-resampled to 16kHz for encoding)
Channels: Mono or stereo (auto-converted to mono)
Duration: Less than 30 seconds per file recommended
Quality: Clean speech without excessive background noise

🎯 Dataset Creation

Step 1: Create Configuration File

Create datasets_config.yaml:

datasets:
  - dataset_path: "/data/english_dataset.csv"
    speaker_name: "speaker_1"
    num_samples: 1000  # Optional: limit samples
    normalize: false   # Optional: normalize audio
    language: "English"  # Optional: add language prefix
    
  - dataset_path: "/data/spanish_dataset.jsonl"
    speaker_name: "speaker_2"
    num_samples: 500
    language: "Spanish"

Parameters:

dataset_path: Path to CSV or JSONL file
speaker_name: Speaker identifier (will override JSONL speaker field if set)
num_samples: (Optional) Limit number of samples to process
normalize: (Optional) Normalize audio volume (default: false)
language: (Optional) Add language prefix to prompt

Step 2: Run Dataset Creation

python create_dataset_higgs.py \
    --config datasets_config.yaml \
    --output_dir /workspace/higgs_dataset \
    --higgs_model bosonai/higgs-audio-v2-tokenizer \
    --device cuda

Parameters:

--config: Path to datasets config YAML
--output_dir: Where to save processed dataset
--higgs_model: Higgs model path or HuggingFace repo
--device: Device (cuda or cpu)

Output:

Processed dataset saved to output_dir
Metadata file with codec info and speaker statistics
Ready for LLM training!

🔍 Dataset Verification

Before training, verify your dataset quality:

# Reconstruct a few samples from the dataset
python reconstruct_from_dataset.py \
    --dataset /workspace/higgs_dataset \
    --higgs_model bosonai/higgs-audio-v2-tokenizer \
    --output_dir reconstructed_samples \
    --num_samples 5 \
    --device cuda

This will:

Extract 5 random samples from the dataset
Decode them back to audio
Save to reconstructed_samples/ for quality checking

⚙️ Training Configuration

Configuration File: `finetune_llm/config.yaml`

# Dataset
TTS_dataset: "/workspace/higgs_dataset"

# Model
model_name: "meta-llama/Llama-3.2-3B"  # Base model to start from
# model_name: "./checkpoints_higgs/checkpoint-5000"  # Or continue from checkpoint

# Training Args
epochs: 3
batch_size: 1  # Adjust based on GPU memory (1-4 typical)
number_processes: 4  # Number of GPUs
pad_token: 128263
save_steps: 1000
save_total_limit: 2  # Keep only last 2 checkpoints
learning_rate: 5.0e-5
lr_scheduler_type: "constant"

# Higgs Token Configuration
add_custom_tokens: true  # true for first training, false for fine-tuning
num_codebooks: 8  # Higgs uses 8 codebooks
codebook_size: 1024  # Each codebook has 1024 codes
base_llama_tokenizer: "meta-llama/Llama-3.2-3B"

# Naming and paths
save_folder: "checkpoints_higgs_v1"
project_name: "higgs-tts"
run_name: "train-higgs-v1"
resume_from_checkpoint: false

Key Configuration Parameters

Model Configuration

Parameter	Description	Example
`model_name`	Base model or checkpoint path	`"meta-llama/Llama-3.2-3B"`
`TTS_dataset`	Path to processed dataset	`"/workspace/higgs_dataset"`

Training Parameters

Parameter	Description	Recommended
`batch_size`	Per-device batch size	1-4 (adjust for GPU)
`number_processes`	Number of GPUs	4 for multi-GPU
`epochs`	Training epochs	3-5 for pretraining
`learning_rate`	Learning rate	`5.0e-5` for scratch, `1.0e-5` for fine-tune
`save_steps`	Save checkpoint every N steps	1000-5000

Higgs Token Settings

Parameter	Description	Values
`add_custom_tokens`	Add Higgs tokens to model	`true` for first training, `false` for fine-tuning
`num_codebooks`	Number of codebooks	8 (fixed for Higgs)
`codebook_size`	Codes per codebook	1024 (fixed for Higgs)

🚀 Training

Single GPU

cd finetune_llm
python train.py

Multi-GPU (8 GPUs)

Using Accelerate

Update accelerate config (accelerate_config.yaml):

num_processes: 8  # Set to your number of GPUs

Launch training:
```
cd finetune_llm
bash train_multi_gpu.sh
```

Or manually:

cd finetune_llm
accelerate launch --config_file accelerate_config.yaml train.py

Training Outputs

Training will create:

checkpoints_higgs_v1/ - Standard checkpoint
checkpoints_higgs_v1/fp16/ - FP16 model (for VLLM)
checkpoints_higgs_v1/bf16/ - BF16 model
checkpoints_higgs_v1/gguf/ - GGUF-ready config

🎤 Inference

Using Trained Model

python inference_transformers.py \
    --model ./checkpoints_higgs_v1 \
    --higgs_tokenizer bosonai/higgs-audio-v2-tokenizer \
    --text "Hello, this is a test of Higgs TTS." \
    --output output.wav \
    --voice Speaker \
    --temperature 0.4 \
    --top_p 0.9 \
    --top_k 40 \
    --max_tokens 1024 \
    --device cuda

Parameters:

--model: Path to trained model checkpoint
--higgs_tokenizer: Higgs audio tokenizer (usually keep default)
--text: Text to synthesize
--output: Output audio file path
--voice: Voice identifier (must match training data)
--temperature: Sampling temperature (0.3-0.6 recommended)
--top_p: Top-p sampling (0.9 recommended)
--top_k: Top-k sampling (40 recommended)
--max_tokens: Max tokens to generate (~400 = 5 seconds)
--device: Device (cuda or cpu)

Batch Inference

Create a script for batch processing:

#!/bin/bash
MODEL="./checkpoints_higgs_v1"

cat texts.txt | while read line; do
    filename=$(echo "$line" | md5sum | cut -d' ' -f1)
    python inference_transformers.py \
        --model $MODEL \
        --text "$line" \
        --output "outputs/${filename}.wav" \
        --temperature 0.4
done

📈 Token Configuration

Higgs Token Layout

Higgs adds 8,202 custom tokens to the LLM:

Token Range	Count	Purpose
128257-128266	10	Special tokens (START_OF_SPEECH, END_OF_SPEECH, etc.)
128266-129289	1,024	Codebook 0
129290-130313	1,024	Codebook 1
130314-131337	1,024	Codebook 2
131338-132361	1,024	Codebook 3
132362-133385	1,024	Codebook 4
133386-134409	1,024	Codebook 5
134410-135433	1,024	Codebook 6
135434-136457	1,024	Codebook 7

Total new tokens: 10 + (8 × 1,024) = 8,202

Sequence Format

Training sequences follow this structure:

[START_OF_HUMAN] + 
  [BOS] + "Speaker: text" + [END_OF_TEXT] + 
[END_OF_HUMAN] + 
[START_OF_AI] + 
  [START_OF_SPEECH] + 
    [cb0_t0, cb1_t0, ..., cb7_t0, cb0_t1, cb1_t1, ..., cb7_t1, ...] + 
  [END_OF_SPEECH] + 
[END_OF_AI]

Where audio tokens are interleaved by frame:

Frame 0: [cb0, cb1, cb2, cb3, cb4, cb5, cb6, cb7]
Frame 1: [cb0, cb1, cb2, cb3, cb4, cb5, cb6, cb7]
...

📊 Codec Specifications

Metric	Value
Sample Rate	16,000 Hz
Frame Rate	50 fps (20ms per frame)
Codebooks	8 (RVQ)
Codebook Size	1,024 codes each
Bitrate	~1.6 kbps
Compression	~150-200x
Tokens per Second	400 (8 codebooks × 50 fps)

Comparison with Other Codecs

Codec	Codebooks	Total Tokens	Sample Rate	Frame Rate	Tokens/sec
Higgs	8	8,192	16 kHz	50 Hz	400
LongCat	1+3	32,492	24 kHz	50 Hz	200
NeuCodec	1 (FSQ)	65,536	16 kHz	50 Hz	50
SNAC	3 hierarchical	12,288	24 kHz	50 Hz	150

🔧 Advanced Usage

Testing Individual Components

1. Encode Audio to Codes

python higgs-encoder.py input.wav codes.npz

2. Decode Codes to Audio

python higgs-decoder.py codes.npz output.wav

3. Reconstruct Single Sequence

# Create tokens.txt with token IDs (one per line or comma-separated)
python reconstruct_single_sequence.py \
    --input_ids tokens.txt \
    --output reconstructed.wav \
    --higgs_model bosonai/higgs-audio-v2-tokenizer

Python API

import sys
sys.path.insert(0, '/path/to/higgs')

from audio_processing.higgs_audio_tokenizer import load_higgs_audio_tokenizer
import torch
import numpy as np

# Load tokenizer
tokenizer = load_higgs_audio_tokenizer(
    "bosonai/higgs-audio-v2-tokenizer",
    device="cuda"
)

# Encode audio
codes = tokenizer.encode("input.wav", sr=16000, loudness_normalize=False)
print(f"Codes shape: {codes.shape}")  # [8, num_frames]

# Decode to audio
audio = tokenizer.decode(codes.unsqueeze(0))
print(f"Audio shape: {audio.shape}")  # [1, 1, num_samples]

# Save
import torchaudio
torchaudio.save("output.wav", audio[0], tokenizer.sampling_rate)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
audio_processing		audio_processing
finetune_llm		finetune_llm
README.md		README.md
__init__.py		__init__.py
create_dataset_higgs.py		create_dataset_higgs.py
higgs-decoder-onnx.py		higgs-decoder-onnx.py
higgs-decoder.py		higgs-decoder.py
higgs-encoder.py		higgs-encoder.py
higgs-to-onnx.py		higgs-to-onnx.py
inference_transformers.py		inference_transformers.py
reconstruct_from_dataset.py		reconstruct_from_dataset.py

anshkumar/higgs-tts

Folders and files

Latest commit

History

Repository files navigation

Higgs Audio TTS Training

🛠️ Installation

1. Create Environment

2. Install Dependencies

📦 Model Download

Higgs Audio Tokenizer

📊 Dataset Preparation

Input Format Options

Option 1: CSV Format

Option 2: JSONL Format

Audio Requirements

🎯 Dataset Creation

Step 1: Create Configuration File

Step 2: Run Dataset Creation

🔍 Dataset Verification

⚙️ Training Configuration

Configuration File: finetune_llm/config.yaml

Key Configuration Parameters

Model Configuration

Training Parameters

Higgs Token Settings

🚀 Training

Single GPU

Multi-GPU (8 GPUs)

Using Accelerate

Training Outputs

🎤 Inference

Using Trained Model

Batch Inference

📈 Token Configuration

Higgs Token Layout

Sequence Format

📊 Codec Specifications

Comparison with Other Codecs

🔧 Advanced Usage

Testing Individual Components

1. Encode Audio to Codes

2. Decode Codes to Audio

3. Reconstruct Single Sequence

Python API

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Configuration File: `finetune_llm/config.yaml`

Packages