Skip to content

anshkumar/higgs-tts

Repository files navigation

Higgs Audio TTS Training

Higgs is a high-quality neural audio codec with 8 codebooks using Residual Vector Quantization (RVQ). This repository provides tools for TTS training with LLMs using Higgs codec.

πŸ› οΈ Installation

1. Create Environment

conda create -n higgs_tts python=3.10
conda activate higgs_tts

2. Install Dependencies

pip install torch torchaudio transformers datasets accelerate
pip install librosa numpy huggingface-hub pyyaml pandas soundfile pydub
pip install descript-audio-codec vector-quantize-pytorch pyloudnorm
pip install wandb  # For training metrics

πŸ“¦ Model Download

Higgs Audio Tokenizer

The Higgs audio tokenizer will be automatically downloaded from HuggingFace on first use:

from audio_processing.higgs_audio_tokenizer import load_higgs_audio_tokenizer

tokenizer = load_higgs_audio_tokenizer("bosonai/higgs-audio-v2-tokenizer", device="cuda")

Or manually download:

# The model will be cached to ~/.cache/huggingface/hub/
# No manual download needed!

πŸ“Š Dataset Preparation

Input Format Options

Higgs TTS supports CSV or JSONL datasets with audio and text.

Option 1: CSV Format

Create a CSV file with these columns:

Column Required Description Example
audio_path βœ… Yes Path to audio file /data/audio/sample1.wav
text βœ… Yes Text transcription Hello, how are you?
speaker_name βœ… Yes Speaker identifier speaker_1
file_id Optional Unique identifier audio_001

Example CSV (my_dataset.csv):

audio_path,text,speaker_name
/data/audio/file1.wav,Hello everyone,female_speaker
/data/audio/file2.wav,The weather is nice today,male_speaker
/data/audio/file3.wav,Thank you very much,female_speaker

Option 2: JSONL Format

Create a JSONL file (one JSON object per line):

Example JSONL (my_dataset.jsonl):

{"audio_filepath": "/data/audio/file1.wav", "text": "Hello everyone", "speaker": "female_speaker"}
{"audio_filepath": "/data/audio/file2.wav", "text": "The weather is nice today", "speaker": "male_speaker"}
{"audio_filepath": "/data/audio/file3.wav", "text": "Thank you very much", "speaker": "female_speaker"}

Required fields:

  • audio_filepath or audio_path
  • text or transcription
  • speaker

Audio Requirements

  • Format: WAV, MP3, FLAC (any format supported by torchaudio)
  • Sample Rate: Any (auto-resampled to 16kHz for encoding)
  • Channels: Mono or stereo (auto-converted to mono)
  • Duration: Less than 30 seconds per file recommended
  • Quality: Clean speech without excessive background noise

🎯 Dataset Creation

Step 1: Create Configuration File

Create datasets_config.yaml:

datasets:
  - dataset_path: "/data/english_dataset.csv"
    speaker_name: "speaker_1"
    num_samples: 1000  # Optional: limit samples
    normalize: false   # Optional: normalize audio
    language: "English"  # Optional: add language prefix
    
  - dataset_path: "/data/spanish_dataset.jsonl"
    speaker_name: "speaker_2"
    num_samples: 500
    language: "Spanish"

Parameters:

  • dataset_path: Path to CSV or JSONL file
  • speaker_name: Speaker identifier (will override JSONL speaker field if set)
  • num_samples: (Optional) Limit number of samples to process
  • normalize: (Optional) Normalize audio volume (default: false)
  • language: (Optional) Add language prefix to prompt

Step 2: Run Dataset Creation

python create_dataset_higgs.py \
    --config datasets_config.yaml \
    --output_dir /workspace/higgs_dataset \
    --higgs_model bosonai/higgs-audio-v2-tokenizer \
    --device cuda

Parameters:

  • --config: Path to datasets config YAML
  • --output_dir: Where to save processed dataset
  • --higgs_model: Higgs model path or HuggingFace repo
  • --device: Device (cuda or cpu)

Output:

  • Processed dataset saved to output_dir
  • Metadata file with codec info and speaker statistics
  • Ready for LLM training!

πŸ” Dataset Verification

Before training, verify your dataset quality:

# Reconstruct a few samples from the dataset
python reconstruct_from_dataset.py \
    --dataset /workspace/higgs_dataset \
    --higgs_model bosonai/higgs-audio-v2-tokenizer \
    --output_dir reconstructed_samples \
    --num_samples 5 \
    --device cuda

This will:

  • Extract 5 random samples from the dataset
  • Decode them back to audio
  • Save to reconstructed_samples/ for quality checking

βš™οΈ Training Configuration

Configuration File: finetune_llm/config.yaml

# Dataset
TTS_dataset: "/workspace/higgs_dataset"

# Model
model_name: "meta-llama/Llama-3.2-3B"  # Base model to start from
# model_name: "./checkpoints_higgs/checkpoint-5000"  # Or continue from checkpoint

# Training Args
epochs: 3
batch_size: 1  # Adjust based on GPU memory (1-4 typical)
number_processes: 4  # Number of GPUs
pad_token: 128263
save_steps: 1000
save_total_limit: 2  # Keep only last 2 checkpoints
learning_rate: 5.0e-5
lr_scheduler_type: "constant"

# Higgs Token Configuration
add_custom_tokens: true  # true for first training, false for fine-tuning
num_codebooks: 8  # Higgs uses 8 codebooks
codebook_size: 1024  # Each codebook has 1024 codes
base_llama_tokenizer: "meta-llama/Llama-3.2-3B"

# Naming and paths
save_folder: "checkpoints_higgs_v1"
project_name: "higgs-tts"
run_name: "train-higgs-v1"
resume_from_checkpoint: false

Key Configuration Parameters

Model Configuration

Parameter Description Example
model_name Base model or checkpoint path "meta-llama/Llama-3.2-3B"
TTS_dataset Path to processed dataset "/workspace/higgs_dataset"

Training Parameters

Parameter Description Recommended
batch_size Per-device batch size 1-4 (adjust for GPU)
number_processes Number of GPUs 4 for multi-GPU
epochs Training epochs 3-5 for pretraining
learning_rate Learning rate 5.0e-5 for scratch, 1.0e-5 for fine-tune
save_steps Save checkpoint every N steps 1000-5000

Higgs Token Settings

Parameter Description Values
add_custom_tokens Add Higgs tokens to model true for first training, false for fine-tuning
num_codebooks Number of codebooks 8 (fixed for Higgs)
codebook_size Codes per codebook 1024 (fixed for Higgs)

πŸš€ Training

Single GPU

cd finetune_llm
python train.py

Multi-GPU (8 GPUs)

Using Accelerate

  1. Update accelerate config (accelerate_config.yaml):

    num_processes: 8  # Set to your number of GPUs
  2. Launch training:

    cd finetune_llm
    bash train_multi_gpu.sh

Or manually:

cd finetune_llm
accelerate launch --config_file accelerate_config.yaml train.py

Training Outputs

Training will create:

  • checkpoints_higgs_v1/ - Standard checkpoint
  • checkpoints_higgs_v1/fp16/ - FP16 model (for VLLM)
  • checkpoints_higgs_v1/bf16/ - BF16 model
  • checkpoints_higgs_v1/gguf/ - GGUF-ready config

🎀 Inference

Using Trained Model

python inference_transformers.py \
    --model ./checkpoints_higgs_v1 \
    --higgs_tokenizer bosonai/higgs-audio-v2-tokenizer \
    --text "Hello, this is a test of Higgs TTS." \
    --output output.wav \
    --voice Speaker \
    --temperature 0.4 \
    --top_p 0.9 \
    --top_k 40 \
    --max_tokens 1024 \
    --device cuda

Parameters:

  • --model: Path to trained model checkpoint
  • --higgs_tokenizer: Higgs audio tokenizer (usually keep default)
  • --text: Text to synthesize
  • --output: Output audio file path
  • --voice: Voice identifier (must match training data)
  • --temperature: Sampling temperature (0.3-0.6 recommended)
  • --top_p: Top-p sampling (0.9 recommended)
  • --top_k: Top-k sampling (40 recommended)
  • --max_tokens: Max tokens to generate (~400 = 5 seconds)
  • --device: Device (cuda or cpu)

Batch Inference

Create a script for batch processing:

#!/bin/bash
MODEL="./checkpoints_higgs_v1"

cat texts.txt | while read line; do
    filename=$(echo "$line" | md5sum | cut -d' ' -f1)
    python inference_transformers.py \
        --model $MODEL \
        --text "$line" \
        --output "outputs/${filename}.wav" \
        --temperature 0.4
done

πŸ“ˆ Token Configuration

Higgs Token Layout

Higgs adds 8,202 custom tokens to the LLM:

Token Range Count Purpose
128257-128266 10 Special tokens (START_OF_SPEECH, END_OF_SPEECH, etc.)
128266-129289 1,024 Codebook 0
129290-130313 1,024 Codebook 1
130314-131337 1,024 Codebook 2
131338-132361 1,024 Codebook 3
132362-133385 1,024 Codebook 4
133386-134409 1,024 Codebook 5
134410-135433 1,024 Codebook 6
135434-136457 1,024 Codebook 7

Total new tokens: 10 + (8 Γ— 1,024) = 8,202

Sequence Format

Training sequences follow this structure:

[START_OF_HUMAN] + 
  [BOS] + "Speaker: text" + [END_OF_TEXT] + 
[END_OF_HUMAN] + 
[START_OF_AI] + 
  [START_OF_SPEECH] + 
    [cb0_t0, cb1_t0, ..., cb7_t0, cb0_t1, cb1_t1, ..., cb7_t1, ...] + 
  [END_OF_SPEECH] + 
[END_OF_AI]

Where audio tokens are interleaved by frame:

  • Frame 0: [cb0, cb1, cb2, cb3, cb4, cb5, cb6, cb7]
  • Frame 1: [cb0, cb1, cb2, cb3, cb4, cb5, cb6, cb7]
  • ...

πŸ“Š Codec Specifications

Metric Value
Sample Rate 16,000 Hz
Frame Rate 50 fps (20ms per frame)
Codebooks 8 (RVQ)
Codebook Size 1,024 codes each
Bitrate ~1.6 kbps
Compression ~150-200x
Tokens per Second 400 (8 codebooks Γ— 50 fps)

Comparison with Other Codecs

Codec Codebooks Total Tokens Sample Rate Frame Rate Tokens/sec
Higgs 8 8,192 16 kHz 50 Hz 400
LongCat 1+3 32,492 24 kHz 50 Hz 200
NeuCodec 1 (FSQ) 65,536 16 kHz 50 Hz 50
SNAC 3 hierarchical 12,288 24 kHz 50 Hz 150

πŸ”§ Advanced Usage

Testing Individual Components

1. Encode Audio to Codes

python higgs-encoder.py input.wav codes.npz

2. Decode Codes to Audio

python higgs-decoder.py codes.npz output.wav

3. Reconstruct Single Sequence

# Create tokens.txt with token IDs (one per line or comma-separated)
python reconstruct_single_sequence.py \
    --input_ids tokens.txt \
    --output reconstructed.wav \
    --higgs_model bosonai/higgs-audio-v2-tokenizer

Python API

import sys
sys.path.insert(0, '/path/to/higgs')

from audio_processing.higgs_audio_tokenizer import load_higgs_audio_tokenizer
import torch
import numpy as np

# Load tokenizer
tokenizer = load_higgs_audio_tokenizer(
    "bosonai/higgs-audio-v2-tokenizer",
    device="cuda"
)

# Encode audio
codes = tokenizer.encode("input.wav", sr=16000, loudness_normalize=False)
print(f"Codes shape: {codes.shape}")  # [8, num_frames]

# Decode to audio
audio = tokenizer.decode(codes.unsqueeze(0))
print(f"Audio shape: {audio.shape}")  # [1, 1, num_samples]

# Save
import torchaudio
torchaudio.save("output.wav", audio[0], tokenizer.sampling_rate)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published