Phonemize

Phonemize is a modern, multilingual grapheme-to-phoneme (G2P) conversion library powered by state-of-the-art Transformer models. This library delivers production-grade accuracy, lightning-fast inference, and seamless integration into text-to-speech (TTS) pipelines, speech synthesis engines, and other voice-related applications. Whether you're building real-time TTS systems or offline speech processing pipelines, Phonemize provides the precision and performance you need.

Quick Access

Installation – Get started in seconds
Quickstart – Basic usage example
Training Guide – Train custom models
Advanced Inference – Detailed predictions & confidence scores
Configuration – Tune model parameters
Troubleshooting – Common issues & solutions
API Reference – Complete API documentation
Pre-trained Models – Download trained models
Performance Tips – Speed optimization

Key Features

Easy-to-use API: Intuitive Python interface for both training and inference tasks. Get started with just a few lines of code.
Multilingual Support: Train and deploy a single unified model across multiple languages without architectural changes.
High Performance: Achieve microsecond-level inference latency with state-of-the-art Transformer-based predictions.
Custom Training: Train your own specialized models with minimal setup using simple YAML configuration files.
Optimized for TTS: Purpose-built for real-time streaming, offline batch processing, and production text-to-speech systems.
TorchScript Export: Export trained models to TorchScript for hardware-accelerated inference and deployment.

Installation

Inference Only

For using pre-trained models without training capabilities:

pip install phonemize

Full Installation (with Training Support)

To enable model training and all dependencies:

pip install 'phonemize[train]'

Requirements: Python 3.8 or later, PyTorch 1.9+

Quickstart

Basic Usage: Inference with Pre-trained Model

Load a pre-trained model and convert text to phonemes in just a few lines:

import torch
from phonemize import Phonemizer

# Import custom classes for safe model loading
from phonemize.preprocessing.text import Preprocessor, LanguageTokenizer, SequenceTokenizer

# Register classes for secure deserialization
torch.serialization.add_safe_globals([Preprocessor, LanguageTokenizer, SequenceTokenizer])

# Load pre-trained checkpoint
phonemizer = Phonemizer.from_checkpoint("phonemize_m1.pt")

# Convert text to phonemes (supports multiple languages)
result = phonemizer("Arcosoph's quixotic insights empower enthusiasts.", lang="en_us")

# Display results
print(result)

Output:

The model converts graphemes (written text) to phonemes (speech sounds) using the ARPAbet notation:

[AA][R][K][AH][S][AO][F][S] [K][W][IH][K][S][AA][T][IH][K] [IH][N][S][AY][T][S] [IH][M][P][AW][ER] [EH][N][TH][UW][Z][IY][AE][S][T][S].

Each [XX] represents a single phonetic unit. This output can be directly fed into TTS vocoders or speech synthesis systems.

Model Architectures

Phonemize supports two distinct Transformer-based architectures, each with different strengths:

Architecture	Type	Speed	Quality	Use Case
Forward Transformer	Encoder-Decoder (seq2seq)	Fast ⚡	Excellent 95%+	Real-time TTS, low-latency inference
Autoregressive Transformer	Decoder-only (causal)	Slower	Slightly Higher 96%+	Research, domain-specific fine-tuning

Quick Decision Guide:

Use Forward for production TTS systems
Use Autoregressive for research or when highest accuracy is needed

Training Your Own Model

Create custom G2P models tailored to your specific language or domain. The training pipeline handles preprocessing, model creation, and distributed training.

Training Pipeline

Prepare Data: Collect grapheme-phoneme pairs in (language, grapheme, phoneme) format
Configure: Define model architecture, hyperparameters, and preprocessing in a YAML config file
Preprocess: Tokenize and prepare datasets with vocabulary and data splits
Train: Launch training with optional multi-GPU support and checkpoint resumption

Example: Training a Multilingual Model

from phonemize.preprocess import preprocess
from phonemize.train import train

# Prepare multilingual training data (language_code, grapheme, phoneme)
# Use IPA (International Phonetic Alphabet) for phonemes
train_data = [
    ("en_us", "young", "jʌŋ"),
    ("de", "benützten", "bənʏt͡stn̩"),
    ("fr", "bonjour", "bɔ̃ʒuʁ")
] * 1000

# Define validation set for monitoring model performance
val_data = [
    ("en_us", "young", "jʌŋ"),
    ("de", "benützten", "bənʏt͡stn̩"),
    ("fr", "bonjour", "bɔ̃ʒuʁ")
] * 100

# Load configuration (defines model architecture, training parameters, etc.)
config_file = "configs/forward.yaml"

# Step 1: Preprocess and tokenize data
preprocess(
    config_file=config_file,
    train_data=train_data,
    val_data=val_data,
    deduplicate_train_data=False  # Keep duplicates for balanced multilingual training
)

# Step 2: Train model with distributed support (rank=0 for single GPU)
train(rank=0, num_gpus=1, config_file=config_file)

Resume Training from Checkpoint

If training is interrupted, resume from the last checkpoint:

from phonemize.train import train

# Resume training from checkpoint
train(
    rank=0, 
    num_gpus=1, 
    config_file="configs/forward.yaml",
    checkpoint_file="checkpoints/latest_model.pt"  # Path to checkpoint
)

Multi-GPU Training (Distributed Data Parallel)

For faster training on multiple GPUs, use DDP configuration:

from phonemize.train import train

# Train on 4 GPUs with automatic distribution
train(rank=0, num_gpus=4, config_file="configs/forward.yaml")

Update your config file with DDP settings:

training:
  ddp_backend: 'nccl'     # Use 'gloo' on CPU-only systems
  ddp_host: 'localhost'   # Hostname for multi-node training
  ddp_port: 12355         # Port for DDP communication

Output: Model checkpoints saved to the checkpoint_dir specified in config. Monitor training with TensorBoard.

Advanced Inference

Basic Inference with Customization

Perform phonemization with optional parameters for punctuation handling and acronym expansion:

from phonemize import Phonemizer

# Load custom trained model
phonemizer = Phonemizer.from_checkpoint("checkpoints/best_model.pt")

# Single text inference with default settings
phonemes = phonemizer("Phonemizing text is simple!", lang="en_us")
print(phonemes)  # Output: [F][O][N][EH][M][IH][Z][IH][NG] ...

# Batch inference for better performance
texts = ["Hello world", "How are you?", "Testing Phonemize"]
results = phonemizer(texts, lang="en_us", batch_size=32)

# Custom punctuation handling
custom_punct = ".,!?;"
result = phonemizer(
    "Hello, world!",
    lang="en_us",
    punctuation=custom_punct,
    expand_acronyms=True,  # Expands "DIY" -> "D-I-Y"
    batch_size=8
)

Inference Parameters:

text (str | List[str]): Input text or list of texts to phonemize
lang (str): Target language code (e.g., 'en_us', 'de', 'fr') - must match training languages
punctuation (str): Characters to split on [default: '().,:?!/–']
expand_acronyms (bool): Auto-expand acronyms like "U.S.A" [default: True]
batch_size (int): Inference batch size for GPU efficiency [default: 8]

Detailed Results with Confidence Scores

Access comprehensive predictions including confidence metrics and token probabilities:

# Get detailed predictions
result = phonemizer.phonemise_list(
    ["Phonemizing text is simple!"], 
    lang="en_us"
)

# Access detailed per-word predictions
for word, pred in result.predictions.items():
    print(f"Word: '{word}'")
    print(f"  Phonemes:  {pred.phonemes}")
    print(f"  Tokens:    {pred.phoneme_tokens}")
    print(f"  Confidence: {pred.confidence:.3f}")
    print(f"  Token Probs: {[f'{p:.2f}' for p in pred.token_probs]}")

# Access full result structure
print(f"Original texts: {result.text}")
print(f"Split text: {result.split_text}")     # Words separated by punctuation
print(f"Phoneme output: {result.phonemes}")   # Concatenated phoneme strings
print(f"Split phonemes: {result.split_phonemes}")  # Phonemes per word

Result Object Properties:

text: Original input texts (List[str])
phonemes: Final phoneme strings (List[str])
split_text: Texts split by punctuation (List[List[str]])
split_phonemes: Corresponding phonemes per split (List[List[str]])
predictions: Dict mapping words → Prediction objects with confidence scores

Prediction Object Properties:

word: Original word
phonemes: Predicted phoneme string
phoneme_tokens: Individual phoneme tokens (with markers)
confidence: Overall confidence score (0.0-1.0)
token_probs: Per-token probabilities

Use Cases: Confidence scores identify uncertain predictions for manual review, retraining, or fallback to dictionary lookup.

Configuration Guide

All training parameters are defined in YAML configuration files. Phonemize includes example configs for different model types.

Forward Transformer Config (Recommended for Production)

Use configs/forward_config.yaml for fast, high-quality models:

model:
  type: 'transformer'  # Fast seq2seq model
  d_model: 512         # Model dimension
  d_fft: 1024          # FFT dimension in feed-forward layer
  layers: 6            # Number of transformer layers
  dropout: 0.1
  heads: 4             # Attention heads

preprocessing:
  languages: ['en_us', 'de', 'fr']  # Add your languages here!
  text_symbols: 'abcdefghijklmnopqrstuvwxyz'  # Supported graphemes
  phoneme_symbols: ['a', 'e', 'i', 'ə', 'ŋ', ...]  # Supported phonemes
  char_repeats: 3      # Max phonemes per grapheme
  lowercase: true
  n_val: 5000          # Validation split size

training:
  learning_rate: 0.0001
  warmup_steps: 10000
  batch_size: 32
  epochs: 500
  generate_steps: 10000      # Eval interval
  validate_steps: 10000
  checkpoint_steps: 100000   # Save interval

Autoregressive Transformer Config (Research/Custom)

Use configs/autoreg_config.yaml for slightly higher accuracy at the cost of speed:

model:
  type: 'autoreg_transformer'  # Slower but potentially higher accuracy
  d_model: 512
  layers: 4            # Usually fewer layers for autoreg
  heads: 4

preprocessing:
  char_repeats: 1      # MUST be 1 for autoregressive models

Configuration Best Practices

For Your Languages:

preprocessing:
  languages: ['en_us', 'de', 'fr', 'es']  # Add all target languages
  text_symbols: 'abcdefghijklmnopqrstuvwxyzäöüàâé...'  # Include all chars in data

For More Accuracy (slower):

model:
  d_model: 768         # Increase model size
  layers: 8            # More transformer layers
  dropout: 0.2

training:
  learning_rate: 0.00005  # Lower LR for stability
  warmup_steps: 20000     # Longer warmup

For Faster Training:

model:
  d_model: 256         # Smaller model
  layers: 3            # Fewer layers
  heads: 2

training:
  batch_size: 64       # Larger batches
  generate_steps: 5000 # Less frequent evaluation

Common Issues & Troubleshooting

Language Not Supported Error

# ❌ Error: Language not supported
phonemizer("Hello", lang="pt")  # Portuguese not in training languages

# ✅ Solution: Use a language from model's training set
# Check available languages in the config file
phonemizer("Hello", lang="en_us")  # Use supported language

Model Not Improving During Training

# Increase model capacity
model:
  d_model: 768
  layers: 8

# Improve learning rate schedule
training:
  learning_rate: 0.0001
  warmup_steps: 20000
  scheduler_plateau_patience: 15

Out of Memory (OOM) Error

# ✅ Reduce batch size
train(rank=0, num_gpus=1, config_file="configs/forward.yaml")
# Then update config:
# training:
#   batch_size: 16  # Reduced from 32

# ✅ For inference: reduce batch_size
result = phonemizer("text", lang="en_us", batch_size=4)  # Was 8

Low Accuracy on Custom Language

preprocessing:
  # Include ALL characters from your training data
  text_symbols: 'abcdefghijklmnopqrstuvwxyzäöü...'  # Add special chars!
  
  # Include ALL phonemes in your dataset
  phoneme_symbols: ['a', 'b', 'c', ..., 'ə', 'ŋ']

TorchScript Export (Production Deployment)

import torch
from phonemize import Phonemizer

# Load trained checkpoint
phonemizer = Phonemizer.from_checkpoint("checkpoints/best_model.pt")

# Convert to TorchScript (JIT compilation)
scripted_model = torch.jit.script(phonemizer.predictor.model)
phonemizer.predictor.model = scripted_model

# Save for production deployment
torch.jit.save(scripted_model, "phonemizer_scripted.pt")

# Run optimized inference
result = phonemizer("Running the optimized TorchScript model!")
print(result)

Benefits:

✅ No Python dependencies required at inference time
✅ Significantly faster latency (30-50% speedup)
✅ Portable across platforms
✅ Memory efficient

Performance & Optimization

Inference Performance Tips

from phonemize import Phonemizer

phonemizer = Phonemizer.from_checkpoint("phonemize_m1.pt")

# ✅ Batch Processing (5-10x faster than single items)
texts = ["word1", "word2", "word3"] * 100
results = phonemizer(texts, lang="en_us", batch_size=32)

# ❌ Avoid: Processing one word at a time
# for word in texts:
#     result = phonemizer(word, lang="en_us")  # Slow!

# ✅ Adjust batch size based on GPU memory
# Small GPU: batch_size=4-8
# Medium GPU (6GB): batch_size=16-32  
# Large GPU (24GB+): batch_size=64-128

# ✅ Use TorchScript for maximum speed
import torch
scripted = torch.jit.script(phonemizer.predictor.model)
phonemizer.predictor.model = scripted

Training Performance Tips

# ✅ For faster training iterations
training:
  batch_size: 64           # Larger batches = faster epoch
  generate_steps: 5000     # Less frequent validation
  validate_steps: 10000

# ✅ Use Multi-GPU training for large datasets
# In training code: train(rank=0, num_gpus=4, config_file=...)

# ✅ For production models: increase compute
model:
  d_model: 768             # Larger model = better accuracy
  layers: 8
  heads: 8

Benchmark Results

Model	Inference Speed	Throughput	Accuracy
Forward Transformer (GPU)	~1-2 ms/word	500-1000 words/sec	98.5%+
Autoregressive (GPU)	~5-10 ms/word	100-200 words/sec	99%+
TorchScript (GPU)	~0.5-1 ms/word	1000-2000 words/sec	Same as original

API Reference

Phonemizer Class

from phonemize import Phonemizer

# Load from checkpoint
phonemizer = Phonemizer.from_checkpoint(checkpoint_path)

# Phonemize single text
result = phonemizer(
    text="Hello world",
    lang="en_us",
    punctuation="().,:?!/–",  # Custom punctuation
    expand_acronyms=True,       # Expand acronyms
    batch_size=8                # Batch size for inference
)  # Returns: str

# Phonemize batch of texts  
results = phonemizer(
    text=["Text 1", "Text 2"],
    lang="en_us"
)  # Returns: List[str]

# Get detailed predictions
result_obj = phonemizer.phonemise_list(
    texts=["Hello"],
    lang="en_us",
    punctuation="().,:?!/–",
    expand_acronyms=True,
    batch_size=8
)  # Returns: PhonemizerResult

Preprocess Function

from phonemize.preprocess import preprocess

preprocess(
    config_file="configs/forward.yaml",          # Config path
    train_data=[("en_us", "word", "wɜrd"), ...], # Training data
    val_data=[("en_us", "word", "wɜrd"), ...],   # Validation data (optional)
    deduplicate_train_data=False                 # Keep duplicates
)

Train Function

from phonemize.train import train

train(
    rank=0,                          # GPU rank (0 for single GPU)
    num_gpus=1,                      # Number of GPUs
    config_file="configs/forward.yaml",  # Config path
    checkpoint_file=None             # Resume from checkpoint (optional)
)

Pre-trained Models

High-quality pre-trained models ready for immediate use:

Model	Language	Dataset	Accuracy	Arch	Version	Size
phonemize_m1	🇺🇸 English (US)	CMUDict	98.5%+	Forward	0.1.0	~60MB

Tested on: CMU Dictionary, diverse English vocabulary including technical terms, names, and phonetic anomalies.

Models are optimized for the phonemize library and saved with PyTorch 1.9+.

Roadmap

Support for additional languages (Mandarin, Japanese, Arabic)
Real-time streaming inference API
ONNX model export for maximum portability
REST API server for easy deployment
Fine-tuning APIs for domain-specific models

Acknowledgments

Phonemize is inspired by DeepPhonemizer and has been completely refactored for modern Python, better performance, and improved usability. Special thanks to the open-source speech processing community.

License & Compatibility

License: MIT License - Free for commercial and personal use
Python Support: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13
Repository: github.com/arcosoph/phonemize

Questions or Issues? Join our Discord community or open an issue on GitHub.

FilesExpand file tree

README.md

Latest commit

History