Cantarela: Local TTS Audiobook System

A self-hosted text-to-speech system for converting documents (EPUB, PDF, MOBI, TXT) into audio files. Designed to make TTS research accessible, particularly for minority and under-resourced languages.

Author: Víctor Barreiro (victorbarreiro.github.io) Computer Engineer

Project Purpose

This project addresses three objectives:

1. Primary Goal: Accessibility Tool for Research-Based TTS

Provide a functional system to make TTS research outputs usable for end users, with focus on languages typically underserved by commercial solutions.

Motivation: Research institutions produce high-quality TTS models for minority languages (Galician, Catalan, Euskera, etc.), but these models often remain as academic outputs without practical deployment tools. This system bridges that gap by providing a ready-to-use platform where research models can be integrated and made available to speakers of these languages.

Technical implementation:

Model-agnostic architecture: supports both Piper (ONNX) and Coqui (PyTorch) engines
Filesystem-based model discovery: add new models without code changes
Semantic HTML and ARIA-compliant interface for screen reader compatibility
Simplified workflow: upload → select voice → download audio
Support for long-form content (books, academic papers)

2. Full-Stack Engineering Practice

Use as a hands-on project to understand the complete development cycle of a production-ready system.

Learning objectives:

Document format parsing (EPUB, PDF, MOBI)
Asynchronous task processing and job queuing
Resource management under hardware constraints (VRAM, CPU)
Containerization and deployment
Multiple interface implementation (Web UI + Telegram Bot)
Mathematical notation handling in TTS context

3. AI-Assisted Development Evaluation

Evaluate the integration of LLM-based tools (Claude, Gemini) in the development workflow.

Areas of assessment:

Code generation quality and correctness
Architectural decision support
Debugging and refactoring assistance
Documentation generation
Trade-offs between development speed and code ownership

Technical Stack

Backend:

Python 3.10, FastAPI
PyTorch, ONNX Runtime (both configured for GPU inference)
Threading/asyncio for concurrent operations

TTS Engines:

Piper (ONNX): Fast synthesis (GPU-accelerated in this implementation)
Coqui TTS (PyTorch): High-quality synthesis (GPU-accelerated in this implementation)

Frontend:

HTML5/JavaScript (Vanilla), TailwindCSS
Telegram Bot (python-telegram-bot)

Infrastructure:

Docker + Docker Compose
NVIDIA GPU (required for this implementation)
NVIDIA Container Toolkit

Note: Both TTS engines support CPU inference, but this implementation is configured for GPU acceleration. Code modifications would be needed to run on CPU-only systems.

System Architecture

┌─────────────────┐     ┌──────────────────┐
│   Web UI        │     │  Telegram Bot    │
│  (Port 8000)    │     │  (Optional)      │
└────────┬────────┘     └────────┬─────────┘
         │                       │
         └───────────┬───────────┘
                     │
         ┌───────────▼────────────┐
         │   FastAPI Backend      │
         │   - Job Queue          │
         │   - File Management    │
         │   - Semaphore Control  │
         └───────────┬────────────┘
                     │
         ┌───────────▼────────────┐
         │    TTS Engine          │
         │  ┌──────────────────┐  │
         │  │ Piper (GPU)      │  │
         │  │ Coqui (GPU)      │  │
         │  └──────────────────┘  │
         └────────────────────────┘

Key Components:

Semaphore-based queue: Prevents concurrent TTS operations (VRAM management)
Background workers: Long-running book conversions don't block API
Model auto-discovery: Filesystem-based detection of available voices
Dynamic model loading: Load/unload PyTorch models to conserve GPU memory

Features

Document Processing:

Multi-format support: EPUB, PDF, MOBI, TXT
Automatic chapter detection and segmentation
Mathematical notation conversion (LaTeX → spoken form)
Language-specific text normalization (ES, GL, EN)

Audio Generation:

Multiple quality presets (maximum, high, balanced, fast, fastest)
MP3 and WAV output formats
Per-chapter audio files + complete ZIP archive
Progress tracking with time estimation

Interfaces:

Web UI with real-time job monitoring
Telegram bot with password authentication
RESTful API endpoints

Supported Languages:

Pre-installed models:

English (US): en_US-lessac-medium (Piper)
Spanish (ES): es_ES-sharvard-medium (Piper)

Additional models (require manual installation):

Galician (GL): gl_ES-celtia (Proxecto Nós / Coqui)
Galician (GL): gl_ES-brais (Proxecto Nós / Coqui)
Galician (GL): gl_ES-icia (Proxecto Nós / Coqui)

Screenshots

Book Converter Interface

Upload documents and configure output settings (format, quality, playback speed).

Conversion Progress

Real-time progress tracking with chapter detection and time estimation.

Conversion Complete

Browse and play individual chapters or download the complete audiobook as ZIP.

Mobile Responsive

Full functionality on mobile devices with optimized layout.

Installation

Prerequisites

Docker Engine (>= 20.10)
Docker Compose (>= 1.29)
NVIDIA GPU (required for this implementation)
NVIDIA Drivers (>= 470.x)
NVIDIA Container Toolkit

Note: The system is configured for GPU acceleration. To run on CPU-only hardware, modify tts_engine.py (remove --cuda flag for Piper, set gpu=False for Coqui) and use onnxruntime instead of onnxruntime-gpu in requirements.txt.

Quick Start

Clone repository:

git clone https://github.com/your-username/cantarela.git
cd cantarela

TTS Models:

Pre-installed (automatic during Docker build):
- English (US): en_US-lessac-medium (Piper)
- Spanish (ES): es_ES-sharvard-medium (Piper)
Add additional models (optional):
- Place Coqui models in backend/models_data/coqui/
- Place additional Piper models in backend/piper_models/
The system auto-detects all available models on startup.
Launch:
```
docker-compose up --build
```

Configure Authentication:

cp .env.example .env

Edit .env and configure passwords:

# Enable/disable authentication (true/false)
AUTH_ENABLED=true

# Admin password (for future admin dashboard)
ADMIN_PASSWORD=cantarela2026admin

# User password (for regular access)
USER_PASSWORD=cantarela2026

# JWT secret (change this to a random string)
JWT_SECRET=your-secret-key-change-this-in-production

# Token expiration in hours (default: 48 hours)
TOKEN_EXPIRATION_HOURS=48

IMPORTANT: Change the default passwords before deploying to production!

Access:
- Web UI: http://localhost:8000 (login page will appear if auth is enabled)
- API docs: http://localhost:8000/docs

Optional: Telegram Bot

Create bot via @BotFather
Edit .env file and add:
```
TELEGRAM_BOT_TOKEN=your_token_here
```
Note: The Telegram bot will use USER_PASSWORD from the authentication settings above. Users must enter the same password to access the bot.
Restart: docker-compose restart

See TELEGRAM_BOT_SETUP.md for details.

Usage

Web Interface

Navigate to http://localhost:8000
Quick Generate Tab:
- Paste text (< 4000 characters)
- Select model, quality, format
- Download audio file
Book Converter Tab:
- Upload document
- Select voice model
- Monitor progress
- Download chapters individually or as ZIP

Telegram Bot

/start    - Show commands
/generate - Convert text to speech
/convert  - Upload book for conversion
/models   - List available voices
/status   - Check job progress
/cancel   - Cancel running job

First-time users must authenticate with the configured password.

Authentication

The system includes a token-based authentication system designed for alpha testing and controlled deployments:

Features:

Password-based access control with two levels:
- Admin: Future dashboard access (password: configurable via ADMIN_PASSWORD)
- User: Regular system access (password: configurable via USER_PASSWORD)
JWT tokens with configurable expiration (default: 48 hours)
Activity logging tracks user behavior metadata (no content stored)
Job isolation: Users can only access their own jobs (admins can access all)
Easy enable/disable: Set AUTH_ENABLED=false in .env to disable entirely

To disable authentication:

# In .env file
AUTH_ENABLED=false

Activity Logging: The system logs user activity metadata to /logs/activity.log:

Login events (IP, OS, browser, timestamp)
Generation requests (text length, model, settings)
Book uploads (filename, size, format)
Conversion completions (chapters, duration)

Note: This is a basic authentication system suitable for PoC/alpha deployments. For production use with untrusted users, consider implementing:

HTTPS/TLS encryption
Password hashing with salts
Rate limiting
More sophisticated user management

API Endpoints

Authentication Endpoints:

Method	Endpoint	Description
POST	`/auth/login`	Authenticate and receive JWT token
POST	`/auth/logout`	Logout (clears auth cookie)
GET	`/auth/verify`	Check authentication status
GET	`/auth/config`	Get auth configuration (public)

TTS Endpoints (require authentication when enabled):

Method	Endpoint	Description
GET	`/models`	List available TTS models
POST	`/generate`	Quick text-to-speech (synchronous)
POST	`/convert`	Upload book for conversion (async)
GET	`/status/{job_id}`	Get conversion job status
POST	`/cancel/{job_id}`	Cancel running job

Admin Endpoints (require admin password):

Method	Endpoint	Description
GET	`/admin/dashboard`	Admin dashboard (placeholder)

Project Structure

cantarela/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI application
│   │   ├── tts_engine.py        # TTS wrapper (Piper/Coqui)
│   │   ├── epub_parser.py       # Document parsing
│   │   ├── telegram_bot.py      # Telegram bot integration
│   │   ├── auth_utils.py        # JWT authentication utilities
│   │   ├── auth_middleware.py   # Authentication middleware
│   │   └── activity_logger.py   # Activity logging system
│   ├── models_data/             # Coqui models (not in repo)
│   ├── piper_models/            # Piper models (not in repo)
│   └── Dockerfile
├── frontend/
│   ├── index.html               # Main web app
│   ├── login.html               # Login page
│   └── js/app.js                # Frontend JavaScript
├── output/                      # Generated audio files
├── logs/                        # Activity logs
├── docker-compose.yml
├── .env.example
└── README.md

Implementation Notes

Resource Management

Challenge: Running PyTorch TTS models on consumer GPUs with limited VRAM.

Solution:

Single TTS semaphore: Only one model inference at a time
Explicit model unloading: torch.cuda.empty_cache() after each job
Model caching: Reuse loaded model if same as previous request

Mathematical Notation

Challenge: TTS engines cannot pronounce LaTeX or mathematical symbols.

Solution: Preprocessing pipeline with language-aware substitution dictionaries:

Greek letters: α → "alpha", β → "beta"
Operators: ∫ → "integral", ∑ → "sum"
LaTeX: \frac{a}{b} → "a divided by b", x^2 → "x squared"

Implemented for English, Spanish, and Galician.

Concurrency Model

Web requests: Async (FastAPI/Uvicorn)
TTS generation: Synchronous (thread pool executor)
Job processing: Background tasks with thread-safe job dictionary
Model operations: Protected by threading.Lock()

Known Limitations

Single TTS operation at a time (hardware constraint)
Telegram file size limits: 20MB upload, 50MB download
In-memory job storage (lost on restart)
Basic password authentication (suitable for PoC, not production-grade security)
Limited to languages with available TTS models

Attribution

Galician TTS Models:

Developed by Proxecto Nós
Licensed under CC-BY 4.0
Source: HuggingFace/ProxectoNos

License

Licensed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
backend		backend
frontend		frontend
images		images
logs		logs
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CODE_REVIEW.md		CODE_REVIEW.md
Caddyfile		Caddyfile
Caddyfile.http		Caddyfile.http
HTTPS_SETUP.md		HTTPS_SETUP.md
LICENSE		LICENSE
README.md		README.md
TELEGRAM_BOT_SETUP.md		TELEGRAM_BOT_SETUP.md
configure-network-access.sh		configure-network-access.sh
docker-compose.http-domain.yml		docker-compose.http-domain.yml
docker-compose.https.yml		docker-compose.https.yml
docker-compose.yml		docker-compose.yml
test.txt		test.txt

Folders and files

Latest commit

History

Repository files navigation

Cantarela: Local TTS Audiobook System

Project Purpose

1. Primary Goal: Accessibility Tool for Research-Based TTS

2. Full-Stack Engineering Practice

3. AI-Assisted Development Evaluation

Technical Stack

System Architecture

Features

Screenshots

Book Converter Interface

Conversion Progress

Conversion Complete

Mobile Responsive

Installation

Prerequisites

Quick Start

Optional: Telegram Bot

Usage

Web Interface

Telegram Bot

Authentication

API Endpoints

Project Structure

Implementation Notes

Resource Management

Mathematical Notation

Concurrency Model

Known Limitations

Attribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages