A self-hosted text-to-speech system for converting documents (EPUB, PDF, MOBI, TXT) into audio files. Designed to make TTS research accessible, particularly for minority and under-resourced languages.
Author: Víctor Barreiro (victorbarreiro.github.io) Computer Engineer
This project addresses three objectives:
Provide a functional system to make TTS research outputs usable for end users, with focus on languages typically underserved by commercial solutions.
Motivation: Research institutions produce high-quality TTS models for minority languages (Galician, Catalan, Euskera, etc.), but these models often remain as academic outputs without practical deployment tools. This system bridges that gap by providing a ready-to-use platform where research models can be integrated and made available to speakers of these languages.
Technical implementation:
- Model-agnostic architecture: supports both Piper (ONNX) and Coqui (PyTorch) engines
- Filesystem-based model discovery: add new models without code changes
- Semantic HTML and ARIA-compliant interface for screen reader compatibility
- Simplified workflow: upload → select voice → download audio
- Support for long-form content (books, academic papers)
Use as a hands-on project to understand the complete development cycle of a production-ready system.
Learning objectives:
- Document format parsing (EPUB, PDF, MOBI)
- Asynchronous task processing and job queuing
- Resource management under hardware constraints (VRAM, CPU)
- Containerization and deployment
- Multiple interface implementation (Web UI + Telegram Bot)
- Mathematical notation handling in TTS context
Evaluate the integration of LLM-based tools (Claude, Gemini) in the development workflow.
Areas of assessment:
- Code generation quality and correctness
- Architectural decision support
- Debugging and refactoring assistance
- Documentation generation
- Trade-offs between development speed and code ownership
Backend:
- Python 3.10, FastAPI
- PyTorch, ONNX Runtime (both configured for GPU inference)
- Threading/asyncio for concurrent operations
TTS Engines:
- Piper (ONNX): Fast synthesis (GPU-accelerated in this implementation)
- Coqui TTS (PyTorch): High-quality synthesis (GPU-accelerated in this implementation)
Frontend:
- HTML5/JavaScript (Vanilla), TailwindCSS
- Telegram Bot (python-telegram-bot)
Infrastructure:
- Docker + Docker Compose
- NVIDIA GPU (required for this implementation)
- NVIDIA Container Toolkit
Note: Both TTS engines support CPU inference, but this implementation is configured for GPU acceleration. Code modifications would be needed to run on CPU-only systems.
┌─────────────────┐ ┌──────────────────┐
│ Web UI │ │ Telegram Bot │
│ (Port 8000) │ │ (Optional) │
└────────┬────────┘ └────────┬─────────┘
│ │
└───────────┬───────────┘
│
┌───────────▼────────────┐
│ FastAPI Backend │
│ - Job Queue │
│ - File Management │
│ - Semaphore Control │
└───────────┬────────────┘
│
┌───────────▼────────────┐
│ TTS Engine │
│ ┌──────────────────┐ │
│ │ Piper (GPU) │ │
│ │ Coqui (GPU) │ │
│ └──────────────────┘ │
└────────────────────────┘
Key Components:
- Semaphore-based queue: Prevents concurrent TTS operations (VRAM management)
- Background workers: Long-running book conversions don't block API
- Model auto-discovery: Filesystem-based detection of available voices
- Dynamic model loading: Load/unload PyTorch models to conserve GPU memory
Document Processing:
- Multi-format support: EPUB, PDF, MOBI, TXT
- Automatic chapter detection and segmentation
- Mathematical notation conversion (LaTeX → spoken form)
- Language-specific text normalization (ES, GL, EN)
Audio Generation:
- Multiple quality presets (maximum, high, balanced, fast, fastest)
- MP3 and WAV output formats
- Per-chapter audio files + complete ZIP archive
- Progress tracking with time estimation
Interfaces:
- Web UI with real-time job monitoring
- Telegram bot with password authentication
- RESTful API endpoints
Supported Languages:
Pre-installed models:
- English (US):
en_US-lessac-medium(Piper) - Spanish (ES):
es_ES-sharvard-medium(Piper)
Additional models (require manual installation):
- Galician (GL):
gl_ES-celtia(Proxecto Nós / Coqui) - Galician (GL):
gl_ES-brais(Proxecto Nós / Coqui) - Galician (GL):
gl_ES-icia(Proxecto Nós / Coqui)
Upload documents and configure output settings (format, quality, playback speed).
Real-time progress tracking with chapter detection and time estimation.
Browse and play individual chapters or download the complete audiobook as ZIP.
Full functionality on mobile devices with optimized layout.
- Docker Engine (>= 20.10)
- Docker Compose (>= 1.29)
- NVIDIA GPU (required for this implementation)
- NVIDIA Drivers (>= 470.x)
- NVIDIA Container Toolkit
Note: The system is configured for GPU acceleration. To run on CPU-only hardware, modify tts_engine.py (remove --cuda flag for Piper, set gpu=False for Coqui) and use onnxruntime instead of onnxruntime-gpu in requirements.txt.
-
Clone repository:
git clone https://github.com/your-username/cantarela.git cd cantarela -
TTS Models:
Pre-installed (automatic during Docker build):
- English (US):
en_US-lessac-medium(Piper) - Spanish (ES):
es_ES-sharvard-medium(Piper)
Add additional models (optional):
- Place Coqui models in
backend/models_data/coqui/ - Place additional Piper models in
backend/piper_models/
The system auto-detects all available models on startup.
- English (US):
-
Launch:
docker-compose up --build
-
Configure Authentication:
cp .env.example .env
Edit
.envand configure passwords:# Enable/disable authentication (true/false) AUTH_ENABLED=true # Admin password (for future admin dashboard) ADMIN_PASSWORD=cantarela2026admin # User password (for regular access) USER_PASSWORD=cantarela2026 # JWT secret (change this to a random string) JWT_SECRET=your-secret-key-change-this-in-production # Token expiration in hours (default: 48 hours) TOKEN_EXPIRATION_HOURS=48IMPORTANT: Change the default passwords before deploying to production!
-
Access:
- Web UI:
http://localhost:8000(login page will appear if auth is enabled) - API docs:
http://localhost:8000/docs
- Web UI:
-
Create bot via @BotFather
-
Edit
.envfile and add:TELEGRAM_BOT_TOKEN=your_token_hereNote: The Telegram bot will use
USER_PASSWORDfrom the authentication settings above. Users must enter the same password to access the bot. -
Restart:
docker-compose restart
See TELEGRAM_BOT_SETUP.md for details.
-
Navigate to
http://localhost:8000 -
Quick Generate Tab:
- Paste text (< 4000 characters)
- Select model, quality, format
- Download audio file
-
Book Converter Tab:
- Upload document
- Select voice model
- Monitor progress
- Download chapters individually or as ZIP
/start - Show commands
/generate - Convert text to speech
/convert - Upload book for conversion
/models - List available voices
/status - Check job progress
/cancel - Cancel running job
First-time users must authenticate with the configured password.
The system includes a token-based authentication system designed for alpha testing and controlled deployments:
Features:
- Password-based access control with two levels:
- Admin: Future dashboard access (password: configurable via
ADMIN_PASSWORD) - User: Regular system access (password: configurable via
USER_PASSWORD)
- Admin: Future dashboard access (password: configurable via
- JWT tokens with configurable expiration (default: 48 hours)
- Activity logging tracks user behavior metadata (no content stored)
- Job isolation: Users can only access their own jobs (admins can access all)
- Easy enable/disable: Set
AUTH_ENABLED=falsein.envto disable entirely
To disable authentication:
# In .env file
AUTH_ENABLED=falseActivity Logging:
The system logs user activity metadata to /logs/activity.log:
- Login events (IP, OS, browser, timestamp)
- Generation requests (text length, model, settings)
- Book uploads (filename, size, format)
- Conversion completions (chapters, duration)
Note: This is a basic authentication system suitable for PoC/alpha deployments. For production use with untrusted users, consider implementing:
- HTTPS/TLS encryption
- Password hashing with salts
- Rate limiting
- More sophisticated user management
Authentication Endpoints:
| Method | Endpoint | Description |
|---|---|---|
| POST | /auth/login |
Authenticate and receive JWT token |
| POST | /auth/logout |
Logout (clears auth cookie) |
| GET | /auth/verify |
Check authentication status |
| GET | /auth/config |
Get auth configuration (public) |
TTS Endpoints (require authentication when enabled):
| Method | Endpoint | Description |
|---|---|---|
| GET | /models |
List available TTS models |
| POST | /generate |
Quick text-to-speech (synchronous) |
| POST | /convert |
Upload book for conversion (async) |
| GET | /status/{job_id} |
Get conversion job status |
| POST | /cancel/{job_id} |
Cancel running job |
Admin Endpoints (require admin password):
| Method | Endpoint | Description |
|---|---|---|
| GET | /admin/dashboard |
Admin dashboard (placeholder) |
cantarela/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI application
│ │ ├── tts_engine.py # TTS wrapper (Piper/Coqui)
│ │ ├── epub_parser.py # Document parsing
│ │ ├── telegram_bot.py # Telegram bot integration
│ │ ├── auth_utils.py # JWT authentication utilities
│ │ ├── auth_middleware.py # Authentication middleware
│ │ └── activity_logger.py # Activity logging system
│ ├── models_data/ # Coqui models (not in repo)
│ ├── piper_models/ # Piper models (not in repo)
│ └── Dockerfile
├── frontend/
│ ├── index.html # Main web app
│ ├── login.html # Login page
│ └── js/app.js # Frontend JavaScript
├── output/ # Generated audio files
├── logs/ # Activity logs
├── docker-compose.yml
├── .env.example
└── README.md
Challenge: Running PyTorch TTS models on consumer GPUs with limited VRAM.
Solution:
- Single TTS semaphore: Only one model inference at a time
- Explicit model unloading:
torch.cuda.empty_cache()after each job - Model caching: Reuse loaded model if same as previous request
Challenge: TTS engines cannot pronounce LaTeX or mathematical symbols.
Solution: Preprocessing pipeline with language-aware substitution dictionaries:
- Greek letters:
α→ "alpha",β→ "beta" - Operators:
∫→ "integral",∑→ "sum" - LaTeX:
\frac{a}{b}→ "a divided by b",x^2→ "x squared"
Implemented for English, Spanish, and Galician.
- Web requests: Async (FastAPI/Uvicorn)
- TTS generation: Synchronous (thread pool executor)
- Job processing: Background tasks with thread-safe job dictionary
- Model operations: Protected by threading.Lock()
- Single TTS operation at a time (hardware constraint)
- Telegram file size limits: 20MB upload, 50MB download
- In-memory job storage (lost on restart)
- Basic password authentication (suitable for PoC, not production-grade security)
- Limited to languages with available TTS models
Galician TTS Models:
- Developed by Proxecto Nós
- Licensed under CC-BY 4.0
- Source: HuggingFace/ProxectoNos
Licensed under the Apache License, Version 2.0.



