Skip to content

victorbarreiro/cantarela

Repository files navigation

Cantarela: Local TTS Audiobook System

A self-hosted text-to-speech system for converting documents (EPUB, PDF, MOBI, TXT) into audio files. Designed to make TTS research accessible, particularly for minority and under-resourced languages.

Author: Víctor Barreiro (victorbarreiro.github.io) Computer Engineer


Project Purpose

This project addresses three objectives:

1. Primary Goal: Accessibility Tool for Research-Based TTS

Provide a functional system to make TTS research outputs usable for end users, with focus on languages typically underserved by commercial solutions.

Motivation: Research institutions produce high-quality TTS models for minority languages (Galician, Catalan, Euskera, etc.), but these models often remain as academic outputs without practical deployment tools. This system bridges that gap by providing a ready-to-use platform where research models can be integrated and made available to speakers of these languages.

Technical implementation:

  • Model-agnostic architecture: supports both Piper (ONNX) and Coqui (PyTorch) engines
  • Filesystem-based model discovery: add new models without code changes
  • Semantic HTML and ARIA-compliant interface for screen reader compatibility
  • Simplified workflow: upload → select voice → download audio
  • Support for long-form content (books, academic papers)

2. Full-Stack Engineering Practice

Use as a hands-on project to understand the complete development cycle of a production-ready system.

Learning objectives:

  • Document format parsing (EPUB, PDF, MOBI)
  • Asynchronous task processing and job queuing
  • Resource management under hardware constraints (VRAM, CPU)
  • Containerization and deployment
  • Multiple interface implementation (Web UI + Telegram Bot)
  • Mathematical notation handling in TTS context

3. AI-Assisted Development Evaluation

Evaluate the integration of LLM-based tools (Claude, Gemini) in the development workflow.

Areas of assessment:

  • Code generation quality and correctness
  • Architectural decision support
  • Debugging and refactoring assistance
  • Documentation generation
  • Trade-offs between development speed and code ownership

Technical Stack

Backend:

  • Python 3.10, FastAPI
  • PyTorch, ONNX Runtime (both configured for GPU inference)
  • Threading/asyncio for concurrent operations

TTS Engines:

  • Piper (ONNX): Fast synthesis (GPU-accelerated in this implementation)
  • Coqui TTS (PyTorch): High-quality synthesis (GPU-accelerated in this implementation)

Frontend:

  • HTML5/JavaScript (Vanilla), TailwindCSS
  • Telegram Bot (python-telegram-bot)

Infrastructure:

  • Docker + Docker Compose
  • NVIDIA GPU (required for this implementation)
  • NVIDIA Container Toolkit

Note: Both TTS engines support CPU inference, but this implementation is configured for GPU acceleration. Code modifications would be needed to run on CPU-only systems.


System Architecture

┌─────────────────┐     ┌──────────────────┐
│   Web UI        │     │  Telegram Bot    │
│  (Port 8000)    │     │  (Optional)      │
└────────┬────────┘     └────────┬─────────┘
         │                       │
         └───────────┬───────────┘
                     │
         ┌───────────▼────────────┐
         │   FastAPI Backend      │
         │   - Job Queue          │
         │   - File Management    │
         │   - Semaphore Control  │
         └───────────┬────────────┘
                     │
         ┌───────────▼────────────┐
         │    TTS Engine          │
         │  ┌──────────────────┐  │
         │  │ Piper (GPU)      │  │
         │  │ Coqui (GPU)      │  │
         │  └──────────────────┘  │
         └────────────────────────┘

Key Components:

  • Semaphore-based queue: Prevents concurrent TTS operations (VRAM management)
  • Background workers: Long-running book conversions don't block API
  • Model auto-discovery: Filesystem-based detection of available voices
  • Dynamic model loading: Load/unload PyTorch models to conserve GPU memory

Features

Document Processing:

  • Multi-format support: EPUB, PDF, MOBI, TXT
  • Automatic chapter detection and segmentation
  • Mathematical notation conversion (LaTeX → spoken form)
  • Language-specific text normalization (ES, GL, EN)

Audio Generation:

  • Multiple quality presets (maximum, high, balanced, fast, fastest)
  • MP3 and WAV output formats
  • Per-chapter audio files + complete ZIP archive
  • Progress tracking with time estimation

Interfaces:

  • Web UI with real-time job monitoring
  • Telegram bot with password authentication
  • RESTful API endpoints

Supported Languages:

Pre-installed models:

  • English (US): en_US-lessac-medium (Piper)
  • Spanish (ES): es_ES-sharvard-medium (Piper)

Additional models (require manual installation):

  • Galician (GL): gl_ES-celtia (Proxecto Nós / Coqui)
  • Galician (GL): gl_ES-brais (Proxecto Nós / Coqui)
  • Galician (GL): gl_ES-icia (Proxecto Nós / Coqui)

Screenshots

Book Converter Interface

Upload documents and configure output settings (format, quality, playback speed).

Book Converter Interface

Conversion Progress

Real-time progress tracking with chapter detection and time estimation.

Conversion Progress

Conversion Complete

Browse and play individual chapters or download the complete audiobook as ZIP.

Conversion Complete

Mobile Responsive

Full functionality on mobile devices with optimized layout.

Mobile Interface


Installation

Prerequisites

  • Docker Engine (>= 20.10)
  • Docker Compose (>= 1.29)
  • NVIDIA GPU (required for this implementation)
  • NVIDIA Drivers (>= 470.x)
  • NVIDIA Container Toolkit

Note: The system is configured for GPU acceleration. To run on CPU-only hardware, modify tts_engine.py (remove --cuda flag for Piper, set gpu=False for Coqui) and use onnxruntime instead of onnxruntime-gpu in requirements.txt.

Quick Start

  1. Clone repository:

    git clone https://github.com/your-username/cantarela.git
    cd cantarela
  2. TTS Models:

    Pre-installed (automatic during Docker build):

    • English (US): en_US-lessac-medium (Piper)
    • Spanish (ES): es_ES-sharvard-medium (Piper)

    Add additional models (optional):

    • Place Coqui models in backend/models_data/coqui/
    • Place additional Piper models in backend/piper_models/

    The system auto-detects all available models on startup.

  3. Launch:

    docker-compose up --build
  4. Configure Authentication:

    cp .env.example .env

    Edit .env and configure passwords:

    # Enable/disable authentication (true/false)
    AUTH_ENABLED=true
    
    # Admin password (for future admin dashboard)
    ADMIN_PASSWORD=cantarela2026admin
    
    # User password (for regular access)
    USER_PASSWORD=cantarela2026
    
    # JWT secret (change this to a random string)
    JWT_SECRET=your-secret-key-change-this-in-production
    
    # Token expiration in hours (default: 48 hours)
    TOKEN_EXPIRATION_HOURS=48
    

    IMPORTANT: Change the default passwords before deploying to production!

  5. Access:

    • Web UI: http://localhost:8000 (login page will appear if auth is enabled)
    • API docs: http://localhost:8000/docs

Optional: Telegram Bot

  1. Create bot via @BotFather

  2. Edit .env file and add:

    TELEGRAM_BOT_TOKEN=your_token_here
    

    Note: The Telegram bot will use USER_PASSWORD from the authentication settings above. Users must enter the same password to access the bot.

  3. Restart: docker-compose restart

See TELEGRAM_BOT_SETUP.md for details.


Usage

Web Interface

  1. Navigate to http://localhost:8000

  2. Quick Generate Tab:

    • Paste text (< 4000 characters)
    • Select model, quality, format
    • Download audio file
  3. Book Converter Tab:

    • Upload document
    • Select voice model
    • Monitor progress
    • Download chapters individually or as ZIP

Telegram Bot

/start    - Show commands
/generate - Convert text to speech
/convert  - Upload book for conversion
/models   - List available voices
/status   - Check job progress
/cancel   - Cancel running job

First-time users must authenticate with the configured password.


Authentication

The system includes a token-based authentication system designed for alpha testing and controlled deployments:

Features:

  • Password-based access control with two levels:
    • Admin: Future dashboard access (password: configurable via ADMIN_PASSWORD)
    • User: Regular system access (password: configurable via USER_PASSWORD)
  • JWT tokens with configurable expiration (default: 48 hours)
  • Activity logging tracks user behavior metadata (no content stored)
  • Job isolation: Users can only access their own jobs (admins can access all)
  • Easy enable/disable: Set AUTH_ENABLED=false in .env to disable entirely

To disable authentication:

# In .env file
AUTH_ENABLED=false

Activity Logging: The system logs user activity metadata to /logs/activity.log:

  • Login events (IP, OS, browser, timestamp)
  • Generation requests (text length, model, settings)
  • Book uploads (filename, size, format)
  • Conversion completions (chapters, duration)

Note: This is a basic authentication system suitable for PoC/alpha deployments. For production use with untrusted users, consider implementing:

  • HTTPS/TLS encryption
  • Password hashing with salts
  • Rate limiting
  • More sophisticated user management

API Endpoints

Authentication Endpoints:

Method Endpoint Description
POST /auth/login Authenticate and receive JWT token
POST /auth/logout Logout (clears auth cookie)
GET /auth/verify Check authentication status
GET /auth/config Get auth configuration (public)

TTS Endpoints (require authentication when enabled):

Method Endpoint Description
GET /models List available TTS models
POST /generate Quick text-to-speech (synchronous)
POST /convert Upload book for conversion (async)
GET /status/{job_id} Get conversion job status
POST /cancel/{job_id} Cancel running job

Admin Endpoints (require admin password):

Method Endpoint Description
GET /admin/dashboard Admin dashboard (placeholder)

Project Structure

cantarela/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI application
│   │   ├── tts_engine.py        # TTS wrapper (Piper/Coqui)
│   │   ├── epub_parser.py       # Document parsing
│   │   ├── telegram_bot.py      # Telegram bot integration
│   │   ├── auth_utils.py        # JWT authentication utilities
│   │   ├── auth_middleware.py   # Authentication middleware
│   │   └── activity_logger.py   # Activity logging system
│   ├── models_data/             # Coqui models (not in repo)
│   ├── piper_models/            # Piper models (not in repo)
│   └── Dockerfile
├── frontend/
│   ├── index.html               # Main web app
│   ├── login.html               # Login page
│   └── js/app.js                # Frontend JavaScript
├── output/                      # Generated audio files
├── logs/                        # Activity logs
├── docker-compose.yml
├── .env.example
└── README.md

Implementation Notes

Resource Management

Challenge: Running PyTorch TTS models on consumer GPUs with limited VRAM.

Solution:

  • Single TTS semaphore: Only one model inference at a time
  • Explicit model unloading: torch.cuda.empty_cache() after each job
  • Model caching: Reuse loaded model if same as previous request

Mathematical Notation

Challenge: TTS engines cannot pronounce LaTeX or mathematical symbols.

Solution: Preprocessing pipeline with language-aware substitution dictionaries:

  • Greek letters: α → "alpha", β → "beta"
  • Operators: → "integral", → "sum"
  • LaTeX: \frac{a}{b} → "a divided by b", x^2 → "x squared"

Implemented for English, Spanish, and Galician.

Concurrency Model

  • Web requests: Async (FastAPI/Uvicorn)
  • TTS generation: Synchronous (thread pool executor)
  • Job processing: Background tasks with thread-safe job dictionary
  • Model operations: Protected by threading.Lock()

Known Limitations

  • Single TTS operation at a time (hardware constraint)
  • Telegram file size limits: 20MB upload, 50MB download
  • In-memory job storage (lost on restart)
  • Basic password authentication (suitable for PoC, not production-grade security)
  • Limited to languages with available TTS models

Attribution

Galician TTS Models:


License

Licensed under the Apache License, Version 2.0.

About

A self-hosted text-to-speech system for converting documents (EPUB, PDF, MOBI, TXT) into audio files. Designed to make TTS research accessible, particularly for minority and under-resourced languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors