Skip to content

garciadias/audify

Repository files navigation

Audify

codecov Tests

Convert ebooks and PDFs to audiobooks using AI text-to-speech and translation services.

Audify is a API-based system that transforms written content into high-quality audio using:

  • Multiple TTS Providers - Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
  • Ollama + LiteLLM for intelligent translation
  • LLM-powered audiobook generation for engaging audio content

πŸš€ Features

  • πŸ“š Multiple Formats: Convert EPUB ebooks, PDF documents, TXT, and MD files
  • πŸ“ Directory Processing: Create audiobooks from multiple files in a directory
  • πŸŽ™οΈ Audiobook Creation: Generate audiobook-style content from books using LLM
  • οΏ½ Multiple TTS Providers: Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
  • 🌍 Multi-language Support: Translate content
  • 🎡 High-Quality TTS: Natural-sounding speech with multiple provider options
  • βš™οΈ Flexible Configuration: Environment-based settings and .keys file support

πŸ“‹ Prerequisites

Core Requirements

For Local TTS Providers (Optional)

Kokoro TTS

  • Docker & Docker Compose (for API services)
  • CUDA-capable GPU (recommended for optimal performance)

Qwen-TTS

  • Qwen-TTS API Server running on port 8890 (see Qwen3-TTS)
  • CUDA-capable GPU (recommended for optimal performance)

For Cloud TTS Providers (Optional)

  • OpenAI TTS: OpenAI API key (get one here)
  • AWS Polly: AWS account with access keys (AWS setup)
  • Google Cloud TTS: Google Cloud project with credentials (GCP setup)

🐳 Quick Start with Docker (For Kokoro TTS)

Note: Docker is only required if you want to use the local Kokoro TTS provider. For Qwen-TTS, you'll need to run the Qwen-TTS API separately (see Qwen-TTS Setup below). You can skip to "Quick Start with Cloud TTS" if you prefer using OpenAI, AWS Polly, or Google Cloud TTS.

1. Clone and Setup

git clone https://github.com/garciadias/audify.git
cd audify

2. Start API Services

# Start Kokoro TTS and Ollama services
docker compose up -d

# Wait for services to be ready (~2-3 minutes)
# Check status: docker compose ps

3. Install Python Dependencies

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

4. Setup Ollama Models

# Pull required models for translation and audiobook generation
docker compose exec ollama ollama pull qwen3:30b

# Or use lighter models for testing:
# docker compose exec ollama ollama pull llama3.2:3b

5. Convert Your First Book

# Convert EPUB to audiobook (using Kokoro TTS)
task run path/to/your/book.epub

# Convert PDF to audiobook
task run path/to/your/document.pdf

# Create audiobook from EPUB
task audiobook path/to/your/book.epub

πŸš€ Quick Start with Qwen-TTS (Local)

Qwen-TTS is a high-quality, free, and privacy-friendly local TTS solution with excellent multilingual support.

1. Setup Qwen-TTS API

First, set up the Qwen-TTS API server (requires GPU):

# Clone Qwen-TTS API repository
git clone https://github.com/QwenLM/Qwen3-TTS
cd Qwen3-TTS

# Start with Docker (recommended)
make up

# The API will be available at http://localhost:8890

For detailed setup instructions, see the Qwen3-TTS documentation.

2. Install Audify

git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

3. Configure Qwen-TTS

Create a .keys file:

TTS_PROVIDER=qwen
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian

4. Convert Your First Book

# Convert using Qwen-TTS
task run path/to/your/book.epub

# Or specify provider explicitly
task --tts-provider qwen run path/to/your/book.epub

πŸš€ Quick Start with Cloud TTS

If you prefer to use cloud TTS providers without Docker:

1. Clone and Install

git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

2. Configure Your TTS Provider

Create a .keys file with your credentials:

cp .keys.example .keys
# Edit .keys and add your provider credentials
# See Configuration section for details

3. Convert Books with Cloud TTS

# Using OpenAI TTS
task --tts-provider openai run "book.epub"

# Using AWS Polly
task --tts-provider aws run "book.epub"

# Using Google Cloud TTS
task --tts-provider google run "book.epub"

πŸ“– Usage Examples

Basic Audiobook Conversion

# English EPUB to audiobook
task run "book.epub"

# PDF with specific language
task --language pt run "document.pdf"

# With translation (English to Spanish)
task --language en --translate es run "book.epub"

Audiobook Generation

# Create audiobook from EPUB
task audiobook "book.epub"

# Limit to first 5 chapters
task audiobook "book.epub" --max-chapters 5

# Custom voice and language
task audiobook "book.epub" --voice af_bella --language en

# With translation
task audiobook "book.epub" --translate pt

Using Commercial APIs (DeepSeek, Claude, GPT-4, Gemini)

Instead of local Ollama models, you can use commercial APIs for better quality or faster processing:

# Using DeepSeek (cost-effective)
task audiobook "book.epub" -m "api:deepseek/deepseek-chat"

# Using Claude 3.5 Sonnet (high quality)
task audiobook "book.epub" -m "api:anthropic/claude-3-5-sonnet-20240620"

# Using GPT-4 (reliable)
task audiobook "book.epub" -m "api:openai/gpt-4-turbo-preview"

# Using Gemini Pro
task audiobook "book.epub" -m "api:gemini/gemini-1.5-pro"

Setup Required: Create a .keys file with your API keys for the provider(s) you intend to use. See Commercial APIs Guide for detailed instructions.

# Copy example file and add your keys
cp .keys.example .keys
# Edit .keys and add keys for your chosen provider(s):
# DEEPSEEK=your-deepseek-api-key-here
# ANTHROPIC=your-anthropic-api-key-here
# OPENAI=your-openai-api-key-here
# GEMINI=your-google-api-key-here

Directory Input (Multi-file Processing)

Process multiple files from a directory into a single audiobook:

# Create audiobook from directory of files
task audiobook "path/to/directory/"

# Process directory with translation
task --translate es audiobook "path/to/articles/" 

# Directory with custom voice
task --voice af_bella --language en audiobook "path/to/papers/" 

Supported file types in directory: EPUB, PDF, TXT, MD

The directory mode will:

  • Process each file as a separate episode
  • Use the filename as the episode title
  • Combine all episodes into a single M4B audiobook with chapter markers
  • Synthesize the title audio for each episode

Advanced Options

# List available languages
task run --list-languages

# List available TTS models
task --list-models run

# Save extracted text
task --save-text run "book.epub"

# Skip confirmation prompts
task -y run "book.epub"

# Use different TTS provider
task --tts-provider openai run "book.epub"    # OpenAI TTS
task --tts-provider aws run "book.epub"       # AWS Polly
task --tts-provider google run "book.epub"    # Google Cloud TTS
task --tts-provider qwen run "book.epub"      # Qwen-TTS (local)

# List available TTS providers
task --list-tts-providers run

βš™οΈ Configuration

TTS Provider Configuration

Audify supports multiple TTS providers. Configure your preferred provider using environment variables or a .keys file:

Option 1: Using .keys File (Recommended)

Create a .keys file in the project root:

cp .keys.example .keys

Edit .keys and add your credentials:

# OpenAI TTS
OPENAI_API_KEY=sk-your-openai-api-key
OPENAI_TTS_MODEL=tts-1-hd
OPENAI_TTS_VOICE=alloy

# AWS Polly
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
AWS_REGION=us-east-1
AWS_POLLY_VOICE=Joanna
AWS_POLLY_ENGINE=neural

# Google Cloud TTS
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GOOGLE_TTS_VOICE=en-US-Neural2-F
GOOGLE_TTS_LANGUAGE_CODE=en-US

# Qwen-TTS (Local)
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian

# Default TTS Provider
TTS_PROVIDER=kokoro  # Options: kokoro, qwen, openai, aws, google

Option 2: Environment Variables

# Kokoro TTS API (Local)
export KOKORO_API_URL="http://localhost:8887/v1/audio"

# OpenAI TTS
export OPENAI_API_KEY="sk-your-key"
export OPENAI_TTS_MODEL="tts-1-hd"  # or "tts-1"
export OPENAI_TTS_VOICE="alloy"     # alloy, echo, fable, onyx, nova, shimmer

# AWS Polly
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
export AWS_POLLY_VOICE="Joanna"     # Neural voices recommended
export AWS_POLLY_ENGINE="neural"    # "standard" or "neural"

# Google Cloud TTS
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export GOOGLE_TTS_VOICE="en-US-Neural2-F"
export GOOGLE_TTS_LANGUAGE_CODE="en-US"

# Qwen-TTS (Local)
export QWEN_API_URL="http://localhost:8890"
export QWEN_TTS_VOICE="Vivian"

# Default Provider
export TTS_PROVIDER="kokoro"  # Options: kokoro, qwen, openai, aws, google

# Ollama Configuration
export OLLAMA_API_BASE_URL="http://localhost:11434"
export OLLAMA_TRANSLATION_MODEL="qwen3:30b"
export OLLAMA_MODEL="magistral:24b"

Choosing a TTS Provider

Provider Pros Cons Best For
Kokoro (Local) Free, privacy-friendly, GPU-accelerated Requires local setup Development, privacy-sensitive projects
Qwen-TTS (Local) Free, privacy-friendly, GPU-accelerated, multilingual Requires separate API setup Multilingual projects, privacy-sensitive content
OpenAI High quality, easy setup Pay per character Production, high-quality output
AWS Polly Neural voices, scalable AWS account required Enterprise, AWS-integrated projects
Google Cloud TTS Natural voices, many languages GCP account required Multi-language projects

Docker Services

The docker-compose.yml configures (only needed for local/Kokoro TTS):

  • Kokoro TTS: Port 8887 (GPU-accelerated speech synthesis, local)
  • Ollama: Port 11434 (LLM for translation and audiobook generation, optional)

Note: Docker services are only required for Kokoro (local TTS). Commercial TTS providers (OpenAI, AWS, Google) and LLM APIs (DeepSeek, Claude, GPT-4, Gemini) work without Docker.

πŸ“ Output Structure

data/output/
β”œβ”€β”€ [book_name]/
β”‚   β”œβ”€β”€ chapters.txt           # Book metadata
β”‚   β”œβ”€β”€ cover.jpg              # Book cover image
β”‚   β”œβ”€β”€ chapters_001.mp3       # Individual chapter audio
β”‚   β”œβ”€β”€ chapters_002.mp3
β”‚   β”œβ”€β”€ chapters_003.mp3
β”‚   β”œβ”€β”€ ...                    # More chapters
β”‚   └── book_name.m4b          # Final audiobook
β”‚
└── audiobooks/
    └── [book_name]/
        β”œβ”€β”€ episodes/
        β”‚   β”œβ”€β”€ episode_001.mp3     # Audiobook episodes
        β”‚   β”œβ”€β”€ episode_002.mp3
        β”‚   └── ...
        β”œβ”€β”€ scripts/                # Generated scripts
        β”‚   β”œβ”€β”€ episode_001_script.txt
        β”‚   β”œβ”€β”€ original_text_001.txt
        β”‚   └── ...
        β”œβ”€β”€ chapters.txt            # FFmpeg metadata
        └── [book_name].m4b         # Final M4B audiobook

Directory audiobook output:

data/output/
└── [directory_name]/
    β”œβ”€β”€ episodes/
    β”‚   β”œβ”€β”€ episode_001.mp3     # Episode from first file
    β”‚   β”œβ”€β”€ episode_002.mp3     # Episode from second file
    β”‚   └── ...
    β”œβ”€β”€ scripts/
    β”‚   β”œβ”€β”€ episode_001_script.txt
    β”‚   └── ...
    β”œβ”€β”€ chapters.txt            # Chapter metadata
    └── [directory_name].m4b    # Combined audiobook

πŸ› οΈ Development

Available Tasks

task test      # Run tests with coverage
task format    # Format code with ruff
task run       # Convert ebook to audiobook
task audiobook   # Create audiobook from content
task up        # Start Docker services

Local Development Setup

# Install development dependencies
uv sync --group dev

# Run tests
task test

# Format code
task format

# Type checking (included in pre_test)
mypy ./audify ./tests --ignore-missing-imports

πŸ—οΈ Architecture

Audify uses a flexible multi-provider architecture supporting both local and cloud services:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Audify CLI        β”‚
β”‚ β€’ EPUB/PDF Read     β”‚
β”‚ β€’ Text Process      β”‚
β”‚ β€’ Audio Combine     β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
       β”œβ”€β”€β”€ TTS Providers ───────┐
       β”‚    β”œβ”€ Kokoro (local)    β”‚
       β”‚    β”œβ”€ OpenAI TTS        β”‚
       β”‚    β”œβ”€ AWS Polly         β”‚
       β”‚    └─ Google Cloud TTS  β”‚
       β”‚                          β”‚
       └─── LLM APIs ─────────────
            β”œβ”€ Ollama (local)    β”‚
            β”œβ”€ DeepSeek          β”‚
            β”œβ”€ Claude            β”‚
            β”œβ”€ GPT-4             β”‚
            └─ Gemini            β”‚

Key Components

  • Text Extraction: EPUB/PDF parsing with chapter detection
  • Translation: LiteLLM + Commercial/Local LLMs for high-quality translation
  • TTS: Multi-provider support (Kokoro, OpenAI, AWS Polly, Google Cloud TTS)
  • Audiobook Generation: LLM-powered script creation with commercial API support
  • Audio Processing: Pydub for format conversion and combining
  • API Management: Unified API key management via .keys file or environment variables

🌍 Supported Languages

Primary: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, Japanese, Hindi

Translation: Any language pair supported by your Ollama model

πŸ”§ Troubleshooting

Common Issues

Services not responding (Docker/Kokoro):

# Check service status
docker compose ps

# Restart services
docker compose restart

# Check logs
docker compose logs kokoro
docker compose logs ollama

Commercial API errors:

# Verify API key configuration
cat .keys

# Test API connectivity
uv run audify translate test.txt --model api:deepseek-chat

# Check API key is loaded
# The system will show an error if the API key is missing or invalid

TTS Provider issues:

# List available TTS providers
uv run audify --list-tts-providers

# Test specific provider
uv run audify translate test.txt --tts-provider openai

# Check provider credentials in .keys file
# OpenAI: OPENAI_API_KEY
# AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# Google: GOOGLE_APPLICATION_CREDENTIALS (path to JSON file)

Ollama model not found:

# List available models
docker compose exec ollama ollama list

# Pull required model
docker compose exec ollama ollama pull qwen3:30b

GPU issues:

# Check GPU availability
docker compose exec kokoro nvidia-smi

# If no GPU, services will run on CPU (slower)

Performance Tips

  • Use SSD storage for model caching
  • Ensure adequate GPU memory (8GB+ recommended) for Kokoro
  • Use lighter models for testing: llama3.2:3b instead of magistral:24b
  • Commercial TTS providers (OpenAI, AWS, Google) are faster than local Kokoro
  • Commercial LLM APIs often provide better latency than local Ollama
  • Consider running local services on separate machines for large workloads
  • Use cloud providers for production workloads requiring high reliability

πŸ“š Examples

Check the examples/ directory for sample usage patterns and configuration files.

🀝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: task test
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors