Skip to content

Latest commit

 

History

History
614 lines (444 loc) · 17.2 KB

File metadata and controls

614 lines (444 loc) · 17.2 KB

Audify

codecov Tests

Convert ebooks and PDFs to audiobooks using AI text-to-speech and translation services.

Audify is a API-based system that transforms written content into high-quality audio using:

  • Multiple TTS Providers - Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
  • Ollama + LiteLLM for intelligent translation
  • LLM-powered audiobook generation for engaging audio content

🚀 Features

  • 📚 Multiple Formats: Convert EPUB ebooks, PDF documents, TXT, and MD files
  • 📁 Directory Processing: Create audiobooks from multiple files in a directory
  • 🎙️ Audiobook Creation: Generate audiobook-style content from books using LLM
  • � Multiple TTS Providers: Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
  • 🌍 Multi-language Support: Translate content
  • 🎵 High-Quality TTS: Natural-sounding speech with multiple provider options
  • ⚙️ Flexible Configuration: Environment-based settings and .keys file support

📋 Prerequisites

Core Requirements

For Local TTS Providers (Optional)

Kokoro TTS

  • Docker & Docker Compose (for API services)
  • CUDA-capable GPU (recommended for optimal performance)

Qwen-TTS

  • Qwen-TTS API Server running on port 8890 (see Qwen3-TTS)
  • CUDA-capable GPU (recommended for optimal performance)

For Cloud TTS Providers (Optional)

  • OpenAI TTS: OpenAI API key (get one here)
  • AWS Polly: AWS account with access keys (AWS setup)
  • Google Cloud TTS: Google Cloud project with credentials (GCP setup)

🐳 Quick Start with Docker (For Kokoro TTS)

Note: Docker is only required if you want to use the local Kokoro TTS provider. For Qwen-TTS, you'll need to run the Qwen-TTS API separately (see Qwen-TTS Setup below). You can skip to "Quick Start with Cloud TTS" if you prefer using OpenAI, AWS Polly, or Google Cloud TTS.

1. Clone and Setup

git clone https://github.com/garciadias/audify.git
cd audify

2. Start API Services

# Start Kokoro TTS and Ollama services
docker compose up -d

# Wait for services to be ready (~2-3 minutes)
# Check status: docker compose ps

3. Install Python Dependencies

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

4. Setup Ollama Models

# Pull required models for translation and audiobook generation
docker compose exec ollama ollama pull qwen3:30b

# Or use lighter models for testing:
# docker compose exec ollama ollama pull llama3.2:3b

5. Convert Your First Book

# Convert EPUB to audiobook (using Kokoro TTS)
task run path/to/your/book.epub

# Convert PDF to audiobook
task run path/to/your/document.pdf

# Create audiobook from EPUB
task audiobook path/to/your/book.epub

🚀 Quick Start with Qwen-TTS (Local)

Qwen-TTS is a high-quality, free, and privacy-friendly local TTS solution with excellent multilingual support.

1. Setup Qwen-TTS API

First, set up the Qwen-TTS API server (requires GPU):

# Clone Qwen-TTS API repository
git clone https://github.com/QwenLM/Qwen3-TTS
cd Qwen3-TTS

# Start with Docker (recommended)
make up

# The API will be available at http://localhost:8890

For detailed setup instructions, see the Qwen3-TTS documentation.

2. Install Audify

git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

3. Configure Qwen-TTS

Create a .keys file:

TTS_PROVIDER=qwen
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian

4. Convert Your First Book

# Convert using Qwen-TTS
task run path/to/your/book.epub

# Or specify provider explicitly
task --tts-provider qwen run path/to/your/book.epub

🚀 Quick Start with Cloud TTS

If you prefer to use cloud TTS providers without Docker:

1. Clone and Install

git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv sync

2. Configure Your TTS Provider

Create a .keys file with your credentials:

cp .keys.example .keys
# Edit .keys and add your provider credentials
# See Configuration section for details

3. Convert Books with Cloud TTS

# Using OpenAI TTS
task --tts-provider openai run "book.epub"

# Using AWS Polly
task --tts-provider aws run "book.epub"

# Using Google Cloud TTS
task --tts-provider google run "book.epub"

📖 Usage Examples

Basic Audiobook Conversion

# English EPUB to audiobook
task run "book.epub"

# PDF with specific language
task --language pt run "document.pdf"

# With translation (English to Spanish)
task --language en --translate es run "book.epub"

Audiobook Generation

# Create audiobook from EPUB
task audiobook "book.epub"

# Limit to first 5 chapters
task audiobook "book.epub" --max-chapters 5

# Custom voice and language
task audiobook "book.epub" --voice af_bella --language en

# With translation
task audiobook "book.epub" --translate pt

Using Commercial APIs (DeepSeek, Claude, GPT-4, Gemini)

Instead of local Ollama models, you can use commercial APIs for better quality or faster processing:

# Using DeepSeek (cost-effective)
task audiobook "book.epub" -m "api:deepseek/deepseek-chat"

# Using Claude 3.5 Sonnet (high quality)
task audiobook "book.epub" -m "api:anthropic/claude-3-5-sonnet-20240620"

# Using GPT-4 (reliable)
task audiobook "book.epub" -m "api:openai/gpt-4-turbo-preview"

# Using Gemini Pro
task audiobook "book.epub" -m "api:gemini/gemini-1.5-pro"

Setup Required: Create a .keys file with your API keys for the provider(s) you intend to use. See Commercial APIs Guide for detailed instructions.

# Copy example file and add your keys
cp .keys.example .keys
# Edit .keys and add keys for your chosen provider(s):
# DEEPSEEK=your-deepseek-api-key-here
# ANTHROPIC=your-anthropic-api-key-here
# OPENAI=your-openai-api-key-here
# GEMINI=your-google-api-key-here

Directory Input (Multi-file Processing)

Process multiple files from a directory into a single audiobook:

# Create audiobook from directory of files
task audiobook "path/to/directory/"

# Process directory with translation
task --translate es audiobook "path/to/articles/" 

# Directory with custom voice
task --voice af_bella --language en audiobook "path/to/papers/" 

Supported file types in directory: EPUB, PDF, TXT, MD

The directory mode will:

  • Process each file as a separate episode
  • Use the filename as the episode title
  • Combine all episodes into a single M4B audiobook with chapter markers
  • Synthesize the title audio for each episode

Advanced Options

# List available languages
task run --list-languages

# List available TTS models
task --list-models run

# Save extracted text
task --save-text run "book.epub"

# Skip confirmation prompts
task -y run "book.epub"

# Use different TTS provider
task --tts-provider openai run "book.epub"    # OpenAI TTS
task --tts-provider aws run "book.epub"       # AWS Polly
task --tts-provider google run "book.epub"    # Google Cloud TTS
task --tts-provider qwen run "book.epub"      # Qwen-TTS (local)

# List available TTS providers
task --list-tts-providers run

⚙️ Configuration

TTS Provider Configuration

Audify supports multiple TTS providers. Configure your preferred provider using environment variables or a .keys file:

Option 1: Using .keys File (Recommended)

Create a .keys file in the project root:

cp .keys.example .keys

Edit .keys and add your credentials:

# OpenAI TTS
OPENAI_API_KEY=sk-your-openai-api-key
OPENAI_TTS_MODEL=tts-1-hd
OPENAI_TTS_VOICE=alloy

# AWS Polly
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
AWS_REGION=us-east-1
AWS_POLLY_VOICE=Joanna
AWS_POLLY_ENGINE=neural

# Google Cloud TTS
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GOOGLE_TTS_VOICE=en-US-Neural2-F
GOOGLE_TTS_LANGUAGE_CODE=en-US

# Qwen-TTS (Local)
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian

# Default TTS Provider
TTS_PROVIDER=kokoro  # Options: kokoro, qwen, openai, aws, google

Option 2: Environment Variables

# Kokoro TTS API (Local)
export KOKORO_API_URL="http://localhost:8887/v1/audio"

# OpenAI TTS
export OPENAI_API_KEY="sk-your-key"
export OPENAI_TTS_MODEL="tts-1-hd"  # or "tts-1"
export OPENAI_TTS_VOICE="alloy"     # alloy, echo, fable, onyx, nova, shimmer

# AWS Polly
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
export AWS_POLLY_VOICE="Joanna"     # Neural voices recommended
export AWS_POLLY_ENGINE="neural"    # "standard" or "neural"

# Google Cloud TTS
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export GOOGLE_TTS_VOICE="en-US-Neural2-F"
export GOOGLE_TTS_LANGUAGE_CODE="en-US"

# Qwen-TTS (Local)
export QWEN_API_URL="http://localhost:8890"
export QWEN_TTS_VOICE="Vivian"

# Default Provider
export TTS_PROVIDER="kokoro"  # Options: kokoro, qwen, openai, aws, google

# Ollama Configuration
export OLLAMA_API_BASE_URL="http://localhost:11434"
export OLLAMA_TRANSLATION_MODEL="qwen3:30b"
export OLLAMA_MODEL="magistral:24b"

Choosing a TTS Provider

Provider Pros Cons Best For
Kokoro (Local) Free, privacy-friendly, GPU-accelerated Requires local setup Development, privacy-sensitive projects
Qwen-TTS (Local) Free, privacy-friendly, GPU-accelerated, multilingual Requires separate API setup Multilingual projects, privacy-sensitive content
OpenAI High quality, easy setup Pay per character Production, high-quality output
AWS Polly Neural voices, scalable AWS account required Enterprise, AWS-integrated projects
Google Cloud TTS Natural voices, many languages GCP account required Multi-language projects

Docker Services

The docker-compose.yml configures (only needed for local/Kokoro TTS):

  • Kokoro TTS: Port 8887 (GPU-accelerated speech synthesis, local)
  • Ollama: Port 11434 (LLM for translation and audiobook generation, optional)

Note: Docker services are only required for Kokoro (local TTS). Commercial TTS providers (OpenAI, AWS, Google) and LLM APIs (DeepSeek, Claude, GPT-4, Gemini) work without Docker.

📁 Output Structure

data/output/
├── [book_name]/
│   ├── chapters.txt           # Book metadata
│   ├── cover.jpg              # Book cover image
│   ├── chapters_001.mp3       # Individual chapter audio
│   ├── chapters_002.mp3
│   ├── chapters_003.mp3
│   ├── ...                    # More chapters
│   └── book_name.m4b          # Final audiobook
│
└── audiobooks/
    └── [book_name]/
        ├── episodes/
        │   ├── episode_001.mp3     # Audiobook episodes
        │   ├── episode_002.mp3
        │   └── ...
        ├── scripts/                # Generated scripts
        │   ├── episode_001_script.txt
        │   ├── original_text_001.txt
        │   └── ...
        ├── chapters.txt            # FFmpeg metadata
        └── [book_name].m4b         # Final M4B audiobook

Directory audiobook output:

data/output/
└── [directory_name]/
    ├── episodes/
    │   ├── episode_001.mp3     # Episode from first file
    │   ├── episode_002.mp3     # Episode from second file
    │   └── ...
    ├── scripts/
    │   ├── episode_001_script.txt
    │   └── ...
    ├── chapters.txt            # Chapter metadata
    └── [directory_name].m4b    # Combined audiobook

🛠️ Development

Available Tasks

task test      # Run tests with coverage
task format    # Format code with ruff
task run       # Convert ebook to audiobook
task audiobook   # Create audiobook from content
task up        # Start Docker services

Local Development Setup

# Install development dependencies
uv sync --group dev

# Run tests
task test

# Format code
task format

# Type checking (included in pre_test)
mypy ./audify ./tests --ignore-missing-imports

🏗️ Architecture

Audify uses a flexible multi-provider architecture supporting both local and cloud services:

┌─────────────────────┐
│   Audify CLI        │
│ • EPUB/PDF Read     │
│ • Text Process      │
│ • Audio Combine     │
└──────┬──────────────┘
       │
       ├─── TTS Providers ───────┐
       │    ├─ Kokoro (local)    │
       │    ├─ OpenAI TTS        │
       │    ├─ AWS Polly         │
       │    └─ Google Cloud TTS  │
       │                          │
       └─── LLM APIs ────────────┤
            ├─ Ollama (local)    │
            ├─ DeepSeek          │
            ├─ Claude            │
            ├─ GPT-4             │
            └─ Gemini            │

Key Components

  • Text Extraction: EPUB/PDF parsing with chapter detection
  • Translation: LiteLLM + Commercial/Local LLMs for high-quality translation
  • TTS: Multi-provider support (Kokoro, OpenAI, AWS Polly, Google Cloud TTS)
  • Audiobook Generation: LLM-powered script creation with commercial API support
  • Audio Processing: Pydub for format conversion and combining
  • API Management: Unified API key management via .keys file or environment variables

🌍 Supported Languages

Primary: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, Japanese, Hindi

Translation: Any language pair supported by your Ollama model

🔧 Troubleshooting

Common Issues

Services not responding (Docker/Kokoro):

# Check service status
docker compose ps

# Restart services
docker compose restart

# Check logs
docker compose logs kokoro
docker compose logs ollama

Commercial API errors:

# Verify API key configuration
cat .keys

# Test API connectivity
uv run audify translate test.txt --model api:deepseek-chat

# Check API key is loaded
# The system will show an error if the API key is missing or invalid

TTS Provider issues:

# List available TTS providers
uv run audify --list-tts-providers

# Test specific provider
uv run audify translate test.txt --tts-provider openai

# Check provider credentials in .keys file
# OpenAI: OPENAI_API_KEY
# AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# Google: GOOGLE_APPLICATION_CREDENTIALS (path to JSON file)

Ollama model not found:

# List available models
docker compose exec ollama ollama list

# Pull required model
docker compose exec ollama ollama pull qwen3:30b

GPU issues:

# Check GPU availability
docker compose exec kokoro nvidia-smi

# If no GPU, services will run on CPU (slower)

Performance Tips

  • Use SSD storage for model caching
  • Ensure adequate GPU memory (8GB+ recommended) for Kokoro
  • Use lighter models for testing: llama3.2:3b instead of magistral:24b
  • Commercial TTS providers (OpenAI, AWS, Google) are faster than local Kokoro
  • Commercial LLM APIs often provide better latency than local Ollama
  • Consider running local services on separate machines for large workloads
  • Use cloud providers for production workloads requiring high reliability

📚 Examples

Check the examples/ directory for sample usage patterns and configuration files.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Workflow

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: task test
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments