Convert ebooks and PDFs to audiobooks using AI text-to-speech and translation services.
Audify is a API-based system that transforms written content into high-quality audio using:
- Multiple TTS Providers - Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
- Ollama + LiteLLM for intelligent translation
- LLM-powered audiobook generation for engaging audio content
- π Multiple Formats: Convert EPUB ebooks, PDF documents, TXT, and MD files
- π Directory Processing: Create audiobooks from multiple files in a directory
- ποΈ Audiobook Creation: Generate audiobook-style content from books using LLM
- οΏ½ Multiple TTS Providers: Choose from Kokoro (local), Qwen-TTS (local), OpenAI, AWS Polly, or Google Cloud TTS
- π Multi-language Support: Translate content
- π΅ High-Quality TTS: Natural-sounding speech with multiple provider options
- βοΈ Flexible Configuration: Environment-based settings and
.keysfile support
- Python 3.10-3.13
- UV package manager (installation guide)
- Docker & Docker Compose (for API services)
- CUDA-capable GPU (recommended for optimal performance)
- Qwen-TTS API Server running on port 8890 (see Qwen3-TTS)
- CUDA-capable GPU (recommended for optimal performance)
- OpenAI TTS: OpenAI API key (get one here)
- AWS Polly: AWS account with access keys (AWS setup)
- Google Cloud TTS: Google Cloud project with credentials (GCP setup)
Note: Docker is only required if you want to use the local Kokoro TTS provider. For Qwen-TTS, you'll need to run the Qwen-TTS API separately (see Qwen-TTS Setup below). You can skip to "Quick Start with Cloud TTS" if you prefer using OpenAI, AWS Polly, or Google Cloud TTS.
git clone https://github.com/garciadias/audify.git
cd audify# Start Kokoro TTS and Ollama services
docker compose up -d
# Wait for services to be ready (~2-3 minutes)
# Check status: docker compose ps# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv sync# Pull required models for translation and audiobook generation
docker compose exec ollama ollama pull qwen3:30b
# Or use lighter models for testing:
# docker compose exec ollama ollama pull llama3.2:3b# Convert EPUB to audiobook (using Kokoro TTS)
task run path/to/your/book.epub
# Convert PDF to audiobook
task run path/to/your/document.pdf
# Create audiobook from EPUB
task audiobook path/to/your/book.epubQwen-TTS is a high-quality, free, and privacy-friendly local TTS solution with excellent multilingual support.
First, set up the Qwen-TTS API server (requires GPU):
# Clone Qwen-TTS API repository
git clone https://github.com/QwenLM/Qwen3-TTS
cd Qwen3-TTS
# Start with Docker (recommended)
make up
# The API will be available at http://localhost:8890For detailed setup instructions, see the Qwen3-TTS documentation.
git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv syncCreate a .keys file:
TTS_PROVIDER=qwen
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian# Convert using Qwen-TTS
task run path/to/your/book.epub
# Or specify provider explicitly
task --tts-provider qwen run path/to/your/book.epubIf you prefer to use cloud TTS providers without Docker:
git clone https://github.com/garciadias/audify.git
cd audify
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv syncCreate a .keys file with your credentials:
cp .keys.example .keys
# Edit .keys and add your provider credentials
# See Configuration section for details# Using OpenAI TTS
task --tts-provider openai run "book.epub"
# Using AWS Polly
task --tts-provider aws run "book.epub"
# Using Google Cloud TTS
task --tts-provider google run "book.epub"# English EPUB to audiobook
task run "book.epub"
# PDF with specific language
task --language pt run "document.pdf"
# With translation (English to Spanish)
task --language en --translate es run "book.epub"# Create audiobook from EPUB
task audiobook "book.epub"
# Limit to first 5 chapters
task audiobook "book.epub" --max-chapters 5
# Custom voice and language
task audiobook "book.epub" --voice af_bella --language en
# With translation
task audiobook "book.epub" --translate ptInstead of local Ollama models, you can use commercial APIs for better quality or faster processing:
# Using DeepSeek (cost-effective)
task audiobook "book.epub" -m "api:deepseek/deepseek-chat"
# Using Claude 3.5 Sonnet (high quality)
task audiobook "book.epub" -m "api:anthropic/claude-3-5-sonnet-20240620"
# Using GPT-4 (reliable)
task audiobook "book.epub" -m "api:openai/gpt-4-turbo-preview"
# Using Gemini Pro
task audiobook "book.epub" -m "api:gemini/gemini-1.5-pro"Setup Required: Create a .keys file with your API keys for the provider(s) you intend to use. See Commercial APIs Guide for detailed instructions.
# Copy example file and add your keys
cp .keys.example .keys
# Edit .keys and add keys for your chosen provider(s):
# DEEPSEEK=your-deepseek-api-key-here
# ANTHROPIC=your-anthropic-api-key-here
# OPENAI=your-openai-api-key-here
# GEMINI=your-google-api-key-hereProcess multiple files from a directory into a single audiobook:
# Create audiobook from directory of files
task audiobook "path/to/directory/"
# Process directory with translation
task --translate es audiobook "path/to/articles/"
# Directory with custom voice
task --voice af_bella --language en audiobook "path/to/papers/" Supported file types in directory: EPUB, PDF, TXT, MD
The directory mode will:
- Process each file as a separate episode
- Use the filename as the episode title
- Combine all episodes into a single M4B audiobook with chapter markers
- Synthesize the title audio for each episode
# List available languages
task run --list-languages
# List available TTS models
task --list-models run
# Save extracted text
task --save-text run "book.epub"
# Skip confirmation prompts
task -y run "book.epub"
# Use different TTS provider
task --tts-provider openai run "book.epub" # OpenAI TTS
task --tts-provider aws run "book.epub" # AWS Polly
task --tts-provider google run "book.epub" # Google Cloud TTS
task --tts-provider qwen run "book.epub" # Qwen-TTS (local)
# List available TTS providers
task --list-tts-providers runAudify supports multiple TTS providers. Configure your preferred provider using environment variables or a .keys file:
Create a .keys file in the project root:
cp .keys.example .keysEdit .keys and add your credentials:
# OpenAI TTS
OPENAI_API_KEY=sk-your-openai-api-key
OPENAI_TTS_MODEL=tts-1-hd
OPENAI_TTS_VOICE=alloy
# AWS Polly
AWS_ACCESS_KEY_ID=your-aws-access-key
AWS_SECRET_ACCESS_KEY=your-aws-secret-key
AWS_REGION=us-east-1
AWS_POLLY_VOICE=Joanna
AWS_POLLY_ENGINE=neural
# Google Cloud TTS
GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
GOOGLE_TTS_VOICE=en-US-Neural2-F
GOOGLE_TTS_LANGUAGE_CODE=en-US
# Qwen-TTS (Local)
QWEN_API_URL=http://localhost:8890
QWEN_TTS_VOICE=Vivian
# Default TTS Provider
TTS_PROVIDER=kokoro # Options: kokoro, qwen, openai, aws, google# Kokoro TTS API (Local)
export KOKORO_API_URL="http://localhost:8887/v1/audio"
# OpenAI TTS
export OPENAI_API_KEY="sk-your-key"
export OPENAI_TTS_MODEL="tts-1-hd" # or "tts-1"
export OPENAI_TTS_VOICE="alloy" # alloy, echo, fable, onyx, nova, shimmer
# AWS Polly
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
export AWS_POLLY_VOICE="Joanna" # Neural voices recommended
export AWS_POLLY_ENGINE="neural" # "standard" or "neural"
# Google Cloud TTS
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export GOOGLE_TTS_VOICE="en-US-Neural2-F"
export GOOGLE_TTS_LANGUAGE_CODE="en-US"
# Qwen-TTS (Local)
export QWEN_API_URL="http://localhost:8890"
export QWEN_TTS_VOICE="Vivian"
# Default Provider
export TTS_PROVIDER="kokoro" # Options: kokoro, qwen, openai, aws, google
# Ollama Configuration
export OLLAMA_API_BASE_URL="http://localhost:11434"
export OLLAMA_TRANSLATION_MODEL="qwen3:30b"
export OLLAMA_MODEL="magistral:24b"| Provider | Pros | Cons | Best For |
|---|---|---|---|
| Kokoro (Local) | Free, privacy-friendly, GPU-accelerated | Requires local setup | Development, privacy-sensitive projects |
| Qwen-TTS (Local) | Free, privacy-friendly, GPU-accelerated, multilingual | Requires separate API setup | Multilingual projects, privacy-sensitive content |
| OpenAI | High quality, easy setup | Pay per character | Production, high-quality output |
| AWS Polly | Neural voices, scalable | AWS account required | Enterprise, AWS-integrated projects |
| Google Cloud TTS | Natural voices, many languages | GCP account required | Multi-language projects |
The docker-compose.yml configures (only needed for local/Kokoro TTS):
- Kokoro TTS: Port 8887 (GPU-accelerated speech synthesis, local)
- Ollama: Port 11434 (LLM for translation and audiobook generation, optional)
Note: Docker services are only required for Kokoro (local TTS). Commercial TTS providers (OpenAI, AWS, Google) and LLM APIs (DeepSeek, Claude, GPT-4, Gemini) work without Docker.
data/output/
βββ [book_name]/
β βββ chapters.txt # Book metadata
β βββ cover.jpg # Book cover image
β βββ chapters_001.mp3 # Individual chapter audio
β βββ chapters_002.mp3
β βββ chapters_003.mp3
β βββ ... # More chapters
β βββ book_name.m4b # Final audiobook
β
βββ audiobooks/
βββ [book_name]/
βββ episodes/
β βββ episode_001.mp3 # Audiobook episodes
β βββ episode_002.mp3
β βββ ...
βββ scripts/ # Generated scripts
β βββ episode_001_script.txt
β βββ original_text_001.txt
β βββ ...
βββ chapters.txt # FFmpeg metadata
βββ [book_name].m4b # Final M4B audiobook
Directory audiobook output:
data/output/
βββ [directory_name]/
βββ episodes/
β βββ episode_001.mp3 # Episode from first file
β βββ episode_002.mp3 # Episode from second file
β βββ ...
βββ scripts/
β βββ episode_001_script.txt
β βββ ...
βββ chapters.txt # Chapter metadata
βββ [directory_name].m4b # Combined audiobook
task test # Run tests with coverage
task format # Format code with ruff
task run # Convert ebook to audiobook
task audiobook # Create audiobook from content
task up # Start Docker services# Install development dependencies
uv sync --group dev
# Run tests
task test
# Format code
task format
# Type checking (included in pre_test)
mypy ./audify ./tests --ignore-missing-importsAudify uses a flexible multi-provider architecture supporting both local and cloud services:
βββββββββββββββββββββββ
β Audify CLI β
β β’ EPUB/PDF Read β
β β’ Text Process β
β β’ Audio Combine β
ββββββββ¬βββββββββββββββ
β
ββββ TTS Providers ββββββββ
β ββ Kokoro (local) β
β ββ OpenAI TTS β
β ββ AWS Polly β
β ββ Google Cloud TTS β
β β
ββββ LLM APIs βββββββββββββ€
ββ Ollama (local) β
ββ DeepSeek β
ββ Claude β
ββ GPT-4 β
ββ Gemini β
- Text Extraction: EPUB/PDF parsing with chapter detection
- Translation: LiteLLM + Commercial/Local LLMs for high-quality translation
- TTS: Multi-provider support (Kokoro, OpenAI, AWS Polly, Google Cloud TTS)
- Audiobook Generation: LLM-powered script creation with commercial API support
- Audio Processing: Pydub for format conversion and combining
- API Management: Unified API key management via .keys file or environment variables
Primary: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Hungarian, Korean, Japanese, Hindi
Translation: Any language pair supported by your Ollama model
Services not responding (Docker/Kokoro):
# Check service status
docker compose ps
# Restart services
docker compose restart
# Check logs
docker compose logs kokoro
docker compose logs ollamaCommercial API errors:
# Verify API key configuration
cat .keys
# Test API connectivity
uv run audify translate test.txt --model api:deepseek-chat
# Check API key is loaded
# The system will show an error if the API key is missing or invalidTTS Provider issues:
# List available TTS providers
uv run audify --list-tts-providers
# Test specific provider
uv run audify translate test.txt --tts-provider openai
# Check provider credentials in .keys file
# OpenAI: OPENAI_API_KEY
# AWS: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
# Google: GOOGLE_APPLICATION_CREDENTIALS (path to JSON file)Ollama model not found:
# List available models
docker compose exec ollama ollama list
# Pull required model
docker compose exec ollama ollama pull qwen3:30bGPU issues:
# Check GPU availability
docker compose exec kokoro nvidia-smi
# If no GPU, services will run on CPU (slower)- Use SSD storage for model caching
- Ensure adequate GPU memory (8GB+ recommended) for Kokoro
- Use lighter models for testing:
llama3.2:3binstead ofmagistral:24b - Commercial TTS providers (OpenAI, AWS, Google) are faster than local Kokoro
- Commercial LLM APIs often provide better latency than local Ollama
- Consider running local services on separate machines for large workloads
- Use cloud providers for production workloads requiring high reliability
Check the examples/ directory for sample usage patterns and configuration files.
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
task test - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Kokoro TTS for high-quality speech synthesis
- Kokoro-FastAPI accessible kokoro via FastAPI
- Ollama for local LLM inference
- LiteLLM for unified LLM API interface
- OpenAI for GPT and TTS APIs
- Anthropic for Claude API
- DeepSeek for DeepSeek API
- Google for Gemini and Cloud TTS
- AWS Polly for Text-to-Speech service