A flexible Python tool for converting PDF documents to audio using various TTS providers.
- ๐ Modern Web Interface - User-friendly web UI with drag & drop upload (NEW!)
- ๐ Batch Processing - Convert multiple PDF documents at once
- ๐ Multiple TTS Providers - Extensible architecture to support different TTS services
- โ๏ธ Smart Text Splitting - Automatically splits long texts to avoid API limits
- ๐ต Customizable Voice - Adjust speed, pitch, timbre, emotion and more
- ๐ Progress Tracking - Real-time progress display and status updates
- ๐ Error Handling - Robust error handling with retry mechanisms
- ๐ง High Quality Output - MP3 format at 128kbps bitrate
- โ PPIO - High-quality TTS using MiniMax Speech 2.8 HD model
- โ Novita AI - High-quality TTS using MiniMax Speech 2.8 Turbo model
- โ ElevenLabs - Premium quality TTS with realistic voices and voice cloning
- โ Azure Cognitive Services - Microsoft's TTS with 400+ voices in 140+ languages
- โ Google Cloud TTS - Google's WaveNet and Neural2 voices in 40+ languages
The easiest way to use this tool is through the modern web interface:
- No Command Line Required - Everything in your browser
- Flexible API Configuration - Use PPIO or custom API endpoints
- Drag & Drop Upload - Simply drag your PDF files
- Real-time Progress - Visual progress bar with status updates
- Audio Preview - Play audio before downloading
- Batch Download - Download all files as ZIP
- Modern Design - Clean, elegant, and intuitive interface
# 1. Install dependencies
pip install -r requirements.txt
# 2. Start the server
python app.py
# Or if port 5000 is occupied: PORT=5001 python app.py
# 3. Open in browser
# Default: http://localhost:5000
# Custom port: http://localhost:5001That's it! Upload your PDF, enter your API key, and convert!
git clone https://github.com/irismaker/pdf-to-audio.git
cd pdf-to-audiopip install -r requirements.txtMethod A: Using Config File (Recommended)
cp config_example.py config.py
# Edit config.py and add your API keys and settingsMethod B: Using Environment Variables
export TTS_PROVIDER="minimax" # or "novita", "elevenlabs", "azure", "google"
export TTS_API_KEY="your_api_key_here"Web Interface (Easiest - Recommended)
python app.py
# Default port is 5000
# If port 5000 is already in use (common on macOS):
PORT=5001 python app.pyThen open your browser and visit:
- Default:
http://localhost:5000 - Custom port:
http://localhost:5001(or your configured port)
Note for macOS users: Port 5000 is often occupied by AirPlay Receiver. Use PORT=5001 python app.py instead.
Features:
- Upload PDFs directly in your browser
- No need to manage file folders
- Visual progress tracking
- Play audio before downloading
- Download all files as ZIP
Command Line - Quick Start
python quick_start.pyCommand Line - Interactive Mode
python pdf_to_audio.pypython pdf_to_audio.pyFollow the prompts to:
- Select TTS provider
- Enter API key (if not in config)
- Specify PDF file or directory
- Customize voice settings (optional)
python quick_start.pyAutomatically uses settings from config.py and scans current directory for PDFs.
The web interface provides the easiest way to convert PDFs to audio without any command line knowledge.
python app.pyThe server will start on http://localhost:5000 by default (or the port specified by the PORT environment variable). Open this URL in your web browser.
Note for macOS users: If port 5000 is already in use by AirPlay Receiver, you can either:
- Disable AirPlay Receiver in System Settings > General > AirDrop & Handoff
- Or run the app on a different port:
PORT=5001 python app.py
-
Configure API
- Select provider (PPIO, Novita AI, ElevenLabs, Azure, or Google Cloud)
- Enter your API key for the selected provider
- API URL automatically updates based on provider selection
- Optionally customize API endpoint
-
Upload PDF
- Click the upload area or drag and drop your PDF file
- Maximum file size: 50MB
- Only PDF files are accepted
-
Customize Voice (Optional)
- Select voice type (male or female voices available)
- Adjust speed (0.5x to 2.0x)
- Adjust pitch (-12 to +12)
- Choose emotion (calm, happy, sad, angry, etc.)
-
Convert
- Click "Convert to Audio" button
- Watch real-time progress
- View status messages during conversion
-
Download Results
- Play audio files directly in browser
- Download individual files
- Download all files as a ZIP archive
- No File Management: Upload files directly through browser
- Flexible API Configuration: Use PPIO provider or custom API endpoints
- Real-time Progress: Visual progress bar with status messages
- Audio Preview: Play audio before downloading
- Batch Download: Download all generated files as ZIP
- Error Handling: Clear error messages with helpful guidance
- Responsive Design: Works on desktop and mobile devices
- Auto-cleanup: Temporary files automatically deleted after 1 hour
from pdf_to_audio import PDFToAudioConverter
# Create converter with MiniMax
converter = PDFToAudioConverter(
provider_name="minimax",
api_key="your_api_key",
provider_config={"timeout": 60}
)
# Or use other providers
# Novita AI
converter = PDFToAudioConverter(
provider_name="novita",
api_key="your_novita_api_key"
)
# ElevenLabs
converter = PDFToAudioConverter(
provider_name="elevenlabs",
api_key="your_elevenlabs_api_key"
)
# Azure Cognitive Services
converter = PDFToAudioConverter(
provider_name="azure",
api_key="your_azure_key",
provider_config={"region": "eastus"}
)
# Google Cloud TTS
converter = PDFToAudioConverter(
provider_name="google",
api_key="your_google_api_key"
)
# Convert single file
converter.convert_pdf_to_audio(
pdf_path="document.pdf",
output_dir="audio_output"
)
# Batch convert directory
converter.batch_convert(
pdf_dir="./pdfs",
output_dir="./audio_output"
)# MiniMax voice settings
voice_settings = {
"speed": 1.2, # Speech rate: 0.5-2.0
"pitch": 2, # Pitch: -12 to 12
"vol": 1.5, # Volume: 0.1-10
"emotion": "happy", # Emotion: neutral, happy, sad, angry
"voice_id": "female-tianmei" # Voice ID
}
converter.convert_pdf_to_audio(
pdf_path="document.pdf",
voice_settings=voice_settings
)male-qn-qingse- Young Malemale-qn-jingying- Professional Malemale-qn-badao- Commanding Malemale-qn-daxuesheng- College Student Male
female-shaonv- Young Femalefemale-yujie- Mature Femalefemale-chengshu- Sophisticated Femalefemale-tianmei- Sweet Female
Edit config.py to configure providers:
# Select provider
TTS_PROVIDER = "minimax" # or "novita", "elevenlabs", "azure", etc.
# API Keys
API_KEYS = {
"minimax": "your_minimax_api_key",
"novita": "your_novita_api_key",
"elevenlabs": "your_elevenlabs_api_key",
"azure": "your_azure_subscription_key",
"google": "your_google_cloud_api_key",
}
# Provider-specific config
MINIMAX_CONFIG = {
"api_url": "https://api.ppio.com/v3/minimax-speech-2.8-hd", # Customize your API endpoint
"timeout": 60,
"default_voice_settings": {
"speed": 1.0,
"pitch": 0,
"voice_id": "male-qn-qingse"
}
}
NOVITA_CONFIG = {
"api_url": "https://api.novita.ai/v3/minimax-speech-2.8-turbo",
"timeout": 60,
"default_voice_settings": {
"speed": 1.0,
"pitch": 0,
"voice_id": "male-qn-qingse"
}
}
ELEVENLABS_CONFIG = {
"api_url": "https://api.elevenlabs.io/v1/text-to-speech",
"timeout": 60,
"default_voice_settings": {
"voice_id": "21m00Tcm4TlvDq8ikWAM", # Rachel
"model_id": "eleven_multilingual_v2",
"stability": 0.5,
"similarity_boost": 0.75
}
}
AZURE_CONFIG = {
"region": "eastus",
"timeout": 60,
"default_voice_settings": {
"voice_name": "en-US-AriaNeural",
"rate": "1.0",
"pitch": "+0Hz"
}
}
GOOGLE_CONFIG = {
"api_url": "https://texttospeech.googleapis.com/v1/text:synthesize",
"timeout": 60,
"default_voice_settings": {
"language_code": "en-US",
"voice_name": "en-US-Neural2-F",
"speaking_rate": 1.0,
"pitch": 0.0
}
}The project supports custom API endpoints. Default endpoints for each provider:
- PPIO:
https://api.ppio.com/v3/minimax-speech-2.8-hd - Novita AI:
https://api.novita.ai/v3/minimax-speech-2.8-turbo - ElevenLabs:
https://api.elevenlabs.io/v1/text-to-speech - Azure:
https://{region}.tts.speech.microsoft.com/cognitiveservices/v1(region configurable) - Google Cloud:
https://texttospeech.googleapis.com/v1/text:synthesize
You can customize any endpoint in config.py or through the web interface.
Simply change the api_url in your provider configuration:
MINIMAX_CONFIG = {
"api_url": "YOUR_CUSTOM_API_ENDPOINT",
"timeout": 60,
# ... other settings
}
NOVITA_CONFIG = {
"api_url": "YOUR_CUSTOM_API_ENDPOINT",
"timeout": 60,
# ... other settings
}| Parameter | Description | Range | Default |
|---|---|---|---|
| speed | Speech rate | 0.5-2.0 | 1.0 |
| pitch | Voice pitch | -12 to 12 | 0 |
| vol | Volume | 0.1-10 | 1.0 |
| emotion | Emotion | neutral, happy, sad, angry | neutral |
| voice_id | Voice ID | See available voices | male-qn-qingse |
pdf-to-audio/
โโโ providers/ # TTS provider implementations
โ โโโ __init__.py # Provider factory
โ โโโ base_provider.py # Abstract base class
โ โโโ minimax_provider.py # MiniMax implementation
โโโ pdf_to_audio.py # Main converter class
โโโ quick_start.py # Quick start script
โโโ config_example.py # Configuration template
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
โโโ LICENSE # MIT License
โโโ .gitignore # Git ignore rules
- Python 3.7+
- requests >= 2.31.0
- PyPDF2 >= 3.0.0
The architecture is designed for easy extensibility. To add a new provider:
- Create a new provider class in
providers/directory:
# providers/custom_provider.py
from .base_provider import BaseTTSProvider
class CustomProvider(BaseTTSProvider):
def text_to_speech(self, text, output_path, voice_settings=None):
# Implement your TTS API call
pass
def get_available_voices(self):
return ["voice1", "voice2", "voice3"]
def get_default_voice_settings(self):
return {"voice": "voice1", "speed": 1.0}- Register the provider in
providers/__init__.py:
from .custom_provider import CustomProvider
PROVIDERS = {
'minimax': MiniMaxProvider,
'custom': CustomProvider, # Add your provider here
}-
Update config_example.py with provider-specific settings
-
Test your implementation and submit a pull request!
- API Keys: Requires valid API keys for the chosen provider
- PDF Format: Only supports PDFs with extractable text (not scanned images)
- Network: Requires stable internet connection for API calls
- API Limits: Be aware of provider-specific rate limits and quotas
- Text Length: Long texts are automatically split (default max 5000 characters per chunk)
- Cause: PDF may be a scanned image
- Solution: Use OCR tools to convert PDF to searchable text first
- Cause: Invalid or expired API key
- Solution: Check and update your API key in config.py
- Cause: Rate limit exceeded
- Solution: Add delays between requests or wait before retrying
- Cause: Unsupported provider name
- Solution: Check available providers with
get_available_providers()
Contributions are welcome! Here's how you can help:
- Add new TTS providers - See ADDING_NEW_PROVIDERS.md for a complete guide
- Improve text extraction - Better handling of PDF formats and scanned documents
- Add features - OCR support, subtitle generation, audio merging, batch processing
- Fix bugs - Report issues or submit fixes
- Improve documentation - Better examples, tutorials, and translations
- Test providers - Help test and improve existing provider implementations
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- MiniMax - For providing the Text-to-Speech API
- PyPDF2 - For PDF text extraction
- All contributors and users of this project
For questions, suggestions, or issues:
- Open an Issue
- Submit a Pull Request
- Add ElevenLabs provider
- Add Azure TTS provider
- Add Google Cloud TTS provider
- Web interface with modern design
- Multiple TTS provider support
- Implement OCR for scanned PDFs
- Add subtitle/caption generation
- Add audio file merging for multi-part outputs
- Add Docker support
- Add more languages support
- Batch processing improvements
- Voice preview feature
โญ If this project helps you, please give it a star!
Made with โค๏ธ by the open source community
