PDF to Audio Converter 🎙️

A flexible Python tool for converting PDF documents to audio using various TTS providers.

✨ Features

🌐 Modern Web Interface - User-friendly web UI with drag & drop upload (NEW!)
📚 Batch Processing - Convert multiple PDF documents at once
🔌 Multiple TTS Providers - Extensible architecture to support different TTS services
✂️ Smart Text Splitting - Automatically splits long texts to avoid API limits
🎵 Customizable Voice - Adjust speed, pitch, timbre, emotion and more
📊 Progress Tracking - Real-time progress display and status updates
🔄 Error Handling - Robust error handling with retry mechanisms
🎧 High Quality Output - MP3 format at 128kbps bitrate

Currently Supported Providers

✅ PPIO - High-quality TTS using MiniMax Speech 2.8 HD model
✅ Novita AI - High-quality TTS using MiniMax Speech 2.8 Turbo model
✅ ElevenLabs - Premium quality TTS with realistic voices and voice cloning
✅ Azure Cognitive Services - Microsoft's TTS with 400+ voices in 140+ languages
✅ Google Cloud TTS - Google's WaveNet and Neural2 voices in 40+ languages

🎨 Web Interface

The easiest way to use this tool is through the modern web interface:

Key Features

No Command Line Required - Everything in your browser
Flexible API Configuration - Use PPIO or custom API endpoints
Drag & Drop Upload - Simply drag your PDF files
Real-time Progress - Visual progress bar with status updates
Audio Preview - Play audio before downloading
Batch Download - Download all files as ZIP
Modern Design - Clean, elegant, and intuitive interface

Quick Start with Web Interface

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server
python app.py
# Or if port 5000 is occupied: PORT=5001 python app.py

# 3. Open in browser
# Default: http://localhost:5000
# Custom port: http://localhost:5001

That's it! Upload your PDF, enter your API key, and convert!

🚀 Quick Start

1. Clone Repository

git clone https://github.com/irismaker/pdf-to-audio.git
cd pdf-to-audio

2. Install Dependencies

pip install -r requirements.txt

3. Configure API

Method A: Using Config File (Recommended)

cp config_example.py config.py
# Edit config.py and add your API keys and settings

Method B: Using Environment Variables

export TTS_PROVIDER="minimax"  # or "novita", "elevenlabs", "azure", "google"
export TTS_API_KEY="your_api_key_here"

4. Run the Program

Web Interface (Easiest - Recommended)

python app.py
# Default port is 5000

# If port 5000 is already in use (common on macOS):
PORT=5001 python app.py

Then open your browser and visit:

Default: http://localhost:5000
Custom port: http://localhost:5001 (or your configured port)

Note for macOS users: Port 5000 is often occupied by AirPlay Receiver. Use PORT=5001 python app.py instead.

Features:

Upload PDFs directly in your browser
No need to manage file folders
Visual progress tracking
Play audio before downloading
Download all files as ZIP

Command Line - Quick Start

python quick_start.py

Command Line - Interactive Mode

python pdf_to_audio.py

📖 Usage

Basic Usage

Interactive Mode

python pdf_to_audio.py

Follow the prompts to:

Select TTS provider
Enter API key (if not in config)
Specify PDF file or directory
Customize voice settings (optional)

Quick Start with Config

python quick_start.py

Automatically uses settings from config.py and scans current directory for PDFs.

Web Interface Usage

The web interface provides the easiest way to convert PDFs to audio without any command line knowledge.

Starting the Web Server

python app.py

The server will start on http://localhost:5000 by default (or the port specified by the PORT environment variable). Open this URL in your web browser.

Note for macOS users: If port 5000 is already in use by AirPlay Receiver, you can either:

Disable AirPlay Receiver in System Settings > General > AirDrop & Handoff
Or run the app on a different port: PORT=5001 python app.py

Using the Web Interface

Configure API
- Select provider (PPIO, Novita AI, ElevenLabs, Azure, or Google Cloud)
- Enter your API key for the selected provider
- API URL automatically updates based on provider selection
- Optionally customize API endpoint
Upload PDF
- Click the upload area or drag and drop your PDF file
- Maximum file size: 50MB
- Only PDF files are accepted
Customize Voice (Optional)
- Select voice type (male or female voices available)
- Adjust speed (0.5x to 2.0x)
- Adjust pitch (-12 to +12)
- Choose emotion (calm, happy, sad, angry, etc.)
Convert
- Click "Convert to Audio" button
- Watch real-time progress
- View status messages during conversion
Download Results
- Play audio files directly in browser
- Download individual files
- Download all files as a ZIP archive

Web Interface Features

No File Management: Upload files directly through browser
Flexible API Configuration: Use PPIO provider or custom API endpoints
Real-time Progress: Visual progress bar with status messages
Audio Preview: Play audio before downloading
Batch Download: Download all generated files as ZIP
Error Handling: Clear error messages with helpful guidance
Responsive Design: Works on desktop and mobile devices
Auto-cleanup: Temporary files automatically deleted after 1 hour

Advanced Usage

Use as Python Module

from pdf_to_audio import PDFToAudioConverter

# Create converter with MiniMax
converter = PDFToAudioConverter(
    provider_name="minimax",
    api_key="your_api_key",
    provider_config={"timeout": 60}
)

# Or use other providers
# Novita AI
converter = PDFToAudioConverter(
    provider_name="novita",
    api_key="your_novita_api_key"
)

# ElevenLabs
converter = PDFToAudioConverter(
    provider_name="elevenlabs",
    api_key="your_elevenlabs_api_key"
)

# Azure Cognitive Services
converter = PDFToAudioConverter(
    provider_name="azure",
    api_key="your_azure_key",
    provider_config={"region": "eastus"}
)

# Google Cloud TTS
converter = PDFToAudioConverter(
    provider_name="google",
    api_key="your_google_api_key"
)

# Convert single file
converter.convert_pdf_to_audio(
    pdf_path="document.pdf",
    output_dir="audio_output"
)

# Batch convert directory
converter.batch_convert(
    pdf_dir="./pdfs",
    output_dir="./audio_output"
)

Custom Voice Settings

# MiniMax voice settings
voice_settings = {
    "speed": 1.2,           # Speech rate: 0.5-2.0
    "pitch": 2,             # Pitch: -12 to 12
    "vol": 1.5,             # Volume: 0.1-10
    "emotion": "happy",     # Emotion: neutral, happy, sad, angry
    "voice_id": "female-tianmei"  # Voice ID
}

converter.convert_pdf_to_audio(
    pdf_path="document.pdf",
    voice_settings=voice_settings
)

🎨 Available Voices (MiniMax)

Male Voices

male-qn-qingse - Young Male
male-qn-jingying - Professional Male
male-qn-badao - Commanding Male
male-qn-daxuesheng - College Student Male

Female Voices

female-shaonv - Young Female
female-yujie - Mature Female
female-chengshu - Sophisticated Female
female-tianmei - Sweet Female

⚙️ Configuration

Provider Settings

Edit config.py to configure providers:

# Select provider
TTS_PROVIDER = "minimax"  # or "novita", "elevenlabs", "azure", etc.

# API Keys
API_KEYS = {
    "minimax": "your_minimax_api_key",
    "novita": "your_novita_api_key",
    "elevenlabs": "your_elevenlabs_api_key",
    "azure": "your_azure_subscription_key",
    "google": "your_google_cloud_api_key",
}

# Provider-specific config
MINIMAX_CONFIG = {
    "api_url": "https://api.ppio.com/v3/minimax-speech-2.8-hd",  # Customize your API endpoint
    "timeout": 60,
    "default_voice_settings": {
        "speed": 1.0,
        "pitch": 0,
        "voice_id": "male-qn-qingse"
    }
}

NOVITA_CONFIG = {
    "api_url": "https://api.novita.ai/v3/minimax-speech-2.8-turbo",
    "timeout": 60,
    "default_voice_settings": {
        "speed": 1.0,
        "pitch": 0,
        "voice_id": "male-qn-qingse"
    }
}

ELEVENLABS_CONFIG = {
    "api_url": "https://api.elevenlabs.io/v1/text-to-speech",
    "timeout": 60,
    "default_voice_settings": {
        "voice_id": "21m00Tcm4TlvDq8ikWAM",  # Rachel
        "model_id": "eleven_multilingual_v2",
        "stability": 0.5,
        "similarity_boost": 0.75
    }
}

AZURE_CONFIG = {
    "region": "eastus",
    "timeout": 60,
    "default_voice_settings": {
        "voice_name": "en-US-AriaNeural",
        "rate": "1.0",
        "pitch": "+0Hz"
    }
}

GOOGLE_CONFIG = {
    "api_url": "https://texttospeech.googleapis.com/v1/text:synthesize",
    "timeout": 60,
    "default_voice_settings": {
        "language_code": "en-US",
        "voice_name": "en-US-Neural2-F",
        "speaking_rate": 1.0,
        "pitch": 0.0
    }
}

Custom API Endpoints

The project supports custom API endpoints. Default endpoints for each provider:

PPIO: https://api.ppio.com/v3/minimax-speech-2.8-hd
Novita AI: https://api.novita.ai/v3/minimax-speech-2.8-turbo
ElevenLabs: https://api.elevenlabs.io/v1/text-to-speech
Azure: https://{region}.tts.speech.microsoft.com/cognitiveservices/v1 (region configurable)
Google Cloud: https://texttospeech.googleapis.com/v1/text:synthesize

You can customize any endpoint in config.py or through the web interface.

Simply change the api_url in your provider configuration:

MINIMAX_CONFIG = {
    "api_url": "YOUR_CUSTOM_API_ENDPOINT",
    "timeout": 60,
    # ... other settings
}

NOVITA_CONFIG = {
    "api_url": "YOUR_CUSTOM_API_ENDPOINT",
    "timeout": 60,
    # ... other settings
}

Voice Parameters (MiniMax)

Parameter	Description	Range	Default
speed	Speech rate	0.5-2.0	1.0
pitch	Voice pitch	-12 to 12	0
vol	Volume	0.1-10	1.0
emotion	Emotion	neutral, happy, sad, angry	neutral
voice_id	Voice ID	See available voices	male-qn-qingse

📁 Project Structure

pdf-to-audio/
├── providers/                 # TTS provider implementations
│   ├── __init__.py           # Provider factory
│   ├── base_provider.py      # Abstract base class
│   └── minimax_provider.py   # MiniMax implementation
├── pdf_to_audio.py           # Main converter class
├── quick_start.py            # Quick start script
├── config_example.py         # Configuration template
├── requirements.txt          # Python dependencies
├── README.md                 # Project documentation
├── LICENSE                   # MIT License
└── .gitignore               # Git ignore rules

🔧 Requirements

Python 3.7+
requests >= 2.31.0
PyPDF2 >= 3.0.0

🔌 Adding New TTS Providers

The architecture is designed for easy extensibility. To add a new provider:

Create a new provider class in providers/ directory:

# providers/custom_provider.py
from .base_provider import BaseTTSProvider

class CustomProvider(BaseTTSProvider):
    def text_to_speech(self, text, output_path, voice_settings=None):
        # Implement your TTS API call
        pass

    def get_available_voices(self):
        return ["voice1", "voice2", "voice3"]

    def get_default_voice_settings(self):
        return {"voice": "voice1", "speed": 1.0}

Register the provider in providers/__init__.py:

from .custom_provider import CustomProvider

PROVIDERS = {
    'minimax': MiniMaxProvider,
    'custom': CustomProvider,  # Add your provider here
}

Update config_example.py with provider-specific settings
Test your implementation and submit a pull request!

📝 Notes

API Keys: Requires valid API keys for the chosen provider
PDF Format: Only supports PDFs with extractable text (not scanned images)
Network: Requires stable internet connection for API calls
API Limits: Be aware of provider-specific rate limits and quotas
Text Length: Long texts are automatically split (default max 5000 characters per chunk)

🐛 Troubleshooting

Empty Text Extraction from PDF

Cause: PDF may be a scanned image
Solution: Use OCR tools to convert PDF to searchable text first

API Returns 401 Error

Cause: Invalid or expired API key
Solution: Check and update your API key in config.py

API Returns 429 Error

Cause: Rate limit exceeded
Solution: Add delays between requests or wait before retrying

Provider Not Found Error

Cause: Unsupported provider name
Solution: Check available providers with get_available_providers()

🤝 Contributing

Contributions are welcome! Here's how you can help:

Add new TTS providers - See ADDING_NEW_PROVIDERS.md for a complete guide
Improve text extraction - Better handling of PDF formats and scanned documents
Add features - OCR support, subtitle generation, audio merging, batch processing
Fix bugs - Report issues or submit fixes
Improve documentation - Better examples, tutorials, and translations
Test providers - Help test and improve existing provider implementations

Development Process

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

MiniMax - For providing the Text-to-Speech API
PyPDF2 - For PDF text extraction
All contributors and users of this project

📮 Contact

For questions, suggestions, or issues:

Open an Issue
Submit a Pull Request

🗺️ Roadmap

✅ Completed

🔮 Future Enhancements

Implement OCR for scanned PDFs
Add subtitle/caption generation
Add audio file merging for multi-part outputs
Add Docker support
Add more languages support
Batch processing improvements
Voice preview feature

⭐ If this project helps you, please give it a star!

Made with ❤️ by the open source community

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
providers		providers
screenshots		screenshots
static		static
templates		templates
.gitignore		.gitignore
ADDING_NEW_PROVIDERS.md		ADDING_NEW_PROVIDERS.md
LICENSE		LICENSE
README.md		README.md
TEST_GUIDE.md		TEST_GUIDE.md
WEB_QUICKSTART.md		WEB_QUICKSTART.md
app.py		app.py
config_example.py		config_example.py
novita_coding_package_copy.md		novita_coding_package_copy.md
novita_coding_package_copy_en.md		novita_coding_package_copy_en.md
pdf_to_audio.py		pdf_to_audio.py
quick_start.py		quick_start.py
requirements.txt		requirements.txt

License

irismaker/pdf-to-audio

Folders and files

Latest commit

History

Repository files navigation

PDF to Audio Converter 🎙️

✨ Features

Currently Supported Providers

🎨 Web Interface

Key Features

Quick Start with Web Interface

🚀 Quick Start

1. Clone Repository

2. Install Dependencies

3. Configure API

4. Run the Program

📖 Usage

Basic Usage

Interactive Mode

Quick Start with Config

Web Interface Usage

Starting the Web Server

Using the Web Interface

Web Interface Features

Advanced Usage

Use as Python Module

Custom Voice Settings

🎨 Available Voices (MiniMax)

Male Voices

Female Voices

⚙️ Configuration

Provider Settings

Custom API Endpoints

Voice Parameters (MiniMax)

📁 Project Structure

🔧 Requirements

🔌 Adding New TTS Providers

📝 Notes

🐛 Troubleshooting

Empty Text Extraction from PDF

API Returns 401 Error

API Returns 429 Error

Provider Not Found Error

🤝 Contributing

Development Process

📄 License

🙏 Acknowledgments

📮 Contact

🗺️ Roadmap

✅ Completed

🔮 Future Enhancements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages