Skip to content

A flexible Python tool for converting PDF documents to audio using various TTS providers

License

Notifications You must be signed in to change notification settings

irismaker/pdf-to-audio

Repository files navigation

PDF to Audio Converter ๐ŸŽ™๏ธ

Python License: MIT GitHub stars

A flexible Python tool for converting PDF documents to audio using various TTS providers.

โœจ Features

  • ๐ŸŒ Modern Web Interface - User-friendly web UI with drag & drop upload (NEW!)
  • ๐Ÿ“š Batch Processing - Convert multiple PDF documents at once
  • ๐Ÿ”Œ Multiple TTS Providers - Extensible architecture to support different TTS services
  • โœ‚๏ธ Smart Text Splitting - Automatically splits long texts to avoid API limits
  • ๐ŸŽต Customizable Voice - Adjust speed, pitch, timbre, emotion and more
  • ๐Ÿ“Š Progress Tracking - Real-time progress display and status updates
  • ๐Ÿ”„ Error Handling - Robust error handling with retry mechanisms
  • ๐ŸŽง High Quality Output - MP3 format at 128kbps bitrate

Currently Supported Providers

  • โœ… PPIO - High-quality TTS using MiniMax Speech 2.8 HD model
  • โœ… Novita AI - High-quality TTS using MiniMax Speech 2.8 Turbo model
  • โœ… ElevenLabs - Premium quality TTS with realistic voices and voice cloning
  • โœ… Azure Cognitive Services - Microsoft's TTS with 400+ voices in 140+ languages
  • โœ… Google Cloud TTS - Google's WaveNet and Neural2 voices in 40+ languages

๐ŸŽจ Web Interface

The easiest way to use this tool is through the modern web interface:

Web Interface

Key Features

  • No Command Line Required - Everything in your browser
  • Flexible API Configuration - Use PPIO or custom API endpoints
  • Drag & Drop Upload - Simply drag your PDF files
  • Real-time Progress - Visual progress bar with status updates
  • Audio Preview - Play audio before downloading
  • Batch Download - Download all files as ZIP
  • Modern Design - Clean, elegant, and intuitive interface

Quick Start with Web Interface

# 1. Install dependencies
pip install -r requirements.txt

# 2. Start the server
python app.py
# Or if port 5000 is occupied: PORT=5001 python app.py

# 3. Open in browser
# Default: http://localhost:5000
# Custom port: http://localhost:5001

That's it! Upload your PDF, enter your API key, and convert!

๐Ÿš€ Quick Start

1. Clone Repository

git clone https://github.com/irismaker/pdf-to-audio.git
cd pdf-to-audio

2. Install Dependencies

pip install -r requirements.txt

3. Configure API

Method A: Using Config File (Recommended)

cp config_example.py config.py
# Edit config.py and add your API keys and settings

Method B: Using Environment Variables

export TTS_PROVIDER="minimax"  # or "novita", "elevenlabs", "azure", "google"
export TTS_API_KEY="your_api_key_here"

4. Run the Program

Web Interface (Easiest - Recommended)

python app.py
# Default port is 5000

# If port 5000 is already in use (common on macOS):
PORT=5001 python app.py

Then open your browser and visit:

  • Default: http://localhost:5000
  • Custom port: http://localhost:5001 (or your configured port)

Note for macOS users: Port 5000 is often occupied by AirPlay Receiver. Use PORT=5001 python app.py instead.

Features:

  • Upload PDFs directly in your browser
  • No need to manage file folders
  • Visual progress tracking
  • Play audio before downloading
  • Download all files as ZIP

Command Line - Quick Start

python quick_start.py

Command Line - Interactive Mode

python pdf_to_audio.py

๐Ÿ“– Usage

Basic Usage

Interactive Mode

python pdf_to_audio.py

Follow the prompts to:

  1. Select TTS provider
  2. Enter API key (if not in config)
  3. Specify PDF file or directory
  4. Customize voice settings (optional)

Quick Start with Config

python quick_start.py

Automatically uses settings from config.py and scans current directory for PDFs.

Web Interface Usage

The web interface provides the easiest way to convert PDFs to audio without any command line knowledge.

Starting the Web Server

python app.py

The server will start on http://localhost:5000 by default (or the port specified by the PORT environment variable). Open this URL in your web browser.

Note for macOS users: If port 5000 is already in use by AirPlay Receiver, you can either:

  • Disable AirPlay Receiver in System Settings > General > AirDrop & Handoff
  • Or run the app on a different port: PORT=5001 python app.py

Using the Web Interface

  1. Configure API

    • Select provider (PPIO, Novita AI, ElevenLabs, Azure, or Google Cloud)
    • Enter your API key for the selected provider
    • API URL automatically updates based on provider selection
    • Optionally customize API endpoint
  2. Upload PDF

    • Click the upload area or drag and drop your PDF file
    • Maximum file size: 50MB
    • Only PDF files are accepted
  3. Customize Voice (Optional)

    • Select voice type (male or female voices available)
    • Adjust speed (0.5x to 2.0x)
    • Adjust pitch (-12 to +12)
    • Choose emotion (calm, happy, sad, angry, etc.)
  4. Convert

    • Click "Convert to Audio" button
    • Watch real-time progress
    • View status messages during conversion
  5. Download Results

    • Play audio files directly in browser
    • Download individual files
    • Download all files as a ZIP archive

Web Interface Features

  • No File Management: Upload files directly through browser
  • Flexible API Configuration: Use PPIO provider or custom API endpoints
  • Real-time Progress: Visual progress bar with status messages
  • Audio Preview: Play audio before downloading
  • Batch Download: Download all generated files as ZIP
  • Error Handling: Clear error messages with helpful guidance
  • Responsive Design: Works on desktop and mobile devices
  • Auto-cleanup: Temporary files automatically deleted after 1 hour

Advanced Usage

Use as Python Module

from pdf_to_audio import PDFToAudioConverter

# Create converter with MiniMax
converter = PDFToAudioConverter(
    provider_name="minimax",
    api_key="your_api_key",
    provider_config={"timeout": 60}
)

# Or use other providers
# Novita AI
converter = PDFToAudioConverter(
    provider_name="novita",
    api_key="your_novita_api_key"
)

# ElevenLabs
converter = PDFToAudioConverter(
    provider_name="elevenlabs",
    api_key="your_elevenlabs_api_key"
)

# Azure Cognitive Services
converter = PDFToAudioConverter(
    provider_name="azure",
    api_key="your_azure_key",
    provider_config={"region": "eastus"}
)

# Google Cloud TTS
converter = PDFToAudioConverter(
    provider_name="google",
    api_key="your_google_api_key"
)

# Convert single file
converter.convert_pdf_to_audio(
    pdf_path="document.pdf",
    output_dir="audio_output"
)

# Batch convert directory
converter.batch_convert(
    pdf_dir="./pdfs",
    output_dir="./audio_output"
)

Custom Voice Settings

# MiniMax voice settings
voice_settings = {
    "speed": 1.2,           # Speech rate: 0.5-2.0
    "pitch": 2,             # Pitch: -12 to 12
    "vol": 1.5,             # Volume: 0.1-10
    "emotion": "happy",     # Emotion: neutral, happy, sad, angry
    "voice_id": "female-tianmei"  # Voice ID
}

converter.convert_pdf_to_audio(
    pdf_path="document.pdf",
    voice_settings=voice_settings
)

๐ŸŽจ Available Voices (MiniMax)

Male Voices

  • male-qn-qingse - Young Male
  • male-qn-jingying - Professional Male
  • male-qn-badao - Commanding Male
  • male-qn-daxuesheng - College Student Male

Female Voices

  • female-shaonv - Young Female
  • female-yujie - Mature Female
  • female-chengshu - Sophisticated Female
  • female-tianmei - Sweet Female

โš™๏ธ Configuration

Provider Settings

Edit config.py to configure providers:

# Select provider
TTS_PROVIDER = "minimax"  # or "novita", "elevenlabs", "azure", etc.

# API Keys
API_KEYS = {
    "minimax": "your_minimax_api_key",
    "novita": "your_novita_api_key",
    "elevenlabs": "your_elevenlabs_api_key",
    "azure": "your_azure_subscription_key",
    "google": "your_google_cloud_api_key",
}

# Provider-specific config
MINIMAX_CONFIG = {
    "api_url": "https://api.ppio.com/v3/minimax-speech-2.8-hd",  # Customize your API endpoint
    "timeout": 60,
    "default_voice_settings": {
        "speed": 1.0,
        "pitch": 0,
        "voice_id": "male-qn-qingse"
    }
}

NOVITA_CONFIG = {
    "api_url": "https://api.novita.ai/v3/minimax-speech-2.8-turbo",
    "timeout": 60,
    "default_voice_settings": {
        "speed": 1.0,
        "pitch": 0,
        "voice_id": "male-qn-qingse"
    }
}

ELEVENLABS_CONFIG = {
    "api_url": "https://api.elevenlabs.io/v1/text-to-speech",
    "timeout": 60,
    "default_voice_settings": {
        "voice_id": "21m00Tcm4TlvDq8ikWAM",  # Rachel
        "model_id": "eleven_multilingual_v2",
        "stability": 0.5,
        "similarity_boost": 0.75
    }
}

AZURE_CONFIG = {
    "region": "eastus",
    "timeout": 60,
    "default_voice_settings": {
        "voice_name": "en-US-AriaNeural",
        "rate": "1.0",
        "pitch": "+0Hz"
    }
}

GOOGLE_CONFIG = {
    "api_url": "https://texttospeech.googleapis.com/v1/text:synthesize",
    "timeout": 60,
    "default_voice_settings": {
        "language_code": "en-US",
        "voice_name": "en-US-Neural2-F",
        "speaking_rate": 1.0,
        "pitch": 0.0
    }
}

Custom API Endpoints

The project supports custom API endpoints. Default endpoints for each provider:

  • PPIO: https://api.ppio.com/v3/minimax-speech-2.8-hd
  • Novita AI: https://api.novita.ai/v3/minimax-speech-2.8-turbo
  • ElevenLabs: https://api.elevenlabs.io/v1/text-to-speech
  • Azure: https://{region}.tts.speech.microsoft.com/cognitiveservices/v1 (region configurable)
  • Google Cloud: https://texttospeech.googleapis.com/v1/text:synthesize

You can customize any endpoint in config.py or through the web interface.

Simply change the api_url in your provider configuration:

MINIMAX_CONFIG = {
    "api_url": "YOUR_CUSTOM_API_ENDPOINT",
    "timeout": 60,
    # ... other settings
}

NOVITA_CONFIG = {
    "api_url": "YOUR_CUSTOM_API_ENDPOINT",
    "timeout": 60,
    # ... other settings
}

Voice Parameters (MiniMax)

Parameter Description Range Default
speed Speech rate 0.5-2.0 1.0
pitch Voice pitch -12 to 12 0
vol Volume 0.1-10 1.0
emotion Emotion neutral, happy, sad, angry neutral
voice_id Voice ID See available voices male-qn-qingse

๐Ÿ“ Project Structure

pdf-to-audio/
โ”œโ”€โ”€ providers/                 # TTS provider implementations
โ”‚   โ”œโ”€โ”€ __init__.py           # Provider factory
โ”‚   โ”œโ”€โ”€ base_provider.py      # Abstract base class
โ”‚   โ””โ”€โ”€ minimax_provider.py   # MiniMax implementation
โ”œโ”€โ”€ pdf_to_audio.py           # Main converter class
โ”œโ”€โ”€ quick_start.py            # Quick start script
โ”œโ”€โ”€ config_example.py         # Configuration template
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”œโ”€โ”€ README.md                 # Project documentation
โ”œโ”€โ”€ LICENSE                   # MIT License
โ””โ”€โ”€ .gitignore               # Git ignore rules

๐Ÿ”ง Requirements

  • Python 3.7+
  • requests >= 2.31.0
  • PyPDF2 >= 3.0.0

๐Ÿ”Œ Adding New TTS Providers

The architecture is designed for easy extensibility. To add a new provider:

  1. Create a new provider class in providers/ directory:
# providers/custom_provider.py
from .base_provider import BaseTTSProvider

class CustomProvider(BaseTTSProvider):
    def text_to_speech(self, text, output_path, voice_settings=None):
        # Implement your TTS API call
        pass

    def get_available_voices(self):
        return ["voice1", "voice2", "voice3"]

    def get_default_voice_settings(self):
        return {"voice": "voice1", "speed": 1.0}
  1. Register the provider in providers/__init__.py:
from .custom_provider import CustomProvider

PROVIDERS = {
    'minimax': MiniMaxProvider,
    'custom': CustomProvider,  # Add your provider here
}
  1. Update config_example.py with provider-specific settings

  2. Test your implementation and submit a pull request!

๐Ÿ“ Notes

  1. API Keys: Requires valid API keys for the chosen provider
  2. PDF Format: Only supports PDFs with extractable text (not scanned images)
  3. Network: Requires stable internet connection for API calls
  4. API Limits: Be aware of provider-specific rate limits and quotas
  5. Text Length: Long texts are automatically split (default max 5000 characters per chunk)

๐Ÿ› Troubleshooting

Empty Text Extraction from PDF

  • Cause: PDF may be a scanned image
  • Solution: Use OCR tools to convert PDF to searchable text first

API Returns 401 Error

  • Cause: Invalid or expired API key
  • Solution: Check and update your API key in config.py

API Returns 429 Error

  • Cause: Rate limit exceeded
  • Solution: Add delays between requests or wait before retrying

Provider Not Found Error

  • Cause: Unsupported provider name
  • Solution: Check available providers with get_available_providers()

๐Ÿค Contributing

Contributions are welcome! Here's how you can help:

  1. Add new TTS providers - See ADDING_NEW_PROVIDERS.md for a complete guide
  2. Improve text extraction - Better handling of PDF formats and scanned documents
  3. Add features - OCR support, subtitle generation, audio merging, batch processing
  4. Fix bugs - Report issues or submit fixes
  5. Improve documentation - Better examples, tutorials, and translations
  6. Test providers - Help test and improve existing provider implementations

Development Process

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • MiniMax - For providing the Text-to-Speech API
  • PyPDF2 - For PDF text extraction
  • All contributors and users of this project

๐Ÿ“ฎ Contact

For questions, suggestions, or issues:

๐Ÿ—บ๏ธ Roadmap

โœ… Completed

  • Add ElevenLabs provider
  • Add Azure TTS provider
  • Add Google Cloud TTS provider
  • Web interface with modern design
  • Multiple TTS provider support

๐Ÿ”ฎ Future Enhancements

  • Implement OCR for scanned PDFs
  • Add subtitle/caption generation
  • Add audio file merging for multi-part outputs
  • Add Docker support
  • Add more languages support
  • Batch processing improvements
  • Voice preview feature

โญ If this project helps you, please give it a star!

Made with โค๏ธ by the open source community

About

A flexible Python tool for converting PDF documents to audio using various TTS providers

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors