A complete local-first audio transcription pipeline with AI-powered summarization using OpenAI's Whisper and GPT models. Built for Python 3.11+ with modular, scalable architecture.
- Local Audio Transcription: Uses OpenAI Whisper models running locally for privacy and performance
- Automatic Language Detection: Supports English, Hindi, and Kannada with automatic detection
- AI Summarization: Generates concise summaries using OpenAI GPT models (GPT-3.5/GPT-4)
- Multiple Output Formats: Saves results as both structured JSON and readable text files
- Batch Processing: Process multiple audio files simultaneously
- Configurable Models: Switch between different Whisper and GPT models
- Comprehensive Logging: Detailed progress tracking and error handling
- CLI Interface: Easy-to-use command-line interface
transcriber/
βββ config.py # Configuration and environment management
βββ transcriber.py # Whisper-based audio transcription
βββ summarizer.py # OpenAI GPT-based summarization
βββ pipeline.py # Main orchestration and CLI interface
βββ requirements.txt # Python dependencies
βββ .env.example # Environment variable template
βββ README.md # This file
βββ input_mp3s/ # Input audio files directory
βββ output/ # Generated transcriptions and summaries
- Python 3.11+
- FFmpeg (required for audio processing)
- OpenAI API Key (for summarization)
Ubuntu/Debian:
sudo apt update
sudo apt install ffmpeg
macOS:
brew install ffmpeg
Windows: Download from https://ffmpeg.org/download.html
# Clone or download the project
git clone <repository-url>
cd transcriber
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment template
cp .env.example .env
# Edit .env file with your OpenAI API key
nano .env # or use your preferred editor
Required Configuration:
# Get your API key from https://platform.openai.com/account/api-keys
OPENAI_API_KEY=your_actual_api_key_here
# Place your MP3 files in input_mp3s/ directory
cp your_audio_file.mp3 input_mp3s/
# Run the pipeline
python pipeline.py --input input_mp3s --output output
# Check results in output/ directory
ls output/
python pipeline.py [OPTIONS]
Options:
-i, --input DIR Input directory with audio files (default: input_mp3s)
-o, --output DIR Output directory for results (default: output)
-m, --model MODEL GPT model for summarization (default: gpt-3.5-turbo)
--whisper-model MODEL Whisper model for transcription (default: base)
-l, --language LANG Force specific language (en/hi/kn, auto-detect if not set)
--info Show system information
-v, --verbose Enable verbose logging
-h, --help Show help message
Basic transcription and summarization:
python pipeline.py --input input_mp3s --output output
Use GPT-4 for better summaries:
python pipeline.py --input audio_files --output results --model gpt-4
Force Hindi language transcription:
python pipeline.py --input input_mp3s --output output --language hi
Use larger Whisper model for better accuracy:
python pipeline.py --input input_mp3s --output output --whisper-model large
Show system information:
python pipeline.py --info
- MP3 (.mp3)
- WAV (.wav)
- M4A (.m4a)
- FLAC (.flac)
- English (en) - Primary support
- Hindi (hi) - Full support
- Kannada (kn) - Full support
For each input file example.mp3
, the pipeline generates:
Filename: example.mp3
Language: en
Processed: 2024-01-15 14:30:25
==================================================
TRANSCRIPTION:
--------------------
[Full transcribed text from the audio]
SUMMARY:
--------------------
[AI-generated concise summary]
==================================================
METADATA:
Model used: gpt-3.5-turbo
Tokens used: 150
{
"filename": "example.mp3",
"language": "en",
"transcription": "[Full transcribed text]",
"summary": "[AI-generated summary]",
"model_used": "gpt-3.5-turbo",
"tokens_used": {
"prompt_tokens": 120,
"completion_tokens": 30,
"total_tokens": 150
},
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "Hello, this is a test recording."
}
]
}
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional
DEFAULT_GPT_MODEL=gpt-3.5-turbo # GPT model for summarization
WHISPER_MODEL=base # Whisper model for transcription
INPUT_DIR=input_mp3s # Input directory path
OUTPUT_DIR=output # Output directory path
LOG_LEVEL=INFO # Logging level
Whisper Models (accuracy vs. speed trade-off):
tiny
- Fastest, least accuratebase
- Good balance (default)small
- Better accuracymedium
- High accuracylarge
- Best accuracy, slowest
GPT Models (quality vs. cost trade-off):
gpt-3.5-turbo
- Fast, cost-effective (default)gpt-3.5-turbo-16k
- Longer context windowgpt-4
- Highest quality, more expensivegpt-4-32k
- Highest quality + long contextgpt-4-turbo-preview
- Latest GPT-4 variant
- CPU: 2+ cores
- RAM: 4GB
- Storage: 2GB free space
- Python: 3.11+
- CPU: 4+ cores
- RAM: 8GB+
- GPU: NVIDIA GPU with CUDA support (optional, for faster Whisper inference)
- Storage: SSD with 5GB+ free space
To enable GPU acceleration for Whisper:
- Install CUDA-compatible PyTorch:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
- Verify GPU detection:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
Audio Length | Whisper Model | Processing Time |
---|---|---|
1 minute | base | 10-30 seconds |
5 minutes | base | 30-90 seconds |
30 minutes | base | 3-8 minutes |
1 hour | large | 8-15 minutes |
Model | Cost per 1K tokens | Typical Summary Cost |
---|---|---|
GPT-3.5 | $0.002 | $0.01-0.05 |
GPT-4 | $0.06 | $0.10-0.50 |
1. "OpenAI API key not found" error:
# Check your .env file exists and contains the API key
cat .env
# Make sure there are no extra spaces around the key
2. "FFmpeg not found" error:
# Install FFmpeg
sudo apt install ffmpeg # Ubuntu/Debian
brew install ffmpeg # macOS
3. "CUDA out of memory" error:
# Use a smaller Whisper model
python pipeline.py --whisper-model tiny
4. Slow processing:
# Use smaller models or enable GPU acceleration
python pipeline.py --whisper-model base --model gpt-3.5-turbo
Enable verbose logging for detailed troubleshooting:
python pipeline.py --verbose
Check the log file for detailed error information:
tail -f transcription_pipeline.log
Create a test audio file or use online text-to-speech:
# Create test directory structure
mkdir -p input_mp3s output
# Place a test MP3 file in input_mp3s/
# Then run the pipeline
python pipeline.py --input input_mp3s --output output --verbose
# Check results
ls output/
cat output/test_file.txt
# Check system info
python pipeline.py --info
# Should show:
# - Whisper model loaded
# - OpenAI API configured
# - Supported languages and formats
- Local Processing: Audio transcription happens entirely on your machine
- API Calls: Only text transcriptions are sent to OpenAI for summarization
- Data Storage: All outputs are stored locally
- API Key: Store securely in
.env
file, never commit to version control
The pipeline is built with modular components:
config.py
: Centralized configuration managementtranscriber.py
: Whisper integration and language detectionsummarizer.py
: OpenAI API integration with retry logicpipeline.py
: Main orchestration and CLI interface
Add new languages:
# In config.py, add to supported_languages
self.supported_languages: List[str] = ['en', 'hi', 'kn', 'es', 'fr']
Custom summarization prompts:
# In summarizer.py, modify system_prompt
custom_prompt = "Create a technical summary focusing on key decisions..."
summary = summarizer.summarize_text(text, custom_prompt=custom_prompt)
This project is provided as-is for educational and development purposes. Please ensure compliance with OpenAI's usage policies when using their API.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Check the troubleshooting section above
- Review the logs in
transcription_pipeline.log
- Open an issue with detailed error information
Happy transcribing! π΅β‘οΈπ