Audio Transcription & Summarization Pipeline

A complete local-first audio transcription pipeline with AI-powered summarization using OpenAI's Whisper and GPT models. Built for Python 3.11+ with modular, scalable architecture.

🎯 Features

Local Audio Transcription: Uses OpenAI Whisper models running locally for privacy and performance
Automatic Language Detection: Supports English, Hindi, and Kannada with automatic detection
AI Summarization: Generates concise summaries using OpenAI GPT models (GPT-3.5/GPT-4)
Multiple Output Formats: Saves results as both structured JSON and readable text files
Batch Processing: Process multiple audio files simultaneously
Configurable Models: Switch between different Whisper and GPT models
Comprehensive Logging: Detailed progress tracking and error handling
CLI Interface: Easy-to-use command-line interface

📂 Project Structure

transcriber/
├── config.py              # Configuration and environment management
├── transcriber.py          # Whisper-based audio transcription
├── summarizer.py           # OpenAI GPT-based summarization
├── pipeline.py            # Main orchestration and CLI interface
├── requirements.txt        # Python dependencies
├── .env.example           # Environment variable template
├── README.md              # This file
├── input_mp3s/            # Input audio files directory
└── output/                # Generated transcriptions and summaries

🚀 Quick Start

1. Prerequisites

Python 3.11+
FFmpeg (required for audio processing)
OpenAI API Key (for summarization)

Install FFmpeg

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

macOS:

brew install ffmpeg

Windows: Download from https://ffmpeg.org/download.html

2. Installation

# Clone or download the project
git clone <repository-url>
cd transcriber

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Configuration

# Copy environment template
cp .env.example .env

# Edit .env file with your OpenAI API key
nano .env  # or use your preferred editor

Required Configuration:

# Get your API key from https://platform.openai.com/account/api-keys
OPENAI_API_KEY=your_actual_api_key_here

4. Basic Usage

# Place your MP3 files in input_mp3s/ directory
cp your_audio_file.mp3 input_mp3s/

# Run the pipeline
python pipeline.py --input input_mp3s --output output

# Check results in output/ directory
ls output/

📖 Detailed Usage

Command Line Options

python pipeline.py [OPTIONS]

Options:
  -i, --input DIR           Input directory with audio files (default: input_mp3s)
  -o, --output DIR          Output directory for results (default: output)
  -m, --model MODEL         GPT model for summarization (default: gpt-3.5-turbo)
  --whisper-model MODEL     Whisper model for transcription (default: base)
  -l, --language LANG       Force specific language (en/hi/kn, auto-detect if not set)
  --info                    Show system information
  -v, --verbose             Enable verbose logging
  -h, --help               Show help message

Usage Examples

Basic transcription and summarization:

python pipeline.py --input input_mp3s --output output

Use GPT-4 for better summaries:

python pipeline.py --input audio_files --output results --model gpt-4

Force Hindi language transcription:

python pipeline.py --input input_mp3s --output output --language hi

Use larger Whisper model for better accuracy:

python pipeline.py --input input_mp3s --output output --whisper-model large

Show system information:

python pipeline.py --info

Supported File Formats

MP3 (.mp3)
WAV (.wav)
M4A (.m4a)
FLAC (.flac)

Supported Languages

English (en) - Primary support
Hindi (hi) - Full support
Kannada (kn) - Full support

📄 Output Format

For each input file example.mp3, the pipeline generates:

1. Text File (`example.txt`)

Filename: example.mp3
Language: en
Processed: 2024-01-15 14:30:25
==================================================

TRANSCRIPTION:
--------------------
[Full transcribed text from the audio]

SUMMARY:
--------------------
[AI-generated concise summary]

==================================================
METADATA:
Model used: gpt-3.5-turbo
Tokens used: 150

2. JSON File (`example.json`)

{
  "filename": "example.mp3",
  "language": "en",
  "transcription": "[Full transcribed text]",
  "summary": "[AI-generated summary]",
  "model_used": "gpt-3.5-turbo",
  "tokens_used": {
    "prompt_tokens": 120,
    "completion_tokens": 30,
    "total_tokens": 150
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.0,
      "text": "Hello, this is a test recording."
    }
  ]
}

⚙️ Configuration Options

Environment Variables (.env file)

# Required
OPENAI_API_KEY=your_openai_api_key_here

# Optional
DEFAULT_GPT_MODEL=gpt-3.5-turbo      # GPT model for summarization
WHISPER_MODEL=base                   # Whisper model for transcription
INPUT_DIR=input_mp3s                 # Input directory path
OUTPUT_DIR=output                    # Output directory path
LOG_LEVEL=INFO                       # Logging level

Available Models

Whisper Models (accuracy vs. speed trade-off):

tiny - Fastest, least accurate
base - Good balance (default)
small - Better accuracy
medium - High accuracy
large - Best accuracy, slowest

GPT Models (quality vs. cost trade-off):

gpt-3.5-turbo - Fast, cost-effective (default)
gpt-3.5-turbo-16k - Longer context window
gpt-4 - Highest quality, more expensive
gpt-4-32k - Highest quality + long context
gpt-4-turbo-preview - Latest GPT-4 variant

🔧 System Requirements

Minimum Requirements

CPU: 2+ cores
RAM: 4GB
Storage: 2GB free space
Python: 3.11+

Recommended for Better Performance

CPU: 4+ cores
RAM: 8GB+
GPU: NVIDIA GPU with CUDA support (optional, for faster Whisper inference)
Storage: SSD with 5GB+ free space

GPU Acceleration (Optional)

To enable GPU acceleration for Whisper:

Install CUDA-compatible PyTorch:

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Verify GPU detection:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

📊 Performance & Costs

Processing Times (approximate)

Audio Length	Whisper Model	Processing Time
1 minute	base	10-30 seconds
5 minutes	base	30-90 seconds
30 minutes	base	3-8 minutes
1 hour	large	8-15 minutes

OpenAI API Costs (approximate)

Model	Cost per 1K tokens	Typical Summary Cost
GPT-3.5	$0.002	$0.01-0.05
GPT-4	$0.06	$0.10-0.50

🔍 Troubleshooting

Common Issues

1. "OpenAI API key not found" error:

# Check your .env file exists and contains the API key
cat .env
# Make sure there are no extra spaces around the key

2. "FFmpeg not found" error:

# Install FFmpeg
sudo apt install ffmpeg  # Ubuntu/Debian
brew install ffmpeg      # macOS

3. "CUDA out of memory" error:

# Use a smaller Whisper model
python pipeline.py --whisper-model tiny

4. Slow processing:

# Use smaller models or enable GPU acceleration
python pipeline.py --whisper-model base --model gpt-3.5-turbo

Debug Mode

Enable verbose logging for detailed troubleshooting:

python pipeline.py --verbose

Check the log file for detailed error information:

tail -f transcription_pipeline.log

🧪 Testing

Test with Sample Audio

Create a test audio file or use online text-to-speech:

# Create test directory structure
mkdir -p input_mp3s output

# Place a test MP3 file in input_mp3s/
# Then run the pipeline
python pipeline.py --input input_mp3s --output output --verbose

# Check results
ls output/
cat output/test_file.txt

Validate Installation

# Check system info
python pipeline.py --info

# Should show:
# - Whisper model loaded
# - OpenAI API configured
# - Supported languages and formats

🔒 Privacy & Security

Local Processing: Audio transcription happens entirely on your machine
API Calls: Only text transcriptions are sent to OpenAI for summarization
Data Storage: All outputs are stored locally
API Key: Store securely in .env file, never commit to version control

🛠️ Development

Project Architecture

The pipeline is built with modular components:

config.py: Centralized configuration management
transcriber.py: Whisper integration and language detection
summarizer.py: OpenAI API integration with retry logic
pipeline.py: Main orchestration and CLI interface

Extending the Pipeline

Add new languages:

# In config.py, add to supported_languages
self.supported_languages: List[str] = ['en', 'hi', 'kn', 'es', 'fr']

Custom summarization prompts:

# In summarizer.py, modify system_prompt
custom_prompt = "Create a technical summary focusing on key decisions..."
summary = summarizer.summarize_text(text, custom_prompt=custom_prompt)

📝 License

This project is provided as-is for educational and development purposes. Please ensure compliance with OpenAI's usage policies when using their API.

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📞 Support

For issues and questions:

Check the troubleshooting section above
Review the logs in transcription_pipeline.log
Open an issue with detailed error information

Happy transcribing! 🎵➡️📝

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
input_mp3s		input_mp3s
output		output
.gitignore		.gitignore
AWS_README.md		AWS_README.md
README.md		README.md
aws_pipeline.py		aws_pipeline.py
aws_transcriber.py		aws_transcriber.py
bedrock_summarizer.py		bedrock_summarizer.py
config.py		config.py
ollama_summarizer.py		ollama_summarizer.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
s3_handler.py		s3_handler.py
summarizer.py		summarizer.py
transcriber.py		transcriber.py

scripbox/scripbox-transcribe

Folders and files

Latest commit

History

Repository files navigation