Skip to content

scripbox/scripbox-transcribe

Repository files navigation

Audio Transcription & Summarization Pipeline

A complete local-first audio transcription pipeline with AI-powered summarization using OpenAI's Whisper and GPT models. Built for Python 3.11+ with modular, scalable architecture.

🎯 Features

  • Local Audio Transcription: Uses OpenAI Whisper models running locally for privacy and performance
  • Automatic Language Detection: Supports English, Hindi, and Kannada with automatic detection
  • AI Summarization: Generates concise summaries using OpenAI GPT models (GPT-3.5/GPT-4)
  • Multiple Output Formats: Saves results as both structured JSON and readable text files
  • Batch Processing: Process multiple audio files simultaneously
  • Configurable Models: Switch between different Whisper and GPT models
  • Comprehensive Logging: Detailed progress tracking and error handling
  • CLI Interface: Easy-to-use command-line interface

πŸ“‚ Project Structure

transcriber/
β”œβ”€β”€ config.py              # Configuration and environment management
β”œβ”€β”€ transcriber.py          # Whisper-based audio transcription
β”œβ”€β”€ summarizer.py           # OpenAI GPT-based summarization
β”œβ”€β”€ pipeline.py            # Main orchestration and CLI interface
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ .env.example           # Environment variable template
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ input_mp3s/            # Input audio files directory
└── output/                # Generated transcriptions and summaries

πŸš€ Quick Start

1. Prerequisites

  • Python 3.11+
  • FFmpeg (required for audio processing)
  • OpenAI API Key (for summarization)

Install FFmpeg

Ubuntu/Debian:

sudo apt update
sudo apt install ffmpeg

macOS:

brew install ffmpeg

Windows: Download from https://ffmpeg.org/download.html

2. Installation

# Clone or download the project
git clone <repository-url>
cd transcriber

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Configuration

# Copy environment template
cp .env.example .env

# Edit .env file with your OpenAI API key
nano .env  # or use your preferred editor

Required Configuration:

# Get your API key from https://platform.openai.com/account/api-keys
OPENAI_API_KEY=your_actual_api_key_here

4. Basic Usage

# Place your MP3 files in input_mp3s/ directory
cp your_audio_file.mp3 input_mp3s/

# Run the pipeline
python pipeline.py --input input_mp3s --output output

# Check results in output/ directory
ls output/

πŸ“– Detailed Usage

Command Line Options

python pipeline.py [OPTIONS]

Options:
  -i, --input DIR           Input directory with audio files (default: input_mp3s)
  -o, --output DIR          Output directory for results (default: output)
  -m, --model MODEL         GPT model for summarization (default: gpt-3.5-turbo)
  --whisper-model MODEL     Whisper model for transcription (default: base)
  -l, --language LANG       Force specific language (en/hi/kn, auto-detect if not set)
  --info                    Show system information
  -v, --verbose             Enable verbose logging
  -h, --help               Show help message

Usage Examples

Basic transcription and summarization:

python pipeline.py --input input_mp3s --output output

Use GPT-4 for better summaries:

python pipeline.py --input audio_files --output results --model gpt-4

Force Hindi language transcription:

python pipeline.py --input input_mp3s --output output --language hi

Use larger Whisper model for better accuracy:

python pipeline.py --input input_mp3s --output output --whisper-model large

Show system information:

python pipeline.py --info

Supported File Formats

  • MP3 (.mp3)
  • WAV (.wav)
  • M4A (.m4a)
  • FLAC (.flac)

Supported Languages

  • English (en) - Primary support
  • Hindi (hi) - Full support
  • Kannada (kn) - Full support

πŸ“„ Output Format

For each input file example.mp3, the pipeline generates:

1. Text File (example.txt)

Filename: example.mp3
Language: en
Processed: 2024-01-15 14:30:25
==================================================

TRANSCRIPTION:
--------------------
[Full transcribed text from the audio]

SUMMARY:
--------------------
[AI-generated concise summary]

==================================================
METADATA:
Model used: gpt-3.5-turbo
Tokens used: 150

2. JSON File (example.json)

{
  "filename": "example.mp3",
  "language": "en",
  "transcription": "[Full transcribed text]",
  "summary": "[AI-generated summary]",
  "model_used": "gpt-3.5-turbo",
  "tokens_used": {
    "prompt_tokens": 120,
    "completion_tokens": 30,
    "total_tokens": 150
  },
  "segments": [
    {
      "start": 0.0,
      "end": 5.0,
      "text": "Hello, this is a test recording."
    }
  ]
}

βš™οΈ Configuration Options

Environment Variables (.env file)

# Required
OPENAI_API_KEY=your_openai_api_key_here

# Optional
DEFAULT_GPT_MODEL=gpt-3.5-turbo      # GPT model for summarization
WHISPER_MODEL=base                   # Whisper model for transcription
INPUT_DIR=input_mp3s                 # Input directory path
OUTPUT_DIR=output                    # Output directory path
LOG_LEVEL=INFO                       # Logging level

Available Models

Whisper Models (accuracy vs. speed trade-off):

  • tiny - Fastest, least accurate
  • base - Good balance (default)
  • small - Better accuracy
  • medium - High accuracy
  • large - Best accuracy, slowest

GPT Models (quality vs. cost trade-off):

  • gpt-3.5-turbo - Fast, cost-effective (default)
  • gpt-3.5-turbo-16k - Longer context window
  • gpt-4 - Highest quality, more expensive
  • gpt-4-32k - Highest quality + long context
  • gpt-4-turbo-preview - Latest GPT-4 variant

πŸ”§ System Requirements

Minimum Requirements

  • CPU: 2+ cores
  • RAM: 4GB
  • Storage: 2GB free space
  • Python: 3.11+

Recommended for Better Performance

  • CPU: 4+ cores
  • RAM: 8GB+
  • GPU: NVIDIA GPU with CUDA support (optional, for faster Whisper inference)
  • Storage: SSD with 5GB+ free space

GPU Acceleration (Optional)

To enable GPU acceleration for Whisper:

  1. Install CUDA-compatible PyTorch:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
  1. Verify GPU detection:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"

πŸ“Š Performance & Costs

Processing Times (approximate)

Audio Length Whisper Model Processing Time
1 minute base 10-30 seconds
5 minutes base 30-90 seconds
30 minutes base 3-8 minutes
1 hour large 8-15 minutes

OpenAI API Costs (approximate)

Model Cost per 1K tokens Typical Summary Cost
GPT-3.5 $0.002 $0.01-0.05
GPT-4 $0.06 $0.10-0.50

πŸ” Troubleshooting

Common Issues

1. "OpenAI API key not found" error:

# Check your .env file exists and contains the API key
cat .env
# Make sure there are no extra spaces around the key

2. "FFmpeg not found" error:

# Install FFmpeg
sudo apt install ffmpeg  # Ubuntu/Debian
brew install ffmpeg      # macOS

3. "CUDA out of memory" error:

# Use a smaller Whisper model
python pipeline.py --whisper-model tiny

4. Slow processing:

# Use smaller models or enable GPU acceleration
python pipeline.py --whisper-model base --model gpt-3.5-turbo

Debug Mode

Enable verbose logging for detailed troubleshooting:

python pipeline.py --verbose

Check the log file for detailed error information:

tail -f transcription_pipeline.log

πŸ§ͺ Testing

Test with Sample Audio

Create a test audio file or use online text-to-speech:

# Create test directory structure
mkdir -p input_mp3s output

# Place a test MP3 file in input_mp3s/
# Then run the pipeline
python pipeline.py --input input_mp3s --output output --verbose

# Check results
ls output/
cat output/test_file.txt

Validate Installation

# Check system info
python pipeline.py --info

# Should show:
# - Whisper model loaded
# - OpenAI API configured
# - Supported languages and formats

πŸ”’ Privacy & Security

  • Local Processing: Audio transcription happens entirely on your machine
  • API Calls: Only text transcriptions are sent to OpenAI for summarization
  • Data Storage: All outputs are stored locally
  • API Key: Store securely in .env file, never commit to version control

πŸ› οΈ Development

Project Architecture

The pipeline is built with modular components:

  • config.py: Centralized configuration management
  • transcriber.py: Whisper integration and language detection
  • summarizer.py: OpenAI API integration with retry logic
  • pipeline.py: Main orchestration and CLI interface

Extending the Pipeline

Add new languages:

# In config.py, add to supported_languages
self.supported_languages: List[str] = ['en', 'hi', 'kn', 'es', 'fr']

Custom summarization prompts:

# In summarizer.py, modify system_prompt
custom_prompt = "Create a technical summary focusing on key decisions..."
summary = summarizer.summarize_text(text, custom_prompt=custom_prompt)

πŸ“ License

This project is provided as-is for educational and development purposes. Please ensure compliance with OpenAI's usage policies when using their API.

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“ž Support

For issues and questions:

  1. Check the troubleshooting section above
  2. Review the logs in transcription_pipeline.log
  3. Open an issue with detailed error information

Happy transcribing! πŸŽ΅βž‘οΈπŸ“

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages