Skip to content

Bread-Technologies/interactive-podcast-platform

Repository files navigation

Interactive Podcast Platform

Transform any podcast or YouTube video into an interactive AI conversation experience. This platform processes audio content using advanced AI to create speaker-identified transcripts, enables real-time voice conversations with AI agents about the content, and extracts high-quality speaker audio clips for voice cloning applications.

🚀 Quick Start

Get running in 30 seconds:

# 1. Navigate to frontend directory
cd frontend

# 2. Install all dependencies (frontend + backend)
npm install

# 3. Start everything with one command
npm run dev

Visit: http://localhost:5173

The platform automatically starts both frontend (port 5173) and backend (port 3001) servers concurrently.

📋 Prerequisites

  • Node.js 18+ and npm
  • Python 3.8+
  • ffmpeg (for audio processing)

System Dependencies

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows - Download from https://ffmpeg.org/download.html

Python Dependencies

Essential for both main platform and speaker extraction tool:

pip install -r requirements.txt

Includes: google-generativeai, librosa, numpy, python-dotenv, and other AI processing dependencies.

Environment Setup (Optional)

# Create environment file for custom API keys
cp frontend/.env.example frontend/.env

# Edit frontend/.env with your keys:
# GEMINI_API_KEY=your_gemini_api_key_here
# OPENAI_API_KEY=your_openai_api_key_here

🏗️ Architecture Overview

Integrated Single-Command Architecture

The platform uses a unified development experience where both frontend and backend run from a single npm run dev command:

  • Frontend (React + Vite): Port 5173 - User interface, player, voice chat
  • Backend (Express + WebSocket): Port 3001 - API endpoints, processing orchestration
  • Python Processing: AI transcription pipeline with chunked processing
  • Real-time Updates: WebSocket-based progress monitoring

Key Components

Frontend (React + Vite)

  • Port: 5173 (development)
  • Entry: frontend/src/main.jsx
  • Routing: React Router DOM with browser navigation support
  • Navigation Structure:
    • / - Homepage
    • /library - Podcast library
    • /search - Search/add content
    • /confirm?podcastData=... - Episode confirmation
    • /processing/{sessionId} - Real-time processing progress
    • /player/{podcastId} - Interactive player with AI chat

Backend (Express.js + WebSocket)

  • Port: 3001 (proxied through frontend)
  • Entry: frontend/server.js
  • Real-time: WebSocket support for processing updates
  • API Pattern: All endpoints prefixed with /api/

Python Processing Pipeline

  • Main Script: scripts/podcast2jsonl.py - Audio transcription using Gemini API with speaker diarization
  • Speaker Extraction Tool: scripts/extract_speaker_audio.py - Standalone 30-second speaker audio clip extraction
  • Output: JSONL transcripts with speaker identification and timestamps, high-quality WAV speaker clips

🎯 Core Features

Universal Content Processing

  • Podcast Search: Discover any podcast via Podcast Index API
  • YouTube Support: Direct URL processing for any YouTube video
  • Audio Normalization: Automatic conversion to mono 16kHz for optimal AI processing
  • Chunked Processing: Handles long-form content (20-minute chunks with 30-second overlap)

AI-Powered Transcription

  • Speaker Identification: Google Gemini API identifies and labels speakers
  • Verbatim Transcription: Preserves stutters, emphasis, interruptions, and vocal nuances
  • Parallel Processing: Bio generation (OpenAI o3) runs concurrently with transcription
  • Smart Chunking: Automatic audio splitting for optimal processing efficiency

Interactive Voice Experience

  • Real-time Voice Chat: ElevenLabs voice agents for natural AI conversations
  • Context-Aware: AI uses last N minutes of transcript for relevant responses
  • Dynamic System Prompts: Generated from episode content and speaker biographies
  • Persona Embodiment: AI speaks as the primary speaker from the episode

Rich User Interface

  • Synchronized Transcript: Highlights current audio position in real-time
  • Click-to-Seek: Instant navigation by clicking transcript segments
  • Browser Navigation: Full back/forward/refresh/bookmark support
  • URL State Management: Shareable links to any application state
  • Real-time Progress: WebSocket updates during processing

Speaker Audio Extraction

  • High-Quality Clips: 30-second WAV files optimized for voice cloning applications
  • AI-Powered Selection: Gemini 2.5 Pro identifies optimal speaking regions for each speaker
  • Quality Scoring: Multi-criteria analysis (energy, voice activity, spectral quality)
  • Professional Processing: 44.1kHz mono output with fade in/out and artifact removal
  • Flexible Input: Supports YouTube URLs, podcast URLs, and local audio files
  • Comprehensive Metadata: Detailed extraction logs with quality metrics and processing methods

🔄 Processing Pipeline

Data Flow Overview

  1. Content Discovery: User searches podcasts or pastes YouTube URLs
  2. Processing Pipeline:
    • Audio download (yt-dlp/HTTP)
    • Audio normalization (ffmpeg to mono 16kHz)
    • Speaker identification (Gemini API)
    • Parallel bio generation (OpenAI o3) and transcription (Gemini API)
    • Summary generation (Gemini API)
    • System prompt generation for AI chat
  3. Interactive Experience: Real-time audio player with synchronized transcript and voice AI chat

Chunked Processing Architecture

The system uses smart chunking for long-form content:

  • Audio split into 20-minute chunks with 30-second overlap
  • Real-time progress updates via WebSocket
  • Streaming chunk completion markers for incremental loading
  • Parallel bio generation (OpenAI o3 + Anthropic Claude) during transcription

Progress Monitoring

Real-Time Updates: The progress bar provides sparse but accurate updates:

  • 5% - Initialization: Python script starts
  • 5-20%: Audio download, normalization, speaker identification
  • 20-75%: AI transcription (longest phase, chunked processing)
  • 75-85%: Transcript post-processing and merging
  • 85-90%: Episode summary generation
  • 90-95%: System prompt creation for AI chat
  • 100%: Complete with interactive player ready

WebSocket Markers:

  • @@CHUNK_READY@@ - Incremental transcript loading
  • @@PROCESS_COMPLETE@@ - Final completion signal

🔧 API Integration

External APIs

  • Gemini API: Audio transcription, speaker identification, content summarization
  • OpenAI o3: Speaker biography generation with web search capabilities
  • Anthropic Claude: Bio content formatting and cleaning
  • ElevenLabs: Voice AI agents for conversational experience
  • Podcast Index API: Podcast search and metadata (credentials included)

Backend REST API

  • POST /api/search-podcasts - Search Podcast Index API
  • POST /api/get-episodes - Get episodes for specific podcast
  • POST /api/process-podcast - Initiate processing pipeline
  • GET /api/status/:sessionId - Processing status
  • GET /api/generated-podcasts - List processed content
  • GET /api/podcast/:id - Get specific podcast data
  • WebSocket /ws - Real-time progress updates

🎙️ Standalone Speaker Audio Extraction

Quick Usage

Extract high-quality 30-second speaker clips from any audio source:

# YouTube video
python scripts/extract_speaker_audio.py "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir ./clips

# Podcast episode
python scripts/extract_speaker_audio.py "https://media.example.com/podcast.mp3" --output-dir ./clips

# Local audio file
python scripts/extract_speaker_audio.py "/path/to/audio.mp3" --output-dir ./clips

API Key Setup

# Set Gemini API key for speaker identification
export GEMINI_API_KEY="your-api-key-here"

# Or pass directly via command line
python scripts/extract_speaker_audio.py "SOURCE" --api-key "your-key" --output-dir ./clips

Output Structure

./clips/
├── Episode_Title_Speaker_Name_30s.wav      # High-quality WAV files (44.1kHz mono)
├── Episode_Title_extraction_metadata.json  # Comprehensive processing metadata

Use Cases

  • Voice Cloning: High-quality speaker samples for AI voice synthesis
  • Speaker Verification: Audio samples for identity confirmation systems
  • Podcast Highlights: Key speaker segments for promotional content
  • Research Applications: Clean speaker samples for audio analysis

📊 Monitoring & Logging

Structured Logging System

The platform uses Winston with daily log rotation for comprehensive monitoring:

// Component-specific loggers
const apiLogger = createLogger('api');
const wsLogger = createLogger('websocket');
const processingLogger = createLogger('processing');

Log Categories:

  • API Logs: logs/app-YYYY-MM-DD.log - Request/response tracking with correlation IDs
  • Error Logs: logs/error-YYYY-MM-DD.log - All errors with stack traces (30-day retention)
  • Processing Logs: logs/processing-YYYY-MM-DD.log - Detailed subprocess monitoring (7-day retention)
  • Exception Logs: logs/exceptions-YYYY-MM-DD.log - Uncaught exceptions and rejections

Structured Data:

  • Correlation IDs: Every request gets a UUID for tracing
  • Session Tracking: Processing sessions monitored throughout lifecycle
  • Subprocess Monitoring: Python script output, duration, and resource usage
  • WebSocket Events: Connection lifecycle and message flow

Real-Time Monitoring

WebSocket Implementation:

  • Connection Tracking: Each WebSocket connection gets unique ID
  • Session Subscription: Clients subscribe to specific session updates
  • Progress Broadcasting: Real-time updates sent to subscribed clients
  • Error Propagation: Processing errors immediately broadcast to UI

Session Management:

// Active sessions with complete lifecycle tracking
const activeSessions = new Map();
// Includes: status, progress, timing, errors, results

🛠️ Development Patterns

Error Handling & Recovery

Structured Error Management:

  • Retry Strategies: Exponential backoff for API calls
  • Health Monitoring: API health checks before processing
  • Graceful Degradation: Cache fallbacks for processed content
  • Session Recovery: Processing state preserved across server restarts

Error Boundaries:

  • Component-Level: React error boundaries for UI resilience
  • API-Level: Comprehensive error responses with correlation IDs
  • Processing-Level: Python subprocess error capture and reporting

State Management

URL-Based State: Critical application state stored in URL parameters

  • Session Storage: Complex objects backed up in sessionStorage
  • WebSocket State: Real-time processing updates
  • React Router: Browser navigation with history support

Performance Optimization

Chunked Processing:

  • 20-minute chunks to avoid API timeouts
  • Parallel operations (bio generation + transcription)
  • Resource limits (speaker bio generation limited to first 4 speakers)
  • Timeout handling with comprehensive fallback strategies

Caching Strategy:

  • Processed Content: Automatic cache detection for re-processing prevention
  • YouTube URLs: URL-based cache lookup to avoid duplicate processing
  • Metadata Persistence: JSONL format for efficient transcript storage

🎙️ Voice AI Integration

ElevenLabs Voice Agents

Context Extraction: Last N minutes of transcript sent to AI agents System Prompts: Dynamically generated based on episode content and speaker biographies Persona Embodiment: AI speaks as the primary speaker from the episode

Configuration (config.yml):

elevenlabs:
  agent_id: "agent_01k0devtnafyhb7cg4ztv3gpa8"
  api_key: ""
  default_context_min: 3

Integration Points:

  • VoiceCallModal.jsx: ElevenLabs voice agent integration
  • useElonVoiceAgent.js: React hook for voice functionality
  • contextExtractor.js: Extracts relevant transcript context for AI

📁 Project Structure

cast.dread.technology/
├── frontend/                    # Integrated React frontend + Express backend
│   ├── src/
│   │   ├── components/          # React components
│   │   │   ├── PodcastSearch.jsx       # Content discovery interface
│   │   │   ├── ProcessingProgress.jsx  # Real-time progress with WebSocket
│   │   │   ├── Player.jsx              # Audio player with transcript sync
│   │   │   ├── Transcript.jsx          # Interactive transcript viewer
│   │   │   └── VoiceCallModal.jsx      # ElevenLabs voice agent integration
│   │   ├── hooks/               # Custom React hooks
│   │   │   └── useElonVoiceAgent.js    # Voice agent integration hook
│   │   ├── utils/               # Utility functions
│   │   │   ├── configLoader.js         # Configuration loading
│   │   │   ├── contextExtractor.js     # Transcript context extraction
│   │   │   └── time.js                 # Time utilities
│   │   └── main.jsx             # React entry point
│   ├── server.js                # Integrated Express.js backend server
│   ├── logger.js                # Winston logging configuration
│   ├── package.json             # All dependencies (frontend + backend)
│   └── vite.config.js           # Vite configuration with proxy
├── scripts/                     # Python processing pipeline
│   ├── podcast2jsonl.py         # Main AI transcription script
│   ├── extract_speaker_audio.py # Standalone speaker audio extraction tool
│   ├── SPEAKER_EXTRACTION_README.md # Speaker extraction documentation
│   ├── api_health_monitor.py    # API health monitoring utilities
│   ├── logger_config.py         # Structured logging configuration
│   ├── subprocess_utils.py      # Enhanced subprocess monitoring
│   └── txt_to_jsonl.py          # Text-to-JSONL converter
├── generated_podcasts/          # Generated content storage
│   ├── {Episode_Title}_transcript.jsonl
│   ├── {Episode_Title}_metadata.json
│   └── {Episode_Title}_summary.txt
├── speaker_clips/               # Speaker audio extraction output (configurable)
│   ├── {Episode_Title}_{Speaker_Name}_30s.wav
│   └── {Episode_Title}_extraction_metadata.json
├── logs/                        # Structured logging output
│   ├── app-YYYY-MM-DD.log       # General application logs
│   ├── error-YYYY-MM-DD.log     # Error logs (30-day retention)
│   ├── processing-YYYY-MM-DD.log # Processing logs (7-day retention)
│   └── exceptions-YYYY-MM-DD.log # Exception logs
├── config.yml                   # Application configuration
├── requirements.txt             # Python dependencies
└── test-setup.sh               # Comprehensive setup validation

🔍 User Journey & Technical Flow

1. Content Discovery

User Action: Search podcasts or paste YouTube URLs
Technical Process:

  • Frontend calls POST /api/search-podcasts
  • Backend queries Podcast Index API with authentication headers
  • Results cached and returned to user

2. Processing Initiation

User Action: Select episode and confirm processing
Technical Process:

  • Check for existing processed content (cache lookup)
  • If cached: Return immediately with processed data
  • If new: Initiate Python subprocess with session tracking

3. Real-Time Processing

User Action: Monitor progress via WebSocket connection
Technical Process:

  • 5%: Python script initialization
  • 5-20%: Audio download, normalization, speaker identification
  • 20-75%: Chunked AI transcription (longest phase)
  • 75-100%: Post-processing, summary generation, system prompt creation

Progress Updates: Sparse but accurate (not artificial increments)

  • WebSocket broadcasts real Python script output
  • @@CHUNK_READY@@ markers enable incremental transcript loading
  • @@PROCESS_COMPLETE@@ signals final completion

4. Interactive Experience

User Action: Listen and chat with AI about content
Technical Process:

  • Audio player syncs with transcript highlighting
  • Voice chat extracts recent transcript context
  • ElevenLabs voice agent responds as episode speaker
  • System prompt provides context-aware AI personality

⚙️ Configuration

Environment Variables

# Frontend (.env) - Optional API key overrides
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
PORT=3001
NODE_ENV=development
LOG_LEVEL=info

# Note: API keys are also hardcoded in server.js for development
# For production, use environment variables instead

Application Configuration (config.yml)

# Legacy JRE episode configuration
audio_path: "data/jre_elon_2281/joe_rogan_experience_2281_elon_musk_m4a_128k.mp3"
transcript_path: "data/jre_elon_2281/transcript_2281.jsonl"

# AI Integration
obelisk:
  api_url: "https://obelisk.dread.technology/api"
  api_key: "sk-..."
  model_id: "gpt-4.1-base"

# Voice Agent Configuration
elevenlabs:
  agent_id: "agent_01k0devtnafyhb7cg4ztv3gpa8"
  api_key: ""
  default_context_min: 3

🧪 Testing & Validation

Setup Validation Script

./test-setup.sh

Validates:

  • System dependencies (Node.js 18+, Python 3.8+, ffmpeg)
  • Package installations (npm and pip)
  • API configurations and connectivity
  • Directory structure and permissions
  • Environment variable setup

Output: Color-coded status with specific remediation steps for any issues found.

Development Commands

# Start development (both frontend & backend)
cd frontend && npm run dev

# Build for production
cd frontend && npm run build

# Preview production build
cd frontend && npm run preview

# Build and serve production
cd frontend && npm run build-and-serve

🎯 Key Features Deep Dive

Processing Pipeline Details

Transcript Format:

  • Input: Speaker-diarized text with timestamps [HH:MM:SS] Speaker: Text
  • Output: JSONL format with structured speaker/timestamp/text entries
  • Features: Preserves stutters, emphasis, interruptions, and vocal nuances

Performance Considerations:

  • Chunked Processing: 20-minute chunks to avoid API timeouts
  • Parallel Operations: Bio generation runs concurrently with transcription
  • Resource Limits: Speaker bio generation limited to first 4 speakers
  • Timeout Handling: Comprehensive timeout strategies for long-running operations

Data Storage

Generated Content (generated_podcasts/ directory):

  • Naming Pattern: {Episode_Title}_transcript.jsonl, {Episode_Title}_metadata.json
  • System Prompts: Generated for each episode to enable contextual AI chat
  • Cache Strategy: Automatic detection prevents duplicate processing

Browser Navigation

Full Browser Support:

  • Back/Forward: Natural browser navigation
  • Refresh: State preservation across reloads
  • Bookmarks: Direct URL access to any page state
  • Deep Linking: Shareable URLs for specific content
  • URL Structure: RESTful paths with embedded state data

🔧 Development

File Management

  • Generated podcasts stored in generated_podcasts/ directory
  • Metadata files contain episode information and processing results
  • Transcript files in JSONL format for easy parsing
  • Summary files contain AI-generated episode summaries
  • System prompts generated per episode for contextual AI chat

🚨 Troubleshooting

Common Issues

Application Not Starting

  • Ensure you're in the frontend directory: cd frontend && npm run dev
  • Check for port conflicts on ports 3001 or 5173
  • Verify Node.js 18+ is installed: node --version

Python Dependencies Missing

  • Install requirements: pip install -r requirements.txt
  • Verify Python 3.8+ is installed: python3 --version
  • Check ffmpeg installation: ffmpeg -version

Audio Processing Fails

  • Ensure ffmpeg is installed and in PATH
  • Check episode URL accessibility
  • Verify API keys are configured correctly
  • Check processing logs: logs/processing-YYYY-MM-DD.log

Voice Chat Not Working

  • Check microphone permissions in browser
  • Ensure ElevenLabs agent ID is correct in config.yml
  • Verify internet connection for ElevenLabs API
  • Check browser console for WebRTC errors

Processing Stuck or Slow

  • Monitor logs for Python subprocess output
  • Check available disk space in generated_podcasts/
  • Verify API rate limits not exceeded
  • Long episodes (>2 hours) may take 30+ minutes

Debug Mode

Set environment variables for detailed logging:

# Enable debug logging
export NODE_ENV=development
export LOG_LEVEL=debug

# Start with verbose output
cd frontend && npm run dev

Log Analysis

Check processing progress:

# Monitor real-time processing
tail -f logs/processing-$(date +%Y-%m-%d).log

# Check for errors
grep "ERROR" logs/error-$(date +%Y-%m-%d).log

# API request tracking
grep "api_request" logs/app-$(date +%Y-%m-%d).log

🤝 Credits & Attribution

  • Original Architecture: Aman (cast.dread.technology)
  • AI Processing Integration: Neel Sardana (Bread Technologies)
  • Voice Agent Module: Cameron
  • Transcript Processing: Google Gemini API
  • Voice Agents: ElevenLabs
  • Podcast Data: Podcast Index API
  • Bio Generation: OpenAI o3 + Anthropic Claude

📄 License

MIT License - See LICENSE file for details.

For questions or contributions, please open an issue or contact the maintainers.

About

Interactive podcast platform with AI-powered voice chat functionality

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors