Interactive Podcast Platform

Transform any podcast or YouTube video into an interactive AI conversation experience. This platform processes audio content using advanced AI to create speaker-identified transcripts, enables real-time voice conversations with AI agents about the content, and extracts high-quality speaker audio clips for voice cloning applications.

🚀 Quick Start

Get running in 30 seconds:

# 1. Navigate to frontend directory
cd frontend

# 2. Install all dependencies (frontend + backend)
npm install

# 3. Start everything with one command
npm run dev

Visit: http://localhost:5173

The platform automatically starts both frontend (port 5173) and backend (port 3001) servers concurrently.

📋 Prerequisites

Node.js 18+ and npm
Python 3.8+
ffmpeg (for audio processing)

System Dependencies

# macOS
brew install ffmpeg

# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg

# Windows - Download from https://ffmpeg.org/download.html

Python Dependencies

Essential for both main platform and speaker extraction tool:

pip install -r requirements.txt

Includes: google-generativeai, librosa, numpy, python-dotenv, and other AI processing dependencies.

Environment Setup (Optional)

# Create environment file for custom API keys
cp frontend/.env.example frontend/.env

# Edit frontend/.env with your keys:
# GEMINI_API_KEY=your_gemini_api_key_here
# OPENAI_API_KEY=your_openai_api_key_here

🏗️ Architecture Overview

Integrated Single-Command Architecture

The platform uses a unified development experience where both frontend and backend run from a single npm run dev command:

Frontend (React + Vite): Port 5173 - User interface, player, voice chat
Backend (Express + WebSocket): Port 3001 - API endpoints, processing orchestration
Python Processing: AI transcription pipeline with chunked processing
Real-time Updates: WebSocket-based progress monitoring

Key Components

Frontend (React + Vite)

Port: 5173 (development)
Entry: frontend/src/main.jsx
Routing: React Router DOM with browser navigation support
Navigation Structure:
- / - Homepage
- /library - Podcast library
- /search - Search/add content
- /confirm?podcastData=... - Episode confirmation
- /processing/{sessionId} - Real-time processing progress
- /player/{podcastId} - Interactive player with AI chat

Backend (Express.js + WebSocket)

Port: 3001 (proxied through frontend)
Entry: frontend/server.js
Real-time: WebSocket support for processing updates
API Pattern: All endpoints prefixed with /api/

Python Processing Pipeline

Main Script: scripts/podcast2jsonl.py - Audio transcription using Gemini API with speaker diarization
Speaker Extraction Tool: scripts/extract_speaker_audio.py - Standalone 30-second speaker audio clip extraction
Output: JSONL transcripts with speaker identification and timestamps, high-quality WAV speaker clips

🎯 Core Features

Universal Content Processing

Podcast Search: Discover any podcast via Podcast Index API
YouTube Support: Direct URL processing for any YouTube video
Audio Normalization: Automatic conversion to mono 16kHz for optimal AI processing
Chunked Processing: Handles long-form content (20-minute chunks with 30-second overlap)

AI-Powered Transcription

Speaker Identification: Google Gemini API identifies and labels speakers
Verbatim Transcription: Preserves stutters, emphasis, interruptions, and vocal nuances
Parallel Processing: Bio generation (OpenAI o3) runs concurrently with transcription
Smart Chunking: Automatic audio splitting for optimal processing efficiency

Interactive Voice Experience

Real-time Voice Chat: ElevenLabs voice agents for natural AI conversations
Context-Aware: AI uses last N minutes of transcript for relevant responses
Dynamic System Prompts: Generated from episode content and speaker biographies
Persona Embodiment: AI speaks as the primary speaker from the episode

Rich User Interface

Synchronized Transcript: Highlights current audio position in real-time
Click-to-Seek: Instant navigation by clicking transcript segments
Browser Navigation: Full back/forward/refresh/bookmark support
URL State Management: Shareable links to any application state
Real-time Progress: WebSocket updates during processing

Speaker Audio Extraction

High-Quality Clips: 30-second WAV files optimized for voice cloning applications
AI-Powered Selection: Gemini 2.5 Pro identifies optimal speaking regions for each speaker
Quality Scoring: Multi-criteria analysis (energy, voice activity, spectral quality)
Professional Processing: 44.1kHz mono output with fade in/out and artifact removal
Flexible Input: Supports YouTube URLs, podcast URLs, and local audio files
Comprehensive Metadata: Detailed extraction logs with quality metrics and processing methods

🔄 Processing Pipeline

Data Flow Overview

Content Discovery: User searches podcasts or pastes YouTube URLs
Processing Pipeline:
- Audio download (yt-dlp/HTTP)
- Audio normalization (ffmpeg to mono 16kHz)
- Speaker identification (Gemini API)
- Parallel bio generation (OpenAI o3) and transcription (Gemini API)
- Summary generation (Gemini API)
- System prompt generation for AI chat
Interactive Experience: Real-time audio player with synchronized transcript and voice AI chat

Chunked Processing Architecture

The system uses smart chunking for long-form content:

Audio split into 20-minute chunks with 30-second overlap
Real-time progress updates via WebSocket
Streaming chunk completion markers for incremental loading
Parallel bio generation (OpenAI o3 + Anthropic Claude) during transcription

Progress Monitoring

Real-Time Updates: The progress bar provides sparse but accurate updates:

5% - Initialization: Python script starts
5-20%: Audio download, normalization, speaker identification
20-75%: AI transcription (longest phase, chunked processing)
75-85%: Transcript post-processing and merging
85-90%: Episode summary generation
90-95%: System prompt creation for AI chat
100%: Complete with interactive player ready

WebSocket Markers:

@@CHUNK_READY@@ - Incremental transcript loading
@@PROCESS_COMPLETE@@ - Final completion signal

🔧 API Integration

External APIs

Gemini API: Audio transcription, speaker identification, content summarization
OpenAI o3: Speaker biography generation with web search capabilities
Anthropic Claude: Bio content formatting and cleaning
ElevenLabs: Voice AI agents for conversational experience
Podcast Index API: Podcast search and metadata (credentials included)

Backend REST API

POST /api/search-podcasts - Search Podcast Index API
POST /api/get-episodes - Get episodes for specific podcast
POST /api/process-podcast - Initiate processing pipeline
GET /api/status/:sessionId - Processing status
GET /api/generated-podcasts - List processed content
GET /api/podcast/:id - Get specific podcast data
WebSocket /ws - Real-time progress updates

🎙️ Standalone Speaker Audio Extraction

Quick Usage

Extract high-quality 30-second speaker clips from any audio source:

# YouTube video
python scripts/extract_speaker_audio.py "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir ./clips

# Podcast episode
python scripts/extract_speaker_audio.py "https://media.example.com/podcast.mp3" --output-dir ./clips

# Local audio file
python scripts/extract_speaker_audio.py "/path/to/audio.mp3" --output-dir ./clips

API Key Setup

# Set Gemini API key for speaker identification
export GEMINI_API_KEY="your-api-key-here"

# Or pass directly via command line
python scripts/extract_speaker_audio.py "SOURCE" --api-key "your-key" --output-dir ./clips

Output Structure

./clips/
├── Episode_Title_Speaker_Name_30s.wav      # High-quality WAV files (44.1kHz mono)
├── Episode_Title_extraction_metadata.json  # Comprehensive processing metadata

Use Cases

Voice Cloning: High-quality speaker samples for AI voice synthesis
Speaker Verification: Audio samples for identity confirmation systems
Podcast Highlights: Key speaker segments for promotional content
Research Applications: Clean speaker samples for audio analysis

📊 Monitoring & Logging

Structured Logging System

The platform uses Winston with daily log rotation for comprehensive monitoring:

// Component-specific loggers
const apiLogger = createLogger('api');
const wsLogger = createLogger('websocket');
const processingLogger = createLogger('processing');

Log Categories:

API Logs: logs/app-YYYY-MM-DD.log - Request/response tracking with correlation IDs
Error Logs: logs/error-YYYY-MM-DD.log - All errors with stack traces (30-day retention)
Processing Logs: logs/processing-YYYY-MM-DD.log - Detailed subprocess monitoring (7-day retention)
Exception Logs: logs/exceptions-YYYY-MM-DD.log - Uncaught exceptions and rejections

Structured Data:

Correlation IDs: Every request gets a UUID for tracing
Session Tracking: Processing sessions monitored throughout lifecycle
Subprocess Monitoring: Python script output, duration, and resource usage
WebSocket Events: Connection lifecycle and message flow

Real-Time Monitoring

WebSocket Implementation:

Connection Tracking: Each WebSocket connection gets unique ID
Session Subscription: Clients subscribe to specific session updates
Progress Broadcasting: Real-time updates sent to subscribed clients
Error Propagation: Processing errors immediately broadcast to UI

Session Management:

// Active sessions with complete lifecycle tracking
const activeSessions = new Map();
// Includes: status, progress, timing, errors, results

🛠️ Development Patterns

Error Handling & Recovery

Structured Error Management:

Retry Strategies: Exponential backoff for API calls
Health Monitoring: API health checks before processing
Graceful Degradation: Cache fallbacks for processed content
Session Recovery: Processing state preserved across server restarts

Error Boundaries:

Component-Level: React error boundaries for UI resilience
API-Level: Comprehensive error responses with correlation IDs
Processing-Level: Python subprocess error capture and reporting

State Management

URL-Based State: Critical application state stored in URL parameters

Session Storage: Complex objects backed up in sessionStorage
WebSocket State: Real-time processing updates
React Router: Browser navigation with history support

Performance Optimization

Chunked Processing:

20-minute chunks to avoid API timeouts
Parallel operations (bio generation + transcription)
Resource limits (speaker bio generation limited to first 4 speakers)
Timeout handling with comprehensive fallback strategies

Caching Strategy:

Processed Content: Automatic cache detection for re-processing prevention
YouTube URLs: URL-based cache lookup to avoid duplicate processing
Metadata Persistence: JSONL format for efficient transcript storage

🎙️ Voice AI Integration

ElevenLabs Voice Agents

Context Extraction: Last N minutes of transcript sent to AI agents System Prompts: Dynamically generated based on episode content and speaker biographies Persona Embodiment: AI speaks as the primary speaker from the episode

Configuration (config.yml):

elevenlabs:
  agent_id: "agent_01k0devtnafyhb7cg4ztv3gpa8"
  api_key: ""
  default_context_min: 3

Integration Points:

VoiceCallModal.jsx: ElevenLabs voice agent integration
useElonVoiceAgent.js: React hook for voice functionality
contextExtractor.js: Extracts relevant transcript context for AI

📁 Project Structure

cast.dread.technology/
├── frontend/                    # Integrated React frontend + Express backend
│   ├── src/
│   │   ├── components/          # React components
│   │   │   ├── PodcastSearch.jsx       # Content discovery interface
│   │   │   ├── ProcessingProgress.jsx  # Real-time progress with WebSocket
│   │   │   ├── Player.jsx              # Audio player with transcript sync
│   │   │   ├── Transcript.jsx          # Interactive transcript viewer
│   │   │   └── VoiceCallModal.jsx      # ElevenLabs voice agent integration
│   │   ├── hooks/               # Custom React hooks
│   │   │   └── useElonVoiceAgent.js    # Voice agent integration hook
│   │   ├── utils/               # Utility functions
│   │   │   ├── configLoader.js         # Configuration loading
│   │   │   ├── contextExtractor.js     # Transcript context extraction
│   │   │   └── time.js                 # Time utilities
│   │   └── main.jsx             # React entry point
│   ├── server.js                # Integrated Express.js backend server
│   ├── logger.js                # Winston logging configuration
│   ├── package.json             # All dependencies (frontend + backend)
│   └── vite.config.js           # Vite configuration with proxy
├── scripts/                     # Python processing pipeline
│   ├── podcast2jsonl.py         # Main AI transcription script
│   ├── extract_speaker_audio.py # Standalone speaker audio extraction tool
│   ├── SPEAKER_EXTRACTION_README.md # Speaker extraction documentation
│   ├── api_health_monitor.py    # API health monitoring utilities
│   ├── logger_config.py         # Structured logging configuration
│   ├── subprocess_utils.py      # Enhanced subprocess monitoring
│   └── txt_to_jsonl.py          # Text-to-JSONL converter
├── generated_podcasts/          # Generated content storage
│   ├── {Episode_Title}_transcript.jsonl
│   ├── {Episode_Title}_metadata.json
│   └── {Episode_Title}_summary.txt
├── speaker_clips/               # Speaker audio extraction output (configurable)
│   ├── {Episode_Title}_{Speaker_Name}_30s.wav
│   └── {Episode_Title}_extraction_metadata.json
├── logs/                        # Structured logging output
│   ├── app-YYYY-MM-DD.log       # General application logs
│   ├── error-YYYY-MM-DD.log     # Error logs (30-day retention)
│   ├── processing-YYYY-MM-DD.log # Processing logs (7-day retention)
│   └── exceptions-YYYY-MM-DD.log # Exception logs
├── config.yml                   # Application configuration
├── requirements.txt             # Python dependencies
└── test-setup.sh               # Comprehensive setup validation

🔍 User Journey & Technical Flow

1. Content Discovery

User Action: Search podcasts or paste YouTube URLs
Technical Process:

Frontend calls POST /api/search-podcasts
Backend queries Podcast Index API with authentication headers
Results cached and returned to user

2. Processing Initiation

User Action: Select episode and confirm processing
Technical Process:

Check for existing processed content (cache lookup)
If cached: Return immediately with processed data
If new: Initiate Python subprocess with session tracking

3. Real-Time Processing

User Action: Monitor progress via WebSocket connection
Technical Process:

5%: Python script initialization
5-20%: Audio download, normalization, speaker identification
20-75%: Chunked AI transcription (longest phase)
75-100%: Post-processing, summary generation, system prompt creation

Progress Updates: Sparse but accurate (not artificial increments)

WebSocket broadcasts real Python script output
@@CHUNK_READY@@ markers enable incremental transcript loading
@@PROCESS_COMPLETE@@ signals final completion

4. Interactive Experience

User Action: Listen and chat with AI about content
Technical Process:

Audio player syncs with transcript highlighting
Voice chat extracts recent transcript context
ElevenLabs voice agent responds as episode speaker
System prompt provides context-aware AI personality

⚙️ Configuration

Environment Variables

# Frontend (.env) - Optional API key overrides
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
PORT=3001
NODE_ENV=development
LOG_LEVEL=info

# Note: API keys are also hardcoded in server.js for development
# For production, use environment variables instead

Application Configuration (`config.yml`)

# Legacy JRE episode configuration
audio_path: "data/jre_elon_2281/joe_rogan_experience_2281_elon_musk_m4a_128k.mp3"
transcript_path: "data/jre_elon_2281/transcript_2281.jsonl"

# AI Integration
obelisk:
  api_url: "https://obelisk.dread.technology/api"
  api_key: "sk-..."
  model_id: "gpt-4.1-base"

# Voice Agent Configuration
elevenlabs:
  agent_id: "agent_01k0devtnafyhb7cg4ztv3gpa8"
  api_key: ""
  default_context_min: 3

🧪 Testing & Validation

Setup Validation Script

./test-setup.sh

Validates:

System dependencies (Node.js 18+, Python 3.8+, ffmpeg)
Package installations (npm and pip)
API configurations and connectivity
Directory structure and permissions
Environment variable setup

Output: Color-coded status with specific remediation steps for any issues found.

Development Commands

# Start development (both frontend & backend)
cd frontend && npm run dev

# Build for production
cd frontend && npm run build

# Preview production build
cd frontend && npm run preview

# Build and serve production
cd frontend && npm run build-and-serve

🎯 Key Features Deep Dive

Processing Pipeline Details

Transcript Format:

Input: Speaker-diarized text with timestamps [HH:MM:SS] Speaker: Text
Output: JSONL format with structured speaker/timestamp/text entries
Features: Preserves stutters, emphasis, interruptions, and vocal nuances

Performance Considerations:

Chunked Processing: 20-minute chunks to avoid API timeouts
Parallel Operations: Bio generation runs concurrently with transcription
Resource Limits: Speaker bio generation limited to first 4 speakers
Timeout Handling: Comprehensive timeout strategies for long-running operations

Data Storage

Generated Content (generated_podcasts/ directory):

Naming Pattern: {Episode_Title}_transcript.jsonl, {Episode_Title}_metadata.json
System Prompts: Generated for each episode to enable contextual AI chat
Cache Strategy: Automatic detection prevents duplicate processing

Browser Navigation

Full Browser Support:

Back/Forward: Natural browser navigation
Refresh: State preservation across reloads
Bookmarks: Direct URL access to any page state
Deep Linking: Shareable URLs for specific content
URL Structure: RESTful paths with embedded state data

🔧 Development

File Management

Generated podcasts stored in generated_podcasts/ directory
Metadata files contain episode information and processing results
Transcript files in JSONL format for easy parsing
Summary files contain AI-generated episode summaries
System prompts generated per episode for contextual AI chat

🚨 Troubleshooting

Common Issues

Application Not Starting

Ensure you're in the frontend directory: cd frontend && npm run dev
Check for port conflicts on ports 3001 or 5173
Verify Node.js 18+ is installed: node --version

Python Dependencies Missing

Install requirements: pip install -r requirements.txt
Verify Python 3.8+ is installed: python3 --version
Check ffmpeg installation: ffmpeg -version

Audio Processing Fails

Ensure ffmpeg is installed and in PATH
Check episode URL accessibility
Verify API keys are configured correctly
Check processing logs: logs/processing-YYYY-MM-DD.log

Voice Chat Not Working

Check microphone permissions in browser
Ensure ElevenLabs agent ID is correct in config.yml
Verify internet connection for ElevenLabs API
Check browser console for WebRTC errors

Processing Stuck or Slow

Monitor logs for Python subprocess output
Check available disk space in generated_podcasts/
Verify API rate limits not exceeded
Long episodes (>2 hours) may take 30+ minutes

Debug Mode

Set environment variables for detailed logging:

# Enable debug logging
export NODE_ENV=development
export LOG_LEVEL=debug

# Start with verbose output
cd frontend && npm run dev

Log Analysis

Check processing progress:

# Monitor real-time processing
tail -f logs/processing-$(date +%Y-%m-%d).log

# Check for errors
grep "ERROR" logs/error-$(date +%Y-%m-%d).log

# API request tracking
grep "api_request" logs/app-$(date +%Y-%m-%d).log

🤝 Credits & Attribution

Original Architecture: Aman (cast.dread.technology)
AI Processing Integration: Neel Sardana (Bread Technologies)
Voice Agent Module: Cameron
Transcript Processing: Google Gemini API
Voice Agents: ElevenLabs
Podcast Data: Podcast Index API
Bio Generation: OpenAI o3 + Anthropic Claude

📄 License

MIT License - See LICENSE file for details.

For questions or contributions, please open an issue or contact the maintainers.

FilesExpand file tree

README.md

Latest commit

History