Transform any podcast or YouTube video into an interactive AI conversation experience. This platform processes audio content using advanced AI to create speaker-identified transcripts, enables real-time voice conversations with AI agents about the content, and extracts high-quality speaker audio clips for voice cloning applications.
Get running in 30 seconds:
# 1. Navigate to frontend directory
cd frontend
# 2. Install all dependencies (frontend + backend)
npm install
# 3. Start everything with one command
npm run devVisit: http://localhost:5173
The platform automatically starts both frontend (port 5173) and backend (port 3001) servers concurrently.
- Node.js 18+ and npm
- Python 3.8+
- ffmpeg (for audio processing)
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# Windows - Download from https://ffmpeg.org/download.htmlEssential for both main platform and speaker extraction tool:
pip install -r requirements.txtIncludes: google-generativeai, librosa, numpy, python-dotenv, and other AI processing dependencies.
# Create environment file for custom API keys
cp frontend/.env.example frontend/.env
# Edit frontend/.env with your keys:
# GEMINI_API_KEY=your_gemini_api_key_here
# OPENAI_API_KEY=your_openai_api_key_hereThe platform uses a unified development experience where both frontend and backend run from a single npm run dev command:
- Frontend (React + Vite): Port 5173 - User interface, player, voice chat
- Backend (Express + WebSocket): Port 3001 - API endpoints, processing orchestration
- Python Processing: AI transcription pipeline with chunked processing
- Real-time Updates: WebSocket-based progress monitoring
Frontend (React + Vite)
- Port: 5173 (development)
- Entry:
frontend/src/main.jsx - Routing: React Router DOM with browser navigation support
- Navigation Structure:
/- Homepage/library- Podcast library/search- Search/add content/confirm?podcastData=...- Episode confirmation/processing/{sessionId}- Real-time processing progress/player/{podcastId}- Interactive player with AI chat
Backend (Express.js + WebSocket)
- Port: 3001 (proxied through frontend)
- Entry:
frontend/server.js - Real-time: WebSocket support for processing updates
- API Pattern: All endpoints prefixed with
/api/
Python Processing Pipeline
- Main Script:
scripts/podcast2jsonl.py- Audio transcription using Gemini API with speaker diarization - Speaker Extraction Tool:
scripts/extract_speaker_audio.py- Standalone 30-second speaker audio clip extraction - Output: JSONL transcripts with speaker identification and timestamps, high-quality WAV speaker clips
- Podcast Search: Discover any podcast via Podcast Index API
- YouTube Support: Direct URL processing for any YouTube video
- Audio Normalization: Automatic conversion to mono 16kHz for optimal AI processing
- Chunked Processing: Handles long-form content (20-minute chunks with 30-second overlap)
- Speaker Identification: Google Gemini API identifies and labels speakers
- Verbatim Transcription: Preserves stutters, emphasis, interruptions, and vocal nuances
- Parallel Processing: Bio generation (OpenAI o3) runs concurrently with transcription
- Smart Chunking: Automatic audio splitting for optimal processing efficiency
- Real-time Voice Chat: ElevenLabs voice agents for natural AI conversations
- Context-Aware: AI uses last N minutes of transcript for relevant responses
- Dynamic System Prompts: Generated from episode content and speaker biographies
- Persona Embodiment: AI speaks as the primary speaker from the episode
- Synchronized Transcript: Highlights current audio position in real-time
- Click-to-Seek: Instant navigation by clicking transcript segments
- Browser Navigation: Full back/forward/refresh/bookmark support
- URL State Management: Shareable links to any application state
- Real-time Progress: WebSocket updates during processing
- High-Quality Clips: 30-second WAV files optimized for voice cloning applications
- AI-Powered Selection: Gemini 2.5 Pro identifies optimal speaking regions for each speaker
- Quality Scoring: Multi-criteria analysis (energy, voice activity, spectral quality)
- Professional Processing: 44.1kHz mono output with fade in/out and artifact removal
- Flexible Input: Supports YouTube URLs, podcast URLs, and local audio files
- Comprehensive Metadata: Detailed extraction logs with quality metrics and processing methods
- Content Discovery: User searches podcasts or pastes YouTube URLs
- Processing Pipeline:
- Audio download (yt-dlp/HTTP)
- Audio normalization (ffmpeg to mono 16kHz)
- Speaker identification (Gemini API)
- Parallel bio generation (OpenAI o3) and transcription (Gemini API)
- Summary generation (Gemini API)
- System prompt generation for AI chat
- Interactive Experience: Real-time audio player with synchronized transcript and voice AI chat
The system uses smart chunking for long-form content:
- Audio split into 20-minute chunks with 30-second overlap
- Real-time progress updates via WebSocket
- Streaming chunk completion markers for incremental loading
- Parallel bio generation (OpenAI o3 + Anthropic Claude) during transcription
Real-Time Updates: The progress bar provides sparse but accurate updates:
- 5% - Initialization: Python script starts
- 5-20%: Audio download, normalization, speaker identification
- 20-75%: AI transcription (longest phase, chunked processing)
- 75-85%: Transcript post-processing and merging
- 85-90%: Episode summary generation
- 90-95%: System prompt creation for AI chat
- 100%: Complete with interactive player ready
WebSocket Markers:
@@CHUNK_READY@@- Incremental transcript loading@@PROCESS_COMPLETE@@- Final completion signal
- Gemini API: Audio transcription, speaker identification, content summarization
- OpenAI o3: Speaker biography generation with web search capabilities
- Anthropic Claude: Bio content formatting and cleaning
- ElevenLabs: Voice AI agents for conversational experience
- Podcast Index API: Podcast search and metadata (credentials included)
POST /api/search-podcasts- Search Podcast Index APIPOST /api/get-episodes- Get episodes for specific podcastPOST /api/process-podcast- Initiate processing pipelineGET /api/status/:sessionId- Processing statusGET /api/generated-podcasts- List processed contentGET /api/podcast/:id- Get specific podcast dataWebSocket /ws- Real-time progress updates
Extract high-quality 30-second speaker clips from any audio source:
# YouTube video
python scripts/extract_speaker_audio.py "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir ./clips
# Podcast episode
python scripts/extract_speaker_audio.py "https://media.example.com/podcast.mp3" --output-dir ./clips
# Local audio file
python scripts/extract_speaker_audio.py "/path/to/audio.mp3" --output-dir ./clips# Set Gemini API key for speaker identification
export GEMINI_API_KEY="your-api-key-here"
# Or pass directly via command line
python scripts/extract_speaker_audio.py "SOURCE" --api-key "your-key" --output-dir ./clips./clips/
├── Episode_Title_Speaker_Name_30s.wav # High-quality WAV files (44.1kHz mono)
├── Episode_Title_extraction_metadata.json # Comprehensive processing metadata
- Voice Cloning: High-quality speaker samples for AI voice synthesis
- Speaker Verification: Audio samples for identity confirmation systems
- Podcast Highlights: Key speaker segments for promotional content
- Research Applications: Clean speaker samples for audio analysis
The platform uses Winston with daily log rotation for comprehensive monitoring:
// Component-specific loggers
const apiLogger = createLogger('api');
const wsLogger = createLogger('websocket');
const processingLogger = createLogger('processing');Log Categories:
- API Logs:
logs/app-YYYY-MM-DD.log- Request/response tracking with correlation IDs - Error Logs:
logs/error-YYYY-MM-DD.log- All errors with stack traces (30-day retention) - Processing Logs:
logs/processing-YYYY-MM-DD.log- Detailed subprocess monitoring (7-day retention) - Exception Logs:
logs/exceptions-YYYY-MM-DD.log- Uncaught exceptions and rejections
Structured Data:
- Correlation IDs: Every request gets a UUID for tracing
- Session Tracking: Processing sessions monitored throughout lifecycle
- Subprocess Monitoring: Python script output, duration, and resource usage
- WebSocket Events: Connection lifecycle and message flow
WebSocket Implementation:
- Connection Tracking: Each WebSocket connection gets unique ID
- Session Subscription: Clients subscribe to specific session updates
- Progress Broadcasting: Real-time updates sent to subscribed clients
- Error Propagation: Processing errors immediately broadcast to UI
Session Management:
// Active sessions with complete lifecycle tracking
const activeSessions = new Map();
// Includes: status, progress, timing, errors, resultsStructured Error Management:
- Retry Strategies: Exponential backoff for API calls
- Health Monitoring: API health checks before processing
- Graceful Degradation: Cache fallbacks for processed content
- Session Recovery: Processing state preserved across server restarts
Error Boundaries:
- Component-Level: React error boundaries for UI resilience
- API-Level: Comprehensive error responses with correlation IDs
- Processing-Level: Python subprocess error capture and reporting
URL-Based State: Critical application state stored in URL parameters
- Session Storage: Complex objects backed up in sessionStorage
- WebSocket State: Real-time processing updates
- React Router: Browser navigation with history support
Chunked Processing:
- 20-minute chunks to avoid API timeouts
- Parallel operations (bio generation + transcription)
- Resource limits (speaker bio generation limited to first 4 speakers)
- Timeout handling with comprehensive fallback strategies
Caching Strategy:
- Processed Content: Automatic cache detection for re-processing prevention
- YouTube URLs: URL-based cache lookup to avoid duplicate processing
- Metadata Persistence: JSONL format for efficient transcript storage
Context Extraction: Last N minutes of transcript sent to AI agents System Prompts: Dynamically generated based on episode content and speaker biographies Persona Embodiment: AI speaks as the primary speaker from the episode
Configuration (config.yml):
elevenlabs:
agent_id: "agent_01k0devtnafyhb7cg4ztv3gpa8"
api_key: ""
default_context_min: 3Integration Points:
- VoiceCallModal.jsx: ElevenLabs voice agent integration
- useElonVoiceAgent.js: React hook for voice functionality
- contextExtractor.js: Extracts relevant transcript context for AI
cast.dread.technology/
├── frontend/ # Integrated React frontend + Express backend
│ ├── src/
│ │ ├── components/ # React components
│ │ │ ├── PodcastSearch.jsx # Content discovery interface
│ │ │ ├── ProcessingProgress.jsx # Real-time progress with WebSocket
│ │ │ ├── Player.jsx # Audio player with transcript sync
│ │ │ ├── Transcript.jsx # Interactive transcript viewer
│ │ │ └── VoiceCallModal.jsx # ElevenLabs voice agent integration
│ │ ├── hooks/ # Custom React hooks
│ │ │ └── useElonVoiceAgent.js # Voice agent integration hook
│ │ ├── utils/ # Utility functions
│ │ │ ├── configLoader.js # Configuration loading
│ │ │ ├── contextExtractor.js # Transcript context extraction
│ │ │ └── time.js # Time utilities
│ │ └── main.jsx # React entry point
│ ├── server.js # Integrated Express.js backend server
│ ├── logger.js # Winston logging configuration
│ ├── package.json # All dependencies (frontend + backend)
│ └── vite.config.js # Vite configuration with proxy
├── scripts/ # Python processing pipeline
│ ├── podcast2jsonl.py # Main AI transcription script
│ ├── extract_speaker_audio.py # Standalone speaker audio extraction tool
│ ├── SPEAKER_EXTRACTION_README.md # Speaker extraction documentation
│ ├── api_health_monitor.py # API health monitoring utilities
│ ├── logger_config.py # Structured logging configuration
│ ├── subprocess_utils.py # Enhanced subprocess monitoring
│ └── txt_to_jsonl.py # Text-to-JSONL converter
├── generated_podcasts/ # Generated content storage
│ ├── {Episode_Title}_transcript.jsonl
│ ├── {Episode_Title}_metadata.json
│ └── {Episode_Title}_summary.txt
├── speaker_clips/ # Speaker audio extraction output (configurable)
│ ├── {Episode_Title}_{Speaker_Name}_30s.wav
│ └── {Episode_Title}_extraction_metadata.json
├── logs/ # Structured logging output
│ ├── app-YYYY-MM-DD.log # General application logs
│ ├── error-YYYY-MM-DD.log # Error logs (30-day retention)
│ ├── processing-YYYY-MM-DD.log # Processing logs (7-day retention)
│ └── exceptions-YYYY-MM-DD.log # Exception logs
├── config.yml # Application configuration
├── requirements.txt # Python dependencies
└── test-setup.sh # Comprehensive setup validation
User Action: Search podcasts or paste YouTube URLs
Technical Process:
- Frontend calls
POST /api/search-podcasts - Backend queries Podcast Index API with authentication headers
- Results cached and returned to user
User Action: Select episode and confirm processing
Technical Process:
- Check for existing processed content (cache lookup)
- If cached: Return immediately with processed data
- If new: Initiate Python subprocess with session tracking
User Action: Monitor progress via WebSocket connection
Technical Process:
- 5%: Python script initialization
- 5-20%: Audio download, normalization, speaker identification
- 20-75%: Chunked AI transcription (longest phase)
- 75-100%: Post-processing, summary generation, system prompt creation
Progress Updates: Sparse but accurate (not artificial increments)
- WebSocket broadcasts real Python script output
@@CHUNK_READY@@markers enable incremental transcript loading@@PROCESS_COMPLETE@@signals final completion
User Action: Listen and chat with AI about content
Technical Process:
- Audio player syncs with transcript highlighting
- Voice chat extracts recent transcript context
- ElevenLabs voice agent responds as episode speaker
- System prompt provides context-aware AI personality
# Frontend (.env) - Optional API key overrides
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
PORT=3001
NODE_ENV=development
LOG_LEVEL=info
# Note: API keys are also hardcoded in server.js for development
# For production, use environment variables instead# Legacy JRE episode configuration
audio_path: "data/jre_elon_2281/joe_rogan_experience_2281_elon_musk_m4a_128k.mp3"
transcript_path: "data/jre_elon_2281/transcript_2281.jsonl"
# AI Integration
obelisk:
api_url: "https://obelisk.dread.technology/api"
api_key: "sk-..."
model_id: "gpt-4.1-base"
# Voice Agent Configuration
elevenlabs:
agent_id: "agent_01k0devtnafyhb7cg4ztv3gpa8"
api_key: ""
default_context_min: 3./test-setup.shValidates:
- System dependencies (Node.js 18+, Python 3.8+, ffmpeg)
- Package installations (npm and pip)
- API configurations and connectivity
- Directory structure and permissions
- Environment variable setup
Output: Color-coded status with specific remediation steps for any issues found.
# Start development (both frontend & backend)
cd frontend && npm run dev
# Build for production
cd frontend && npm run build
# Preview production build
cd frontend && npm run preview
# Build and serve production
cd frontend && npm run build-and-serveTranscript Format:
- Input: Speaker-diarized text with timestamps
[HH:MM:SS] Speaker: Text - Output: JSONL format with structured speaker/timestamp/text entries
- Features: Preserves stutters, emphasis, interruptions, and vocal nuances
Performance Considerations:
- Chunked Processing: 20-minute chunks to avoid API timeouts
- Parallel Operations: Bio generation runs concurrently with transcription
- Resource Limits: Speaker bio generation limited to first 4 speakers
- Timeout Handling: Comprehensive timeout strategies for long-running operations
Generated Content (generated_podcasts/ directory):
- Naming Pattern:
{Episode_Title}_transcript.jsonl,{Episode_Title}_metadata.json - System Prompts: Generated for each episode to enable contextual AI chat
- Cache Strategy: Automatic detection prevents duplicate processing
Full Browser Support:
- Back/Forward: Natural browser navigation
- Refresh: State preservation across reloads
- Bookmarks: Direct URL access to any page state
- Deep Linking: Shareable URLs for specific content
- URL Structure: RESTful paths with embedded state data
- Generated podcasts stored in
generated_podcasts/directory - Metadata files contain episode information and processing results
- Transcript files in JSONL format for easy parsing
- Summary files contain AI-generated episode summaries
- System prompts generated per episode for contextual AI chat
Application Not Starting
- Ensure you're in the frontend directory:
cd frontend && npm run dev - Check for port conflicts on ports 3001 or 5173
- Verify Node.js 18+ is installed:
node --version
Python Dependencies Missing
- Install requirements:
pip install -r requirements.txt - Verify Python 3.8+ is installed:
python3 --version - Check ffmpeg installation:
ffmpeg -version
Audio Processing Fails
- Ensure ffmpeg is installed and in PATH
- Check episode URL accessibility
- Verify API keys are configured correctly
- Check processing logs:
logs/processing-YYYY-MM-DD.log
Voice Chat Not Working
- Check microphone permissions in browser
- Ensure ElevenLabs agent ID is correct in
config.yml - Verify internet connection for ElevenLabs API
- Check browser console for WebRTC errors
Processing Stuck or Slow
- Monitor logs for Python subprocess output
- Check available disk space in
generated_podcasts/ - Verify API rate limits not exceeded
- Long episodes (>2 hours) may take 30+ minutes
Set environment variables for detailed logging:
# Enable debug logging
export NODE_ENV=development
export LOG_LEVEL=debug
# Start with verbose output
cd frontend && npm run devCheck processing progress:
# Monitor real-time processing
tail -f logs/processing-$(date +%Y-%m-%d).log
# Check for errors
grep "ERROR" logs/error-$(date +%Y-%m-%d).log
# API request tracking
grep "api_request" logs/app-$(date +%Y-%m-%d).log- Original Architecture: Aman (cast.dread.technology)
- AI Processing Integration: Neel Sardana (Bread Technologies)
- Voice Agent Module: Cameron
- Transcript Processing: Google Gemini API
- Voice Agents: ElevenLabs
- Podcast Data: Podcast Index API
- Bio Generation: OpenAI o3 + Anthropic Claude
MIT License - See LICENSE file for details.
For questions or contributions, please open an issue or contact the maintainers.