This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
cd frontend
npm run devThis starts both the frontend (port 5173) and backend (port 3001) in a single command using concurrently.
cd frontend
npm run build # Build frontend for production
npm run build-and-serve # Build and start production server
npm run preview # Preview production build./test-setup.sh # Comprehensive setup validation scriptpip install -r requirements.txt # Install Python dependencies (includes librosa, numpy, python-dotenv)# Extract 30-second speaker clips from any audio source
python scripts/extract_speaker_audio.py "https://youtube.com/watch?v=VIDEO_ID" --output-dir ./clips
# Set Gemini API key for speaker identification
export GEMINI_API_KEY="your-api-key-here"This is an interactive podcast platform that processes audio content (podcasts, YouTube videos) into transcripts with AI-powered voice chat functionality. The architecture uses an integrated frontend/backend approach where both servers run from a single command.
Frontend (React + Vite)
- Port: 5173 (development)
- Entry:
frontend/src/main.jsx - Routing: React Router DOM with browser navigation support
- Navigation Structure:
/- Homepage/library- Podcast library/search- Search/add content/confirm?podcastData=...- Episode confirmation/processing/{sessionId}- Real-time processing progress/player/{podcastId}- Interactive player with AI chat
Backend (Express.js + WebSocket)
- Port: 3001 (proxied through frontend)
- Entry:
frontend/server.js - Real-time: WebSocket support for processing updates
- API Pattern: All endpoints prefixed with
/api/
Python Processing Pipeline
- Main Script:
scripts/podcast2jsonl.py- Audio transcription using Gemini API with speaker diarization - Speaker Extraction Tool:
scripts/extract_speaker_audio.py- Standalone 30-second speaker audio clip extraction - Output: JSONL transcripts with speaker identification and timestamps, high-quality WAV speaker clips
- Content Discovery: User searches podcasts or pastes YouTube URLs
- Processing Pipeline:
- Audio download (yt-dlp/HTTP)
- Audio normalization (ffmpeg to mono 16kHz)
- Speaker identification (Gemini API)
- Parallel bio generation (OpenAI o3) and transcription (Gemini API)
- Summary generation (Gemini API)
- System prompt generation for AI chat
- Interactive Experience: Real-time audio player with synchronized transcript and voice AI chat
The system uses chunked processing for long-form content:
- Audio split into 20-minute chunks with 30-second overlap
- Real-time progress updates via WebSocket
- Streaming chunk completion markers for incremental loading
- Parallel bio generation (OpenAI o3 + Anthropic Claude) during transcription
- Podcast Storage:
generated_podcasts/directory - Speaker Clips Storage: Configurable output directories (default:
./speaker_clips/) - Naming Pattern:
{Episode_Title}_transcript.jsonl,{Episode_Title}_metadata.json, etc. - Speaker Clips:
{Episode_Title}_{Speaker_Name}_30s.wav+ extraction metadata JSON - System Prompts: Generated for each episode to enable contextual AI chat
- Main Config:
config.yml(legacy JRE configuration + ElevenLabs agent settings) - Environment:
frontend/.env(optional API keys) - Package Management:
frontend/package.jsoncontains both frontend and backend dependencies
PodcastSearch.jsx: Content discovery interfaceProcessingProgress.jsx: Real-time processing status with WebSocketPlayer.jsx: Audio player with transcript synchronizationVoiceCallModal.jsx: ElevenLabs voice agent integrationTranscript.jsx: Interactive transcript with click-to-seek
- Gemini API: Audio transcription, speaker identification, content summarization
- OpenAI o3: Speaker biography generation with web search
- Anthropic Claude: Bio content formatting and cleaning
- ElevenLabs: Voice AI agents for conversational experience
- Podcast Index API: Podcast search and metadata (credentials included in codebase)
POST /api/search-podcasts- Search Podcast Index APIPOST /api/get-episodes- Get episodes for specific podcastPOST /api/process-podcast- Initiate processing pipelineGET /api/status/:sessionId- Processing statusGET /api/generated-podcasts- List processed contentWebSocket /ws- Real-time progress updates
The extract_speaker_audio.py tool provides standalone speaker audio extraction capabilities, generating high-quality 30-second audio clips optimized for voice cloning applications and speaker analysis.
- AI-Powered Speaker Analysis: Uses Gemini 2.5 Pro to identify speakers and optimal speaking regions
- Multiple Input Sources: Supports YouTube URLs, podcast URLs, and local audio files
- Quality Scoring System: Multi-criteria scoring (0-100) based on energy, voice activity, spectral quality
- Professional Audio Processing: 44.1kHz mono WAV output with fade processing
- Comprehensive Metadata: Detailed extraction logs with quality metrics and processing methods
# YouTube video extraction
python scripts/extract_speaker_audio.py "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir ./clips
# Podcast URL extraction
python scripts/extract_speaker_audio.py "https://media.example.com/podcast.mp3" --output-dir ./clips
# Local audio file extraction
python scripts/extract_speaker_audio.py "/path/to/audio.mp3" --output-dir ./clips
# With API key specification
python scripts/extract_speaker_audio.py "SOURCE_URL" --api-key "your-gemini-key" --output-dir ./clipsThe tool leverages the existing podcast processing infrastructure:
- Audio Processing: Reuses proven download and normalization functions from main pipeline
- AI Integration: Integrates with Gemini 2.5 Pro's 1M context window for speaker analysis
- Signal Analysis: Uses librosa for precision audio quality assessment
- Logging: Structured logging with stage-specific progress tracking
./speaker_clips/
├── Episode_Title_Speaker_Name_30s.wav # High-quality WAV files
├── Episode_Title_extraction_metadata.json # Comprehensive processing metadata
- Energy Score (30%): RMS energy levels in optimal speech range
- Voice Activity (25%): Percentage of frames with active speech
- Spectral Quality (20%): Spectral centroid in speech frequency range
- Silence Penalty (15%): Deduction for long silence gaps (>2 seconds)
- Zero Crossing Rate (10%): Optimal range for human speech
While standalone, the tool shares core components with the main podcast processing pipeline:
- subprocess_utils.py: Enhanced subprocess monitoring for yt-dlp and ffmpeg
- logger_config.py: Structured logging configuration
- api_health_monitor.py: API health monitoring and retry strategies
- Structured Logging: Winston with daily log rotation (
logs/directory) - Component Loggers: API, WebSocket, and processing-specific loggers
- Error Recovery: Retry strategies with exponential backoff for API calls
- Health Monitoring: API health checks before processing
- URL-Based State: Critical application state stored in URL parameters
- Session Storage: Complex objects backed up in sessionStorage
- WebSocket State: Real-time processing updates
- React Router: Browser navigation with history support
- Progress Markers: Sparse updates (5% → 85% → 100%)
- Chunk Streaming:
@@CHUNK_READY@@markers for incremental transcript loading - Completion Signals:
@@PROCESS_COMPLETE@@markers - Resource Monitoring: Memory and duration tracking for subprocess operations
- Input: Speaker-diarized text with timestamps
[HH:MM:SS] Speaker: Text - Output: JSONL format with structured speaker/timestamp/text entries
- Features: Preserves stutters, emphasis, interruptions, and vocal nuances
- Context Extraction: Last N minutes of transcript sent to AI agents
- System Prompts: Dynamically generated based on episode content and speaker biographies
- Persona Embodiment: AI speaks as the primary speaker from the episode
- Chunked Processing: 20-minute chunks to avoid API timeouts
- Parallel Operations: Bio generation runs concurrently with transcription
- Resource Limits: Speaker bio generation limited to first 4 speakers
- Timeout Handling: Comprehensive timeout strategies for long-running operations
Use ./test-setup.sh to validate:
- System dependencies (Node.js 18+, Python 3.8+, ffmpeg)
- Package installations
- API configurations
- Directory structure
The script provides color-coded status output and specific remediation steps for any issues found.