CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Starting the Application

cd frontend
npm run dev

This starts both the frontend (port 5173) and backend (port 3001) in a single command using concurrently.

Building and Production

cd frontend
npm run build                    # Build frontend for production
npm run build-and-serve          # Build and start production server
npm run preview                  # Preview production build

Testing Setup

./test-setup.sh                  # Comprehensive setup validation script

Python Environment

pip install -r requirements.txt  # Install Python dependencies (includes librosa, numpy, python-dotenv)

Speaker Audio Extraction Tool

# Extract 30-second speaker clips from any audio source
python scripts/extract_speaker_audio.py "https://youtube.com/watch?v=VIDEO_ID" --output-dir ./clips

# Set Gemini API key for speaker identification
export GEMINI_API_KEY="your-api-key-here"

Architecture Overview

This is an interactive podcast platform that processes audio content (podcasts, YouTube videos) into transcripts with AI-powered voice chat functionality. The architecture uses an integrated frontend/backend approach where both servers run from a single command.

Key Components

Frontend (React + Vite)

Port: 5173 (development)
Entry: frontend/src/main.jsx
Routing: React Router DOM with browser navigation support
Navigation Structure:
- / - Homepage
- /library - Podcast library
- /search - Search/add content
- /confirm?podcastData=... - Episode confirmation
- /processing/{sessionId} - Real-time processing progress
- /player/{podcastId} - Interactive player with AI chat

Backend (Express.js + WebSocket)

Port: 3001 (proxied through frontend)
Entry: frontend/server.js
Real-time: WebSocket support for processing updates
API Pattern: All endpoints prefixed with /api/

Python Processing Pipeline

Main Script: scripts/podcast2jsonl.py - Audio transcription using Gemini API with speaker diarization
Speaker Extraction Tool: scripts/extract_speaker_audio.py - Standalone 30-second speaker audio clip extraction
Output: JSONL transcripts with speaker identification and timestamps, high-quality WAV speaker clips

Data Flow

Content Discovery: User searches podcasts or pastes YouTube URLs
Processing Pipeline:
- Audio download (yt-dlp/HTTP)
- Audio normalization (ffmpeg to mono 16kHz)
- Speaker identification (Gemini API)
- Parallel bio generation (OpenAI o3) and transcription (Gemini API)
- Summary generation (Gemini API)
- System prompt generation for AI chat
Interactive Experience: Real-time audio player with synchronized transcript and voice AI chat

Processing Architecture

The system uses chunked processing for long-form content:

Audio split into 20-minute chunks with 30-second overlap
Real-time progress updates via WebSocket
Streaming chunk completion markers for incremental loading
Parallel bio generation (OpenAI o3 + Anthropic Claude) during transcription

File Structure & Key Locations

Generated Content

Podcast Storage: generated_podcasts/ directory
Speaker Clips Storage: Configurable output directories (default: ./speaker_clips/)
Naming Pattern: {Episode_Title}_transcript.jsonl, {Episode_Title}_metadata.json, etc.
Speaker Clips: {Episode_Title}_{Speaker_Name}_30s.wav + extraction metadata JSON
System Prompts: Generated for each episode to enable contextual AI chat

Configuration

Main Config: config.yml (legacy JRE configuration + ElevenLabs agent settings)
Environment: frontend/.env (optional API keys)
Package Management: frontend/package.json contains both frontend and backend dependencies

Core React Components

PodcastSearch.jsx: Content discovery interface
ProcessingProgress.jsx: Real-time processing status with WebSocket
Player.jsx: Audio player with transcript synchronization
VoiceCallModal.jsx: ElevenLabs voice agent integration
Transcript.jsx: Interactive transcript with click-to-seek

API Integration

External APIs

Gemini API: Audio transcription, speaker identification, content summarization
OpenAI o3: Speaker biography generation with web search
Anthropic Claude: Bio content formatting and cleaning
ElevenLabs: Voice AI agents for conversational experience
Podcast Index API: Podcast search and metadata (credentials included in codebase)

API Endpoints (Backend)

POST /api/search-podcasts - Search Podcast Index API
POST /api/get-episodes - Get episodes for specific podcast
POST /api/process-podcast - Initiate processing pipeline
GET /api/status/:sessionId - Processing status
GET /api/generated-podcasts - List processed content
WebSocket /ws - Real-time progress updates

Speaker Audio Extraction Tool

Overview

The extract_speaker_audio.py tool provides standalone speaker audio extraction capabilities, generating high-quality 30-second audio clips optimized for voice cloning applications and speaker analysis.

Key Features

AI-Powered Speaker Analysis: Uses Gemini 2.5 Pro to identify speakers and optimal speaking regions
Multiple Input Sources: Supports YouTube URLs, podcast URLs, and local audio files
Quality Scoring System: Multi-criteria scoring (0-100) based on energy, voice activity, spectral quality
Professional Audio Processing: 44.1kHz mono WAV output with fade processing
Comprehensive Metadata: Detailed extraction logs with quality metrics and processing methods

Usage Examples

# YouTube video extraction
python scripts/extract_speaker_audio.py "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir ./clips

# Podcast URL extraction  
python scripts/extract_speaker_audio.py "https://media.example.com/podcast.mp3" --output-dir ./clips

# Local audio file extraction
python scripts/extract_speaker_audio.py "/path/to/audio.mp3" --output-dir ./clips

# With API key specification
python scripts/extract_speaker_audio.py "SOURCE_URL" --api-key "your-gemini-key" --output-dir ./clips

Technical Architecture

The tool leverages the existing podcast processing infrastructure:

Audio Processing: Reuses proven download and normalization functions from main pipeline
AI Integration: Integrates with Gemini 2.5 Pro's 1M context window for speaker analysis
Signal Analysis: Uses librosa for precision audio quality assessment
Logging: Structured logging with stage-specific progress tracking

Output Structure

./speaker_clips/
├── Episode_Title_Speaker_Name_30s.wav     # High-quality WAV files
├── Episode_Title_extraction_metadata.json # Comprehensive processing metadata

Quality Standards

Energy Score (30%): RMS energy levels in optimal speech range
Voice Activity (25%): Percentage of frames with active speech
Spectral Quality (20%): Spectral centroid in speech frequency range
Silence Penalty (15%): Deduction for long silence gaps (>2 seconds)
Zero Crossing Rate (10%): Optimal range for human speech

Integration with Main Platform

While standalone, the tool shares core components with the main podcast processing pipeline:

subprocess_utils.py: Enhanced subprocess monitoring for yt-dlp and ffmpeg
logger_config.py: Structured logging configuration
api_health_monitor.py: API health monitoring and retry strategies

Development Patterns

Error Handling & Logging

Structured Logging: Winston with daily log rotation (logs/ directory)
Component Loggers: API, WebSocket, and processing-specific loggers
Error Recovery: Retry strategies with exponential backoff for API calls
Health Monitoring: API health checks before processing

State Management

URL-Based State: Critical application state stored in URL parameters
Session Storage: Complex objects backed up in sessionStorage
WebSocket State: Real-time processing updates
React Router: Browser navigation with history support

Processing Pipeline Monitoring

Progress Markers: Sparse updates (5% → 85% → 100%)
Chunk Streaming: @@CHUNK_READY@@ markers for incremental transcript loading
Completion Signals: @@PROCESS_COMPLETE@@ markers
Resource Monitoring: Memory and duration tracking for subprocess operations

Important Implementation Notes

Transcript Format

Input: Speaker-diarized text with timestamps [HH:MM:SS] Speaker: Text
Output: JSONL format with structured speaker/timestamp/text entries
Features: Preserves stutters, emphasis, interruptions, and vocal nuances

Voice AI Integration

Context Extraction: Last N minutes of transcript sent to AI agents
System Prompts: Dynamically generated based on episode content and speaker biographies
Persona Embodiment: AI speaks as the primary speaker from the episode

Performance Considerations

Chunked Processing: 20-minute chunks to avoid API timeouts
Parallel Operations: Bio generation runs concurrently with transcription
Resource Limits: Speaker bio generation limited to first 4 speakers
Timeout Handling: Comprehensive timeout strategies for long-running operations

Testing and Validation

Use ./test-setup.sh to validate:

System dependencies (Node.js 18+, Python 3.8+, ffmpeg)
Package installations
API configurations
Directory structure

The script provides color-coded status output and specific remediation steps for any issues found.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

Development Commands

Starting the Application

Building and Production

Testing Setup

Python Environment

Speaker Audio Extraction Tool

Architecture Overview

Key Components

Data Flow

Processing Architecture

File Structure & Key Locations

Generated Content

Configuration

Core React Components

API Integration

External APIs

API Endpoints (Backend)

Speaker Audio Extraction Tool

Overview

Key Features

Usage Examples

Technical Architecture

Output Structure

Quality Standards

Integration with Main Platform

Development Patterns

Error Handling & Logging

State Management

Processing Pipeline Monitoring

Important Implementation Notes

Transcript Format

Voice AI Integration

Performance Considerations

Testing and Validation

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Development Commands

Starting the Application

Building and Production

Testing Setup

Python Environment

Speaker Audio Extraction Tool

Architecture Overview

Key Components

Data Flow

Processing Architecture

File Structure & Key Locations

Generated Content

Configuration

Core React Components

API Integration

External APIs

API Endpoints (Backend)

Speaker Audio Extraction Tool

Overview

Key Features

Usage Examples

Technical Architecture

Output Structure

Quality Standards

Integration with Main Platform

Development Patterns

Error Handling & Logging

State Management

Processing Pipeline Monitoring

Important Implementation Notes

Transcript Format

Voice AI Integration

Performance Considerations

Testing and Validation