Skip to content

Latest commit

 

History

History
228 lines (181 loc) · 9.66 KB

File metadata and controls

228 lines (181 loc) · 9.66 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Development Commands

Starting the Application

cd frontend
npm run dev

This starts both the frontend (port 5173) and backend (port 3001) in a single command using concurrently.

Building and Production

cd frontend
npm run build                    # Build frontend for production
npm run build-and-serve          # Build and start production server
npm run preview                  # Preview production build

Testing Setup

./test-setup.sh                  # Comprehensive setup validation script

Python Environment

pip install -r requirements.txt  # Install Python dependencies (includes librosa, numpy, python-dotenv)

Speaker Audio Extraction Tool

# Extract 30-second speaker clips from any audio source
python scripts/extract_speaker_audio.py "https://youtube.com/watch?v=VIDEO_ID" --output-dir ./clips

# Set Gemini API key for speaker identification
export GEMINI_API_KEY="your-api-key-here"

Architecture Overview

This is an interactive podcast platform that processes audio content (podcasts, YouTube videos) into transcripts with AI-powered voice chat functionality. The architecture uses an integrated frontend/backend approach where both servers run from a single command.

Key Components

Frontend (React + Vite)

  • Port: 5173 (development)
  • Entry: frontend/src/main.jsx
  • Routing: React Router DOM with browser navigation support
  • Navigation Structure:
    • / - Homepage
    • /library - Podcast library
    • /search - Search/add content
    • /confirm?podcastData=... - Episode confirmation
    • /processing/{sessionId} - Real-time processing progress
    • /player/{podcastId} - Interactive player with AI chat

Backend (Express.js + WebSocket)

  • Port: 3001 (proxied through frontend)
  • Entry: frontend/server.js
  • Real-time: WebSocket support for processing updates
  • API Pattern: All endpoints prefixed with /api/

Python Processing Pipeline

  • Main Script: scripts/podcast2jsonl.py - Audio transcription using Gemini API with speaker diarization
  • Speaker Extraction Tool: scripts/extract_speaker_audio.py - Standalone 30-second speaker audio clip extraction
  • Output: JSONL transcripts with speaker identification and timestamps, high-quality WAV speaker clips

Data Flow

  1. Content Discovery: User searches podcasts or pastes YouTube URLs
  2. Processing Pipeline:
    • Audio download (yt-dlp/HTTP)
    • Audio normalization (ffmpeg to mono 16kHz)
    • Speaker identification (Gemini API)
    • Parallel bio generation (OpenAI o3) and transcription (Gemini API)
    • Summary generation (Gemini API)
    • System prompt generation for AI chat
  3. Interactive Experience: Real-time audio player with synchronized transcript and voice AI chat

Processing Architecture

The system uses chunked processing for long-form content:

  • Audio split into 20-minute chunks with 30-second overlap
  • Real-time progress updates via WebSocket
  • Streaming chunk completion markers for incremental loading
  • Parallel bio generation (OpenAI o3 + Anthropic Claude) during transcription

File Structure & Key Locations

Generated Content

  • Podcast Storage: generated_podcasts/ directory
  • Speaker Clips Storage: Configurable output directories (default: ./speaker_clips/)
  • Naming Pattern: {Episode_Title}_transcript.jsonl, {Episode_Title}_metadata.json, etc.
  • Speaker Clips: {Episode_Title}_{Speaker_Name}_30s.wav + extraction metadata JSON
  • System Prompts: Generated for each episode to enable contextual AI chat

Configuration

  • Main Config: config.yml (legacy JRE configuration + ElevenLabs agent settings)
  • Environment: frontend/.env (optional API keys)
  • Package Management: frontend/package.json contains both frontend and backend dependencies

Core React Components

  • PodcastSearch.jsx: Content discovery interface
  • ProcessingProgress.jsx: Real-time processing status with WebSocket
  • Player.jsx: Audio player with transcript synchronization
  • VoiceCallModal.jsx: ElevenLabs voice agent integration
  • Transcript.jsx: Interactive transcript with click-to-seek

API Integration

External APIs

  • Gemini API: Audio transcription, speaker identification, content summarization
  • OpenAI o3: Speaker biography generation with web search
  • Anthropic Claude: Bio content formatting and cleaning
  • ElevenLabs: Voice AI agents for conversational experience
  • Podcast Index API: Podcast search and metadata (credentials included in codebase)

API Endpoints (Backend)

  • POST /api/search-podcasts - Search Podcast Index API
  • POST /api/get-episodes - Get episodes for specific podcast
  • POST /api/process-podcast - Initiate processing pipeline
  • GET /api/status/:sessionId - Processing status
  • GET /api/generated-podcasts - List processed content
  • WebSocket /ws - Real-time progress updates

Speaker Audio Extraction Tool

Overview

The extract_speaker_audio.py tool provides standalone speaker audio extraction capabilities, generating high-quality 30-second audio clips optimized for voice cloning applications and speaker analysis.

Key Features

  • AI-Powered Speaker Analysis: Uses Gemini 2.5 Pro to identify speakers and optimal speaking regions
  • Multiple Input Sources: Supports YouTube URLs, podcast URLs, and local audio files
  • Quality Scoring System: Multi-criteria scoring (0-100) based on energy, voice activity, spectral quality
  • Professional Audio Processing: 44.1kHz mono WAV output with fade processing
  • Comprehensive Metadata: Detailed extraction logs with quality metrics and processing methods

Usage Examples

# YouTube video extraction
python scripts/extract_speaker_audio.py "https://www.youtube.com/watch?v=VIDEO_ID" --output-dir ./clips

# Podcast URL extraction  
python scripts/extract_speaker_audio.py "https://media.example.com/podcast.mp3" --output-dir ./clips

# Local audio file extraction
python scripts/extract_speaker_audio.py "/path/to/audio.mp3" --output-dir ./clips

# With API key specification
python scripts/extract_speaker_audio.py "SOURCE_URL" --api-key "your-gemini-key" --output-dir ./clips

Technical Architecture

The tool leverages the existing podcast processing infrastructure:

  • Audio Processing: Reuses proven download and normalization functions from main pipeline
  • AI Integration: Integrates with Gemini 2.5 Pro's 1M context window for speaker analysis
  • Signal Analysis: Uses librosa for precision audio quality assessment
  • Logging: Structured logging with stage-specific progress tracking

Output Structure

./speaker_clips/
├── Episode_Title_Speaker_Name_30s.wav     # High-quality WAV files
├── Episode_Title_extraction_metadata.json # Comprehensive processing metadata

Quality Standards

  • Energy Score (30%): RMS energy levels in optimal speech range
  • Voice Activity (25%): Percentage of frames with active speech
  • Spectral Quality (20%): Spectral centroid in speech frequency range
  • Silence Penalty (15%): Deduction for long silence gaps (>2 seconds)
  • Zero Crossing Rate (10%): Optimal range for human speech

Integration with Main Platform

While standalone, the tool shares core components with the main podcast processing pipeline:

  • subprocess_utils.py: Enhanced subprocess monitoring for yt-dlp and ffmpeg
  • logger_config.py: Structured logging configuration
  • api_health_monitor.py: API health monitoring and retry strategies

Development Patterns

Error Handling & Logging

  • Structured Logging: Winston with daily log rotation (logs/ directory)
  • Component Loggers: API, WebSocket, and processing-specific loggers
  • Error Recovery: Retry strategies with exponential backoff for API calls
  • Health Monitoring: API health checks before processing

State Management

  • URL-Based State: Critical application state stored in URL parameters
  • Session Storage: Complex objects backed up in sessionStorage
  • WebSocket State: Real-time processing updates
  • React Router: Browser navigation with history support

Processing Pipeline Monitoring

  • Progress Markers: Sparse updates (5% → 85% → 100%)
  • Chunk Streaming: @@CHUNK_READY@@ markers for incremental transcript loading
  • Completion Signals: @@PROCESS_COMPLETE@@ markers
  • Resource Monitoring: Memory and duration tracking for subprocess operations

Important Implementation Notes

Transcript Format

  • Input: Speaker-diarized text with timestamps [HH:MM:SS] Speaker: Text
  • Output: JSONL format with structured speaker/timestamp/text entries
  • Features: Preserves stutters, emphasis, interruptions, and vocal nuances

Voice AI Integration

  • Context Extraction: Last N minutes of transcript sent to AI agents
  • System Prompts: Dynamically generated based on episode content and speaker biographies
  • Persona Embodiment: AI speaks as the primary speaker from the episode

Performance Considerations

  • Chunked Processing: 20-minute chunks to avoid API timeouts
  • Parallel Operations: Bio generation runs concurrently with transcription
  • Resource Limits: Speaker bio generation limited to first 4 speakers
  • Timeout Handling: Comprehensive timeout strategies for long-running operations

Testing and Validation

Use ./test-setup.sh to validate:

  • System dependencies (Node.js 18+, Python 3.8+, ffmpeg)
  • Package installations
  • API configurations
  • Directory structure

The script provides color-coded status output and specific remediation steps for any issues found.