┌─────────────────────────────────────────────────────────────────────────────────┐
│ CONTENT SOURCES │
├─────────────────────────┬───────────────────────┬─────────────────────────────——┤
│ YouTube Videos │ Podcast Index API │ Direct Podcast URLs │
│ (youtube.com/...) │ (podcastindex.org) │ (RSS feeds/MP3 URLs) │
└─────────┬───────────────┼───────────────────────┼─────────────────┬───────────——┘
│ │ │ │
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ STREAMING │ │ STREAMING │ │ STREAMING │ │ STREAMING │
│ TO USER │ │ TO USER │ │ TO USER │ │ TO USER │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │ │
│ ┌─────┴─────┐ ┌─────┴─────┐ │
│ │BACKGROUND │ │BACKGROUND │ │
│ │PROCESSING │ │PROCESSING │ │
│ │(COPYING) │ │(COPYING) │ │
│ └─────┬─────┘ └─────┬─────┘ │
│ │ │ │
▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ USER INTERFACE (React Frontend) │
│ │
│ USER SEES: BACKGROUND PROCESSING: │
│ • YouTube embedded player • Audio downloaded via yt-dlp │
│ • Direct podcast stream links • Audio processed for transcription │
│ • Interactive transcript overlay • AI analysis of downloaded audio │
│ • AI chat powered by transcript • Generated content stored locally │
└─────────────────────────┬───────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ DUAL-TRACK ARCHITECTURE: STREAM + PROCESS │
├─────────────────────────────────────────────────────────────────────────────────┤
│ │
│ TRACK 1: CONTENT DELIVERY TO USER (NO COPYING) │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ • YouTube: Native embed player (YouTube serves content) │ │
│ │ • Podcasts: Direct stream from RSS feed/MP3 URL │ │
│ │ • No local storage of original audio/video for user consumption │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
│ TRACK 2: BACKGROUND PROCESSING (COPYING FOR AI FEATURES) │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ ⚠️ COPYING OCCURS HERE: │ │
│ │ • yt-dlp downloads audio for transcript generation │ │
│ │ • ffmpeg processes audio (mono 16kHz, chunking) │ │
│ │ • Gemini API transcribes downloaded audio │ │
│ │ • AI generates summaries, bios, chat context │ │
│ │ • Original downloaded audio deleted after processing │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────┬───────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ WHAT WE STORE LOCALLY │
├─────────────────────────────────────────────────────────────────────────────────┤
│ • Transcripts (text derived from audio) │
│ • AI-generated summaries (transformative content) │
│ • Speaker biographies (AI-generated) │
│ • Chat system prompts (AI-generated) │
│ • Metadata (episode info, timestamps) │
│ │
│ ❌ WE DO NOT STORE: │
│ • Original audio files (deleted after transcription) │
│ • Video files │
│ • Podcast episodes for user playback │
└─────────────────────────────────────────────────────────────────────────────────┘
What Content We Copy:
- Audio from YouTube videos and podcasts is temporarily downloaded using
yt-dlptool - Audio is processed through
ffmpegfor format conversion and segmentation - Purpose: Generate AI transcripts, summaries, and interactive features
- Duration: Original audio files are deleted after AI processing completes
What Content We Stream Only:
- YouTube Videos: Users view content through YouTube's native embedded player - we never store video
- Podcast Episodes: Users listen through direct streaming from RSS feeds or original MP3 URLs
- User Experience: All audio/video consumption happens via original source streaming
What We Store Permanently:
- Text transcripts (derived/transformative content from audio)
- AI-generated summaries and speaker biographies
- Metadata (episode titles, timestamps, speaker names)
- Chat system prompts for AI interaction
- High-quality 30-second speaker audio clips (standalone extraction tool output)
What We Do NOT Store:
- Original audio files (deleted post-processing)
- Video files of any kind
- Podcast episodes for user playback
YouTube Video Experience:
- User Interface: YouTube's native embedded player within our interface
- Content Delivery: YouTube serves all video/audio content directly to user
- Our Role: Provide interactive transcript overlay and AI chat features
- No Copying for User: User watches/listens via YouTube's streaming infrastructure
Podcast Episode Experience:
- User Interface: HTML5 audio player with direct RSS feed or MP3 URL
- Content Delivery: Original podcast host serves audio content to user
- Our Role: Provide interactive transcript, summaries, and AI chat features
- No Copying for User: User listens via streaming from original podcast source
Interactive Features:
- Real-time transcript highlighting synchronized with streaming audio
- Click-to-seek navigation within transcript
- AI voice chat based on transcript content
- Speaker biography information displayed alongside content
Audio Download Process:
Content URL → yt-dlp download → Temporary local audio file → AI processing → File deletion
Step-by-Step Copying Process:
-
Download Phase (
⚠️ Copyright-sensitive):yt-dlptool downloads audio from YouTube or podcast URLs- Audio saved temporarily to local filesystem for processing
- Alternative HTTP download method for direct MP3 URLs
-
Processing Phase:
ffmpegconverts audio to mono 16kHz format (optimal for AI transcription)- Audio split into 20-minute chunks with 30-second overlap
- Processed audio files remain on local system during AI analysis
-
AI Analysis Phase:
- Google Gemini 2.5 Pro API receives processed audio for transcription
- Speaker identification and diarization performed
- Content summarization generated
- OpenAI and Claude APIs generate speaker biographies
-
Speaker Audio Extraction (
⚠️ Additional copying for standalone tool):extract_speaker_audio.pycreates high-quality 30-second speaker clips- Uses Gemini 2.5 Pro for optimal speaking region identification
- Applies professional audio processing (44.1kHz mono WAV with fade processing)
- Generated for voice cloning applications and speaker analysis
-
Cleanup Phase:
- Original downloaded audio files deleted after transcript generation
- Only derived text content (transcripts, summaries) retained
- Speaker audio clips (30-second processed segments) may be retained
- No permanent storage of full source audio material
Generated Files Stored Long-term:
generated_podcasts/
├── {Episode_Title}_transcript.jsonl # Text transcript with speaker/timestamp data
├── {Episode_Title}_metadata.json # Episode metadata, speaker names, duration
├── {Episode_Title}_summary.txt # AI-generated episode summary
├── {Episode_Title}_system_prompt.txt # AI chat system context
└── {Episode_Title}_transcript_summary_and_toc.txt # Table of contents
speaker_clips/ # Standalone speaker extraction output
├── {Episode_Title}_{Speaker_Name}_30s.wav # High-quality 30-second speaker clips
└── {Episode_Title}_extraction_metadata.json # Extraction processing metadata
File Content Analysis:
- Transcripts: Text representations of spoken content with speaker identification
- Metadata: Factual information (titles, timestamps, speaker names)
- Summaries: AI-generated transformative content describing episode themes
- System Prompts: AI-generated context for chat functionality
- Biographies: AI-researched and generated speaker background information
- Speaker Audio Clips: 30-second processed audio segments optimized for voice applications
Retention Policy:
- Generated text files stored indefinitely for user access
- Original audio files deleted immediately after AI processing
- No backup or archival of source audio material
Content Discovery APIs:
- Podcast Index API: Search publicly available podcast metadata and RSS feeds
- No Content Copying: Only metadata and RSS feed URLs obtained
AI Processing APIs:
- Google Gemini 2.5 Pro: Receives processed audio for transcription services
- OpenAI o3: Processes transcript text for speaker biography generation
- Anthropic Claude: Processes and formats AI-generated content
- ElevenLabs: Provides AI voice agents using transcript context
Data Transmission:
- Processed audio sent to Gemini API for transcription (temporary processing)
- Text transcripts sent to OpenAI/Claude for biography generation
- No raw audio transmitted to biography generation APIs
Separation of User Experience and Processing:
- Users consume content via original streaming sources (YouTube, podcast hosts)
- Background AI processing occurs independently from user content consumption
- Generated features (transcripts, chat) enhance but don't replace original content
Temporary vs. Permanent Storage:
- Audio copying is temporary and solely for AI feature generation
- Permanent storage limited to derived/transformative text content
- No mechanism for users to access downloaded audio files
Access Control:
- Generated transcript files served only to users of our platform
- No public distribution of derived content outside our application
- User access tied to original content source availability
This architecture ensures that users access copyrighted content through original streaming sources while enabling AI-powered interactive features through temporary processing and permanent storage of derived text content only.