Skip to content

Latest commit

 

History

History
226 lines (190 loc) · 15.2 KB

File metadata and controls

226 lines (190 loc) · 15.2 KB

System Architecture Overview

Content Delivery vs. Background Processing Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                            CONTENT SOURCES                                      │
├─────────────────────────┬───────────────────────┬─────────────────────────────——┤
│    YouTube Videos       │    Podcast Index API  │    Direct Podcast URLs        │
│   (youtube.com/...)     │   (podcastindex.org)  │   (RSS feeds/MP3 URLs)        │
└─────────┬───────────────┼───────────────────────┼─────────────────┬───────────——┘
          │               │                       │                 │
    ┌─────┴─────┐   ┌─────┴─────┐           ┌─────┴─────┐     ┌─────┴─────┐
    │ STREAMING │   │ STREAMING │           │ STREAMING │     │ STREAMING │
    │ TO USER   │   │ TO USER   │           │ TO USER   │     │ TO USER   │
    └─────┬─────┘   └─────┬─────┘           └─────┬─────┘     └─────┬─────┘
          │               │                       │                 │
          │         ┌─────┴─────┐           ┌─────┴─────┐           │
          │         │BACKGROUND │           │BACKGROUND │           │
          │         │PROCESSING │           │PROCESSING │           │
          │         │(COPYING)  │           │(COPYING)  │           │
          │         └─────┬─────┘           └─────┬─────┘           │
          │               │                       │                 │
          ▼               ▼                       ▼                 ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                     USER INTERFACE (React Frontend)                             │
│                                                                                 │
│  USER SEES:                          BACKGROUND PROCESSING:                     │
│  • YouTube embedded player          • Audio downloaded via yt-dlp               │
│  • Direct podcast stream links      • Audio processed for transcription         │
│  • Interactive transcript overlay   • AI analysis of downloaded audio           │
│  • AI chat powered by transcript    • Generated content stored locally          │
└─────────────────────────┬───────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│              DUAL-TRACK ARCHITECTURE: STREAM + PROCESS                          │
├─────────────────────────────────────────────────────────────────────────────────┤
│                                                                                 │
│  TRACK 1: CONTENT DELIVERY TO USER (NO COPYING)                                 │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │  • YouTube: Native embed player (YouTube serves content)                │    │
│  │  • Podcasts: Direct stream from RSS feed/MP3 URL                        │    │
│  │  • No local storage of original audio/video for user consumption        │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
│                                                                                 │
│  TRACK 2: BACKGROUND PROCESSING (COPYING FOR AI FEATURES)                       │
│  ┌─────────────────────────────────────────────────────────────────────────┐    │
│  │  ⚠️ COPYING OCCURS HERE:                                                │    │
│  │  • yt-dlp downloads audio for transcript generation                     │    │
│  │  • ffmpeg processes audio (mono 16kHz, chunking)                        │    │
│  │  • Gemini API transcribes downloaded audio                              │    │
│  │  • AI generates summaries, bios, chat context                           │    │
│  │  • Original downloaded audio deleted after processing                   │    │
│  └─────────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────┬───────────────────────────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                     WHAT WE STORE LOCALLY                                       │
├─────────────────────────────────────────────────────────────────────────────────┤
│  • Transcripts (text derived from audio)                                        │
│  • AI-generated summaries (transformative content)                              │
│  • Speaker biographies (AI-generated)                                           │
│  • Chat system prompts (AI-generated)                                           │
│  • Metadata (episode info, timestamps)                                          │
│                                                                                 │
│  ❌ WE DO NOT STORE:                                                            │
│  • Original audio files (deleted after transcription)                           │
│  • Video files                                                                  │
│  • Podcast episodes for user playback                                           │
└─────────────────────────────────────────────────────────────────────────────────┘

Legal Evaluation Notes

Copyright-Sensitive Operations Summary

What Content We Copy:

  • Audio from YouTube videos and podcasts is temporarily downloaded using yt-dlp tool
  • Audio is processed through ffmpeg for format conversion and segmentation
  • Purpose: Generate AI transcripts, summaries, and interactive features
  • Duration: Original audio files are deleted after AI processing completes

What Content We Stream Only:

  • YouTube Videos: Users view content through YouTube's native embedded player - we never store video
  • Podcast Episodes: Users listen through direct streaming from RSS feeds or original MP3 URLs
  • User Experience: All audio/video consumption happens via original source streaming

What We Store Permanently:

  • Text transcripts (derived/transformative content from audio)
  • AI-generated summaries and speaker biographies
  • Metadata (episode titles, timestamps, speaker names)
  • Chat system prompts for AI interaction
  • High-quality 30-second speaker audio clips (standalone extraction tool output)

What We Do NOT Store:

  • Original audio files (deleted post-processing)
  • Video files of any kind
  • Podcast episodes for user playback

Detailed Content Flow for Legal Review

1. How Users Access Content

YouTube Video Experience:

  • User Interface: YouTube's native embedded player within our interface
  • Content Delivery: YouTube serves all video/audio content directly to user
  • Our Role: Provide interactive transcript overlay and AI chat features
  • No Copying for User: User watches/listens via YouTube's streaming infrastructure

Podcast Episode Experience:

  • User Interface: HTML5 audio player with direct RSS feed or MP3 URL
  • Content Delivery: Original podcast host serves audio content to user
  • Our Role: Provide interactive transcript, summaries, and AI chat features
  • No Copying for User: User listens via streaming from original podcast source

Interactive Features:

  • Real-time transcript highlighting synchronized with streaming audio
  • Click-to-seek navigation within transcript
  • AI voice chat based on transcript content
  • Speaker biography information displayed alongside content

2. Background Processing Architecture (Where Copying Occurs)

Audio Download Process:

Content URL → yt-dlp download → Temporary local audio file → AI processing → File deletion

Step-by-Step Copying Process:

  1. Download Phase (⚠️ Copyright-sensitive):

    • yt-dlp tool downloads audio from YouTube or podcast URLs
    • Audio saved temporarily to local filesystem for processing
    • Alternative HTTP download method for direct MP3 URLs
  2. Processing Phase:

    • ffmpeg converts audio to mono 16kHz format (optimal for AI transcription)
    • Audio split into 20-minute chunks with 30-second overlap
    • Processed audio files remain on local system during AI analysis
  3. AI Analysis Phase:

    • Google Gemini 2.5 Pro API receives processed audio for transcription
    • Speaker identification and diarization performed
    • Content summarization generated
    • OpenAI and Claude APIs generate speaker biographies
  4. Speaker Audio Extraction (⚠️ Additional copying for standalone tool):

    • extract_speaker_audio.py creates high-quality 30-second speaker clips
    • Uses Gemini 2.5 Pro for optimal speaking region identification
    • Applies professional audio processing (44.1kHz mono WAV with fade processing)
    • Generated for voice cloning applications and speaker analysis
  5. Cleanup Phase:

    • Original downloaded audio files deleted after transcript generation
    • Only derived text content (transcripts, summaries) retained
    • Speaker audio clips (30-second processed segments) may be retained
    • No permanent storage of full source audio material

3. Data Storage and Retention

Generated Files Stored Long-term:

generated_podcasts/
├── {Episode_Title}_transcript.jsonl      # Text transcript with speaker/timestamp data
├── {Episode_Title}_metadata.json         # Episode metadata, speaker names, duration
├── {Episode_Title}_summary.txt           # AI-generated episode summary
├── {Episode_Title}_system_prompt.txt     # AI chat system context
└── {Episode_Title}_transcript_summary_and_toc.txt  # Table of contents

speaker_clips/                            # Standalone speaker extraction output
├── {Episode_Title}_{Speaker_Name}_30s.wav # High-quality 30-second speaker clips
└── {Episode_Title}_extraction_metadata.json # Extraction processing metadata

File Content Analysis:

  • Transcripts: Text representations of spoken content with speaker identification
  • Metadata: Factual information (titles, timestamps, speaker names)
  • Summaries: AI-generated transformative content describing episode themes
  • System Prompts: AI-generated context for chat functionality
  • Biographies: AI-researched and generated speaker background information
  • Speaker Audio Clips: 30-second processed audio segments optimized for voice applications

Retention Policy:

  • Generated text files stored indefinitely for user access
  • Original audio files deleted immediately after AI processing
  • No backup or archival of source audio material

4. External API Usage and Data Flow

Content Discovery APIs:

  • Podcast Index API: Search publicly available podcast metadata and RSS feeds
  • No Content Copying: Only metadata and RSS feed URLs obtained

AI Processing APIs:

  • Google Gemini 2.5 Pro: Receives processed audio for transcription services
  • OpenAI o3: Processes transcript text for speaker biography generation
  • Anthropic Claude: Processes and formats AI-generated content
  • ElevenLabs: Provides AI voice agents using transcript context

Data Transmission:

  • Processed audio sent to Gemini API for transcription (temporary processing)
  • Text transcripts sent to OpenAI/Claude for biography generation
  • No raw audio transmitted to biography generation APIs

5. Technical Implementation for Copyright Compliance

Separation of User Experience and Processing:

  • Users consume content via original streaming sources (YouTube, podcast hosts)
  • Background AI processing occurs independently from user content consumption
  • Generated features (transcripts, chat) enhance but don't replace original content

Temporary vs. Permanent Storage:

  • Audio copying is temporary and solely for AI feature generation
  • Permanent storage limited to derived/transformative text content
  • No mechanism for users to access downloaded audio files

Access Control:

  • Generated transcript files served only to users of our platform
  • No public distribution of derived content outside our application
  • User access tied to original content source availability

This architecture ensures that users access copyrighted content through original streaming sources while enabling AI-powered interactive features through temporary processing and permanent storage of derived text content only.