Skip to content

Latest commit

 

History

History
300 lines (244 loc) · 12.7 KB

File metadata and controls

300 lines (244 loc) · 12.7 KB

Technical Explanation

1. Agent Workflow

PodVibe.fm uses an agentic AI architecture that processes YouTube podcast URLs through a structured ReAct (Reasoning + Acting) pattern. Here's the step-by-step workflow:

Step 1: Receive User Input

  • User provides a YouTube video URL through the web interface (React frontend) or Streamlit UI
  • The request is routed to the Flask API backend (/api/summarize endpoint)
  • Input is validated to ensure it's a valid YouTube URL format

Step 2: Planning Phase (ReAct: Reasoning)

The Planner (planner.py) creates an execution plan by breaking down the summarization task into discrete sub-tasks:

  1. Extract Video ID - Parse the YouTube URL to extract the unique video identifier
  2. Fetch Transcript - Retrieve the full transcript from YouTube using the YouTube Transcript API
  3. Generate Summary - Use Google Gemini API to create an intelligent summary following the 80/20 principle (extracting the 20% of content that delivers 80% of the value)
  4. Extract Keywords - Use Gemini to identify 10 semantic keywords that represent core concepts
  5. Store Result - Save the final result with all metadata

Each task in the plan includes:

  • Step number and action type
  • Tool to be used (url_parser, youtube_api, gemini_api, etc.)
  • Status tracking (pending → in_progress → completed/failed)
  • Context from previous steps

Step 3: Execution Phase (ReAct: Acting)

The Executor (executor.py) sequentially executes each planned task:

  • Task 1: Extract Video ID (url_parser tool)

    • Parses various YouTube URL formats (youtube.com/watch, youtu.be, embed URLs)
    • Extracts the video ID using URL parsing logic
    • Returns video_id for use in subsequent steps
  • Task 2: Fetch Transcript (youtube_api tool)

    • Uses youtube-transcript-api library to fetch English transcripts
    • Retrieves transcript segments with timestamps for later use
    • Handles errors gracefully (missing transcripts, language issues)
    • Returns full transcript text and structured segments
  • Task 3: Generate Summary (gemini_api tool)

    • Sends transcript to Google Gemini 2.5 Flash model
    • Uses specialized prompts designed for "Learning Mode" that focus on:
      • Novel frameworks and mental models
      • Counter-intuitive insights
      • Actionable steps with clear guidance
      • Evidence-based conclusions
      • High-utility knowledge
    • Filters out low-value content (intros, ads, filler, repetitive content)
    • Supports multiple summary types: comprehensive, brief, key_points
    • Returns structured summary text
  • Task 4: Extract Keywords (keyword_extractor tool)

    • Uses Gemini to analyze the summary and extract exactly 10 semantic keywords
    • Focuses on meaningful concepts, not just frequent words
    • Prioritizes multi-word phrases that capture important ideas
    • Returns ordered list of keywords by importance
  • Task 5: Store Result (memory_store tool)

    • Prepares final result dictionary with all accumulated data
    • Includes video_id, transcript, summary, keywords, metadata
    • Ready for storage or API response

Step 4: Memory & Observability

The Memory (memory.py) component logs every step of the process:

  • Event Types Logged:

    • user_input - Initial user request
    • plan_created - Execution plan with all tasks
    • task_started - When each task begins
    • task_completed - Successful task completion with results
    • task_failed - Task failures with error details
    • final_result - Complete summary with all metadata
  • Memory Features:

    • Session-based tracking with unique session IDs
    • Timestamped events for full audit trail
    • Execution timeline for chronological analysis
    • Optional persistence to JSON files in logs/ directory
    • Session summaries with statistics (total events, completion rates)

Step 5: Return Final Output

The agent returns a comprehensive result dictionary containing:

  • Video ID and original URL
  • Full transcript text and structured segments with timestamps
  • AI-generated summary (comprehensive, brief, or key points)
  • 10 semantic keywords
  • Execution metadata (timestamps, plan summary)
  • Complete memory log for observability

2. Key Modules

Planner (planner.py)

Purpose: Breaks down high-level goals into executable sub-tasks

Key Methods:

  • create_plan(user_input) - Generates execution plan from user input
  • update_task_status(plan, step, status, result) - Updates task status as execution progresses
  • get_next_task(plan) - Returns the next pending task to execute
  • is_plan_complete(plan) - Checks if all tasks are finished
  • get_plan_summary(plan) - Provides statistics on plan execution

Design Pattern: Template-based planning with predefined task sequences that can be customized based on user input (e.g., summary_type).

Executor (executor.py)

Purpose: Executes planned tasks using appropriate tools and APIs

Key Tools:

  • url_parser - Extracts video ID from YouTube URLs
  • youtube_api - Fetches transcripts using youtube-transcript-api
  • gemini_api - Generates summaries using Google Gemini 2.5 Flash
  • keyword_extractor - Extracts semantic keywords using Gemini
  • keyword_timestamp_finder - Finds where keywords are discussed in video (uses Gemini to analyze transcript with timestamps)
  • memory_store - Prepares results for storage

Key Methods:

  • execute_task(task, context) - Main execution method that routes to appropriate tool
  • Each tool method handles its specific operation and error cases

Design Pattern: Tool registry pattern where each tool is a method that can be called by name. Tools receive task context and return structured results.

Memory (memory.py)

Purpose: Provides full observability into agent decisions and actions

Key Methods:

  • log_event(event_type, data) - Core logging method
  • log_plan_creation(plan) - Logs when plan is created
  • log_task_start(task) - Logs task initiation
  • log_task_completion(task, result) - Logs successful task completion
  • log_task_failure(task, error) - Logs task failures
  • get_memory(event_type) - Retrieves logged events (optionally filtered)
  • get_session_summary() - Provides session statistics
  • get_execution_timeline() - Returns chronological execution events
  • export_memory(filepath) - Exports memory to JSON file

Design Pattern: Event-driven logging with structured data. Supports both in-memory and persistent storage.

YouTube Summarizer (youtube_summarizer.py)

Purpose: Main orchestrator that coordinates Planner, Executor, and Memory

Key Methods:

  • process_youtube_url(url, summary_type) - Main workflow method that:
    1. Creates execution plan
    2. Executes tasks sequentially
    3. Updates context between tasks
    4. Logs all activities
    5. Returns final result
  • find_keyword_timestamp(video_id, keyword, transcript_segments) - Advanced feature to locate where topics are discussed
  • get_memory_log() - Access to full observability data

Design Pattern: Orchestrator pattern that coordinates the three core components while maintaining separation of concerns.

3. Tool Integration

Google Gemini API

Usage: Primary AI engine for summarization and keyword extraction

Integration Points:

  • Summary Generation: Uses gemini-2.5-flash model with specialized prompts

    • Comprehensive prompts designed for "Learning Mode" following 80/20 principle
    • Filters high-value content (frameworks, insights, actionable steps)
    • Excludes low-value content (intros, ads, filler)
  • Keyword Extraction: Analyzes summaries to extract 10 semantic keywords

    • Focuses on meaningful concepts and multi-word phrases
    • Orders by importance/relevance
  • Timestamp Finding: Analyzes transcript segments to locate where keywords/topics are discussed

    • Uses semantic understanding to find concepts even if exact words aren't used
    • Returns precise timestamps for video navigation

API Configuration:

  • API key from GEMINI_API_KEY environment variable
  • Configured using google.generativeai library
  • Model: gemini-2.5-flash (latest stable model)

YouTube Transcript API

Usage: Fetches video transcripts for summarization

Integration Points:

  • Uses youtube-transcript-api Python library
  • Fetches English transcripts by default
  • Retrieves structured segments with timestamps
  • Handles errors (missing transcripts, language issues)

Error Handling:

  • Graceful fallback when transcripts unavailable
  • Clear error messages for debugging

YouTube Data API (Optional)

Usage: Fetches trending podcast videos by category

Integration Points:

  • Used in trending.py for discovering content
  • Searches for podcasts uploaded within last 14 days
  • Filters by: English language, podcast tags, minimum 1 hour duration
  • Falls back to sample data if API key not configured

API Configuration:

  • API key from YOUTUBE_API_KEY environment variable
  • Optional - system works without it using sample data

4. Observability & Testing

Logging System

Location: logs/ directory (created automatically)

Log Format: JSON files with structure:

{
  "session_summary": {
    "session_id": "20251212_103000",
    "total_events": 12,
    "event_breakdown": {...},
    "start_time": "...",
    "end_time": "..."
  },
  "memory": [
    {
      "session_id": "...",
      "timestamp": "...",
      "event_type": "task_started",
      "data": {...}
    },
    ...
  ]
}

What's Logged:

  • Every user input
  • Complete execution plans
  • Task start/completion/failure events
  • Tool calls and their results
  • Final outputs
  • Error details

Access Methods:

  • get_memory_log() - Full event log
  • get_session_summary() - Session statistics
  • get_execution_timeline() - Chronological task execution
  • export_memory(filepath) - Export to JSON file

Testing

Manual Testing:

  • Streamlit UI (streamlit run src/app.py) for interactive testing
  • Flask API (python src/api.py) for API endpoint testing
  • Direct Python usage (python src/youtube_summarizer.py)

Test Scenarios:

  • Valid YouTube URLs with transcripts
  • Invalid URLs (error handling)
  • Videos without transcripts (error handling)
  • Different summary types (comprehensive, brief, key_points)
  • Keyword extraction accuracy
  • Memory logging completeness

5. Known Limitations

API Dependencies

  • Gemini API: Requires valid API key and internet connection. Rate limits may apply for high-volume usage.
  • YouTube Transcript API: Some videos may not have transcripts available, especially:
    • Very new videos (transcripts may not be generated yet)
    • Videos with disabled captions
    • Videos in languages other than English (though we request English specifically)
  • YouTube Data API: Optional for trending features. Quota limits apply if using real API.

Performance Considerations

  • Long Transcripts: Very long podcasts (3+ hours) may hit token limits. Current implementation handles this by truncating transcript segments for timestamp finding, but full transcripts are still processed for summaries.
  • API Latency: Summary generation depends on Gemini API response time, which can vary based on:
    • Transcript length
    • API load
    • Network conditions
  • Sequential Execution: Tasks execute sequentially (by design for clarity), which means total time is sum of all task times. Could be parallelized for performance improvements.

Content Quality

  • Summary Quality: Depends on Gemini model capabilities and prompt engineering. The 80/20 principle prompts are designed to extract high-value content, but subjective quality may vary.
  • Keyword Extraction: Limited to 10 keywords. May miss some important concepts if content is very diverse.
  • Language Support: Currently optimized for English content. Other languages may work but aren't specifically tested.

Error Handling

  • Network Failures: Basic retry logic not implemented. Network failures will cause task failures.
  • Partial Failures: If a task fails mid-execution, the plan stops. No automatic retry or recovery mechanisms.
  • Invalid Inputs: URL validation is basic. Some edge-case YouTube URL formats may not be recognized.

Scalability

  • Single-User Design: Current architecture is designed for single-user, synchronous processing. Not optimized for:
    • Concurrent requests
    • Background job processing
    • Large-scale batch processing
  • Memory Storage: In-memory storage by default. File-based persistence available but not optimized for high-volume scenarios.

Future Improvements

  • Implement retry logic for API calls
  • Add support for multiple languages
  • Parallel task execution where possible
  • Caching of transcripts and summaries
  • Background job processing for scalability
  • Enhanced error recovery mechanisms
  • Support for other video platforms (Vimeo, etc.)