PodVibe.fm uses an agentic AI architecture that processes YouTube podcast URLs through a structured ReAct (Reasoning + Acting) pattern. Here's the step-by-step workflow:
- User provides a YouTube video URL through the web interface (React frontend) or Streamlit UI
- The request is routed to the Flask API backend (
/api/summarizeendpoint) - Input is validated to ensure it's a valid YouTube URL format
The Planner (planner.py) creates an execution plan by breaking down the summarization task into discrete sub-tasks:
- Extract Video ID - Parse the YouTube URL to extract the unique video identifier
- Fetch Transcript - Retrieve the full transcript from YouTube using the YouTube Transcript API
- Generate Summary - Use Google Gemini API to create an intelligent summary following the 80/20 principle (extracting the 20% of content that delivers 80% of the value)
- Extract Keywords - Use Gemini to identify 10 semantic keywords that represent core concepts
- Store Result - Save the final result with all metadata
Each task in the plan includes:
- Step number and action type
- Tool to be used (url_parser, youtube_api, gemini_api, etc.)
- Status tracking (pending → in_progress → completed/failed)
- Context from previous steps
The Executor (executor.py) sequentially executes each planned task:
-
Task 1: Extract Video ID (
url_parsertool)- Parses various YouTube URL formats (youtube.com/watch, youtu.be, embed URLs)
- Extracts the video ID using URL parsing logic
- Returns video_id for use in subsequent steps
-
Task 2: Fetch Transcript (
youtube_apitool)- Uses
youtube-transcript-apilibrary to fetch English transcripts - Retrieves transcript segments with timestamps for later use
- Handles errors gracefully (missing transcripts, language issues)
- Returns full transcript text and structured segments
- Uses
-
Task 3: Generate Summary (
gemini_apitool)- Sends transcript to Google Gemini 2.5 Flash model
- Uses specialized prompts designed for "Learning Mode" that focus on:
- Novel frameworks and mental models
- Counter-intuitive insights
- Actionable steps with clear guidance
- Evidence-based conclusions
- High-utility knowledge
- Filters out low-value content (intros, ads, filler, repetitive content)
- Supports multiple summary types: comprehensive, brief, key_points
- Returns structured summary text
-
Task 4: Extract Keywords (
keyword_extractortool)- Uses Gemini to analyze the summary and extract exactly 10 semantic keywords
- Focuses on meaningful concepts, not just frequent words
- Prioritizes multi-word phrases that capture important ideas
- Returns ordered list of keywords by importance
-
Task 5: Store Result (
memory_storetool)- Prepares final result dictionary with all accumulated data
- Includes video_id, transcript, summary, keywords, metadata
- Ready for storage or API response
The Memory (memory.py) component logs every step of the process:
-
Event Types Logged:
user_input- Initial user requestplan_created- Execution plan with all taskstask_started- When each task beginstask_completed- Successful task completion with resultstask_failed- Task failures with error detailsfinal_result- Complete summary with all metadata
-
Memory Features:
- Session-based tracking with unique session IDs
- Timestamped events for full audit trail
- Execution timeline for chronological analysis
- Optional persistence to JSON files in
logs/directory - Session summaries with statistics (total events, completion rates)
The agent returns a comprehensive result dictionary containing:
- Video ID and original URL
- Full transcript text and structured segments with timestamps
- AI-generated summary (comprehensive, brief, or key points)
- 10 semantic keywords
- Execution metadata (timestamps, plan summary)
- Complete memory log for observability
Purpose: Breaks down high-level goals into executable sub-tasks
Key Methods:
create_plan(user_input)- Generates execution plan from user inputupdate_task_status(plan, step, status, result)- Updates task status as execution progressesget_next_task(plan)- Returns the next pending task to executeis_plan_complete(plan)- Checks if all tasks are finishedget_plan_summary(plan)- Provides statistics on plan execution
Design Pattern: Template-based planning with predefined task sequences that can be customized based on user input (e.g., summary_type).
Purpose: Executes planned tasks using appropriate tools and APIs
Key Tools:
url_parser- Extracts video ID from YouTube URLsyoutube_api- Fetches transcripts using youtube-transcript-apigemini_api- Generates summaries using Google Gemini 2.5 Flashkeyword_extractor- Extracts semantic keywords using Geminikeyword_timestamp_finder- Finds where keywords are discussed in video (uses Gemini to analyze transcript with timestamps)memory_store- Prepares results for storage
Key Methods:
execute_task(task, context)- Main execution method that routes to appropriate tool- Each tool method handles its specific operation and error cases
Design Pattern: Tool registry pattern where each tool is a method that can be called by name. Tools receive task context and return structured results.
Purpose: Provides full observability into agent decisions and actions
Key Methods:
log_event(event_type, data)- Core logging methodlog_plan_creation(plan)- Logs when plan is createdlog_task_start(task)- Logs task initiationlog_task_completion(task, result)- Logs successful task completionlog_task_failure(task, error)- Logs task failuresget_memory(event_type)- Retrieves logged events (optionally filtered)get_session_summary()- Provides session statisticsget_execution_timeline()- Returns chronological execution eventsexport_memory(filepath)- Exports memory to JSON file
Design Pattern: Event-driven logging with structured data. Supports both in-memory and persistent storage.
Purpose: Main orchestrator that coordinates Planner, Executor, and Memory
Key Methods:
process_youtube_url(url, summary_type)- Main workflow method that:- Creates execution plan
- Executes tasks sequentially
- Updates context between tasks
- Logs all activities
- Returns final result
find_keyword_timestamp(video_id, keyword, transcript_segments)- Advanced feature to locate where topics are discussedget_memory_log()- Access to full observability data
Design Pattern: Orchestrator pattern that coordinates the three core components while maintaining separation of concerns.
Usage: Primary AI engine for summarization and keyword extraction
Integration Points:
-
Summary Generation: Uses
gemini-2.5-flashmodel with specialized prompts- Comprehensive prompts designed for "Learning Mode" following 80/20 principle
- Filters high-value content (frameworks, insights, actionable steps)
- Excludes low-value content (intros, ads, filler)
-
Keyword Extraction: Analyzes summaries to extract 10 semantic keywords
- Focuses on meaningful concepts and multi-word phrases
- Orders by importance/relevance
-
Timestamp Finding: Analyzes transcript segments to locate where keywords/topics are discussed
- Uses semantic understanding to find concepts even if exact words aren't used
- Returns precise timestamps for video navigation
API Configuration:
- API key from
GEMINI_API_KEYenvironment variable - Configured using
google.generativeailibrary - Model:
gemini-2.5-flash(latest stable model)
Usage: Fetches video transcripts for summarization
Integration Points:
- Uses
youtube-transcript-apiPython library - Fetches English transcripts by default
- Retrieves structured segments with timestamps
- Handles errors (missing transcripts, language issues)
Error Handling:
- Graceful fallback when transcripts unavailable
- Clear error messages for debugging
Usage: Fetches trending podcast videos by category
Integration Points:
- Used in
trending.pyfor discovering content - Searches for podcasts uploaded within last 14 days
- Filters by: English language, podcast tags, minimum 1 hour duration
- Falls back to sample data if API key not configured
API Configuration:
- API key from
YOUTUBE_API_KEYenvironment variable - Optional - system works without it using sample data
Location: logs/ directory (created automatically)
Log Format: JSON files with structure:
{
"session_summary": {
"session_id": "20251212_103000",
"total_events": 12,
"event_breakdown": {...},
"start_time": "...",
"end_time": "..."
},
"memory": [
{
"session_id": "...",
"timestamp": "...",
"event_type": "task_started",
"data": {...}
},
...
]
}What's Logged:
- Every user input
- Complete execution plans
- Task start/completion/failure events
- Tool calls and their results
- Final outputs
- Error details
Access Methods:
get_memory_log()- Full event logget_session_summary()- Session statisticsget_execution_timeline()- Chronological task executionexport_memory(filepath)- Export to JSON file
Manual Testing:
- Streamlit UI (
streamlit run src/app.py) for interactive testing - Flask API (
python src/api.py) for API endpoint testing - Direct Python usage (
python src/youtube_summarizer.py)
Test Scenarios:
- Valid YouTube URLs with transcripts
- Invalid URLs (error handling)
- Videos without transcripts (error handling)
- Different summary types (comprehensive, brief, key_points)
- Keyword extraction accuracy
- Memory logging completeness
- Gemini API: Requires valid API key and internet connection. Rate limits may apply for high-volume usage.
- YouTube Transcript API: Some videos may not have transcripts available, especially:
- Very new videos (transcripts may not be generated yet)
- Videos with disabled captions
- Videos in languages other than English (though we request English specifically)
- YouTube Data API: Optional for trending features. Quota limits apply if using real API.
- Long Transcripts: Very long podcasts (3+ hours) may hit token limits. Current implementation handles this by truncating transcript segments for timestamp finding, but full transcripts are still processed for summaries.
- API Latency: Summary generation depends on Gemini API response time, which can vary based on:
- Transcript length
- API load
- Network conditions
- Sequential Execution: Tasks execute sequentially (by design for clarity), which means total time is sum of all task times. Could be parallelized for performance improvements.
- Summary Quality: Depends on Gemini model capabilities and prompt engineering. The 80/20 principle prompts are designed to extract high-value content, but subjective quality may vary.
- Keyword Extraction: Limited to 10 keywords. May miss some important concepts if content is very diverse.
- Language Support: Currently optimized for English content. Other languages may work but aren't specifically tested.
- Network Failures: Basic retry logic not implemented. Network failures will cause task failures.
- Partial Failures: If a task fails mid-execution, the plan stops. No automatic retry or recovery mechanisms.
- Invalid Inputs: URL validation is basic. Some edge-case YouTube URL formats may not be recognized.
- Single-User Design: Current architecture is designed for single-user, synchronous processing. Not optimized for:
- Concurrent requests
- Background job processing
- Large-scale batch processing
- Memory Storage: In-memory storage by default. File-based persistence available but not optimized for high-volume scenarios.
- Implement retry logic for API calls
- Add support for multiple languages
- Parallel task execution where possible
- Caching of transcripts and summaries
- Background job processing for scalability
- Enhanced error recovery mechanisms
- Support for other video platforms (Vimeo, etc.)