YuhHearDem3 - Complete Implementation Guide

Executive Summary

A hybrid vector/graph search and conversational AI system for Barbados Parliament debates. Combines:

Video Transcription: Gemini 2.5 Flash with iterative segment processing
Knowledge Graph: LLM-first extraction with canonical IDs and provenance
Conversational Search: Thread-based chat with Hybrid Graph-RAG
Hybrid Search: Vector similarity + BM25 full-text + graph traversal

System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           YuhHearDem3 System                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────────┐ │
│  │   Video Input   │───▶│  Transcription  │───▶│  Three-Tier Storage     │ │
│  │  (YouTube/GCS)  │    │  (Gemini 2.5)   │    │  (PostgreSQL + pgvector)│ │
│  └─────────────────┘    └─────────────────┘    └─────────────────────────┘ │
│           │                      │                         │                │
│           ▼                      ▼                         ▼                │
│  ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────────────┐ │
│  │ Order Papers    │    │ Knowledge Graph │    │     Search API          │ │
│  │ (PDF Parsing)   │───▶│    Extraction   │───▶│  - Hybrid Search       │ │
│  └─────────────────┘    └─────────────────┘    │  - Conversational AI   │ │
│                                                   │  - Graph Traversal     │ │
│                                                   └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

The frontend is served by FastAPI from frontend/dist and talks to the same API origin.

Core Modules

Video Transcription (`transcribe.py`)

Features:

Iterative segment processing with configurable duration (default 30 min)
Speaker diarization with fuzzy matching across segments
Legislation/bill identification
Order paper integration for context
Overlap handling for continuity

Usage:

python transcribe.py --order-file order.txt --segment-minutes 30 --start-minutes 0

Knowledge Graph Extraction (`lib/knowledge_graph/`)

Architecture:

OSS Two-Pass: Improved entity/relation extraction
Window Builder: Configurable window size/stride (default: 30/18)
Canonical IDs: Hash-based stable identifiers
Vector Context: Top-K similar nodes per window

Relationship Types:

Conceptual (11): AMENDS, GOVERNS, MODERNIZES, AIMS_TO_REDUCE, REQUIRES_APPROVAL, IMPLEMENTED_BY, RESPONSIBLE_FOR, ASSOCIATED_WITH, CAUSES, ADDRESSES, PROPOSES
Discourse (4): RESPONDS_TO, AGREES_WITH, DISAGREES_WITH, QUESTIONS

Node Types:

foaf:Person, schema:Legislation, schema:Organization, schema:Place, skos:Concept

Usage:

python scripts/kg_extract_from_video.py --youtube-video-id "VIDEO_ID" --window-size 30 --stride 18

Conversational Search (`lib/chat_agent_v2.py`)

Components:

KGChatAgentV2: Main chat agent class
KGAgentLoop: Handles LLM tool calls and Graph-RAG
Thread Storage: PostgreSQL-backed conversation history
Citation Engine: Grounded answers with transcript citations

Tracing:

CHAT_TRACE=1 python -m uvicorn api.search_api:app --reload

Hybrid Search (`lib/kg_hybrid_graph_rag.py`)

Pipeline:

Vector search over kg_nodes (semantic similarity)
Graph expansion (N-hop traversal)
Citation retrieval with timestamps
Re-ranking by relevance

Data Flow

Transcription Flow

Video URL → yt-dlp metadata → Gemini API → Segment transcription → Speaker normalization → JSON output
JSON output → scripts/ingest_transcript_json.py → Transcript tables (videos/paragraphs/sentences/entities)

Order Paper Flow

Order paper PDF → scripts/ingest_order_paper_pdf.py → order_papers/order_paper_items → context + role seeding

Bill Ingestion Flow

Bill site → scripts/ingest_bills.py → bills + bill_excerpts (embeddings)

Knowledge Graph Flow

Transcript → Window Builder (30 utterances, stride 18) → LLM extraction → Canonicalization → KG Store → PostgreSQL
Bill excerpts → BillWindowBuilder → LLM extraction → Canonicalization → KG Store → PostgreSQL

Chat Flow

User Query → Embedding → Vector Search → Graph Expansion → LLM Synthesis → Grounded Answer + Citations

Database Schema

Core Tables

Transcript Tables:

paragraphs: Paragraphs with embeddings
sentences: Individual sentences with provenance
speakers: Speaker information
speaker_video_roles: Speaker roles per video

Knowledge Graph Tables:

kg_nodes: Canonical nodes with embeddings
kg_aliases: Normalized alias index
kg_edges: Edges with provenance (evidence, timestamps, citations)

Chat Tables:

chat_threads: Conversation threads
chat_messages: Messages with role and content
chat_thread_state: Persisted state for follow-ups

API Endpoints

Method	Path	Description
POST	`/search`	Hybrid search (vector + graph + BM25)
POST	`/search/temporal`	Search with date/speaker/entity filters
GET	`/search/trends`	Trend analysis for entities
GET	`/speakers`	List all speakers
GET	`/speakers/{speaker_id}`	Speaker details
GET	`/videos/{youtube_video_id}/speakers/{speaker_id}/roles`	Speaker roles for a video
POST	`/chat/threads`	Create new thread
POST	`/chat/threads/{thread_id}/messages`	Add message to thread
GET	`/chat/threads/{thread_id}/messages/stream`	Stream message response (SSE)
GET	`/health`	Health check
GET	`/api`	API metadata

Scripts Reference

Script	Purpose
`transcribe.py`	Main video transcription
`scripts/ingest_transcript_json.py`	Ingest transcript JSON into Postgres
`scripts/kg_extract_from_video.py`	Extract KG from video
`scripts/kg_extract_from_bills.py`	Extract KG from bill excerpts
`scripts/cron_transcription.py`	Automated transcription jobs
`scripts/migrate_chat_schema.py`	Chat schema migration
`scripts/clear_kg.py`	Clear KG tables
`scripts/ingest_order_paper_pdf.py`	Ingest order paper PDFs
`scripts/ingest_bills.py`	Scrape/process bills and ingest
`scripts/list_channel_videos.py`	List channel videos

Configuration

Environment Variables

Variable	Description
`GOOGLE_API_KEY`	Google AI Studio API key
`CHAT_TRACE`	Enable chat tracing (1/true/on)
`ENABLE_THINKING`	Enable model thinking

Command-Line Options

transcribe.py:

--order-file: Path to order paper file
--order-paper-id: Order paper ID from database
--segment-minutes: Segment duration (default: 30)
--overlap-minutes: Segment overlap (default: 1)
--start-minutes: Start position (default: 0)
--max-segments: Limit segments processed
--video: YouTube ID/URL or gs:// URI

kg_extract_from_video.py:

--youtube-video-id: Video ID to process
--window-size: Utterances per window (default: 30)
--stride: Utterances between windows (default: 18)
--max-windows: Limit windows processed

Testing

# Run all tests
python -m pytest tests/ -v

# Run specific test
python -m pytest tests/test_chat_agent_v2_unit.py -v

# Lint
ruff check .

# Type check
mypy lib/

Quick Reference

Essential Commands

# Transcribe video
python transcribe.py --order-file order.txt

# Extract knowledge graph
python scripts/kg_extract_from_video.py --youtube-video-id "ID"

# Start API
python -m uvicorn api.search_api:app --reload --port 8000

# Run tests
python -m pytest tests/ -v

# Lint
ruff check . --fix

File Locations

Component	Location
Chat API	`api/search_api.py`
Chat Agent	`lib/chat_agent_v2.py`
KG Extraction	`lib/knowledge_graph/`
Order Papers	`lib/order_papers/`
Tests	`tests/`

Dependencies

Core

google-genai>=0.8.0: Gemini API client
fastapi>=0.109.0: Web framework
psycopg[binary,pool]>=3.2.0: PostgreSQL
pydantic>=2.5.0: Data validation
yt-dlp>=2024.0.0: Video metadata

Optional

rapidfuzz>=3.6.0: Fuzzy string matching
tenacity>=8.2.0: Retry logic
beautifulsoup4>=4.12.0: HTML parsing

Documentation

Document	Description
README.md	Project overview
QUICK_REFERENCE.md	Command quick reference
CHAT_TRACE.md	Debug tracing
DATE_NORMALIZATION.md	Date handling

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YuhHearDem3 - Complete Implementation Guide

Executive Summary

System Architecture

Core Modules

Video Transcription (`transcribe.py`)

Knowledge Graph Extraction (`lib/knowledge_graph/`)

Conversational Search (`lib/chat_agent_v2.py`)

Hybrid Search (`lib/kg_hybrid_graph_rag.py`)

Data Flow

Transcription Flow

Order Paper Flow

Bill Ingestion Flow

Knowledge Graph Flow

Chat Flow

Database Schema

Core Tables

API Endpoints

Scripts Reference

Configuration

Environment Variables

Command-Line Options

Testing

Quick Reference

Essential Commands

File Locations

Dependencies

Core

Optional

Documentation

License

FilesExpand file tree

COMPLETE_GUIDE.md

Latest commit

History

COMPLETE_GUIDE.md

File metadata and controls

YuhHearDem3 - Complete Implementation Guide

Executive Summary

System Architecture

Core Modules

Video Transcription (transcribe.py)

Knowledge Graph Extraction (lib/knowledge_graph/)

Conversational Search (lib/chat_agent_v2.py)

Hybrid Search (lib/kg_hybrid_graph_rag.py)

Data Flow

Transcription Flow

Order Paper Flow

Bill Ingestion Flow

Knowledge Graph Flow

Chat Flow

Database Schema

Core Tables

API Endpoints

Scripts Reference

Configuration

Environment Variables

Command-Line Options

Testing

Quick Reference

Essential Commands

File Locations

Dependencies

Core

Optional

Documentation

License

Video Transcription (`transcribe.py`)

Knowledge Graph Extraction (`lib/knowledge_graph/`)

Conversational Search (`lib/chat_agent_v2.py`)

Hybrid Search (`lib/kg_hybrid_graph_rag.py`)