YuhHearDem3 - Code Map and Review
YuhHearDem3 is a parliamentary transcription and knowledge graph system that processes video recordings of parliament sessions, extracts structured information, and enables conversational search over debates.
┌─────────────────────────────────────────────────────────────────────────────┐
│ YuhHearDem3 System │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ │
│ │ Video Input │───▶│ Transcription │───▶│ Three-Tier Storage │ │
│ │ (YouTube/GCS) │ │ (Gemini 2.5) │ │ (PostgreSQL + pgvector)│ │
│ └─────────────────┘ └─────────────────┘ └─────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────────┐ │
│ │ Order Papers │ │ Knowledge Graph │ │ Search API │ │
│ │ (PDF Parsing) │───▶│ Extraction │───▶│ - Hybrid Search │ │
│ └─────────────────┘ └─────────────────┘ │ - Conversational AI │ │
│ │ - Graph Traversal │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Code Flow Diagram (Mermaid)
flowchart LR
subgraph Sources
YT[YouTube or GCS video]
OPDF[Order paper PDF]
BillsSite[Parliament bills site]
end
subgraph Transcription
Transcribe[transcribe.py]
JSONOut[transcription_output.json]
end
subgraph TranscriptIngest
IngestScript[scripts/ingest_transcript_json.py]
Ingestor[lib/transcripts/ingestor.py]
end
subgraph OrderPapers
OPIngest[scripts/ingest_order_paper_pdf.py]
OPParser[lib/order_papers/*.py]
end
subgraph Bills
BillIngest[scripts/ingest_bills.py]
BillScraper[lib/scraping/bill_scraper.py]
BillProcessor[lib/processors/bill_ingestor.py]
end
subgraph KGExtraction
KGVideo[scripts/kg_extract_from_video.py]
KGBills[scripts/kg_extract_from_bills.py]
WindowBuilder[lib/knowledge_graph/window_builder.py]
BillWindowBuilder[lib/knowledge_graph/bill_window_builder.py]
Extractor[lib/knowledge_graph/oss_kg_extractor.py + kg_extractor.py]
KGStore[lib/knowledge_graph/kg_store.py]
end
subgraph Storage[(PostgreSQL + pgvector)]
Tables[Transcript + search + KG tables]
end
subgraph SearchAPI
API[api/search_api.py]
ChatAgent[lib/chat_agent_v2.py]
AgentLoop[lib/kg_agent_loop.py]
HybridRAG[lib/kg_hybrid_graph_rag.py]
AdvSearch[lib/advanced_search_features.py]
end
subgraph Frontend
UI[frontend/src (Vite + React)]
end
YT --> Transcribe --> JSONOut --> IngestScript --> Ingestor --> Tables
OPDF --> OPIngest --> OPParser --> Tables
BillsSite --> BillIngest --> BillScraper --> BillProcessor --> Tables
Tables --> WindowBuilder --> KGVideo
Tables --> BillWindowBuilder --> KGBills
KGVideo --> Extractor --> KGStore --> Tables
KGBills --> Extractor --> KGStore
Tables --> API
API --> ChatAgent --> AgentLoop --> HybridRAG --> Tables
API --> AdvSearch --> Tables
UI --> API
API --> UI
Loading
File
Lines
Purpose
transcribe.py
~650
Main video transcription script
api/search_api.py
~400
FastAPI application with chat and search endpoints
File
Lines
Purpose
chat_agent_v2.py
551
Conversational AI agent with thread management
kg_agent_loop.py
699
KG-powered agent loop with tool calling
kg_hybrid_graph_rag.py
~400
Hybrid Graph-RAG retrieval
advanced_search_features.py
~450
Temporal search, trends, graph queries
Knowledge Graph (lib/knowledge_graph/)
File
Lines
Purpose
oss_kg_extractor.py
~800
OSS KG extraction (two-pass)
oss_two_pass.py
677
OSS two-pass entity extraction
window_builder.py
287
Window-based processing for transcripts
bill_window_builder.py
~200
Bill excerpt window construction
kg_store.py
~350
KG storage operations
kg_extractor.py
~550
Main KG extraction logic
base_kg_seeder.py
~300
Base KG seeding
model_compare.py
~300
Model comparison utilities
window_benchmark.py
~160
Window performance benchmarks
Order Papers (lib/order_papers/)
File
Lines
Purpose
pdf_parser.py
192
PDF order paper parsing
video_matcher.py
344
Match papers to YouTube videos
ingestor.py
95
Order paper ingestion
parser.py
129
Order paper parsing
models.py
34
Order paper models
role_extract.py
27
Speaker role extraction
Transcripts (lib/transcripts/)
File
Lines
Purpose
ingestor.py
433
Transcript ingestion
Embeddings (lib/embeddings/)
File
Lines
Purpose
google_client.py
~200
Embedding generation client
File
Lines
Purpose
postgres_client.py
~100
PostgreSQL connection pool
chat_schema.py
~150
Chat schema management
Processors (lib/processors/)
File
Lines
Purpose
three_tier_transcription.py
147
Three-tier transcript processing
paragraph_splitter.py
115
Paragraph grouping
bill_entity_extractor.py
341
Bill entity extraction
bill_ingestor.py
184
Bill ingestion
File
Lines
Purpose
bill_scraper.py
305
Bill scraping from parliament website
File
Lines
Purpose
config.py
85
Configuration management
File
Lines
Purpose
id_generators.py
~100
ID generation utilities
roles.py
~120
Speaker role normalization utilities
File
Purpose
kg_extract_from_video.py
Extract KG from video
kg_extract_from_bills.py
Extract KG from bill excerpts
cron_transcription.py
Automated transcription jobs
migrate_chat_schema.py
Chat schema migration
clear_kg.py
Clear KG tables
ingest_order_paper_pdf.py
Ingest order paper PDFs
ingest_transcript_json.py
Ingest transcript JSON into Postgres
ingest_bills.py
Scrape/process bills and ingest
ingest_knowledge_graph.py
Ingest KG data
list_channel_videos.py
List channel videos
match_order_papers_to_videos.py
Match papers to videos
backfill_speaker_video_roles.py
Backfill speaker roles
compare_kg_models.py
Compare KG models
migrate_transcripts.py
Migrate transcripts
export_order_paper.py
Export order papers
deploy.sh
Deployment script
kg_export_html.py
Export KG to HTML
kg_seed_base.py
Seed base KG
File
Purpose
test_chat_agent_v2_unit.py
Chat agent tests
test_kg_agent_loop_unit.py
KG agent loop tests
test_kg_hybrid_graph_rag_unit.py
Graph-RAG tests
test_oss_two_pass.py
OSS two-pass tests
test_window_builder.py
Window builder tests
test_trace_helpers_unit.py
Trace helper tests
test_order_paper_*.py
Order paper tests
test_bill_*.py
Bill tests
test_database.py
Database tests
transcribe.py
├── google.genai (Gemini API)
├── yt_dlp (YouTube metadata)
├── rapidfuzz (fuzzy matching)
├── pydantic (data validation)
├── tenacity (retry logic)
└── lib/order_papers (order paper context)
api/search_api.py
├── fastapi (web framework)
├── lib.chat_agent_v2
├── lib.kg_agent_loop
├── lib.kg_hybrid_graph_rag
└── lib.db.postgres_client
lib/chat_agent_v2.py
├── lib.db.chat_schema
├── lib.kg_agent_loop
└── lib.utils.config
lib/kg_agent_loop.py
├── lib.kg_hybrid_graph_rag
├── google.genai
└── lib.utils.config
lib/knowledge_graph/*.py
├── google.genai (LLM extraction)
├── lib.db.postgres_client
└── tenacity (retry logic)
lib/order_papers/*.py
├── lib.db.postgres_client
└── pdf parsing libraries
1. Canonical ID Generation
# Hash-based stable IDs for consistency
kg_ < hash (type:label )> [:12 ] # Node IDs
kge_ < hash (source | predicate | target | video | seconds | evidence )> [:12 ] # Edge IDs
2. Window-Based Processing
# Configurable window size and stride
window_size = 30 # utterances per window
stride = 18 # utterances between windows
# 40% overlap for continuity
# Persistent conversation threads
chat_threads (table )
├── id : UUID
├── title : str
├── created_at : timestamp
└── updated_at : timestamp
chat_messages (table )
├── id : UUID
├── thread_id : FK
├── role : 'user' | 'assistant'
├── content : text
└── created_at : timestamp
# Every answer grounded in evidence
answer → cite_utterance_ids → transcript sentences
Clear Architecture : Separation between transcription, KG extraction, and chat
Modern Python : Type hints, dataclasses, pydantic models
Error Handling : Tenacity decorators for retry logic
Database Design : Well-designed schema with proper indexes
Citation Tracking : Full provenance for answers
Testing : Comprehensive unit tests for core functionality
Large Functions : Some functions exceed 100 lines
Type Coverage : Not all functions have type annotations
Documentation : Some modules lack docstrings
Error Messages : Could be more descriptive
Chat Agent V2 : Complete rewrite with thread-based conversations
KG Extraction : Added OSS two-pass extraction for improved accuracy
Order Papers : New PDF parsing and video matching
Tracing : Comprehensive debug tracing with CHAT_TRACE
Follow-Up Questions : LLM-generated contextual follow-ups
YuhHearDem3/
├── api/ # FastAPI endpoints
├── lib/
│ ├── chat_agent_v2.py # Chat agent
│ ├── kg_*.py # Knowledge graph
│ ├── order_papers/ # Order paper processing
│ ├── transcripts/ # Transcript processing
│ ├── db/ # Database clients
│ ├── processors/ # Data processors
│ ├── scraping/ # Web scrapers
│ └── utils/ # Utilities
├── scripts/ # CLI scripts
├── tests/ # Unit tests
├── docs/ # Documentation
└── schema/ # Database schema
The codebase is well-organized with clear separation of concerns. Core functionality includes:
Transcription : Iterative video processing with speaker diarization
Knowledge Graph : LLM-first extraction with canonical IDs
Conversational Search : Thread-based chat with grounded answers
Order Papers : PDF parsing and video matching
Overall quality: Good - Solid architecture with comprehensive testing.