This file contains build commands, testing procedures, and coding style guidelines for agentic coding.
The docs/ directory contains comprehensive project documentation:
| Document | Purpose | Lines |
|---|---|---|
| README.md | Main project documentation - transcription, knowledge graphs, chat | 425+ |
| COMPLETE_GUIDE.md | Comprehensive 5-phase implementation guide with architecture | 1121+ |
| QUICK_REFERENCE.md | Quick reference card for commands and API endpoints | 376+ |
| README_SEARCH_SYSTEM.md | System architecture and database queries | 328 |
| CODE_MAP_AND_REVIEW.md | Complete code map, architecture diagram, and code review | ~400 |
| DATE_NORMALIZATION.md | Technical documentation for date normalization | 136 |
| KG_SUMMARY.md | Knowledge graph statistics and sample data | 71 |
| CHAT_TRACE.md | Chat trace logging documentation | 456 |
Quick Links:
- Start here: README.md for transcription, KG extraction, and chat
- For development: QUICK_REFERENCE.md for commands
- For architecture: COMPLETE_GUIDE.md for full system overview
- For code understanding: CODE_MAP_AND_REVIEW.md
- For tracing: CHAT_TRACE.md for debug output format
# Run linter
ruff check .
# Fix auto-fixable issues
ruff check . --fix
# Check specific file
ruff check lib/chat_agent_v2.py
# Show detailed output with line numbers
ruff check . --output-format=full# Run type checker on lib modules
mypy lib/
# Check specific file
mypy lib/kg_agent_loop.py
# More verbose output
mypy lib/ --show-error-codesTranscribe video (main script):
python transcribe.py --order-file order.txt --segment-minutes 30 --start-minutes 0Extract knowledge graph from video:
python scripts/kg_extract_from_video.py --youtube-video-id "VIDEO_ID" --window-size 10 --stride 6Chat API server:
python -m uvicorn api.search_api:app --reload --host 0.0.0.0 --port 8000Cron transcription:
python scripts/cron_transcription.py --process
python scripts/cron_transcription.py --list
python scripts/cron_transcription.py --add "VIDEO_ID"Migrate chat schema:
python scripts/migrate_chat_schema.pyRun all tests:
python -m pytest tests/ -vRun specific test file:
python -m pytest tests/test_chat_agent_v2_unit.py -v
python -m pytest tests/test_kg_agent_loop_unit.py -vQuick manual testing:
- Run the script with
--max-segments 2for quick iteration - Verify JSON output structure matches expected format
- Check for runtime errors with
python transcribe.py --help
- Order: Standard library → third-party → local (if any)
- No blank line between imports from the same package
- Use
from collections.abc import Callablefor types - Import from
typingmodule for backward compatibility if needed
Example:
import argparse
import json
from datetime import datetime
from typing import Any
import pydantic
import tenacity
from google import genai
from google.genai.types import GenerateContentConfig, Part- Use modern Python 3.13+ syntax:
dict[str, str]instead ofDict[str, str] - Always annotate function parameters and return types
- Use
| Nonefor optional types instead ofOptional[str] - Use
list[T]instead ofList[T] - Use
dataclassfor simple data objects - Use
pydantic.BaseModelfor validated data structures
Example:
from dataclasses import dataclass
from datetime import timedelta
@dataclass
class VideoSegment:
start: timedelta
end: timedelta
def get_video_duration(video: Video) -> timedelta | None:
"""Fetch video duration."""
# implementation- Functions and variables:
snake_case - Classes:
CamelCase - Constants:
UPPER_SNAKE_CASE - Private methods:
_leading_underscore - Module-level constants: Define at top after imports
- Enum classes:
CamelCasewithUPPER_CASEvalues
Example:
class Model(enum.Enum):
GEMINI_2_5_FLASH = "gemini-2.5-flash"
DEFAULT_CONFIG = GenerateContentConfig(temperature=0.0)
def format_known_speakers(known_speakers: dict) -> str:
"""Format known speakers for prompt context."""
pass- Keep it simple: One-line triple-quoted string on first line of function
- Use Google-style docstrings only for complex functions needing more explanation
- No docstring for trivial functions (obvious from name and signature)
Example:
def get_video_metadata_ytdlp(video: Video) -> VideoMetadataInfo:
"""Fetch video metadata using yt-dlp."""
with yt_dlp.YoutubeDL({"quiet": True}) as ydl:
info = ydl.extract_info(video.value, download=False)
return VideoMetadataInfo(...)- Use try/except for operations that may fail (API calls, file I/O)
- Print errors to console with context (e.g.,
print(f"Error: {e}")) - Return empty/default values on failure when appropriate (e.g.,
return []) - Raise ValueError with descriptive message for invalid user input
- Use tenacity for retrying transient errors (API rate limits)
Example:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
def extract_entities(self, transcript_entry: dict) -> list[tuple]:
"""Extract entities from transcript."""
try:
# implementation
except Exception as e:
print(f"Error extracting entities: {e}")
return []- Target line length: 88-100 characters (flexible for URLs, long strings)
- ruff will auto-fix most issues with
ruff check . --fix - Use f-strings for string formatting (not
.format()or%) - Use match/case instead of if/elif chains for complex conditionals
- Entry point: Always use
if __name__ == "__main__":pattern - argparse setup in main() or at module level if simple
- Data models: Define after imports, before functions
- Helper functions: Organize logically, typically grouped by purpose
- Use tenacity decorators for retry logic on API calls
- Set sensible timeout and max retries (3-7 attempts)
- Handle ClientError specifically for API errors
- Log/retry transient errors (429 rate limits, 503 service unavailable)
- Progress indicators: Use
--batch-sizefor periodic updates - Error prefixes: Use
❌for errors,⚠️for warnings,✅for success - Use print() for all output (no logging module currently)
- Section headers with
=or─separators for readability
Use modern Python 3.10+ match/case for cleaner conditional logic:
match err.code:
case 400 if "try again" in err.message:
retry = True
case 429:
retry = True
case _:
retry = False- Enum for video sources:
Videoclass with YouTube URLs - Enum for models:
Modelwith Gemini variants - Time formats: Use
timedeltafor all time calculations - Speaker IDs: Format
s_<name>_<number>(normalized, lowercase, underscore-separated) - Legislation IDs: Format
L_BILLNUMBER_<number>
- LLM-first approach: Single pass extraction using Gemini, no NER pre-filtering
- Window-based extraction: Concept windows (10 utterances, stride 6) with 40% overlap
- OSS two-pass extraction:
oss_two_pass.pyfor improved entity/relation extraction - Node IDs: Hash-based stable IDs:
kg_<hash(type:label)>[:12] - Edge IDs: Hash-based:
kge_<hash(source|predicate|target|video|seconds|evidence)>[:12] - Predicates: 15 predicates (11 conceptual + 4 discourse) extracted in single pass
- Timestamp accuracy: Edges use timestamps from specific utterances referenced by each edge
- Vector context: Top-K vector search provides relevant known nodes to LLM per window
- Node types:
foaf:Person,schema:Legislation,schema:Organization,schema:Place,skos:Concept
- KGChatAgentV2: Main chat agent class
- Thread-based conversations: Creates and manages chat threads in PostgreSQL
- KGAgentLoop: Handles LLM tool calls for knowledge graph retrieval
- Hybrid Graph-RAG: Uses
kg_hybrid_graph_ragtool for subgraph retrieval - Tracing:
CHAT_TRACE=1environment variable enables detailed debug output
- PDF parsing: Extracts order paper content from parliamentary PDFs
- Video matching: Matches order papers to YouTube videos
- Role extraction: Extracts speaker roles from order papers
- Use
python scripts/clear_kg.py --yesto clear all KG tables for fresh extraction - Clears:
kg_edges,kg_aliases,kg_nodes(in that order due to FK constraints)
# Create virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# Install all dependencies
pip install -r requirements.txt
# Set API key (Google AI Studio)
export GOOGLE_API_KEY="your-api-key"- google-genai: Gemini API client
- fastapi: Web framework for chat API
- uvicorn: ASGI server
- pydantic: Data validation
- tenacity: Retry logic
- yt-dlp: Video metadata
- rapidfuzz: Fuzzy string matching
- psycopg: PostgreSQL client
- pgvector: Vector similarity search
chat_threads: Conversation threadschat_messages: Individual messages with role and contentchat_thread_state: Persisted state for follow-up questions
kg_nodes: Knowledge graph nodes with embeddingskg_aliases: Normalized alias indexkg_edges: Knowledge graph edges with provenance
paragraphs: Transcript paragraphs with embeddingssentences: Individual sentences (no embeddings)speakers: Speaker informationspeaker_video_roles: Speaker roles in specific videos