Skip to content

Commit e44731c

Browse files
committed
docs: update architecture and API references
1 parent f39eb3b commit e44731c

File tree

6 files changed

+204
-75
lines changed

6 files changed

+204
-75
lines changed

docs/CODE_MAP_AND_REVIEW.md

Lines changed: 90 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,78 @@ YuhHearDem3 is a parliamentary transcription and knowledge graph system that pro
2626
└─────────────────────────────────────────────────────────────────────────────┘
2727
```
2828

29+
## Code Flow Diagram (Mermaid)
30+
31+
```mermaid
32+
flowchart LR
33+
subgraph Sources
34+
YT[YouTube or GCS video]
35+
OPDF[Order paper PDF]
36+
BillsSite[Parliament bills site]
37+
end
38+
39+
subgraph Transcription
40+
Transcribe[transcribe.py]
41+
JSONOut[transcription_output.json]
42+
end
43+
44+
subgraph TranscriptIngest
45+
IngestScript[scripts/ingest_transcript_json.py]
46+
Ingestor[lib/transcripts/ingestor.py]
47+
end
48+
49+
subgraph OrderPapers
50+
OPIngest[scripts/ingest_order_paper_pdf.py]
51+
OPParser[lib/order_papers/*.py]
52+
end
53+
54+
subgraph Bills
55+
BillIngest[scripts/ingest_bills.py]
56+
BillScraper[lib/scraping/bill_scraper.py]
57+
BillProcessor[lib/processors/bill_ingestor.py]
58+
end
59+
60+
subgraph KGExtraction
61+
KGVideo[scripts/kg_extract_from_video.py]
62+
KGBills[scripts/kg_extract_from_bills.py]
63+
WindowBuilder[lib/knowledge_graph/window_builder.py]
64+
BillWindowBuilder[lib/knowledge_graph/bill_window_builder.py]
65+
Extractor[lib/knowledge_graph/oss_kg_extractor.py + kg_extractor.py]
66+
KGStore[lib/knowledge_graph/kg_store.py]
67+
end
68+
69+
subgraph Storage[(PostgreSQL + pgvector)]
70+
Tables[Transcript + search + KG tables]
71+
end
72+
73+
subgraph SearchAPI
74+
API[api/search_api.py]
75+
ChatAgent[lib/chat_agent_v2.py]
76+
AgentLoop[lib/kg_agent_loop.py]
77+
HybridRAG[lib/kg_hybrid_graph_rag.py]
78+
AdvSearch[lib/advanced_search_features.py]
79+
end
80+
81+
subgraph Frontend
82+
UI[frontend/src (Vite + React)]
83+
end
84+
85+
YT --> Transcribe --> JSONOut --> IngestScript --> Ingestor --> Tables
86+
OPDF --> OPIngest --> OPParser --> Tables
87+
BillsSite --> BillIngest --> BillScraper --> BillProcessor --> Tables
88+
89+
Tables --> WindowBuilder --> KGVideo
90+
Tables --> BillWindowBuilder --> KGBills
91+
KGVideo --> Extractor --> KGStore --> Tables
92+
KGBills --> Extractor --> KGStore
93+
94+
Tables --> API
95+
API --> ChatAgent --> AgentLoop --> HybridRAG --> Tables
96+
API --> AdvSearch --> Tables
97+
UI --> API
98+
API --> UI
99+
```
100+
29101
## Code Map
30102

31103
### Entry Points
@@ -50,12 +122,15 @@ YuhHearDem3 is a parliamentary transcription and knowledge graph system that pro
50122

51123
| File | Lines | Purpose |
52124
|------|-------|---------|
125+
| `oss_kg_extractor.py` | ~800 | OSS KG extraction (two-pass) |
53126
| `oss_two_pass.py` | 677 | OSS two-pass entity extraction |
54127
| `window_builder.py` | 287 | Window-based processing for transcripts |
128+
| `bill_window_builder.py` | ~200 | Bill excerpt window construction |
55129
| `kg_store.py` | ~350 | KG storage operations |
56130
| `kg_extractor.py` | ~550 | Main KG extraction logic |
57131
| `base_kg_seeder.py` | ~300 | Base KG seeding |
58132
| `model_compare.py` | ~300 | Model comparison utilities |
133+
| `window_benchmark.py` | ~160 | Window performance benchmarks |
59134

60135
#### Order Papers (`lib/order_papers/`)
61136

@@ -74,6 +149,12 @@ YuhHearDem3 is a parliamentary transcription and knowledge graph system that pro
74149
|------|-------|---------|
75150
| `ingestor.py` | 433 | Transcript ingestion |
76151

152+
#### Embeddings (`lib/embeddings/`)
153+
154+
| File | Lines | Purpose |
155+
|------|-------|---------|
156+
| `google_client.py` | ~200 | Embedding generation client |
157+
77158
#### Database (`lib/db/`)
78159

79160
| File | Lines | Purpose |
@@ -101,18 +182,26 @@ YuhHearDem3 is a parliamentary transcription and knowledge graph system that pro
101182
| File | Lines | Purpose |
102183
|------|-------|---------|
103184
| `config.py` | 85 | Configuration management |
104-
| `roles.py` | ~50 | Role utilities |
185+
186+
#### Utilities (`lib/`)
187+
188+
| File | Lines | Purpose |
189+
|------|-------|---------|
105190
| `id_generators.py` | ~100 | ID generation utilities |
191+
| `roles.py` | ~120 | Speaker role normalization utilities |
106192

107193
### Scripts (`scripts/`)
108194

109195
| File | Purpose |
110196
|------|---------|
111197
| `kg_extract_from_video.py` | Extract KG from video |
198+
| `kg_extract_from_bills.py` | Extract KG from bill excerpts |
112199
| `cron_transcription.py` | Automated transcription jobs |
113200
| `migrate_chat_schema.py` | Chat schema migration |
114201
| `clear_kg.py` | Clear KG tables |
115202
| `ingest_order_paper_pdf.py` | Ingest order paper PDFs |
203+
| `ingest_transcript_json.py` | Ingest transcript JSON into Postgres |
204+
| `ingest_bills.py` | Scrape/process bills and ingest |
116205
| `ingest_knowledge_graph.py` | Ingest KG data |
117206
| `list_channel_videos.py` | List channel videos |
118207
| `match_order_papers_to_videos.py` | Match papers to videos |

docs/COMPLETE_GUIDE.md

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,8 @@ A hybrid vector/graph search and conversational AI system for Barbados Parliamen
3333
└─────────────────────────────────────────────────────────────────────────────┘
3434
```
3535

36+
The frontend is served by FastAPI from `frontend/dist` and talks to the same API origin.
37+
3638
---
3739

3840
## Core Modules
@@ -109,12 +111,26 @@ CHAT_TRACE=1 python -m uvicorn api.search_api:app --reload
109111

110112
```
111113
Video URL → yt-dlp metadata → Gemini API → Segment transcription → Speaker normalization → JSON output
114+
JSON output → scripts/ingest_transcript_json.py → Transcript tables (videos/paragraphs/sentences/entities)
115+
```
116+
117+
### Order Paper Flow
118+
119+
```
120+
Order paper PDF → scripts/ingest_order_paper_pdf.py → order_papers/order_paper_items → context + role seeding
121+
```
122+
123+
### Bill Ingestion Flow
124+
125+
```
126+
Bill site → scripts/ingest_bills.py → bills + bill_excerpts (embeddings)
112127
```
113128

114129
### Knowledge Graph Flow
115130

116131
```
117132
Transcript → Window Builder (30 utterances, stride 18) → LLM extraction → Canonicalization → KG Store → PostgreSQL
133+
Bill excerpts → BillWindowBuilder → LLM extraction → Canonicalization → KG Store → PostgreSQL
118134
```
119135

120136
### Chat Flow
@@ -155,14 +171,16 @@ User Query → Embedding → Vector Search → Graph Expansion → LLM Synthesis
155171
| Method | Path | Description |
156172
|--------|------|-------------|
157173
| POST | `/search` | Hybrid search (vector + graph + BM25) |
158-
| POST | `/chat` | Conversational AI with citations |
159-
| GET | `/chat/threads` | List chat threads |
160-
| POST | `/chat/threads` | Create new thread |
161-
| GET | `/chat/threads/{id}` | Get thread messages |
162-
| POST | `/chat/threads/{id}` | Add message to thread |
163-
| GET | `/graph` | Graph data for entity |
174+
| POST | `/search/temporal` | Search with date/speaker/entity filters |
175+
| GET | `/search/trends` | Trend analysis for entities |
164176
| GET | `/speakers` | List all speakers |
165-
| GET | `/speakers/{id}` | Speaker details |
177+
| GET | `/speakers/{speaker_id}` | Speaker details |
178+
| GET | `/videos/{youtube_video_id}/speakers/{speaker_id}/roles` | Speaker roles for a video |
179+
| POST | `/chat/threads` | Create new thread |
180+
| POST | `/chat/threads/{thread_id}/messages` | Add message to thread |
181+
| GET | `/chat/threads/{thread_id}/messages/stream` | Stream message response (SSE) |
182+
| GET | `/health` | Health check |
183+
| GET | `/api` | API metadata |
166184

167185
---
168186

@@ -171,11 +189,14 @@ User Query → Embedding → Vector Search → Graph Expansion → LLM Synthesis
171189
| Script | Purpose |
172190
|--------|---------|
173191
| `transcribe.py` | Main video transcription |
192+
| `scripts/ingest_transcript_json.py` | Ingest transcript JSON into Postgres |
174193
| `scripts/kg_extract_from_video.py` | Extract KG from video |
194+
| `scripts/kg_extract_from_bills.py` | Extract KG from bill excerpts |
175195
| `scripts/cron_transcription.py` | Automated transcription jobs |
176196
| `scripts/migrate_chat_schema.py` | Chat schema migration |
177197
| `scripts/clear_kg.py` | Clear KG tables |
178198
| `scripts/ingest_order_paper_pdf.py` | Ingest order paper PDFs |
199+
| `scripts/ingest_bills.py` | Scrape/process bills and ingest |
179200
| `scripts/list_channel_videos.py` | List channel videos |
180201

181202
---
@@ -195,9 +216,12 @@ User Query → Embedding → Vector Search → Graph Expansion → LLM Synthesis
195216
**transcribe.py:**
196217

197218
- `--order-file`: Path to order paper file
219+
- `--order-paper-id`: Order paper ID from database
198220
- `--segment-minutes`: Segment duration (default: 30)
221+
- `--overlap-minutes`: Segment overlap (default: 1)
199222
- `--start-minutes`: Start position (default: 0)
200223
- `--max-segments`: Limit segments processed
224+
- `--video`: YouTube ID/URL or gs:// URI
201225

202226
**kg_extract_from_video.py:**
203227

docs/QUICK_REFERENCE.md

Lines changed: 21 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ python transcribe.py --order-file order.txt --max-segments 2
2424
# Extract KG from video
2525
python scripts/kg_extract_from_video.py --youtube-video-id "VIDEO_ID"
2626

27+
# Extract KG from bill excerpts
28+
python scripts/kg_extract_from_bills.py --max-bills 10
29+
2730
# With custom window parameters
2831
python scripts/kg_extract_from_video.py --youtube-video-id "VIDEO_ID" --window-size 15 --stride 10
2932

@@ -71,6 +74,12 @@ python scripts/migrate_chat_schema.py
7174

7275
# Backfill speaker roles
7376
python scripts/backfill_speaker_video_roles.py
77+
78+
# Ingest transcript JSON into Postgres
79+
python scripts/ingest_transcript_json.py --transcript-file transcription_output.json --youtube-video-id "VIDEO_ID"
80+
81+
# Ingest bills into Postgres
82+
python scripts/ingest_bills.py --scrape
7483
```
7584

7685
### Order Papers
@@ -111,14 +120,16 @@ mypy lib/
111120
| Method | Endpoint | Description |
112121
|--------|----------|-------------|
113122
| POST | `/search` | Hybrid search |
114-
| POST | `/chat` | Conversational AI |
115-
| GET | `/chat/threads` | List threads |
116-
| POST | `/chat/threads` | Create thread |
117-
| GET | `/chat/threads/{id}` | Get thread |
118-
| POST | `/chat/threads/{id}` | Send message |
119-
| GET | `/graph` | Graph data |
123+
| POST | `/search/temporal` | Search with date/speaker/entity filters |
124+
| GET | `/search/trends` | Trend analysis for entities |
120125
| GET | `/speakers` | List speakers |
121-
| GET | `/speakers/{id}` | Speaker details |
126+
| GET | `/speakers/{speaker_id}` | Speaker details |
127+
| GET | `/videos/{youtube_video_id}/speakers/{speaker_id}/roles` | Speaker roles for a video |
128+
| POST | `/chat/threads` | Create thread |
129+
| POST | `/chat/threads/{thread_id}/messages` | Send message |
130+
| GET | `/chat/threads/{thread_id}/messages/stream` | Stream message response |
131+
| GET | `/health` | Health check |
132+
| GET | `/api` | API metadata |
122133

123134
## Environment Variables
124135

@@ -156,10 +167,13 @@ mypy lib/
156167
| Option | Default | Description |
157168
|--------|---------|-------------|
158169
| `--order-file | Path` | Required to order file |
170+
| `--order-paper-id` | None | Order paper ID from database |
159171
| `--segment-minutes` | 30 | Segment duration |
172+
| `--overlap-minutes` | 1 | Segment overlap |
160173
| `--start-minutes` | 0 | Start position |
161174
| `--max-segments` | None | Limit segments |
162175
| `--output-file` | Varies | Output file path |
176+
| `--video` | None | YouTube ID/URL or gs:// URI |
163177

164178
### kg_extract_from_video.py
165179

docs/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,12 @@ A comprehensive parliamentary transcription and search system that processes vid
2727
- **Follow-Up Suggestions**: Generates contextual follow-up questions
2828
- **Full Citation Tracing**: Every answer grounded in transcript evidence
2929

30+
### Frontend UI
31+
32+
- **React + Vite**: Single-page app served from `frontend/dist`
33+
- **Streaming Chat**: SSE-based progress updates
34+
- **Graph View**: Explore entity connections visually
35+
3036
### Search System
3137

3238
- **Hybrid Search**: Combines vector similarity, BM25 full-text, and graph traversal
@@ -56,6 +62,8 @@ A comprehensive parliamentary transcription and search system that processes vid
5662
└─────────────────────────────────────────────────────────────────────────┘
5763
```
5864

65+
The frontend is served by FastAPI from `frontend/dist` and talks to the same API origin.
66+
5967
## Quick Start
6068

6169
### Prerequisites
@@ -94,6 +102,12 @@ python transcribe.py --order-file order.txt --segment-minutes 30
94102
python scripts/kg_extract_from_video.py --youtube-video-id "VIDEO_ID"
95103
```
96104

105+
### Ingest Transcript JSON
106+
107+
```bash
108+
python scripts/ingest_transcript_json.py --transcript-file transcription_output.json --youtube-video-id "VIDEO_ID"
109+
```
110+
97111
### Start Chat API
98112

99113
```bash
@@ -111,6 +125,7 @@ YuhHearDem3/
111125
│ ├── kg_agent_loop.py # KG-powered agent loop
112126
│ ├── kg_hybrid_graph_rag.py # Hybrid Graph-RAG retrieval
113127
│ ├── advanced_search_features.py # Temporal search, trends, graph queries
128+
│ ├── embeddings/ # Embedding clients
114129
│ ├── knowledge_graph/
115130
│ │ ├── oss_two_pass.py # OSS two-pass extraction
116131
│ │ ├── window_builder.py # Window-based processing
@@ -127,6 +142,7 @@ YuhHearDem3/
127142
│ ├── cron_transcription.py # Automated transcription
128143
│ ├── migrate_chat_schema.py # Chat schema migration
129144
│ └── clear_kg.py # Clear KG tables
145+
├── frontend/ # React frontend (Vite)
130146
├── tests/ # Unit tests
131147
└── docs/ # Documentation
132148
```
@@ -139,6 +155,8 @@ YuhHearDem3/
139155
| [QUICK_REFERENCE.md](QUICK_REFERENCE.md) | Command quick reference |
140156
| [CHAT_TRACE.md](CHAT_TRACE.md) | Debug tracing documentation |
141157
| [DATE_NORMALIZATION.md](DATE_NORMALIZATION.md) | Date handling |
158+
| [CODE_MAP_AND_REVIEW.md](CODE_MAP_AND_REVIEW.md) | Code map and flow diagram |
159+
| [README_SEARCH_SYSTEM.md](README_SEARCH_SYSTEM.md) | Search system details |
142160

143161
## Technology Stack
144162

docs/README_SEARCH_SYSTEM.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -137,15 +137,24 @@ LIMIT 50;
137137
}
138138
```
139139

140-
### POST /chat
140+
### POST /search/temporal
141141

142142
```json
143143
{
144-
"thread_id": "uuid",
145-
"message": "What did they say about healthcare?"
144+
"query": "healthcare reform",
145+
"limit": 20,
146+
"alpha": 0.6,
147+
"start_date": "2024-01-01",
148+
"end_date": "2024-12-31",
149+
"speaker_id": "s_john_doe_1",
150+
"entity_type": "schema:Legislation"
146151
}
147152
```
148153

154+
### GET /search/trends
155+
156+
Query params: `entity_id`, `days`, `window_size`
157+
149158
## Performance
150159

151160
| Operation | Typical Latency |

0 commit comments

Comments
 (0)