-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
Add first-class bill text excerpts as retrievable and citable evidence in chat answers, while keeping existing transcript utterance citations working unchanged. Use precomputed embeddings for bill excerpts for speed and stable citations.
Problem
Currently:
- Bills are scraped and stored in DB with
source_textavailable (schema/init.sql:129) kg_hybrid_graph_ragonly returns transcript utterance citations fromsentencestable - no bill-document citations- Chat sources are utterance-centric (
utterance_id, youtube timestamp) - no bill evidence TranscriptIngestorcreates bill rows from transcript legislation butsource_textis polluted with"audio"/"visual"modality strings instead of actual bill text
Goal
- Make bill text excerpts retrievable and citable as first-class evidence
- Keep existing transcript
utterancecitations working unchanged - Use precomputed embeddings for bill excerpts (no on-demand embedding at query time)
Success Criteria
- Query about a bill returns at least one bill excerpt source when available
- Chat can cite both transcript utterances and bill excerpts in one answer
- Existing chat clients do not break if they only understand utterance sources
- End-to-end latency remains acceptable
Implementation Plan
Phase 1: Data Model + Migration
-
Add new table
bill_excerpts:id TEXT PRIMARY KEY(stable ID:bex_<bill_id>_<chunk_index>)bill_id TEXT NOT NULLFK ->bills(id)chunk_index INTEGER NOT NULLtext TEXT NOT NULLchar_start INTEGER,char_end INTEGERembedding vector(768)(precomputed)tsv tsvectorsource_url TEXTcreated_at,updated_at- unique
(bill_id, chunk_index)
-
Indexes: ivfflat on
embedding, GIN ontsv, btree onbill_id -
Trigger:
bill_excerpts_tsv_trigger()to auto-populatetsvfromtext
Phase 2: Chunking + Embedding Pipeline
-
Create chunker module:
lib/bills/excerpt_chunker.py- Deterministic chunking (for stable IDs)
- Default: split by paragraph, merge/split to ~900 chars, 150 char overlap
- Skip tiny/noisy chunks, preserve offsets
-
Extend
BillIngestorinlib/processors/bill_ingestor.py:- After bill upsert, build chunks from
source_text(fallback todescription) - Batch-generate embeddings, upsert
bill_excerpts - Safe re-run: upsert by
(bill_id, chunk_index)
- After bill upsert, build chunks from
-
Fix transcript-derived bill writes in
lib/transcripts/ingestor.py:- Stop setting
source_textto"audio"/"visual"modality strings - Set
source_textonly when real textual content exists
- Stop setting
Phase 3: Backfill Existing Bills
Add script: scripts/backfill_bill_excerpts.py
- Scan
billswheresource_textordescriptionhas usable content - Chunk, embed, upsert
- Flags:
--max-bills,--rebuild,--skip-embeddings,--only-missing
Phase 4: Retrieval Integration (Hybrid Graph-RAG)
-
Extend
lib/kg_hybrid_graph_rag.py:- Add
_retrieve_bill_excerpts(...): vector similarity + BM25/FTS - Optional boost for seed legislation nodes
- Add
bill_citationsto tool output with:citation_id,bill_id,bill_number,bill_title,excerpt,source_url,score
- Add
-
Add knobs:
max_bill_citations(default 8)
Phase 5: Chat Source/Citation Model Upgrade
- Update
lib/chat_agent_v2.py:- Add
source_kindenum:utterance|bill_excerpt - Add bill fields to source model
- Support
#src:bill:<bill_id>:<chunk_index>citation IDs - Merge transcript + bill citations in
_sources_from_retrieval
- Add
Phase 6: Agent Prompt + Tool Contract
-
Update
lib/kg_agent_loop.pytool schema withmax_bill_citations -
Update system instructions to encourage bill-excerpt citations for bill-content questions
Phase 7: API + Frontend Compatibility
-
Update
api/search_api.pyChatSourcemodel with optional bill fields +source_kind -
Frontend: show source badge, bill card with title + excerpt + link
Phase 8: Tests
- Unit: chunker, upsert idempotency, retrieval ranking, citation parsing, mixed source serialization
- Integration: seed bill, query, verify bill_citations returned
- Regression: utterance-only flows unchanged
Phase 9: Rollout
-
Feature flag:
ENABLE_BILL_EVIDENCE(default off) -
Deploy sequence:
- schema migration
- ingestion + retrieval code
- backfill excerpts
- enable flag in staging, validate
- enable in prod
Files Likely Touched
schema/init.sqlschema/migrations/<new>_bill_excerpts.sqllib/processors/bill_ingestor.pylib/transcripts/ingestor.pylib/kg_hybrid_graph_rag.pylib/kg_agent_loop.pylib/chat_agent_v2.pyapi/search_api.pyfrontend/src/App.tsx- new:
lib/bills/excerpt_chunker.py - new:
scripts/backfill_bill_excerpts.py - tests under
tests/