Skip to content

Implement bill excerpts as citable evidence in chat #3

@hammertoe

Description

@hammertoe

Summary

Add first-class bill text excerpts as retrievable and citable evidence in chat answers, while keeping existing transcript utterance citations working unchanged. Use precomputed embeddings for bill excerpts for speed and stable citations.

Problem

Currently:

  • Bills are scraped and stored in DB with source_text available (schema/init.sql:129)
  • kg_hybrid_graph_rag only returns transcript utterance citations from sentences table - no bill-document citations
  • Chat sources are utterance-centric (utterance_id, youtube timestamp) - no bill evidence
  • TranscriptIngestor creates bill rows from transcript legislation but source_text is polluted with "audio"/"visual" modality strings instead of actual bill text

Goal

  • Make bill text excerpts retrievable and citable as first-class evidence
  • Keep existing transcript utterance citations working unchanged
  • Use precomputed embeddings for bill excerpts (no on-demand embedding at query time)

Success Criteria

  • Query about a bill returns at least one bill excerpt source when available
  • Chat can cite both transcript utterances and bill excerpts in one answer
  • Existing chat clients do not break if they only understand utterance sources
  • End-to-end latency remains acceptable

Implementation Plan

Phase 1: Data Model + Migration

  1. Add new table bill_excerpts:

    • id TEXT PRIMARY KEY (stable ID: bex_<bill_id>_<chunk_index>)
    • bill_id TEXT NOT NULL FK -> bills(id)
    • chunk_index INTEGER NOT NULL
    • text TEXT NOT NULL
    • char_start INTEGER, char_end INTEGER
    • embedding vector(768) (precomputed)
    • tsv tsvector
    • source_url TEXT
    • created_at, updated_at
    • unique (bill_id, chunk_index)
  2. Indexes: ivfflat on embedding, GIN on tsv, btree on bill_id

  3. Trigger: bill_excerpts_tsv_trigger() to auto-populate tsv from text

Phase 2: Chunking + Embedding Pipeline

  1. Create chunker module: lib/bills/excerpt_chunker.py

    • Deterministic chunking (for stable IDs)
    • Default: split by paragraph, merge/split to ~900 chars, 150 char overlap
    • Skip tiny/noisy chunks, preserve offsets
  2. Extend BillIngestor in lib/processors/bill_ingestor.py:

    • After bill upsert, build chunks from source_text (fallback to description)
    • Batch-generate embeddings, upsert bill_excerpts
    • Safe re-run: upsert by (bill_id, chunk_index)
  3. Fix transcript-derived bill writes in lib/transcripts/ingestor.py:

    • Stop setting source_text to "audio"/"visual" modality strings
    • Set source_text only when real textual content exists

Phase 3: Backfill Existing Bills

Add script: scripts/backfill_bill_excerpts.py

  • Scan bills where source_text or description has usable content
  • Chunk, embed, upsert
  • Flags: --max-bills, --rebuild, --skip-embeddings, --only-missing

Phase 4: Retrieval Integration (Hybrid Graph-RAG)

  1. Extend lib/kg_hybrid_graph_rag.py:

    • Add _retrieve_bill_excerpts(...): vector similarity + BM25/FTS
    • Optional boost for seed legislation nodes
    • Add bill_citations to tool output with: citation_id, bill_id, bill_number, bill_title, excerpt, source_url, score
  2. Add knobs: max_bill_citations (default 8)

Phase 5: Chat Source/Citation Model Upgrade

  1. Update lib/chat_agent_v2.py:
    • Add source_kind enum: utterance | bill_excerpt
    • Add bill fields to source model
    • Support #src:bill:<bill_id>:<chunk_index> citation IDs
    • Merge transcript + bill citations in _sources_from_retrieval

Phase 6: Agent Prompt + Tool Contract

  1. Update lib/kg_agent_loop.py tool schema with max_bill_citations

  2. Update system instructions to encourage bill-excerpt citations for bill-content questions

Phase 7: API + Frontend Compatibility

  1. Update api/search_api.py ChatSource model with optional bill fields + source_kind

  2. Frontend: show source badge, bill card with title + excerpt + link

Phase 8: Tests

  • Unit: chunker, upsert idempotency, retrieval ranking, citation parsing, mixed source serialization
  • Integration: seed bill, query, verify bill_citations returned
  • Regression: utterance-only flows unchanged

Phase 9: Rollout

  1. Feature flag: ENABLE_BILL_EVIDENCE (default off)

  2. Deploy sequence:

    • schema migration
    • ingestion + retrieval code
    • backfill excerpts
    • enable flag in staging, validate
    • enable in prod

Files Likely Touched

  • schema/init.sql
  • schema/migrations/<new>_bill_excerpts.sql
  • lib/processors/bill_ingestor.py
  • lib/transcripts/ingestor.py
  • lib/kg_hybrid_graph_rag.py
  • lib/kg_agent_loop.py
  • lib/chat_agent_v2.py
  • api/search_api.py
  • frontend/src/App.tsx
  • new: lib/bills/excerpt_chunker.py
  • new: scripts/backfill_bill_excerpts.py
  • tests under tests/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions