Skip to content

Plan: Add bill sliding-window KG extraction and graph integration #5

@hammertoe

Description

@hammertoe

Context

Bills and bill excerpts are currently ingested and vector-searchable, but bill concepts/relationships are not extracted into KG nodes/edges the way transcript windows are.

Goal: make bill concepts first-class in the knowledge graph using a sliding-window pipeline, with provenance and citations that work end-to-end in chat/retrieval.

Proposed approach

Phase 1: Bill window extraction MVP

  • Add BillWindowBuilder to construct overlapping windows from bill_excerpts per bill.
    • Default: window size 4 chunks, stride 2 chunks.
  • Reuse existing KG extractor flow (OssKGExtractor) against bill windows.
  • Add script: scripts/kg_extract_from_bills.py to run for one bill or all bills.
  • Store extracted nodes/edges through canonical store pipeline.

Phase 2: Provenance model upgrade (recommended now)

  • Extend edge provenance to support both transcript and bill evidence without overloading transcript-only fields.
  • Add edge-level provenance fields (example):
    • source_kind (transcript | bill)
    • source_ref_id (video id or bill id)
    • evidence_ids (utterance ids or bill chunk ids)
  • Keep backward compatibility with existing transcript edges.

Phase 3: Retrieval + citation hydration integration

  • Extend retrieval hydration path so KG edges sourced from bills resolve to bill_excerpts citations.
  • Ensure chat answers can cite bill-backed graph edges with stable citation IDs (bill:<bill_id>:<chunk_index>).

Phase 4: Quality controls + operations

  • Deduplicate repeated legal boilerplate edges.
  • Preserve idempotency for reruns (kg_run_id, deterministic IDs/upserts).
  • Add tunable runtime/cost flags:
    • --max-bills
    • --max-windows-per-bill
    • --top-k

Test-first plan

  1. Add failing unit tests for:
    • bill window creation/overlap behavior
    • validation acceptance for bill evidence IDs
    • canonical storage handling bill-sourced windows
    • retrieval hydration for bill-backed edge citations
  2. Add one small integration test with fixture bills:
    • run extraction,
    • assert KG nodes/edges increased,
    • assert citations resolve from hydrated results.

Acceptance criteria

  • Bill text is processed via sliding windows and produces KG edges.
  • Bill-derived nodes/edges are queryable in graph retrieval.
  • Chat responses can cite bill-backed graph evidence with resolved sources.
  • Existing transcript KG behavior remains unchanged.
  • Extraction is idempotent and safe to rerun.

Suggested implementation order

  1. Provenance schema + compatibility layer
  2. Bill window builder + extraction script
  3. Retrieval hydration for bill-backed edges
  4. Tests + staging validation run

Notes

  • Existing bill excerpt vector retrieval remains in place and should continue to work.
  • This issue is specifically to add graph-native bill concepts/relations on top of current bill ingestion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions