-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Context
Bills and bill excerpts are currently ingested and vector-searchable, but bill concepts/relationships are not extracted into KG nodes/edges the way transcript windows are.
Goal: make bill concepts first-class in the knowledge graph using a sliding-window pipeline, with provenance and citations that work end-to-end in chat/retrieval.
Proposed approach
Phase 1: Bill window extraction MVP
- Add BillWindowBuilder to construct overlapping windows from bill_excerpts per bill.
- Default: window size 4 chunks, stride 2 chunks.
- Reuse existing KG extractor flow (OssKGExtractor) against bill windows.
- Add script: scripts/kg_extract_from_bills.py to run for one bill or all bills.
- Store extracted nodes/edges through canonical store pipeline.
Phase 2: Provenance model upgrade (recommended now)
- Extend edge provenance to support both transcript and bill evidence without overloading transcript-only fields.
- Add edge-level provenance fields (example):
- source_kind (transcript | bill)
- source_ref_id (video id or bill id)
- evidence_ids (utterance ids or bill chunk ids)
- Keep backward compatibility with existing transcript edges.
Phase 3: Retrieval + citation hydration integration
- Extend retrieval hydration path so KG edges sourced from bills resolve to bill_excerpts citations.
- Ensure chat answers can cite bill-backed graph edges with stable citation IDs (bill:<bill_id>:<chunk_index>).
Phase 4: Quality controls + operations
- Deduplicate repeated legal boilerplate edges.
- Preserve idempotency for reruns (kg_run_id, deterministic IDs/upserts).
- Add tunable runtime/cost flags:
- --max-bills
- --max-windows-per-bill
- --top-k
Test-first plan
- Add failing unit tests for:
- bill window creation/overlap behavior
- validation acceptance for bill evidence IDs
- canonical storage handling bill-sourced windows
- retrieval hydration for bill-backed edge citations
- Add one small integration test with fixture bills:
- run extraction,
- assert KG nodes/edges increased,
- assert citations resolve from hydrated results.
Acceptance criteria
- Bill text is processed via sliding windows and produces KG edges.
- Bill-derived nodes/edges are queryable in graph retrieval.
- Chat responses can cite bill-backed graph evidence with resolved sources.
- Existing transcript KG behavior remains unchanged.
- Extraction is idempotent and safe to rerun.
Suggested implementation order
- Provenance schema + compatibility layer
- Bill window builder + extraction script
- Retrieval hydration for bill-backed edges
- Tests + staging validation run
Notes
- Existing bill excerpt vector retrieval remains in place and should continue to work.
- This issue is specifically to add graph-native bill concepts/relations on top of current bill ingestion.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels