docs(07.1-02): complete KG Builder agent implementation plan

Siddharth-Khattar · Siddharth-Khattar · commit 6030135b6a8d · 2026-02-08T12:26:52.000+01:00
Tasks completed: 2/2
- Task 1: KG Builder Agent -- Runner, Factory, Prompt, Input Assembly, DB Write
- Task 2: Pipeline Wiring -- Replace Programmatic KG Builder with LLM Agent

SUMMARY: .planning/phases/07.1-llm-kg-builder-agent/07.1-02-SUMMARY.md
diff --git a/.planning/STATE.md b/.planning/STATE.md
@@ -1,7 +1,7 @@
 # Holmes Project State
 
 **Last Updated:** 2026-02-08
-**Current Phase:** 7.1 of 12 (LLM-Based KG Builder Agent) — IN PROGRESS (1/2 plans)
+**Current Phase:** 7.1 of 12 (LLM-Based KG Builder Agent) — COMPLETE (2/2 plans)
 **Next Phase:** 7.2 (D3.js KG Frontend Enhancement) → 8 (Synthesis)
 **Current Milestone:** M1 - Holmes v1.0
 
@@ -18,7 +18,7 @@
 | 5 | Agent Flow | COMPLETE | 2026-02-04 | 2026-02-05 | SSE pipeline complete; HITL infra built but verification deferred to Phase 6+ |
 | 6 | Domain Agents | COMPLETE | 2026-02-06 | 2026-02-06 | 5 plans (14 commits) + 21 post-plan commits (35 total): refactoring, routing HITL, production hardening, live-testing bugfixes |
 | 7 | Knowledge Storage & Domain Agent Enrichment | COMPLETE | 2026-02-07 | 2026-02-07 | 6 plans (11 commits), 8/8 verified: 9 DB models + migration, KG/findings schemas, KG Builder + findings service, prompt enrichment, 10 API endpoints, pipeline wiring |
-| 7.1 | LLM-Based KG Builder Agent | IN_PROGRESS | 2026-02-08 | - | Plan 01/02 complete: schema evolution + Pydantic schemas |
+| 7.1 | LLM-Based KG Builder Agent | COMPLETE | 2026-02-08 | 2026-02-08 | 2 plans (4 commits): schema evolution, Pydantic schemas, agent runner/prompt/factory, pipeline wiring |
 | 7.2 | KG Frontend (D3.js Enhancement) | NOT_STARTED | - | - | Epstein-inspired D3.js improvements |
 | 7.3 | KG Frontend (vis-network) | DEFERRED | - | - | Optional; only if D3.js proves insufficient |
 | 8 | Intelligence Layer & Geospatial | NOT_STARTED | - | - | |
@@ -53,14 +53,12 @@
   - vis-network deferred to optional Phase 7.3
 - New phase structure: 7.1 (LLM KG Builder) → 7.2 (D3.js Enhancement) → 7.3 (vis-network, optional)
 
-**Phase 7.1 Plan 01 complete** (2026-02-08): Schema evolution + Pydantic schemas -- 2 plans, 2 commits
-  - Alembic migration adding 10 new nullable columns (5 entity, 5 relationship)
-  - KgBuilderOutput/KgBuilderEntity/KgBuilderRelationship Pydantic schemas with integer ID cross-referencing
-  - EntityResponse and RelationshipResponse updated with new optional fields
-  - Frontend API types regenerated
+**Phase 7.1 Complete** (2026-02-08): LLM-Based KG Builder Agent -- 2 plans, 4 commits
+  - Plan 01: Alembic migration adding 10 new nullable columns + KgBuilderOutput Pydantic schemas with integer ID cross-referencing
+  - Plan 02: KgBuilderAgentRunner with text-only input, KG_BUILDER_SYSTEM_PROMPT (8+1 entity taxonomy), AgentFactory.create_kg_builder_agent(), DB writer with clear-and-rebuild, pipeline Stage 7 replaced with LLM invocation
+  - Full pipeline: Triage -> Orchestrator -> Domain -> Strategy -> HITL -> Save Findings -> LLM KG Builder -> Backfill Entity IDs -> Final
 
 **What's next:**
-- Phase 7.1 Plan 02: KG Builder agent runner, prompt, DB writer, pipeline wiring
 - Phase 7.2: D3.js KG Frontend Enhancement — Epstein-inspired layout, physics, sidebars, filtering, document excerpts
 - Phase 8: Synthesis Agent & Intelligence Layer — cross-referencing, hypotheses, contradictions, gaps, timeline
 
@@ -373,6 +371,10 @@ All frontend features need these backend endpoints:
 | KG API data access | Direct model queries vs Service layer | Direct model queries | Simple CRUD doesn't need service abstraction; findings uses service for complex search |
 | EntityCreateRequest metadata mapping | Direct field vs Renamed | metadata -> properties | Schema field "metadata" maps to DB column "properties" (SQLAlchemy reserved attribute) |
 | KG Builder entity cross-referencing | Name matching vs Integer IDs | Integer IDs (1, 2, 3...) | Eliminates name-matching inconsistencies; LLM assigns sequential IDs, mapped to DB UUIDs during write |
+| KG Builder input format | Multimodal files vs Text-only | Text-only (findings + entities + case description) | Domain agents already processed raw evidence; KG Builder only needs pre-processed text |
+| KG Builder rebuild strategy | Incremental merge vs Clear-and-rebuild | Clear-and-rebuild | Clean slate every run; delete all KG data then insert curated LLM output |
+| KG Builder failure handling | Block pipeline vs Non-blocking | Non-blocking (try/except, continue) | KG Builder failure emits SSE error, pipeline continues; KG page shows empty state |
+| KG Builder media resolution | HIGH vs None | None (text-only input) | No generate_content_config needed; KG Builder receives text, not files |
 
 ---
 
@@ -385,7 +387,7 @@ None currently.
 ## Session Continuity
 
 Last session: 2026-02-08
-Stopped at: Completed 07.1-01-PLAN.md (schema evolution + Pydantic schemas)
+Stopped at: Completed 07.1-02-PLAN.md (KG Builder agent runner, prompt, pipeline wiring)
 Resume file: None
 
 ---
diff --git a/.planning/phases/07.1-llm-kg-builder-agent/07.1-02-SUMMARY.md b/.planning/phases/07.1-llm-kg-builder-agent/07.1-02-SUMMARY.md
@@ -0,0 +1,113 @@
+---
+phase: 07.1-llm-kg-builder-agent
+plan: 02
+subsystem: agents
+tags: [gemini-pro, adk, knowledge-graph, llm-agent, pipeline, sse]
+
+# Dependency graph
+requires:
+  - phase: 07.1-01
+    provides: "Alembic migration + ORM models + KgBuilderOutput Pydantic schema"
+  - phase: 07
+    provides: "KgEntity, KgRelationship ORM models, programmatic KG builder, pipeline wiring"
+provides:
+  - "KgBuilderAgentRunner subclass with text-only content preparation"
+  - "KG_BUILDER_SYSTEM_PROMPT with 8+1 entity taxonomy and semantic relationship instructions"
+  - "AgentFactory.create_kg_builder_agent() with Pro model and high thinking"
+  - "Input assembly from case_findings + domain entities + case description"
+  - "DB write with clear-and-rebuild, lenient parsing, entity degree computation"
+  - "Pipeline Stage 7 replaced with LLM KG Builder invocation in try/except"
+  - "SSE lifecycle events for KG Builder (started, complete, error)"
+affects: [07.2, 08, 09]
+
+# Tech tracking
+tech-stack:
+  added: []
+  patterns:
+    - "Text-only DomainAgentRunner subclass (no multimodal files)"
+    - "Clear-and-rebuild KG strategy (delete all, insert curated data)"
+    - "Lenient LLM output parsing (skip malformed, log warning)"
+    - "Non-blocking pipeline stage (try/except, continues on failure)"
+
+key-files:
+  created:
+    - "backend/app/agents/kg_builder.py"
+    - "backend/app/agents/prompts/kg_builder.py"
+  modified:
+    - "backend/app/agents/factory.py"
+    - "backend/app/services/kg_builder.py"
+    - "backend/app/services/pipeline.py"
+
+key-decisions:
+  - "KG Builder receives text-only input (no multimodal files) -- findings + entities + case description"
+  - "Clear-and-rebuild strategy: delete all existing KG data before writing curated data"
+  - "KG Builder failure is non-blocking: try/except in pipeline, emits SSE error, continues"
+  - "Old programmatic builder functions preserved as deprecated (not deleted)"
+
+patterns-established:
+  - "Text-only DomainAgentRunner subclass pattern for agents consuming pre-processed text"
+  - "Non-blocking pipeline stage pattern with SSE error emission on failure"
+
+# Metrics
+duration: 5min
+completed: 2026-02-08
+---
+
+# Phase 7.1 Plan 02: KG Builder Agent Implementation Summary
+
+**LLM-based KG Builder agent with Gemini Pro, text-only input assembly, clear-and-rebuild DB writer, and non-blocking pipeline integration with SSE lifecycle events**
+
+## Performance
+
+- **Duration:** 5 min
+- **Started:** 2026-02-08T11:19:59Z
+- **Completed:** 2026-02-08T11:25:07Z
+- **Tasks:** 2
+- **Files modified:** 5
+
+## Accomplishments
+- KgBuilderAgentRunner subclass following Strategy pattern with text-only content preparation from case findings, domain entities, and case description
+- System prompt with 8+1 entity taxonomy (PERSON, ORGANIZATION, LOCATION, EVENT, ASSET, FINANCIAL_ENTITY, COMMUNICATION, DOCUMENT, OTHER), semantic relationship instructions, deduplication rules, and evidence grounding requirements
+- AgentFactory.create_kg_builder_agent() producing fresh LlmAgent with Pro model, high thinking, and text-only input (no media resolution)
+- DB writer that clears existing KG data, inserts curated entities/relationships with integer-ID-to-UUID mapping, computes entity degrees, and uses lenient parsing (skip malformed, log warning)
+- Pipeline Stage 7 replaced: run_kg_builder() called instead of build_knowledge_graph(), wrapped in try/except for graceful failure, SSE events for started/complete/error
+
+## Task Commits
+
+Each task was committed atomically:
+
+1. **Task 1: KG Builder Agent -- Runner, Factory, Prompt, Input Assembly, DB Write** - `07428c7` (feat)
+2. **Task 2: Pipeline Wiring -- Replace Programmatic KG Builder with LLM Agent** - `618958b` (feat)
+
+## Files Created/Modified
+- `backend/app/agents/kg_builder.py` - KgBuilderAgentRunner, assemble_kg_builder_input(), write_kg_from_llm_output(), run_kg_builder()
+- `backend/app/agents/prompts/kg_builder.py` - KG_BUILDER_SYSTEM_PROMPT with entity taxonomy and relationship instructions
+- `backend/app/agents/factory.py` - Added create_kg_builder_agent() static method
+- `backend/app/services/kg_builder.py` - Marked 4 old functions as deprecated (extract_entities_from_output, build_relationships_from_findings, deduplicate_entities, build_knowledge_graph)
+- `backend/app/services/pipeline.py` - Stage 7 replaced with LLM KG Builder invocation, SSE lifecycle events added
+
+## Decisions Made
+- KG Builder receives text-only input assembled from case_findings (with [FINDING:uuid] prefixes), DomainEntity JSON lists, and case description -- no multimodal files needed since domain agents already processed raw evidence
+- Clear-and-rebuild strategy for KG data: delete all kg_entities and kg_relationships for the case before writing curated data from LLM output
+- Lenient parsing: each entity and relationship insertion wrapped in try/except, malformed items skipped with warning log, partial results preserved
+- Old programmatic builder functions (extract_entities_from_output, build_relationships_from_findings, deduplicate_entities, build_knowledge_graph) marked deprecated but preserved for backward compatibility and audit trail
+
+## Deviations from Plan
+
+None - plan executed exactly as written.
+
+## Issues Encountered
+- Import sorting: ruff pre-commit hook caught unsorted import of run_kg_builder in pipeline.py -- fixed by moving import to alphabetical position among app.agents.* imports
+
+## User Setup Required
+None - no external service configuration required.
+
+## Next Phase Readiness
+- LLM KG Builder agent fully wired into pipeline, ready for end-to-end testing
+- Phase 7.2 (D3.js KG Frontend Enhancement) can proceed -- existing KG API endpoints return curated data without changes
+- Phase 8 (Synthesis) can proceed -- KG Builder produces entities with semantic relationships for synthesis agent consumption
+- No blockers
+
+---
+*Phase: 07.1-llm-kg-builder-agent*
+*Completed: 2026-02-08*