Skip to content

Commit 4131f7a

Browse files
docs: log fuzzy matching and entity count issues for Phase 8
1 parent b339e5f commit 4131f7a

File tree

2 files changed

+27
-0
lines changed

2 files changed

+27
-0
lines changed

.planning/MINOR_ISSUES.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,26 @@ Deferred issues that are non-blocking but should be addressed in future work.
2020
- **Fix**: Either annotate concrete functions to return `tuple[DomainAgentOutput | None, UUID | None]`, or use a generic `TypeVar` on the return type.
2121
- **Priority**: Low (no runtime impact, passes pyright default mode)
2222
- **Added**: 2025-02-07, QC review of Phase 7 execution ID fix
23+
24+
## MI-003: Fuzzy entity matching produces false positives on numeric/temporal values
25+
26+
- **File**: `backend/app/services/kg_builder.py`, `deduplicate_entities()` (~line 270-288)
27+
- **Detail**: Fuzzy string matching (rapidfuzz ratio >=85%) flags semantically distinct values as potential duplicates when they are string-similar but numerically different. Observed in live testing:
28+
- `'2016-05-02 20:00'` vs `'2016-05-02 22:00'` (92%) — different timestamps (8 PM vs 10 PM)
29+
- `'50,000 dollars'` vs `'2,000 dollars'` (88%) — 25x difference in amount
30+
- `'$5,000'` vs `'$50,000'` (88%) — 10x difference in amount
31+
- **Impact**: No data corruption (flags only, not auto-merged). But these false flags add noise for Phase 8 LLM resolution, wasting tokens and potentially confusing the synthesis agent.
32+
- **Fix**: Add type-aware matching logic before fuzzy comparison. For `entity_type` in (`timestamp`, `monetary_amount`, `date`, `other` when value looks numeric): parse the actual value and compare semantically instead of string-matching. E.g., for monetary amounts, extract the number and compare magnitude; for timestamps, parse and compare actual time difference.
33+
- **Priority**: Medium (should be fixed before or during Phase 8 to avoid noisy LLM resolution input)
34+
- **Phase**: Fix during Phase 8 (Synthesis) when fuzzy flags are consumed
35+
- **Added**: 2026-02-07, live pipeline testing
36+
37+
## MI-004: Pipeline summary log mixes two different entity count semantics
38+
39+
- **File**: `backend/app/services/pipeline.py` line ~1089
40+
- **Log**: `entities=40 kg_entities=61` — confusing because both sound like entity counts but measure different things.
41+
- **Detail**: `entities` = triage file-level entities (line 1048: `sum(len(fr.entities) for fr in triage_output.file_results)`) + domain agent top-level `output.entities` (line 1043). `kg_entities` = KG Builder total (includes both top-level AND per-finding entities). The KG count is always >= domain entity count because KG Builder also extracts from `finding.entities` lists. The naming makes them appear comparable when they are not.
42+
- **Fix**: Rename `entities` to `triage_entities` and `total_domain_entities` to `domain_output_entities` in the log, or remove the triage entity count from this summary (it's already logged in the triage stage). The `processing-complete` SSE event (line 1065) also sums these two, which is sent to the frontend — verify frontend doesn't display this misleadingly.
43+
- **Priority**: Low (cosmetic log clarity, no data impact)
44+
- **Phase**: Fix opportunistically during Phase 8 pipeline extension
45+
- **Added**: 2026-02-07, live pipeline testing

.planning/ROADMAP.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -830,6 +830,10 @@ Plans:
830830
- Task deduplication via existing task list injection into synthesis prompt
831831
- Stored in investigation_tasks table
832832

833+
**Known Issues to Resolve (from `.planning/MINOR_ISSUES.md`):**
834+
- **MI-003**: Fuzzy entity deduplication produces false positives on numeric/temporal values (e.g., `$5,000` vs `$50,000` flagged at 88% similarity). Fix: add type-aware matching in `kg_builder.py:deduplicate_entities()` before fuzzy comparison — parse numeric entity types semantically instead of string-matching. Must be fixed before synthesis consumes fuzzy flags.
835+
- **MI-004**: Pipeline summary log mixes triage entity count + domain entity count as `entities=N`, confusingly alongside `kg_entities=M`. Fix: rename or separate the counters in `pipeline.py` for clarity.
836+
833837
**Technical Notes:**
834838
- Synthesis Agent runs in fresh stage-isolated ADK session (consistent with existing pattern)
835839
- Input is TEXT from PostgreSQL (case_findings.finding_text), NOT multimodal file content

0 commit comments

Comments
 (0)