You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Detail**: Fuzzy string matching (rapidfuzz ratio >=85%) flags semantically distinct values as potential duplicates when they are string-similar but numerically different. Observed in live testing:
28
+
-`'2016-05-02 20:00'` vs `'2016-05-02 22:00'` (92%) — different timestamps (8 PM vs 10 PM)
29
+
-`'50,000 dollars'` vs `'2,000 dollars'` (88%) — 25x difference in amount
30
+
-`'$5,000'` vs `'$50,000'` (88%) — 10x difference in amount
31
+
-**Impact**: No data corruption (flags only, not auto-merged). But these false flags add noise for Phase 8 LLM resolution, wasting tokens and potentially confusing the synthesis agent.
32
+
-**Fix**: Add type-aware matching logic before fuzzy comparison. For `entity_type` in (`timestamp`, `monetary_amount`, `date`, `other` when value looks numeric): parse the actual value and compare semantically instead of string-matching. E.g., for monetary amounts, extract the number and compare magnitude; for timestamps, parse and compare actual time difference.
33
+
-**Priority**: Medium (should be fixed before or during Phase 8 to avoid noisy LLM resolution input)
34
+
-**Phase**: Fix during Phase 8 (Synthesis) when fuzzy flags are consumed
35
+
-**Added**: 2026-02-07, live pipeline testing
36
+
37
+
## MI-004: Pipeline summary log mixes two different entity count semantics
38
+
39
+
-**File**: `backend/app/services/pipeline.py` line ~1089
40
+
-**Log**: `entities=40 kg_entities=61` — confusing because both sound like entity counts but measure different things.
41
+
-**Detail**: `entities` = triage file-level entities (line 1048: `sum(len(fr.entities) for fr in triage_output.file_results)`) + domain agent top-level `output.entities` (line 1043). `kg_entities` = KG Builder total (includes both top-level AND per-finding entities). The KG count is always >= domain entity count because KG Builder also extracts from `finding.entities` lists. The naming makes them appear comparable when they are not.
42
+
-**Fix**: Rename `entities` to `triage_entities` and `total_domain_entities` to `domain_output_entities` in the log, or remove the triage entity count from this summary (it's already logged in the triage stage). The `processing-complete` SSE event (line 1065) also sums these two, which is sent to the frontend — verify frontend doesn't display this misleadingly.
43
+
-**Priority**: Low (cosmetic log clarity, no data impact)
44
+
-**Phase**: Fix opportunistically during Phase 8 pipeline extension
Copy file name to clipboardExpand all lines: .planning/ROADMAP.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -830,6 +830,10 @@ Plans:
830
830
- Task deduplication via existing task list injection into synthesis prompt
831
831
- Stored in investigation_tasks table
832
832
833
+
**Known Issues to Resolve (from `.planning/MINOR_ISSUES.md`):**
834
+
-**MI-003**: Fuzzy entity deduplication produces false positives on numeric/temporal values (e.g., `$5,000` vs `$50,000` flagged at 88% similarity). Fix: add type-aware matching in `kg_builder.py:deduplicate_entities()` before fuzzy comparison — parse numeric entity types semantically instead of string-matching. Must be fixed before synthesis consumes fuzzy flags.
835
+
-**MI-004**: Pipeline summary log mixes triage entity count + domain entity count as `entities=N`, confusingly alongside `kg_entities=M`. Fix: rename or separate the counters in `pipeline.py` for clarity.
836
+
833
837
**Technical Notes:**
834
838
- Synthesis Agent runs in fresh stage-isolated ADK session (consistent with existing pattern)
835
839
- Input is TEXT from PostgreSQL (case_findings.finding_text), NOT multimodal file content
0 commit comments