LLMs can miss details or introduce inaccuracies when analyzing complex compliance documents. High-stakes use cases (security audits, compliance verification, technical design review) require exhaustive, validated fact extraction.
Extract structured facts from documents using:
- Maximum depth extraction - Extract 5-10 facts per paragraph with enhanced metadata
- Real-time validation - Verify every fact against source text immediately
- Multiple evidence quotes - Support facts synthesized from multiple locations (V5)
- Quality post-processing - Filter and score facts for optimal quality/density
Add consensus layer:
- Run multiple LLM instances in parallel with different random seeds
- Use non-LLM semantic comparison to identify consensus
- Discard outliers and resolve contradictions with judge model
- Result: high-confidence answers with exact citations
Documents (PDF/Markdown)
↓
PyPDF2/OCR → Chunk (sliding window + resume)
↓
Enhanced Extract (Claude Sonnet via CLI)
- Maximum depth (5-10 facts/paragraph)
- 8 metadata fields
- Multiple evidence quotes (V5)
- Section-aware prompting
- Parallel workers (5-11)
↓
Real-Time Validation
- Quote verification
- Line number validation
- Fuzzy matching (70%)
- Fact recovery (40-79%)
↓
Post-Processing Pipeline
- Retroactive QV tagging
- Quality scoring
- Aggressive filtering (35% QV target)
↓
Structured Facts (JSON) → Validated 100%
Validated Facts (from Phase 1)
↓
Embed Facts (sentence transformers)
↓
Cluster by Semantic Similarity
↓
Swarm Extraction (N instances, different seeds)
↓
Detect Consensus (plurality threshold)
↓
Detect Contradictions
↓
Judge Resolution (Claude Opus) ← only for contradictions
↓
Interactive Q&A + Report Generation
- Input: PDFs via PyPDF2 (fast) or Tesseract OCR (fallback)
- Encrypted PDFs: Handled automatically (pycryptodome)
- Chunking: Configurable (default: 500 lines, 100 overlap)
- Resume: Smart session tracking for interrupted extractions
- Target: SOC2 reports (50-150 pages), pentest reports (20-100 pages)
- Model: Claude Sonnet via CLI (headless mode)
- Extraction Mode: Maximum depth (5-10 facts per paragraph)
- Schema: Enhanced with 8 metadata fields
fact_type,control_family,specificity_scoreentities,quantitative_values,process_detailssection_context,related_control_ids
- V5 Feature: Multiple evidence quotes per fact
- Parallel Processing: 5-11 workers (ThreadPoolExecutor)
- Output: Structured facts (JSON schema)
{
"claim": "specific assertion",
"source_doc": "filename.pdf",
"source_location": "Lines 42-45",
"evidence_quotes": [ // V5: Multiple quotes
{
"quote": "exact text from source",
"source_location": "Lines 42-45",
"relevance": "why this supports claim"
}
],
"confidence": 0.95,
// Enhanced metadata (8 fields)
"fact_type": "technical_control",
"control_family": "access_control",
"specificity_score": 0.85,
"entities": ["AWS", "TLS 1.2"],
"quantitative_values": ["daily", "90 days"],
"process_details": {"who": "IT team", "when": "quarterly", "how": "automated"},
"section_context": "Control Testing",
"related_control_ids": ["CC6.1"]
}- Quote Verification: Check every quote exists in source text
- Line Number Validation: Verify quotes in specified ranges
- Fuzzy Matching: 70% threshold for OCR artifacts
- Fact Recovery: LLM-assisted recovery for 40-79% matches
- Philosophy: 100% validation rate before saving
- Retroactive QV Tagging: Scan claims for missed quantitative values
- Quality Scoring: Specificity + entities + process details
- Aggressive Filtering: Achieve target QV coverage (35%)
- Output: High-quality, high-density fact set
- Storage: Local session directories
- Resume: Track processed chunks, resume from interruption
- Progress: Real-time progress tracking
- Consolidation: Auto-consolidate facts per session
- Embedding: Sentence transformers (local, fast, no API costs)
- Clustering: Cosine similarity (threshold: 0.85)
- Plurality: Configurable threshold (default: 0.8 or "all but one")
- Outliers: Facts below threshold are discarded as errors
- Philosophy: Precision over recall
- Detection: Semantically similar facts with conflicting claims
- Resolution: Judge model (Claude Opus) examines source docs
- Output: Resolved fact + reasoning documented in report appendix
- Model: Claude Opus or latest Sonnet (highest quality)
- Input: Consensus facts + resolved contradictions
- Output: Final answer with confidence score and citations
- Query: Natural language questions over extracted facts
- Search: Semantic search for relevant facts
- Synthesis: Generate answers with citations
User Question: "Does the system implement 2FA?"
Swarm Extraction (5 instances):
- Instance 1: Extracts 3 facts about 2FA
- Instance 2: Extracts 3 facts (2 match Instance 1)
- Instance 3: Extracts 2 facts (both match previous)
- Instance 4: Extracts 4 facts (3 match, 1 is hallucination)
- Instance 5: Extracts 3 facts (all match consensus)
Consensus:
- Cluster 1: "System uses 2FA" (5/5 instances) → Consensus ✓
- Cluster 2: "2FA via SMS and TOTP" (4/5 instances) → Consensus ✓
- Cluster 3: "Hardware tokens required" (1/5 instances) → Outlier ✗
Judge Synthesis:
- Combines consensus facts
- Confidence: 95% (strong consensus)
- Answer: "Yes, 2FA is implemented via SMS and TOTP"
Report:
- Direct answer with confidence
- Supporting facts with exact page/section citations
- Appendix: 1 hallucination corrected (hardware tokens)
Using embeddings instead of LLM-based comparison:
- Fast: Local computation, no API latency
- Cheap: No API costs for comparison
- Deterministic: Reproducible clustering
- Judge LLM only invoked for contradictions
- Low-occurrence facts (1-2 instances) are assumed errors
- Saves human time by not flagging low-confidence findings
- Trade-off: May miss rare but true facts
- All facts must link to exact text in source
- No paraphrasing in evidence quotes
- Enables verification and trust
- Session state persists across workflow steps
- Parallel swarm execution with retries
- Resume sessions across CLI invocations
CLI flags:
--swarm-size: Number of LLM instances (default: 5)--consensus-threshold: Plurality threshold (default: 0.8)--similarity-threshold: Semantic clustering (default: 0.85)--swarm-model: Model for swarm (default: claude-sonnet-4)--judge-model: Model for judge (default: claude-opus-4)
$ frfr process documents/report.pdf documents/spec.md
Loading documents...
Session started: sess_abc123
> does the system implement 2FA?
[Processing with 5 instances...]
Answer: Yes, 2FA is implemented via SMS and TOTP.
Confidence: 95%
Type 'report' to view full details.
> report
# Full markdown report displayed #
> what TOTP apps are supported?
[Querying extracted facts...]
Answer: Google Authenticator and Authy.
Confidence: 88%
> exit
Session ended.Markdown report:
- Executive Summary: Direct answer
- Confidence Score: 0-100%
- Supporting Facts: All consensus facts with citations
- Methodology: Swarm size, consensus stats, outliers
- Appendix:
- Corrected hallucinations
- Resolved contradictions
- Low-confidence facts (informational)
Language: Python 3.10+
Dependencies:
anthropic: Claude API (via CLI)pypdf2,pytesseract: Document processingclick,rich: CLItextual: TUI interface- 🔮
temporalio: Workflow orchestration (planned) - 🔮
sentence-transformers: Embeddings (planned)
Module Structure:
frfr/
├── documents/ # Parsing, chunking, storage
├── extraction/ # LLM fact extraction, swarm coordination
├── consensus/ # 🔮 Embeddings, clustering, comparison (planned)
├── judge/ # 🔮 Contradiction resolution, synthesis (planned)
├── workflows/ # 🔮 Workflow orchestration (planned)
├── reporting/ # 🔮 Markdown generation, formatting (planned)
└── cli.py # Interactive CLI
- Security Audits: "Does this pentest report identify critical vulnerabilities?"
- Compliance: "Does this SOC2 report implement controls in reference spec?"
- Design Review: "Does this architecture doc address scaling requirements?"
- Governance: "What data retention policies are described?"
- Speed: Multiple LLM calls are slower than single instance
- Cost: N instances = N × API cost (mitigated by smaller swarm model)
- Rare Facts: Low-occurrence truths may be discarded as outliers
- Context Windows: Very large documents require chunking
- Multi-model swarm (different models for diversity)
- Dynamic prompt generation (meta-LLM creates variations)
- Session persistence across CLI invocations
- Incremental fact extraction for multi-document sessions
- Web UI wrapper
- Advanced comparison modes (compliance checking, feature verification)
Precision over recall: Better to say "I don't know" than hallucinate.
Transparency: Show all work in appendix.
Exact citations: Never paraphrase evidence.
Observability: Full execution history and session tracking.
Phase 1 (Extraction & Validation): ✅ COMPLETE - Production Ready
- Document processing: ✅ Complete (PyPDF2 + OCR)
- Enhanced extraction: ✅ Complete (V5 with 8 metadata fields)
- Real-time validation: ✅ Complete (100% rate achieved)
- Post-processing: ✅ Complete (QV tagging + filtering)
- CLI: ✅ Complete (7 commands)
- Session management: ✅ Complete (resume capability)
Phase 2 (Consensus & Q&A): 🔮 Planned
- Swarm consensus: 🔮 Planned
- Semantic clustering: 🔮 Planned
- Contradiction resolution: 🔮 Planned
- Enhanced interactive Q&A: 🔮 Planned
- Workflow orchestration: 🔮 Planned
Production Metrics (V5):
- 1,011 validated facts from 155-page SOC2 report
- 35.0% quantitative value coverage (target achieved)
- 0.878 average specificity (high quality)
- 28 minutes extraction time (170 chunks, 11 workers)
- 100% validation rate
License: TBD (intended for open source release)