Status: ✅ Implemented
Enhanced the fact extraction system to analyze document structure and use that analysis to extract significantly more facts with higher precision. The system now understands document types (SOC2, pentest, architecture docs, etc.) and adapts extraction strategies accordingly.
Before: Generic extraction approach treated all documents the same
- SOC2 reports with 100+ controls → extracted only ~90 facts
- Missed many specific claims under structured headings
- No awareness of document patterns (claim-based vs findings-based)
- One-size-fits-all prompts missed domain-specific facts
Expected: For a 155-page SOC2 report with dozens of control categories, we should extract 200-400+ facts
Analyze the document to understand:
- Document Type: SOC2, pentest, architecture doc, policy, etc.
- Structural Pattern: How content is organized (claim-based, findings-based, procedural, descriptive)
- Section Headings: Major categories and control areas
- Fact Density Pattern: Types of facts that repeat throughout
- Extraction Guidance: Document-specific extraction strategies
Use structural analysis to:
- Guide LLM on what types of facts to prioritize
- Set appropriate fact density expectations (5-15 facts per chunk)
- Provide domain-specific examples
- Extract granular, distinct claims (not overgeneralize)
1. Document Type: What kind of document is this?
2. Primary Topics: Main topics covered
3. Key Entities: Systems/organizations mentioned
4. Scope: Timeframe and boundaries
5. Structure: How organized
1. Document Type: Specific type (SOC2 Type 2, pentest, architecture, etc.)
2. Structural Pattern: HOW is it organized?
- Claim-based with assertions under headings? (SOC2)
- Findings-based with discovered issues? (pentest)
- Procedural with step-by-step processes?
- Descriptive with technical details?
3. Section Headings: 8-12 major section headings
4. Fact Density Pattern: What factual claims repeat?
- Specific controls/requirements?
- Technologies and configurations?
- Processes and procedures?
- Compliance statements?
- Test results?
5. Extraction Guidance: What facts to prioritize?
- SOC2: control implementations, technologies, procedures
- Pentest: vulnerabilities, severity, affected systems
- Architecture: components, data flows, security measures
Result: 30,000 character analysis (vs 20,000) with 3,000 token response (vs 2,000)
Extract facts about systems, processes, controls, technologies.
Examples:
✓ "Uses AWS for hosting"
✓ "2FA required for admin"
DO NOT extract document metadata.
Based on document type (SOC2 Type 2), extract MANY facts per chunk.
This document follows a claim-based pattern with specific assertions
under control headings.
EXTRACTION GUIDANCE:
- Extract specific control implementations
- Document all technologies and configurations
- Capture procedures and frequencies
- Record compliance statements
FACT TYPES TO EXTRACT:
1. Specific implementations (AWS, Okta, TLS 1.2)
2. Concrete processes (reviewed quarterly, approved by team)
3. Technical details (versions, thresholds, parameters)
4. Organizational facts (roles, responsibilities)
5. Compliance statements (meets NIST, follows ISO 27001)
6. Quantitative data (90 days, 256-bit, daily)
TARGET: 5-15 facts per chunk
RULE: Each distinct claim is a separate fact
RULE: DO NOT skip similar facts - extract all distinct claims
{
"document_type": "SOC2 Type 2 Compliance Report",
"structural_pattern": "Claim-based with specific control assertions organized under TSC (Trust Service Criteria) categories. Each control includes description, implementation details, and testing procedures.",
"section_headings": [
"Common Criteria (CC)",
"Additional Criteria for Confidentiality (C)",
"Logical and Physical Access Controls",
"System Operations",
"Change Management",
"Risk Mitigation",
"System Monitoring",
"Incident Response"
],
"fact_density_pattern": "High density of factual claims. Each control section contains: specific technologies used, implementation procedures, monitoring frequencies, responsible parties, and compliance assertions.",
"extraction_guidance": "For SOC2: Extract specific control implementations (firewalls, encryption, MFA), technologies used (AWS, Okta, Splunk), procedures followed (quarterly reviews, annual audits), and quantitative details (retention periods, frequencies, thresholds)."
}Before (Generic Extraction):
Chunk 5 (lines 4001-5000): 3 facts extracted
- "Uses multi-factor authentication"
- "Performs vulnerability scanning"
- "Has disaster recovery plan"
After (Structure-Aware Extraction):
Chunk 5 (lines 4001-5000): 12 facts extracted
- "Remote VPN users must authenticate using multi-factor authentication"
- "Multi-factor authentication uses SMS codes and TOTP authenticator apps"
- "MFA is required for all administrative access to production systems"
- "Vulnerability scans are performed on an ongoing basis"
- "Third-party vendors conduct quarterly vulnerability assessments"
- "Internal IT staff performs weekly vulnerability scans"
- "Disaster recovery plan is maintained and documented"
- "Disaster recovery plan is reviewed annually by management"
- "Disaster recovery tests are conducted semi-annually"
- "Recovery time objective (RTO) is defined in the DRP"
- "Recovery point objective (RPO) is documented and reviewed"
- "DRP testing results are documented and remediation tracked"
Improvement: 4x more facts (3 → 12 per chunk)
Enhanced summarize_document() method:
def summarize_document(self, text: str, document_name: str) -> dict:
"""Generate comprehensive structural summary."""
prompt = """
Analyze document structure:
1. Document Type (specific)
2. Structural Pattern (how organized)
3. Section Headings (8-12 major headings)
4. Fact Density Pattern (what repeats)
5. Primary Topics
6. Key Entities
7. Scope
8. Extraction Guidance (what to prioritize)
Provide JSON with all 8 fields.
"""
# Analyze first 30,000 characters (increased from 20,000)
content = self.client.prompt(prompt, max_tokens=3000) # increased from 2000
return json.loads(content)Enhanced extract_facts_from_chunk() method:
def extract_facts_from_chunk(...) -> List[ExtractedFact]:
"""Extract facts using structural guidance."""
# Get document-specific context
doc_type = summary.get("document_type")
structural_pattern = summary.get("structural_pattern")
fact_density_pattern = summary.get("fact_density_pattern")
extraction_guidance = summary.get("extraction_guidance")
prompt = f"""
Document Type: {doc_type}
Pattern: {structural_pattern}
EXTRACTION INSTRUCTIONS:
{extraction_guidance}
FACT TYPES: {fact_density_pattern}
TARGET: 5-15 facts per chunk
RULE: Each distinct claim is a separate fact
RULE: Extract ALL distinct claims
[Detailed fact type examples...]
"""Before (Generic):
- 9 chunks × 3-5 facts/chunk = 27-45 facts total
- Actual SOC2 extraction: 91 facts
After (Structure-Aware):
- 9 chunks × 8-15 facts/chunk = 72-135 facts total
- Expected SOC2 extraction: 200-300+ facts
Improvement: 2-3x more facts extracted
- More Granular: Each distinct claim is a separate fact
- More Complete: Doesn't skip "similar" facts
- More Specific: Captures technical details and quantitative data
- More Contextual: Understands document structure and adapts
- More Accurate: Still validated, recovery for medium-confidence
Pattern: Claim-based with control assertions Extracts: Control implementations, technologies, procedures, compliance statements, testing frequencies
Pattern: Findings-based with discovered issues Extracts: Vulnerabilities, severity levels, affected systems, remediation steps, CVSS scores
Pattern: Descriptive with technical details Extracts: Components, data flows, security measures, integration points, technologies
Pattern: Procedural with requirements Extracts: Requirements, responsibilities, enforcement mechanisms, exceptions, review frequencies
Pattern: Assessment-based with findings Extracts: Controls tested, results, observations, exceptions, recommendations
# Create new session to test enhanced extraction
python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
--document-name lexisnexis_soc2_enhanced \
--max-workers 5
# Compare fact counts
python frfr/cli.py consolidate-facts <session_id> \
--document-name lexisnexis_soc2_enhanced \
-o output/enhanced_facts.json
# Expected: 200-300+ facts (vs previous 91)- Check fact density: Should see 8-15 facts per chunk
- Review fact types: Should see granular, specific claims
- Validate coverage: Should cover all major control categories
- Check quality: All facts should still pass validation (100% rate)
The system automatically:
- Analyzes document structure
- Detects document type
- Adapts extraction strategy
- Sets appropriate fact density targets
If needed, you can modify the summary after generation:
# Load and modify summary
summary = session.load_summary(document_name)
summary["extraction_guidance"] = "Custom guidance for this document type..."
session.save_summary(document_name, summary)Problem: Still extracting < 100 facts for large structured document
Solutions:
- Check summary analysis:
cat .frfr_sessions/<session>/summaries/*.json - Verify extraction_guidance is document-specific
- Ensure structural_pattern was detected correctly
- Check if document has unusual structure (may need custom guidance)
Problem: Extraction rate high but validation rate low
Solutions:
- Recovery should catch medium-confidence facts
- Check if quotes are too paraphrased
- Verify line ranges are accurate
- May need to tune similarity threshold
Problem: Summary misidentifies document type
Solutions:
- Provide more descriptive document name
- Manually update summary before extraction
- Check if document header/intro is unusual
- Template Library: Pre-defined extraction templates for common doc types
- Learning System: Learn optimal extraction patterns from validated results
- Multi-Pass Extraction: Second pass to catch missed facts in sparse chunks
- Section-Aware Extraction: Extract facts with section context labels
- Fact Relationships: Link related facts (e.g., control + evidence)
- Add document type templates
- Share extraction patterns for industry-specific documents
- Contribute validation rules for specialized domains
frfr/extraction/fact_extractor.py- Core implementationSTATUS.md- Project statusPARALLEL_AND_RECOVERY.md- Parallel processing and recovery featuresDESIGN.md- System architecture