Enhanced Fact Extraction - High-Specificity System

Status: ✅ Implemented

Overview

This document describes the major enhancements made to address extraction quality issues identified by judge LLM evaluation. The goal: shift from "control awareness" to "control specificity" by capturing technical details, quantitative values, and contextual information that make facts actionable.

Problems Addressed

Critical Issues from Judge Evaluation

Overly Generic Claims: Facts lacked actionable specificity
- ❌ "LNRS engineers use several monitoring tools"
- ✅ "LNRS uses Splunk Enterprise for log aggregation and Datadog for infrastructure monitoring with alerts sent when CPU exceeds 80%"
Missing Context Connections: Facts extracted in isolation without relationships
Structural Information Lost:
- WHO performs control ❌ (rarely captured)
- WHEN/HOW OFTEN ⚠️ (inconsistently captured)
- WHAT TOOLS/SYSTEMS 🔴 (almost never captured)
- Metrics and thresholds 🔴 (never captured)
Table Structure Ignored: SOC2 3-column format conflated different fact types
CUECs Not Extracted: Complementary User Entity Controls completely missed
Test Procedures Underrepresented: Only 15% coverage of test methodology

Solution Architecture

1. Enhanced Fact Schema

File: frfr/extraction/schemas.py:7-67

Added rich metadata fields to ExtractedFact:

class ExtractedFact(BaseModel):
    # Core fields (unchanged)
    claim: str
    source_doc: str
    source_location: str
    evidence_quote: str
    confidence: float

    # NEW: Enhanced metadata
    fact_type: Optional[str]  # technical_control, organizational, process, metric, CUEC, test_result
    control_family: Optional[str]  # access_control, encryption, monitoring, etc.
    specificity_score: Optional[float]  # 0.0=generic, 1.0=highly specific
    entities: Optional[List[str]]  # Named entities: AWS, TLS 1.2, NIST, Splunk
    quantitative_values: Optional[List[str]]  # 90 days, 256-bit, 99.9%
    process_details: Optional[dict]  # {who: role, when: frequency, how: procedure}
    section_context: Optional[str]  # System Description, Control Testing, CUEC
    related_control_ids: Optional[List[str]]  # CC6.1, A.1.2

Benefits:

Enables filtering by fact type (show me all CUECs)
Quantifies specificity (identify generic claims for review)
Captures WHO/WHEN/HOW structure
Links related facts via control IDs

2. Enhanced Document Summarization

File: frfr/extraction/fact_extractor.py:45-130

Expanded from 8 to 10 analysis fields:

New Fields:

section_types: Categorized section types with extraction priorities

{
  "section_type": "Control Testing",
  "characteristics": "technical implementation + test evidence",
  "extraction_priority": "high"
}

table_structure: Detects and describes table formats

{
  "column_structure": "3-column: Control | Test Performed | Results",
  "extraction_guidance": "Extract facts from each column separately"
}

Enhanced Extraction Guidance:

Emphasizes capturing WHO/WHEN/HOW details
Requires quantitative values extraction
Demands technical specifications (protocols, algorithms, versions)

3. Section-Aware Extraction

File: frfr/extraction/fact_extractor.py:260-450

Section Detection

Automatically detects chunk context:

System Description → organizational/architectural facts
Control Testing → technical implementations + test evidence
CUEC → customer responsibilities
Privacy/Confidentiality → data handling specifics

Adaptive Prompting

Different extraction strategies per section type:

if "control testing" in section_context.lower():
    "Focus on: control implementations, test procedures, test results, technologies used"
elif "cuec" in section_context.lower():
    "Focus on: customer responsibilities, required controls, user entity requirements"

Table Structure Recognition

When tables detected:

When extracting from tables:
- Extract facts from EACH column separately
- Do NOT conflate "Control Statement" with "Test Performed" with "Results"
- Each column represents a different fact type

4. Named Entity Recognition & Quantitative Extraction

File: frfr/extraction/fact_extractor.py:374-420

Enhanced prompts explicitly request:

Named Entities:

Technology brands/products: AWS, Splunk, Okta, Palo Alto
Software versions: PostgreSQL 13.7, TLS 1.3
Protocols/algorithms: AES-256, SHA-256, bcrypt
Standards/frameworks: NIST SP 800-53, ISO 27001, OWASP Top 10

Quantitative Values:

Numbers with units: 90 days, 256-bit, 8 characters
Percentages: 99.9%, 5% variance
Frequencies: daily, weekly, monthly, quarterly
Thresholds: > 80°F, < 3 attempts, at least 12 characters
Metrics: RTO of 4 hours, 99.95% uptime SLA

Process Details:

{
  "who": "IT Security team",
  "when": "quarterly",
  "how": "automated vulnerability scanner with manual review"
}

5. Specificity Scoring

File: frfr/extraction/fact_extractor.py:399-404

Facts now include a specificity score (0.0-1.0):

Examples:

0.2: "Temperature and humidity levels are monitored"
0.5: "Temperature monitored with alerts triggered on variance"
0.9: "Data center temperature maintained at 68°F with alerts triggered at ±5°F variance"

Usage: Filter out low-specificity facts (< 0.5) or prioritize high-specificity for review.

6. Multi-Pass Extraction Strategy

File: frfr/extraction/fact_extractor.py:491-630

Optional multi-pass mode for deep extraction:

Pass 1: General Extraction

Standard comprehensive extraction (current default)

Pass 2: CUEC Extraction

Focused on:

Customer/user entity responsibilities
Controls customers must implement
Actions the customer must take
Look for: "user entity", "customer responsibility", "complementary controls"

Pass 3: Test Procedures

Focused on:

What tests were performed
How tests were conducted
Test results and outcomes
Sample sizes and populations

Pass 4: Quantitative Extraction

Focused on:

All numbers with units
Percentages and frequencies
Thresholds and limits
Performance metrics

Pass 5: Technical Specifications

Focused on:

Technology brands/products
Software versions
Protocols and algorithms
Standards and frameworks
Configurations

Enable: --multipass flag in CLI

Usage

Maximum Depth Extraction (Default)

The system now operates in aggressive extraction mode by default, designed to exceed human analysis capabilities.

python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
  --document-name my_soc2_report \
  --chunk-size 500 \
  --overlap 100 \
  --max-workers 11

Result: Maximum depth extraction with:

Every technical detail extracted as separate facts
Every quantitative value captured
Every process step documented with WHO/WHEN/HOW
Every tool, technology, protocol, and standard identified
Enhanced metadata (entities, quantitative_values, process_details, specificity_score)

Philosophy: "If a paragraph describes a control, extract 5-10 distinct facts from it"

Multi-Pass Extraction (Maximum Depth)

python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
  --document-name my_soc2_report \
  --chunk-size 500 \
  --overlap 100 \
  --max-workers 11 \
  --multipass

Result: Additional specialized passes extract CUECs, test procedures, quantitative data, and technical specs that might be missed in general pass.

Expected Improvements

Coverage Improvements

Based on judge evaluation baseline:

Category	Before	After (Target)	Improvement
High-level control existence	80-85%	90%+	+10%
Technical implementation specifics	20-30%	70-80%	+50%
Tool/technology names	5%	60-70%	+60%
Metrics/thresholds/SLAs	<5%	50-60%	+55%
CUECs	0%	80-90%	+90%
Test procedures	15%	60-70%	+50%
Process details (who/when/how)	30-40%	70-80%	+40%

Fact Count Improvements

Before (generic extraction): ~275 facts After (enhanced + multipass): 500-800+ facts (estimated)

Specificity Improvements

Before:

Average specificity: ~0.4 (generic)
Few facts mention specific tools/versions

After:

Average specificity: ~0.7 (specific)
Most facts include named entities and quantitative values

Quality Validation

Specificity Score Distribution

After extraction, analyze specificity:

import json

with open("output/consolidated_facts.json") as f:
    data = json.load(f)

for doc_name, doc_data in data["documents"].items():
    facts = doc_data["facts"]

    scores = [f.get("specificity_score", 0) for f in facts]
    avg_specificity = sum(scores) / len(scores)

    print(f"Document: {doc_name}")
    print(f"  Average specificity: {avg_specificity:.2f}")
    print(f"  High specificity (>0.7): {sum(1 for s in scores if s > 0.7)}")
    print(f"  Low specificity (<0.4): {sum(1 for s in scores if s < 0.4)}")

Named Entity Coverage

Check extraction of technical terms:

all_entities = []
for fact in facts:
    all_entities.extend(fact.get("entities", []))

unique_entities = set(all_entities)
print(f"Unique entities extracted: {len(unique_entities)}")
print(f"Examples: {list(unique_entities)[:10]}")

Process Details Coverage

Check WHO/WHEN/HOW extraction:

facts_with_process = [f for f in facts if f.get("process_details")]
print(f"Facts with process details: {len(facts_with_process)}/{len(facts)}")

for fact in facts_with_process[:5]:
    print(f"  {fact['claim'][:60]}...")
    print(f"    WHO: {fact['process_details'].get('who')}")
    print(f"    WHEN: {fact['process_details'].get('when')}")
    print(f"    HOW: {fact['process_details'].get('how')}")

Troubleshooting

Low Specificity Scores

Problem: Most facts have specificity < 0.5

Solutions:

Check if document actually contains specific details
Review extraction_guidance in summary - may need manual tuning
Try --multipass for deeper extraction
Increase max_tokens in extraction prompt (currently 4000)

Missing Named Entities

Problem: entities field mostly empty

Solutions:

Check if document uses brand names or generic terms
Review chunk text - entities may be in different sections
Use --multipass with technical_specs pass
Add custom entity patterns to prompt

CUECs Not Extracted (Even with Multipass)

Problem: No facts with fact_type="CUEC"

Solutions:

Confirm document has CUEC section (search for "user entity")
Check section_types in summary - CUEC section detected?
Manually review CUEC section line ranges
May need to extract CUEC section separately with targeted prompt

Process Details Incomplete

Problem: process_details missing "who", "when", or "how"

Solutions:

Some facts don't have all three - that's expected
Check if source text provides these details
May be implicit (e.g., "automatically" → no "who")
Review evidence_quote to see what's available

Performance Considerations

Multi-Pass Overhead

Each specialized pass adds ~30% extraction time
4 specialized passes = ~2-3x total time
Recommended for final production runs, not development

Token Usage

Enhanced prompts use more tokens (~6k vs 4k per chunk)
Responses with metadata use more tokens (~2k vs 1k)
Budget: ~8k tokens per chunk (up from ~5k)

Parallel Workers

More metadata = more memory per worker
Reduce max_workers if OOM errors occur
Recommended: 8-11 workers on 16GB RAM, 5 workers on 8GB RAM

Examples

Generic Fact (Old System)

{
  "claim": "LNRS uses monitoring tools",
  "confidence": 0.7,
  "evidence_quote": "LNRS engineers use several monitoring tools..."
}

Specific Fact (Enhanced System)

{
  "claim": "LNRS uses Splunk Enterprise for log aggregation with real-time alerting when error rates exceed 5%",
  "confidence": 0.95,
  "evidence_quote": "LNRS has implemented Splunk Enterprise 9.0 for centralized log aggregation across all production systems. Real-time alerts are configured to trigger when error rates exceed 5% threshold...",
  "fact_type": "technical_control",
  "control_family": "monitoring",
  "specificity_score": 0.92,
  "entities": ["Splunk Enterprise", "Splunk Enterprise 9.0"],
  "quantitative_values": ["5%", "real-time"],
  "process_details": {
    "who": "IT Operations team",
    "when": "real-time",
    "how": "automated alerting with threshold-based triggers"
  },
  "section_context": "Control Testing",
  "related_control_ids": ["CC7.2"]
}

Future Enhancements

Machine Learning Entity Extraction: Use NER models for entity detection
Relationship Graphs: Build knowledge graph of related facts
Automated Specificity Improvement: Post-process to enrich generic facts
Template Library: Pre-defined extraction patterns for common doc types
Confidence Calibration: Adjust confidence based on specificity score
Semantic Deduplication: Merge semantically similar facts
Fact Validation Rules: Domain-specific validation (e.g., valid ports, protocols)

Conclusion

The enhanced extraction system shifts from "awareness" to "specificity" by:

✅ Capturing named entities (tools, protocols, standards)
✅ Extracting quantitative values (metrics, thresholds, frequencies)
✅ Recording process details (WHO/WHEN/HOW)
✅ Scoring specificity to identify generic claims
✅ Using section context for adaptive extraction
✅ Recognizing table structures
✅ Supporting multi-pass extraction for deep coverage
✅ Extracting CUECs and test procedures

Result: Actionable, detailed facts suitable for technical due diligence, compliance validation, and security assessment.

FilesExpand file tree

ENHANCED_EXTRACTION.md

Latest commit

History

ENHANCED_EXTRACTION.md

File metadata and controls

Enhanced Fact Extraction - High-Specificity System

Overview

Problems Addressed

Critical Issues from Judge Evaluation

Solution Architecture

1. Enhanced Fact Schema

2. Enhanced Document Summarization

3. Section-Aware Extraction

Section Detection

Adaptive Prompting

Table Structure Recognition

4. Named Entity Recognition & Quantitative Extraction

5. Specificity Scoring

6. Multi-Pass Extraction Strategy

Pass 1: General Extraction

Pass 2: CUEC Extraction

Pass 3: Test Procedures

Pass 4: Quantitative Extraction

Pass 5: Technical Specifications

Usage

Maximum Depth Extraction (Default)

Multi-Pass Extraction (Maximum Depth)

Expected Improvements

Coverage Improvements

Fact Count Improvements

Specificity Improvements

Quality Validation

Specificity Score Distribution

Named Entity Coverage

Process Details Coverage

Troubleshooting

Low Specificity Scores

Missing Named Entities

CUECs Not Extracted (Even with Multipass)

Process Details Incomplete

Performance Considerations

Multi-Pass Overhead

Token Usage

Parallel Workers

Examples

Generic Fact (Old System)

Specific Fact (Enhanced System)

Related Documentation

Future Enhancements

Conclusion