Status: ✅ Implemented
This document describes the major enhancements made to address extraction quality issues identified by judge LLM evaluation. The goal: shift from "control awareness" to "control specificity" by capturing technical details, quantitative values, and contextual information that make facts actionable.
-
Overly Generic Claims: Facts lacked actionable specificity
- ❌ "LNRS engineers use several monitoring tools"
- ✅ "LNRS uses Splunk Enterprise for log aggregation and Datadog for infrastructure monitoring with alerts sent when CPU exceeds 80%"
-
Missing Context Connections: Facts extracted in isolation without relationships
-
Structural Information Lost:
- WHO performs control ❌ (rarely captured)
- WHEN/HOW OFTEN
⚠️ (inconsistently captured) - WHAT TOOLS/SYSTEMS 🔴 (almost never captured)
- Metrics and thresholds 🔴 (never captured)
-
Table Structure Ignored: SOC2 3-column format conflated different fact types
-
CUECs Not Extracted: Complementary User Entity Controls completely missed
-
Test Procedures Underrepresented: Only 15% coverage of test methodology
File: frfr/extraction/schemas.py:7-67
Added rich metadata fields to ExtractedFact:
class ExtractedFact(BaseModel):
# Core fields (unchanged)
claim: str
source_doc: str
source_location: str
evidence_quote: str
confidence: float
# NEW: Enhanced metadata
fact_type: Optional[str] # technical_control, organizational, process, metric, CUEC, test_result
control_family: Optional[str] # access_control, encryption, monitoring, etc.
specificity_score: Optional[float] # 0.0=generic, 1.0=highly specific
entities: Optional[List[str]] # Named entities: AWS, TLS 1.2, NIST, Splunk
quantitative_values: Optional[List[str]] # 90 days, 256-bit, 99.9%
process_details: Optional[dict] # {who: role, when: frequency, how: procedure}
section_context: Optional[str] # System Description, Control Testing, CUEC
related_control_ids: Optional[List[str]] # CC6.1, A.1.2Benefits:
- Enables filtering by fact type (show me all CUECs)
- Quantifies specificity (identify generic claims for review)
- Captures WHO/WHEN/HOW structure
- Links related facts via control IDs
File: frfr/extraction/fact_extractor.py:45-130
Expanded from 8 to 10 analysis fields:
New Fields:
section_types: Categorized section types with extraction priorities{ "section_type": "Control Testing", "characteristics": "technical implementation + test evidence", "extraction_priority": "high" }table_structure: Detects and describes table formats{ "column_structure": "3-column: Control | Test Performed | Results", "extraction_guidance": "Extract facts from each column separately" }
Enhanced Extraction Guidance:
- Emphasizes capturing WHO/WHEN/HOW details
- Requires quantitative values extraction
- Demands technical specifications (protocols, algorithms, versions)
File: frfr/extraction/fact_extractor.py:260-450
Automatically detects chunk context:
- System Description → organizational/architectural facts
- Control Testing → technical implementations + test evidence
- CUEC → customer responsibilities
- Privacy/Confidentiality → data handling specifics
Different extraction strategies per section type:
if "control testing" in section_context.lower():
"Focus on: control implementations, test procedures, test results, technologies used"
elif "cuec" in section_context.lower():
"Focus on: customer responsibilities, required controls, user entity requirements"When tables detected:
When extracting from tables:
- Extract facts from EACH column separately
- Do NOT conflate "Control Statement" with "Test Performed" with "Results"
- Each column represents a different fact type
File: frfr/extraction/fact_extractor.py:374-420
Enhanced prompts explicitly request:
Named Entities:
- Technology brands/products: AWS, Splunk, Okta, Palo Alto
- Software versions: PostgreSQL 13.7, TLS 1.3
- Protocols/algorithms: AES-256, SHA-256, bcrypt
- Standards/frameworks: NIST SP 800-53, ISO 27001, OWASP Top 10
Quantitative Values:
- Numbers with units: 90 days, 256-bit, 8 characters
- Percentages: 99.9%, 5% variance
- Frequencies: daily, weekly, monthly, quarterly
- Thresholds: > 80°F, < 3 attempts, at least 12 characters
- Metrics: RTO of 4 hours, 99.95% uptime SLA
Process Details:
{
"who": "IT Security team",
"when": "quarterly",
"how": "automated vulnerability scanner with manual review"
}File: frfr/extraction/fact_extractor.py:399-404
Facts now include a specificity score (0.0-1.0):
Examples:
- 0.2: "Temperature and humidity levels are monitored"
- 0.5: "Temperature monitored with alerts triggered on variance"
- 0.9: "Data center temperature maintained at 68°F with alerts triggered at ±5°F variance"
Usage: Filter out low-specificity facts (< 0.5) or prioritize high-specificity for review.
File: frfr/extraction/fact_extractor.py:491-630
Optional multi-pass mode for deep extraction:
Standard comprehensive extraction (current default)
Focused on:
- Customer/user entity responsibilities
- Controls customers must implement
- Actions the customer must take
- Look for: "user entity", "customer responsibility", "complementary controls"
Focused on:
- What tests were performed
- How tests were conducted
- Test results and outcomes
- Sample sizes and populations
Focused on:
- All numbers with units
- Percentages and frequencies
- Thresholds and limits
- Performance metrics
Focused on:
- Technology brands/products
- Software versions
- Protocols and algorithms
- Standards and frameworks
- Configurations
Enable: --multipass flag in CLI
The system now operates in aggressive extraction mode by default, designed to exceed human analysis capabilities.
python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
--document-name my_soc2_report \
--chunk-size 500 \
--overlap 100 \
--max-workers 11Result: Maximum depth extraction with:
- Every technical detail extracted as separate facts
- Every quantitative value captured
- Every process step documented with WHO/WHEN/HOW
- Every tool, technology, protocol, and standard identified
- Enhanced metadata (entities, quantitative_values, process_details, specificity_score)
Philosophy: "If a paragraph describes a control, extract 5-10 distinct facts from it"
python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
--document-name my_soc2_report \
--chunk-size 500 \
--overlap 100 \
--max-workers 11 \
--multipassResult: Additional specialized passes extract CUECs, test procedures, quantitative data, and technical specs that might be missed in general pass.
Based on judge evaluation baseline:
| Category | Before | After (Target) | Improvement |
|---|---|---|---|
| High-level control existence | 80-85% | 90%+ | +10% |
| Technical implementation specifics | 20-30% | 70-80% | +50% |
| Tool/technology names | 5% | 60-70% | +60% |
| Metrics/thresholds/SLAs | <5% | 50-60% | +55% |
| CUECs | 0% | 80-90% | +90% |
| Test procedures | 15% | 60-70% | +50% |
| Process details (who/when/how) | 30-40% | 70-80% | +40% |
Before (generic extraction): ~275 facts After (enhanced + multipass): 500-800+ facts (estimated)
Before:
- Average specificity: ~0.4 (generic)
- Few facts mention specific tools/versions
After:
- Average specificity: ~0.7 (specific)
- Most facts include named entities and quantitative values
After extraction, analyze specificity:
import json
with open("output/consolidated_facts.json") as f:
data = json.load(f)
for doc_name, doc_data in data["documents"].items():
facts = doc_data["facts"]
scores = [f.get("specificity_score", 0) for f in facts]
avg_specificity = sum(scores) / len(scores)
print(f"Document: {doc_name}")
print(f" Average specificity: {avg_specificity:.2f}")
print(f" High specificity (>0.7): {sum(1 for s in scores if s > 0.7)}")
print(f" Low specificity (<0.4): {sum(1 for s in scores if s < 0.4)}")Check extraction of technical terms:
all_entities = []
for fact in facts:
all_entities.extend(fact.get("entities", []))
unique_entities = set(all_entities)
print(f"Unique entities extracted: {len(unique_entities)}")
print(f"Examples: {list(unique_entities)[:10]}")Check WHO/WHEN/HOW extraction:
facts_with_process = [f for f in facts if f.get("process_details")]
print(f"Facts with process details: {len(facts_with_process)}/{len(facts)}")
for fact in facts_with_process[:5]:
print(f" {fact['claim'][:60]}...")
print(f" WHO: {fact['process_details'].get('who')}")
print(f" WHEN: {fact['process_details'].get('when')}")
print(f" HOW: {fact['process_details'].get('how')}")Problem: Most facts have specificity < 0.5
Solutions:
- Check if document actually contains specific details
- Review extraction_guidance in summary - may need manual tuning
- Try --multipass for deeper extraction
- Increase max_tokens in extraction prompt (currently 4000)
Problem: entities field mostly empty
Solutions:
- Check if document uses brand names or generic terms
- Review chunk text - entities may be in different sections
- Use --multipass with technical_specs pass
- Add custom entity patterns to prompt
Problem: No facts with fact_type="CUEC"
Solutions:
- Confirm document has CUEC section (search for "user entity")
- Check section_types in summary - CUEC section detected?
- Manually review CUEC section line ranges
- May need to extract CUEC section separately with targeted prompt
Problem: process_details missing "who", "when", or "how"
Solutions:
- Some facts don't have all three - that's expected
- Check if source text provides these details
- May be implicit (e.g., "automatically" → no "who")
- Review evidence_quote to see what's available
- Each specialized pass adds ~30% extraction time
- 4 specialized passes = ~2-3x total time
- Recommended for final production runs, not development
- Enhanced prompts use more tokens (~6k vs 4k per chunk)
- Responses with metadata use more tokens (~2k vs 1k)
- Budget: ~8k tokens per chunk (up from ~5k)
- More metadata = more memory per worker
- Reduce max_workers if OOM errors occur
- Recommended: 8-11 workers on 16GB RAM, 5 workers on 8GB RAM
{
"claim": "LNRS uses monitoring tools",
"confidence": 0.7,
"evidence_quote": "LNRS engineers use several monitoring tools..."
}{
"claim": "LNRS uses Splunk Enterprise for log aggregation with real-time alerting when error rates exceed 5%",
"confidence": 0.95,
"evidence_quote": "LNRS has implemented Splunk Enterprise 9.0 for centralized log aggregation across all production systems. Real-time alerts are configured to trigger when error rates exceed 5% threshold...",
"fact_type": "technical_control",
"control_family": "monitoring",
"specificity_score": 0.92,
"entities": ["Splunk Enterprise", "Splunk Enterprise 9.0"],
"quantitative_values": ["5%", "real-time"],
"process_details": {
"who": "IT Operations team",
"when": "real-time",
"how": "automated alerting with threshold-based triggers"
},
"section_context": "Control Testing",
"related_control_ids": ["CC7.2"]
}STATUS.md- Project statusSTRUCTURAL_EXTRACTION.md- Structure-aware extraction approachPARALLEL_AND_RECOVERY.md- Parallel processing and fact recoveryDESIGN.md- System architecture
- Machine Learning Entity Extraction: Use NER models for entity detection
- Relationship Graphs: Build knowledge graph of related facts
- Automated Specificity Improvement: Post-process to enrich generic facts
- Template Library: Pre-defined extraction patterns for common doc types
- Confidence Calibration: Adjust confidence based on specificity score
- Semantic Deduplication: Merge semantically similar facts
- Fact Validation Rules: Domain-specific validation (e.g., valid ports, protocols)
The enhanced extraction system shifts from "awareness" to "specificity" by:
- ✅ Capturing named entities (tools, protocols, standards)
- ✅ Extracting quantitative values (metrics, thresholds, frequencies)
- ✅ Recording process details (WHO/WHEN/HOW)
- ✅ Scoring specificity to identify generic claims
- ✅ Using section context for adaptive extraction
- ✅ Recognizing table structures
- ✅ Supporting multi-pass extraction for deep coverage
- ✅ Extracting CUECs and test procedures
Result: Actionable, detailed facts suitable for technical due diligence, compliance validation, and security assessment.