Maximum Depth Extraction Mode (Now Default)

Status: ✅ Active by Default Philosophy: Exceed human analysis capabilities

What Changed

The extraction system now operates in aggressive extraction mode by default, designed to extract more facts than a human analyst would typically capture.

Key Philosophy

"This system aims to EXCEED human analysis capabilities. Extract facts with MAXIMUM DEPTH and SPECIFICITY. Every distinct technical detail, every specific configuration, every quantitative value, every process step is a separate fact."

Depth Instructions

The system now follows these rules:

If a paragraph describes a control, extract 5-10 distinct facts from it
If a sentence contains multiple technical details, create a separate fact for each
If a list has 5 items, create 5 separate facts (one per item)
Extract facts about the same control at different specificity levels:
- High-level: "Uses firewalls for network security"
- Mid-level: "Firewall rules restrict inbound traffic"
- Detailed: "Firewall configured to allow only ports 80 and 443 for inbound HTTPS traffic with stateful packet inspection"

What Gets Extracted Now

1. Specific Implementations - Leave NO technical detail behind

Extract:

EVERY technology mentioned: AWS, AWS RDS, AWS EC2, Splunk Enterprise 9.0, Okta SSO
EVERY version: PostgreSQL 13.7, TLS 1.2, TLS 1.3, Python 3.10
EVERY protocol: TLS 1.3, AES-256-GCM, SHA-256, RSA-4096
EVERY standard: NIST SP 800-53 Rev 5, ISO 27001:2013, OWASP Top 10, HIPAA, SOX
EVERY third-party: vendor names, service providers, subservice organizations

Example:

Original text: "LNRS uses AWS for hosting with RDS PostgreSQL databases and EC2 instances,
secured with TLS 1.3 encryption and monitored by Splunk Enterprise."

Extracted facts:
1. "LNRS uses AWS for cloud hosting"
2. "LNRS uses AWS RDS for database hosting"
3. "LNRS databases run on PostgreSQL"
4. "LNRS uses AWS EC2 for compute instances"
5. "LNRS secures connections with TLS 1.3 encryption"
6. "LNRS uses Splunk Enterprise for monitoring"

2. Concrete Processes - Extract EVERY process detail

Extract:

WHO: Every role, title, team, department (IT manager, Security team, CISO, VP of Engineering, authorized personnel, third-party auditor)
WHEN: Every frequency, schedule, timeframe (daily, weekly, monthly, quarterly, annually, semi-annually, real-time, within 24 hours, every 90 days)
HOW: Every procedure, methodology, workflow step (automated script, manual review, ticketing system, approval workflow)

Example:

Original text: "The IT Security team reviews firewall rules quarterly using an automated
compliance tool, with changes requiring CISO approval before implementation."

Extracted facts:
1. "IT Security team reviews firewall rules quarterly" (WHO: IT Security team, WHEN: quarterly, HOW: review)
2. "Firewall rule reviews use an automated compliance tool" (HOW: automated tool)
3. "Firewall rule changes require CISO approval" (WHO: CISO, WHAT: approval)
4. "CISO approval is required before firewall rule implementation" (WHEN: before implementation)

3. Technical Details - Extract EVERY quantitative value

Extract:

Numbers with units: 90 days, 365 days, 256-bit, 4096-bit, 8 characters, 8GB RAM
Percentages: 99.9%, 99.95%, 5% error rate, 80% CPU threshold
Frequencies: daily at 2 AM, weekly on Sundays, monthly on first Monday
Thresholds: temperature >80°F, <3 failed login attempts, CPU >80%, disk >90%
Capacity metrics: RTO of 4 hours, RPO of 15 minutes, 99.95% uptime SLA

Example:

Original text: "Backups run daily at 2 AM with 90-day retention. System achieves 99.95%
uptime with RTO of 4 hours and RPO of 15 minutes."

Extracted facts:
1. "Backups run daily at 2 AM"
2. "Backups are retained for 90 days"
3. "System achieves 99.95% uptime"
4. "System has RTO of 4 hours"
5. "System has RPO of 15 minutes"

4. Organizational Facts - Extract EVERY organizational detail

Extract:

Team sizes: 8-person IT team, 3 security engineers, 50+ developers
Locations: Alpharetta GA, data center in Virginia, office in London
Reporting structures: reports to CISO, overseen by Board, managed by VP
Responsibilities: responsible for patch management, accountable for backups

5. Compliance Statements - Extract EVERY compliance detail

Extract:

Standards: meets NIST SP 800-53, follows ISO 27001:2013, complies with GDPR Article 32
Certifications: SOC 2 Type 2 certified, PCI DSS Level 1, HIPAA compliant
Requirements: required by policy, mandated by regulation, enforced by contract

6. Test Results - Extract EVERY test detail

Extract:

What was tested: user authentication, firewall rules, backup restoration
How it was tested: inspection, observation, inquiry, re-performance, automated testing
Sample sizes: 25 of 100 users, all 50 servers, representative sample of 10%
Results: no exceptions noted, 3 deviations found, all tests passed

Critical: Exact Quotes Required

The system now emphasizes EXACT, VERBATIM quotes:

**CRITICAL**: Each fact must have an EXACT, VERBATIM quote as evidence from the chunk
- Copy the text WORD-FOR-WORD from the chunk
- DO NOT paraphrase, summarize, or rephrase the quote
- DO NOT change wording, even slightly
- If the exact text is unclear, extract a longer quote to be safe

This addresses the paraphrasing issue while maintaining aggressive extraction.

Expected Results

Fact Count Increase

Previous: 193 facts from SOC2 document (with enhanced prompts but not maximum depth)

Expected with Maximum Depth: 400-800+ facts from same document

Every technical detail becomes a fact
Every process step becomes a fact
Every quantitative value becomes a fact
Lists become multiple facts (5 items = 5 facts)

Quality Maintenance

Specificity score should remain high (avg >0.7)
All facts must validate (exact quotes from source)
Metadata completeness should improve (more entities, quantitative_values)

Validation Rate

May see increased rejection rate initially due to more aggressive extraction
Quote corrector should recover many medium-confidence facts (40-79% match)
Overall validated fact count should be 2-3x higher than previous run

Token Usage Impact

Increased Limits:

Max tokens per chunk: 4000 → 6000 (50% increase)
Allows extraction of 20-40 facts per chunk (vs 10-15 previously)

Cost Implications:

~50% more tokens per chunk
But 2-3x more facts extracted
Better value: cost per validated fact actually decreases

Usage

No changes required - this is now the default:

python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
  --document-name my_soc2_report \
  --chunk-size 500 \
  --overlap 100 \
  --max-workers 11

You'll now get:

Maximum depth extraction by default
Every technical detail captured
Every process fully documented
Every quantitative value extracted
Comprehensive metadata for all facts

Examples

Example 1: Physical Security

Original text:

"The data center has 24/7 security guards, badge access system with mantraps,
and surveillance cameras recording at all entrances. Temperature is maintained
at 68°F with alerts at ±5°F variance."

Extracted facts (Maximum Depth):

"Data center has 24/7 security guards"
"Data center uses badge access system"
"Data center has mantraps"
"Data center has surveillance cameras at all entrances"
"Surveillance cameras record activities"
"Data center temperature is maintained at 68°F"
"Temperature alerts trigger at ±5°F variance"
"Temperature alerts trigger when below 63°F"
"Temperature alerts trigger when above 73°F"

9 facts from 2 sentences (5x human baseline)

Example 2: Access Control

Original text:

"Remote VPN access requires multi-factor authentication using SMS codes or
TOTP authenticator apps. Accounts lock after 3 failed login attempts within
15 minutes and remain locked for 30 minutes."

Extracted facts (Maximum Depth):

"Remote VPN access requires multi-factor authentication"
"Multi-factor authentication uses SMS codes"
"Multi-factor authentication uses TOTP authenticator apps"
"Accounts lock after 3 failed login attempts"
"Failed login attempts are counted within 15-minute windows"
"Locked accounts remain locked for 30 minutes"

6 facts from 2 sentences (3x human baseline)

Example 3: Backup and Recovery

Original text:

"Daily incremental backups run at 2 AM UTC with weekly full backups on Sundays.
Backups are encrypted with AES-256 and retained for 90 days. Backup restoration
tests are performed quarterly by the IT Operations team."

Extracted facts (Maximum Depth):

"Daily incremental backups run at 2 AM UTC"
"Weekly full backups run on Sundays"
"Backups are encrypted with AES-256"
"Backups are retained for 90 days"
"Backup restoration tests are performed quarterly"
"Backup restoration tests are performed by IT Operations team"

6 facts from 3 sentences (2x human baseline)

Benefits

Exceeds Human Analysis: Captures details humans might overlook or consider redundant
Complete Coverage: No technical detail left behind
Machine-Queryable: Every detail is a discrete, searchable fact
Audit Trail: Every claim is traceable to exact source quote
Automated Due Diligence: Can answer highly specific questions without re-reading source
Compliance Validation: Can verify specific technical requirements are documented

Trade-offs

Pros

✅ Maximum information extraction
✅ Exceeds human analysis capabilities
✅ Comprehensive coverage of all details
✅ Better for automated analysis and querying
✅ More facts = more data for consensus algorithms

Cons

⚠️ Higher token usage (~50% increase)
⚠️ More facts to validate
⚠️ May include some redundant information
⚠️ Requires quote corrector to recover rejected facts

Net Result: The benefits far outweigh the costs for automated analysis systems.

Next Steps

Run Test Extraction with maximum depth mode on SOC2 document
Measure Results: Compare fact count, specificity, coverage
Integrate Quote Corrector: Auto-recover rejected facts during extraction
Build Deduplication: Post-process to merge highly similar facts if needed

Conclusion

The system now operates at maximum depth by default, designed to exceed human analysis capabilities. This makes it suitable for:

Automated compliance validation
Technical due diligence
Security assessments
Detailed audit trail generation
Machine-queryable knowledge bases

Philosophy: Extract everything, validate rigorously, let downstream systems handle deduplication if needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maximum Depth Extraction Mode (Now Default)

What Changed

Key Philosophy

Depth Instructions

What Gets Extracted Now

1. Specific Implementations - Leave NO technical detail behind

2. Concrete Processes - Extract EVERY process detail

3. Technical Details - Extract EVERY quantitative value

4. Organizational Facts - Extract EVERY organizational detail

5. Compliance Statements - Extract EVERY compliance detail

6. Test Results - Extract EVERY test detail

Critical: Exact Quotes Required

Expected Results

Fact Count Increase

Quality Maintenance

Validation Rate

Token Usage Impact

Usage

Examples

Example 1: Physical Security

Example 2: Access Control

Example 3: Backup and Recovery

Benefits

Trade-offs

Pros

Cons

Next Steps

Conclusion

FilesExpand file tree

MAXIMUM_DEPTH_MODE.md

Latest commit

History

MAXIMUM_DEPTH_MODE.md

File metadata and controls

Maximum Depth Extraction Mode (Now Default)

What Changed

Key Philosophy

Depth Instructions

What Gets Extracted Now

1. Specific Implementations - Leave NO technical detail behind

2. Concrete Processes - Extract EVERY process detail

3. Technical Details - Extract EVERY quantitative value

4. Organizational Facts - Extract EVERY organizational detail

5. Compliance Statements - Extract EVERY compliance detail

6. Test Results - Extract EVERY test detail

Critical: Exact Quotes Required

Expected Results

Fact Count Increase

Quality Maintenance

Validation Rate

Token Usage Impact

Usage

Examples

Example 1: Physical Security

Example 2: Access Control

Example 3: Backup and Recovery

Benefits

Trade-offs

Pros

Cons

Next Steps

Conclusion