Status: ✅ Implemented
Two major enhancements have been added to improve extraction speed and fact recovery:
- Parallel Chunk Processing - Process multiple chunks simultaneously
- Medium-Confidence Fact Recovery - LLM-assisted recovery of facts with partial validation
Processes multiple document chunks in parallel using a thread pool, significantly reducing total extraction time.
- Configurable Workers: Control max parallel processes (default: 5)
- Thread Pool: Uses
ThreadPoolExecutorfor safe concurrent execution - Progress Tracking: Real-time progress bar showing completed chunks
- Ordered Results: Facts are combined in chunk order regardless of completion order
python frfr/cli.py extract-facts <text_file> \
--document-name <doc_name> \
--max-workers 5 # Default: 5, adjust based on system resourcesExamples:
# Use default (5 parallel workers)
python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
--document-name my_doc
# Increase to 10 workers for faster processing (requires more resources)
python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
--document-name my_doc \
--max-workers 10
# Reduce to 2 workers for resource-constrained environments
python frfr/cli.py extract-facts output/soc2_full_extraction.txt \
--document-name my_doc \
--max-workers 2Sequential Processing (max-workers=1):
- 9 chunks × ~2 min/chunk = ~18 minutes
Parallel Processing (max-workers=5):
- 9 chunks ÷ 5 workers ≈ 2 batches
- ~2 min × 2 batches = ~4-5 minutes (3-4x speedup)
File: frfr/extraction/fact_extractor.py
# Thread pool executor processes chunks concurrently
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
future_to_chunk = {}
for chunk_info in chunks_to_process:
future = executor.submit(
self._process_single_chunk,
chunk_info, document_name, summary, session, validator, len(all_facts)
)
future_to_chunk[future] = chunk_info[0]
# Collect results as they complete
for future in as_completed(future_to_chunk):
chunk_id, validated_facts, stats = future.result()
chunk_results[chunk_id] = (validated_facts, stats)When a fact has a 40-79% validation match (medium confidence), the system attempts to recover it by:
- Using LLM to search for the correct evidence quote
- Verifying the recovered quote exists in the source
- Updating the fact with corrected evidence
This saves facts that would otherwise be rejected, improving overall extraction yield.
- Smart Threshold: Only attempts recovery for 40-79% matches
- LLM Search: Uses Claude to find exact supporting quotes
- Verification: Validates recovered quotes against source text
- Fact Update: Automatically updates evidence_quote and source_location
- Tracking: Marks recovered facts and includes in stats
Extract Fact → Validate Quote → Quote Found (>80% match) → ✓ Accept
→ Quote Not Found (<40% match) → ✗ Reject
Extract Fact → Validate Quote → Quote Found (>80% match) → ✓ Accept
→ Medium Match (40-79%) → Attempt Recovery
→ Recovery Succeeds → ✓ Accept (with updated quote)
→ Recovery Fails → ✗ Reject
→ Quote Not Found (<40% match) → ✗ Reject
Original Fact (60% match):
{
"claim": "Remote user VPN connections utilize multi-factor authentication",
"evidence_quote": "VPN connections require MFA for access",
"source_location": "Lines 100-102"
}Recovery Process:
- LLM searches lines 80-122 (expanded context)
- Finds exact quote: "Remote users connecting via VPN must authenticate using multi-factor authentication methods"
- Verifies quote exists at lines 98-100
Recovered Fact:
{
"claim": "Remote user VPN connections utilize multi-factor authentication",
"evidence_quote": "Remote users connecting via VPN must authenticate using multi-factor authentication methods",
"source_location": "Lines 80-122"
}Recovery is enabled by default when Claude client is available. No additional configuration needed.
To disable recovery (not recommended):
# In fact_extractor.py, pass claude_client=None
validator = FactValidator(text_file, claude_client=None)File: frfr/validation/fact_validator.py
def attempt_fact_recovery(
self, claim: str, original_quote: str, search_context: str,
start_line: int, end_line: int
) -> Optional[Tuple[str, str]]:
"""Use LLM to find correct quote for medium-confidence facts."""
prompt = f"""Find the exact quote supporting this claim:
Claim: {claim}
Original Quote: {original_quote}
Context: {search_context}
"""
response = self.claude_client.prompt(prompt)
result = json.loads(response)
if result["found"] and result["confidence"] >= 0.8:
recovered_quote = result["quote"]
# Verify quote exists in context
if self.find_quote_in_text(recovered_quote, search_context):
return recovered_quote, f"Lines {start_line}-{end_line}"
return NoneRecovery stats are included in extraction output:
Chunk 5 complete: 50 extracted, 35 validated (7 recovered), 15 rejected- 35 validated: Total facts passing validation
- 7 recovered: Facts that were medium-confidence and recovered
- 15 rejected: Facts that failed validation and recovery
- Time: ~18 minutes for 9 chunks
- Facts: 91 validated (many medium-confidence facts rejected)
- Time: ~4-5 minutes for 9 chunks
- Facts: 91+ validated (recovered medium-confidence facts included)
- Speedup: 3-4x faster
- Quality: Higher fact yield with verified evidence
-
frfr/extraction/fact_extractor.py- Added
max_workersparameter - Implemented parallel chunk processing with ThreadPoolExecutor
- Integrated fact recovery in validation flow
- Added
-
frfr/validation/fact_validator.py- Added
claude_clientparameter for recovery - Implemented
attempt_fact_recovery()method - Enhanced
ValidationResultwith recovery tracking - Updated
validate_fact()to attempt recovery for medium-confidence facts
- Added
frfr/cli.py- Added
--max-workersparameter (default: 5) - Enhanced progress reporting with real-time bar
- Pass extractor max_workers setting
- Added
-
Start with default (5 workers) - Good balance of speed and resources
-
Increase for powerful machines - Up to 10 workers if you have:
- 16+ GB RAM
- Fast SSD
- High API rate limits
-
Decrease for constraints - Use 2-3 workers if:
- Limited RAM (< 8GB)
- API rate limits
- Shared system resources
- Monitor recovery rate - Check logs for recovery success rate
- Review recovered facts - Spot-check recovered facts for accuracy
- Trust the system - Recovery includes verification, false positives are rare
- Dynamic worker adjustment based on system load
- Batch size optimization
- Memory usage monitoring
- Multi-level recovery (try increasingly relaxed thresholds)
- Semantic similarity-based quote matching
- User-configurable recovery confidence threshold
- Recovery retry with different context windows
Problem: Out of memory errors
- Solution: Reduce
--max-workersto 2-3
Problem: API rate limit errors
- Solution: Reduce
--max-workersor add delays
Problem: Chunk results out of order
- Solution: This is expected! Results are combined in correct order automatically
Problem: Too many facts being recovered (potential false positives)
- Solution: Recovery includes verification, but you can review logs for patterns
Problem: Recovery is too slow
- Solution: Recovery only runs for medium-confidence facts (40-79% match), should be minimal overhead
Problem: No recovery happening
- Solution: Check that Claude client is initialized (enabled by default)
To test the new features:
# Test parallel processing with progress bar
python frfr/cli.py extract-facts output/test_doc.txt \
--document-name test \
--max-workers 3
# Check for recovery in logs
grep "Recovered fact" <log_output>
# Verify speedup
time python frfr/cli.py extract-facts ... --max-workers 1 # Sequential
time python frfr/cli.py extract-facts ... --max-workers 5 # Parallel../STATUS.md- Project status and progressDESIGN.md- System architecture