Skip to content

Enhanced Ingest System

crocodilestick edited this page Sep 23, 2025 · 2 revisions

CWA's Enhanced Ingest System

CWA's Enhanced Ingest System provides robust, reliable, and intelligent book processing with enterprise-grade features including timeout protection, intelligent queuing, and comprehensive status tracking.

Table of Contents

Quick Start

Basic Usage - Ingest Dir 📁

  1. Add Books: Simply drop your ebook files into the /cwa-book-ingest folder (or whatever you've mapped it to in your docker-compose)
  2. Wait for Processing: CWA will automatically detect, process, and add the books to your library
  3. Check Status: Monitor progress through the CWA web interface or check the logs

Web Interface Upload 🌐

CWA's web interface upload system seamlessly integrates with the enhanced ingest system:

New Book Upload

  1. Navigate to any page in the CWA web interface
  2. Use the "Upload" button in the top navigation
  3. Select your ebook files
  4. Click "Upload"
  5. Files are processed through the same enhanced ingest pipeline

Adding Format to Existing Book

  1. Go to the Edit Book page for any book
  2. Use the "Upload Format" button
  3. Select the new format file
  4. The system automatically adds it as an additional format to the existing book

Upload Process Flow

🌐 Web Upload → 📁 Atomic Save → 🔄 Ingest Pipeline → 📚 Library
                      ↓
                 📋 Manifest (for format uploads)

Behind the Scenes:

  • Files are saved with unique timestamped names to prevent conflicts
  • Atomic operations ensure no partial uploads are processed
  • Manifest system handles special actions (like adding formats to existing books)
  • Same processing pipeline as folder-based ingest for consistency

Supported Formats

The ingest system supports 27+ ebook formats:

  • Common: EPUB, MOBI, AZW3, PDF, TXT
  • Comics: CBZ, CBR, CB7, CBC
  • Documents: DOCX, ODT, HTML, HTMLZ
  • Specialized: KEPUB, FB2, LIT, LRF, PRC, PDB, PML, RB, RTF, SNB, TCR, TXTZ
  • Audio: M4B, M4A, MP4 (audiobooks)
  • Metadata: CWA.JSON (custom metadata files)

How It Works

Processing Pipeline

📁 Ingest Folder  → 🔍 Detection → ⏱️ Stability Check → 🔒 Lock Check → 📚 Processing → ✅ Library
🌐 Web Upload     ↗                                          ↓
                                                         📋 Queue (if busy)
                                                              ↓
                                                         🔄 Retry Later

Input Methods

The enhanced ingest system supports multiple input methods:

  1. Folder-Based Ingest (Traditional)

    • Drop files into the ingest folder
    • Real-time detection with inotifywait
    • Bulk processing support
  2. Web Interface Upload (Modern)

    • Upload through CWA web interface
    • Atomic file operations
    • Manifest-driven special actions
    • Progress tracking in Tasks view

Detailed Process Flow

Universal Processing Steps

  1. File Detection: Files are detected either by folder monitoring or web upload completion
  2. Stability Check: Ensures files are completely uploaded before processing
  3. Format Validation: Checks if the file format is supported and not in the ignore list
  4. Lock Management: Uses a lock file to prevent concurrent processing
  5. Processing: Converts, fixes, and adds the book to your Calibre library
  6. Cleanup: Removes processed files and updates status

Web Upload Specific Features

  • Atomic Operations: Files are saved with .uploading extension, then atomically renamed
  • Unique Naming: Timestamped filenames prevent conflicts (YYYYMMDD_HHMMSS_microseconds_filename)
  • Manifest System: Special .cwa.json files instruct the processor for specific actions
  • User Context: Upload tracking tied to specific users for better task management

Key Features

🛡️ Timeout Protection

  • Coordinated Timeouts: Processor handles internal timeout logic while service provides safety timeout (3x configured)
  • Automatic Timeout: Prevents hanging processes with configurable timeout (default: 15 minutes)
  • Failed File Backup: Problematic files are automatically backed up with timestamps
  • Service Continuity: Ensures the ingest service never gets stuck

🔒 Process Lock Management

  • Robust Locking: Advanced ProcessLock class with PID tracking and stale lock detection
  • Automatic Recovery: System automatically cleans up orphaned processes and stale locks
  • Race Condition Prevention: Proper file locking prevents multiple processors from running simultaneously
  • Container Restart Safety: Process recovery service ensures clean state after container restarts

📋 Intelligent Queuing

  • Busy Handling: Files that can't be processed immediately are queued for retry
  • Persistent Queue: Queue survives container restarts and updates
  • Size Management: Configurable maximum queue size with FIFO overflow handling
  • Automatic Retry: Queued files are automatically retried after successful processing

📊 Status Tracking

  • Real-time Status: Current processing status available at /config/cwa_ingest_status
  • Detailed Information: Shows current file, timestamp, and processing state
  • Queue Monitoring: Track how many files are waiting in the retry queue

🔧 Enhanced Error Handling

  • Timestamped Backups: Failed files are saved with timestamps for investigation
  • Detailed Logging: Comprehensive logging for troubleshooting
  • Graceful Degradation: System continues operating even when individual files fail
  • Database Connection Safety: All database operations use proper context managers to prevent leaks
  • Permission Error Recovery: Graceful handling of permission errors in network share environments

📄 Manifest System

  • Smart Actions: Special .cwa.json files enable advanced processing instructions
  • Format Addition: Add new formats to existing books without creating duplicates
  • User Attribution: Web uploads are tracked and attributed to specific users
  • Atomic Operations: Manifest and file operations are coordinated to prevent inconsistencies

Manifest File Format

{
    "action": "add_format",
    "book_id": 123,
    "original_filename": "my-book.epub"
}

Supported Actions:

  • add_format - Add a new format to an existing book by ID

Configuration

Timeout Settings

Via Web Interface (Recommended)

  1. Navigate to Admin PanelCWA Settings
  2. Find the "Ingest Processing Timeout" section
  3. Set your desired timeout in minutes (5-120 range)
  4. Click Save Settings

Via Environment Variables

# In your docker-compose.yml or environment
CWA_INGEST_MAX_QUEUE_SIZE=50  # Maximum files in retry queue (default: 50)

Database Configuration

The timeout setting is stored in your CWA database:

-- View current timeout setting
SELECT ingest_timeout_minutes FROM cwa_settings;

-- Manually update timeout (not recommended, use web interface)
UPDATE cwa_settings SET ingest_timeout_minutes = 20;

File Format Configuration

Configure which formats to ignore through the CWA Settings page:

  • Auto-Convert Ignored Formats: Formats to skip during conversion
  • Auto-Ingest Ignored Formats: Formats to completely ignore during ingest

Status Monitoring

Status File Format

The status file (/config/cwa_ingest_status) contains:

state:filename:timestamp:detail

Possible States:

  • idle - Service is waiting for new files
  • processing:filename:timestamp - Currently processing a file
  • queued:filename:timestamp - File added to retry queue
  • completed:filename:timestamp - File successfully processed
  • timeout:filename:timestamp - File processing timed out (internal processor timeout)
  • safety_timeout:filename:timestamp - File hit safety timeout (indicates serious issue)
  • error:filename:code:timestamp - Processing error occurred

Programmatic Status Checking

# Example: Read ingest status in Python
def get_ingest_status():
    try:
        with open('/config/cwa_ingest_status', 'r') as f:
            status_line = f.read().strip()
            parts = status_line.split(':')
            return {
                'state': parts[0],
                'filename': parts[1] if len(parts) > 1 else '',
                'timestamp': parts[2] if len(parts) > 2 else '',
                'detail': parts[3] if len(parts) > 3 else ''
            }
    except FileNotFoundError:
        return {'state': 'unknown'}

Queue Monitoring

# Check queue size
wc -l /config/cwa_ingest_retry_queue

# View queued files
cat /config/cwa_ingest_retry_queue

Troubleshooting

Common Issues

Files Not Being Processed

  1. Check File Format: Ensure the file format is supported
  2. Check Ignore Lists: Verify the format isn't in the ignore list
  3. Check Permissions: Ensure files are owned by the container user, not root
  4. Check Status: Look at /config/cwa_ingest_status for current state
  5. Web Upload Issues: Check the Tasks page for upload status and errors

Web Upload Specific Issues

  1. Upload Fails to Start: Check browser console for JavaScript errors
  2. Upload Hangs: Verify file size isn't exceeding server limits
  3. Format Not Added: For existing books, ensure the book ID is valid
  4. Manifest Errors: Check logs for .cwa.json processing errors

Files Timing Out

  1. Check File Size: Very large files may need longer timeout
  2. Increase Timeout: Adjust timeout in CWA Settings
  3. Check Failed Backups: Look in /config/processed_books/failed/ for problematic files
  4. Check Logs: Review container logs for detailed error information
  5. Web Upload Timeouts: Large uploads may need increased web server timeout settings

Process Lock Issues

  1. Check Lock Status: Verify no stale lock files exist in /tmp/
  2. Container Restart: The process recovery service automatically cleans up on restart
  3. Manual Cleanup: If needed, remove lock files: rm -f /tmp/ingest_processor.lock
  4. Check Orphaned Processes: Look for hung Python processes: ps aux | grep ingest_processor

Permission Problems

  1. Network Shares: Set NETWORK_SHARE_MODE=true for NFS/SMB environments
  2. File Ownership: Ensure files are owned by container user (abc:abc)
  3. Directory Permissions: Verify ingest directory is writable
  4. Failed Backup Location: Check /config/processed_books/failed/ for permission error backups

Failed File Investigation

Failed files are saved with descriptive timestamps:

/config/processed_books/failed/
├── 20250902_143052_timeout_large-book.epub
├── 20250902_143127_retry_timeout_corrupted-file.pdf
├── 20250902_143200_safety_timeout_problematic-book.mobi
└── 20250902_143301_permission_error_network-file.epub

Filename Format:

  • YYYYMMDD_HHMMSS_timeout_originalname.ext - Files that timed out on first attempt (processor timeout)
  • YYYYMMDD_HHMMSS_retry_timeout_originalname.ext - Files that timed out during retry
  • YYYYMMDD_HHMMSS_safety_timeout_originalname.ext - Files that hit the safety timeout (serious issue)
  • YYYYMMDD_HHMMSS_permission_error_originalname.ext - Files with permission/access issues

Log Analysis

# Monitor ingest service in real-time
docker logs -f calibre-web-automated | grep "cwa-ingest-service"

# Check for specific patterns
docker logs calibre-web-automated | grep -E "(TIMEOUT|ERROR|queue)"

System Recovery Features

The Enhanced Ingest System includes automatic recovery mechanisms:

  1. Process Recovery Service: Automatically runs on container startup to:

    • Clean up stale temporary files older than 1 hour
    • Reset stuck processing status
    • Identify orphaned CWA processes
  2. Lock File Management: Robust locking system that:

    • Tracks process IDs to detect stale locks
    • Automatically cleans up locks from dead processes
    • Prevents race conditions in concurrent access
  3. Database Connection Safety: All database operations use context managers to prevent connection leaks

  4. Coordinated Timeout System:

    • Processor handles internal timeout logic (configurable timeout)
    • Service provides safety timeout (3x configured timeout) as last resort
    • Prevents conflicts between different timeout mechanisms

Advanced Configuration

Network Share Optimization

For network shares (NFS/SMB), the system automatically falls back to polling mode:

# Force network share mode
NETWORK_SHARE_MODE=true

# Force polling mode
CWA_WATCH_MODE=poll

Custom Stability Checks

Fine-tune file stability detection:

# Number of size checks to perform
CWA_INGEST_STABLE_CHECKS=6

# Number of consecutive matching sizes required
CWA_INGEST_STABLE_CONSEC_MATCH=2

# Interval between checks (seconds)
CWA_INGEST_STABLE_INTERVAL=0.5

Performance Tuning

For High-Volume Environments

# Increase queue size for high-volume ingestion
CWA_INGEST_MAX_QUEUE_SIZE=100

# Reduce timeout for faster throughput (if files are small)
# Set via web interface: 5-10 minutes

For Large File Environments

# Increase timeout for large files
# Set via web interface: 30-60 minutes

# Keep default queue size
CWA_INGEST_MAX_QUEUE_SIZE=50

File Flow Diagram

graph TD
    A1[📁 File Added to Ingest Folder] --> B[🔍 File Detected by inotifywait]
    A2[🌐 Web Interface Upload] --> B2[💾 Atomic Save with .uploading]
    B2 --> B3[🔄 Atomic Rename to Final Name]
    B3 --> B
    
    B --> C[⏱️ Wait for File Stability]
    C --> D{📝 Valid Format?}
    D -->|No| E[🗑️ Ignore File]
    D -->|Yes| F{📄 Has Manifest?}
    
    F -->|Yes| F1[📋 Process Manifest Action]
    F1 --> F2{🎯 Add Format to Existing Book?}
    F2 -->|Yes| F3[➕ Add Format to Book ID]
    F2 -->|No| G
    F3 --> J
    
    F -->|No| G{🔒 Processor Available?}
    G -->|Yes| H[⚡ Start Processing with Timeout]
    G -->|No| I[📋 Add to Retry Queue]
    
    H --> K{⏰ Processing Complete?}
    K -->|Success| J[✅ File Processed Successfully]
    K -->|Timeout| L[⚠️ Move to Failed Backup]
    K -->|Error| M[❌ Log Error]
    
    J --> N[🔄 Process Retry Queue]
    N --> O{📋 Queue Empty?}
    O -->|No| P[🔄 Retry Next File]
    O -->|Yes| Q[😴 Return to Idle]
    
    I --> R[⏳ Wait for Processor Available]
    R --> P
    
    P --> K
    L --> Q
    M --> Q
    
    style A1 fill:#e1f5fe
    style A2 fill:#f3e5f5
    style J fill:#e8f5e8
    style L fill:#fff3e0
    style M fill:#ffebee
    style F1 fill:#f3e5f5
Loading

Best Practices

File Management

  • Don't download directly to ingest folder - Complete downloads elsewhere first, then move
  • Use proper file permissions - Ensure files are owned by your user, not root
  • Monitor disk space - Failed backups and queue files require storage
  • Web uploads are atomic - No need to worry about partial uploads being processed
  • Unique filenames - Web uploads automatically get unique names to prevent conflicts

Performance

  • Batch processing - Add multiple files at once rather than one-by-one
  • Monitor queue size - Adjust CWA_INGEST_MAX_QUEUE_SIZE based on your usage
  • Tune timeout - Set appropriate timeout based on your typical file sizes
  • Web upload efficiency - Use web interface for single files, folder ingest for bulk operations
  • Format management - Use "Add Format" feature instead of re-uploading entire books

Monitoring

  • Check status regularly - Use the status file to monitor processing
  • Review failed files - Investigate files in the failed backup folder
  • Monitor logs - Watch container logs for processing issues

Integration with Other CWA Features

The Enhanced Ingest System works seamlessly with:

  • Auto-Convert: Automatic format conversion during processing
  • EPUB Fixer: Automatic EPUB repair and optimization
  • Metadata Enforcement: Automatic metadata and cover enforcement
  • Backup System: Automatic backup of processed files
  • Stats Tracking: Processing statistics in CWA Stats page

This enhanced system provides enterprise-grade reliability while maintaining the simplicity and ease-of-use that makes CWA great for home users.

Clone this wiki locally