This document summarizes the implementation work for processing invoice images, including OCR improvements, dashboard enhancements, and database cleanup utilities.
File: ingestion/image_processor.py
- Issue: OCR was configured for English only (
lang="en"), causing failures on Chinese invoices - Fix: Changed default language to Chinese (
lang="ch") which supports both Chinese and English - Impact: Enables proper text extraction from Chinese invoices (e.g., 机动车销售统一发票)
File: brain/extractor.py
- Issue: LLM extraction prompts didn't handle Chinese invoice field names
- Fix: Added Chinese field name mappings:
- 销售方 (seller) → vendor_name
- 发票号码 → invoice_number
- 价税合计 → total_amount
- 税额 → tax_amount
- 货物或应税劳务名称 → line items
- Impact: Improved extraction accuracy for Chinese invoices
File: core/models.py
- Issue: Type mismatches between model definitions and database schema
- Fixes:
- Changed numeric fields from
StringtoNumerictype (subtotal, tax_amount, tax_rate, total_amount, etc.) - Changed date fields from
DateTimetoDatetype (invoice_date, due_date) - Added proper imports for
NumericandDatetypes
- Changed numeric fields from
- Impact: Resolves database insertion errors and ensures type consistency
Files: interface/dashboard/app.py, interface/dashboard/queries.py
- Added new columns: Hash, Size, Vendor, Amount, Duration, Processed timestamp
- Visual indicators: Status emojis (✅ ❌ ⏳), duplicate detection (🔄 icon)
- Enhanced metrics: Total Invoices, Unique Files count
- Better formatting: File sizes in KB/MB, amounts with currency, processing duration
- Metadata expander: Shows duplicate file detection summary
- Dropdown selector: Easy invoice selection from list
- Manual UUID input: Alternative method with validation
- Enhanced information display:
- File information section (name, type, size, hash, version)
- Processing status with duration calculation
- Extracted data with financial details
- Validation results with summary metrics
- Helper section: Instructions on how to get Invoice ID
File: ingestion/orchestrator.py
- Issue: Version increment didn't always get the latest version when multiple invoices share same hash
- Fix: Added ordering by version DESC to ensure latest version is used for increment
- Impact: Correct version numbering for reprocessed files
- Processes all invoice images in
data/directory - Shows progress for each file
- Displays summary of successful/failed processing
- Supports custom API URL and data directory
- Cleans up pgvector/embedding data
- Finds tables with vector columns
- Finds LlamaIndex-related tables
- Currently: No vector data found (LlamaIndex uses in-memory storage)
- Cleans up invoice records and related data
- Deletes: invoices, extracted_data, validation_results, processing_jobs
- Features:
- Dry-run mode (
--dry-run) - File path filtering (
--file-path-filter) - Safety confirmation (requires typing "DELETE ALL")
- Dry-run mode (
- Respects foreign key constraints
- Detailed explanation of duplicate file handling
- File hashing and version management
- Example scenarios
- Use cases and best practices
- Engine: PaddleOCR
- Default Language: Chinese (
ch) - supports both Chinese and English - Features: Text direction detection enabled
- Numeric Fields: Use
Numeric(precision, scale)type - Date Fields: Use
Datetype (not DateTime) - File Hash: SHA-256 for duplicate detection
- Version Tracking: Increments for reprocessed files
- File ingestion → Calculate SHA-256 hash
- Check for existing invoice with same hash
- Determine version (increment if exists)
- OCR/Text extraction (PaddleOCR for images)
- AI extraction (LlamaIndex RAG)
- Validation (business rules)
- Self-correction (if validation fails)
- Storage (PostgreSQL)
math_check_subtotal_tax: Validates subtotal + tax = totalline_item_math: Validates line items sum = subtotalvendor_sanity: Validates vendor name existsdate_consistency: Validates due_date >= invoice_date
ingestion/image_processor.py- OCR language fixbrain/extractor.py- Chinese invoice supportcore/models.py- Database type fixesingestion/orchestrator.py- Version increment fixinterface/dashboard/app.py- Enhanced UIinterface/dashboard/queries.py- Improved queries
docs/duplicate-processing-logic.md- Technical documentationscripts/process_all_invoices.py- Batch processingscripts/cleanup_vectors.py- Vector cleanup utilityscripts/cleanup_invoices.py- Invoice cleanup utility
- Verified OCR works with Chinese invoices
- Fixed database type mismatches
- Enhanced dashboard displays all invoice data
- Created cleanup utilities for maintenance