Skip to content

gatorbonita/checkorc

Repository files navigation

OCR Check Processing with OLMoCR

Extract structured data from checks in PDF bank statements using the OLMoCR-7B vision model via Google Colab.

Overview

This project provides a complete workflow for:

  1. Extracting check images from PDF bank statements (local)
  2. Processing with OCR using OLMoCR-7B model (Google Colab with free T4 GPU)
  3. Parsing structured data (check number, date, payee, amount, memo)
  4. Exporting results as CSV and JSON

Architecture

┌─────────────────────────┐
│  1. Local Machine       │
│  Extract check images   │
│  Input: PDF statements  │
│  Output: PNG images     │
└───────────┬─────────────┘
            │
            │ Upload to Google Colab
            ▼
┌─────────────────────────┐
│  2. Google Colab        │
│  OCR with OLMoCR        │
│  Parse data fields      │
│  Generate CSV & JSON    │
└───────────┬─────────────┘
            │
            │ Download results
            ▼
┌─────────────────────────┐
│  3. Local Machine       │
│  View and use results   │
│  CSV ready for Excel    │
└─────────────────────────┘

Requirements

Local Requirements

  • Python 3.8+
  • PyMuPDF (fitz): pip install PyMuPDF
  • PIL/Pillow: pip install Pillow
  • Optional: numpy, scipy (for quality checking)

Google Colab Requirements

  • Google account (free)
  • T4 GPU runtime (free tier)
  • Packages installed automatically in notebook

File Structure

ocr/
├── README.md                              # This file
├── olmocr_check_processor.ipynb           # Main Colab notebook
│
├── checkorc/                              # Original extraction script
│   ├── extract_images_from_pdf.py        # Simple single-PDF extractor
│   └── test.pdf                          # Sample PDF for testing
│
├── local_scripts/                         # Enhanced local scripts
│   ├── batch_extract.py                  # Batch PDF processing
│   ├── check_quality.py                  # Image quality validation
│   └── results_viewer.py                 # View OCR results
│
└── results/                               # Output directory
    ├── extracted_checks/                 # PNG check images
    ├── processed/                        # OCR results (CSV/JSON)
    └── extraction_manifest.json          # Extraction metadata

Quick Start

Step 1: Extract Check Images (Local)

Option A: Simple Single PDF

cd checkorc
python extract_images_from_pdf.py

This extracts checks from test.pdf in the same directory.

Option B: Batch Processing (Recommended)

cd local_scripts
python batch_extract.py /path/to/pdf/folder -o ../results/extracted_checks

Features:

  • Process multiple PDFs at once
  • Generate JSON manifest with metadata
  • Optional quality filtering
  • Organized output structure

Advanced options:

# Process with quality threshold (minimum 100 DPI)
python batch_extract.py /path/to/pdfs -o ../results/extracted_checks -q 100

# Recursive directory search
python batch_extract.py /path/to/pdfs -r

# Custom manifest location
python batch_extract.py /path/to/pdfs -m ../results/my_manifest.json

Step 2: Validate Image Quality (Optional)

cd local_scripts
python check_quality.py ../results/extracted_checks

This validates:

  • ✓ Resolution/DPI (default minimum: 100 DPI)
  • ✓ Blur detection
  • ✓ Contrast levels
  • ✓ File integrity

Output shows which images might have OCR issues.

Step 3: Process with OCR (Google Colab)

  1. Open the notebook:

    • Go to Google Colab
    • Upload olmocr_check_processor.ipynb
    • Or: File → Open notebook → Upload
  2. Enable GPU:

    • Runtime → Change runtime type → Select "T4 GPU"
  3. Run all cells:

    • Runtime → Run all
    • Or press Ctrl+F9
  4. Upload check images:

    • When prompted, upload PNG files from results/extracted_checks/
    • Or mount Google Drive if you prefer
  5. Wait for processing:

    • Model loading: ~2-3 minutes (first time)
    • Per check: ~2-3 seconds
    • Example: 50 checks ≈ 2-3 minutes + processing
  6. Download results:

    • check_ocr_results.csv - Structured data for Excel/Sheets
    • check_ocr_results.json - Full data with raw OCR text

Step 4: View Results (Local)

cd local_scripts
python results_viewer.py ../results/processed/check_ocr_results.json

Options:

# Show only analysis summary
python results_viewer.py results.json -a

# View specific check
python results_viewer.py results.json -c 5

# Include raw OCR text
python results_viewer.py results.json --show-raw

Output Formats

CSV Output

File: check_ocr_results.csv

Check Number Date Pay to the Order of Amount For/Memo Bank Name Account (Last 4) Source Image
1234 01/15/2024 ABC Company Inc 1250.00 Invoice #5678 Wells Fargo 5678 check_5_0.png
1235 01/20/2024 John Smith 500.00 Contractor Wells Fargo 5678 check_5_1.png
  • Ready to import into Excel, Google Sheets, or accounting software
  • Empty cells indicate unreadable handwriting or missing data
  • Amounts are numeric for easy calculations

JSON Output

File: check_ocr_results.json

[
  {
    "check_number": "1234",
    "date": "01/15/2024",
    "payee_name": "ABC Company Inc",
    "amount": 1250.00,
    "memo": "Invoice #5678",
    "bank_name": "Wells Fargo",
    "account_last4": "5678",
    "image_file": "check_5_0.png",
    "raw_text": "Full OCR output..."
  }
]
  • Includes raw OCR text for debugging
  • Suitable for programmatic processing
  • Preserves all extracted information

Cost Analysis

Google Colab Free Tier

  • Cost: $0/month
  • GPU time: ~12-15 hours/day
  • Best for:
    • Development and testing
    • Processing 100-300 checks/day
    • Occasional batch processing

Limitations:

  • Session timeouts after 90 minutes idle
  • Daily GPU quota
  • Manual file upload/download

Google Colab Pro

  • Cost: $10/month
  • GPU time: ~24 hours/day
  • Best for:
    • Regular processing (500+ checks/day)
    • Longer sessions
    • Priority GPU access

When to Consider AWS/Other Platforms

  • Need 24/7 API availability
  • Processing >1000 checks/day
  • Automated workflows required
  • Sub-5-second response time needed

For most use cases, Colab Free is sufficient to start!

Troubleshooting

Issue: No checks extracted

Solution:

  • Check that PDF contains embedded images (not scanned pages)
  • Verify check dimensions match filters in script
  • Try adjusting is_check_image() parameters

Issue: Poor OCR quality

Solution:

  • Run check_quality.py to validate images
  • Ensure minimum 100 DPI resolution
  • Check for blur or low contrast
  • Re-extract at higher quality

Issue: Missing data fields

Solution:

  • Review raw OCR text in JSON output
  • Adjust regex patterns in parsing function
  • Some handwriting may be unreadable (expected)
  • Consider using LLM (Claude API) for complex parsing

Issue: Google Colab session timeout

Solution:

  • Process in batches before 90-minute timeout
  • Upgrade to Colab Pro for longer sessions
  • Save model to Google Drive for faster reload

Issue: Out of GPU memory

Solution:

  • Restart runtime (Runtime → Restart runtime)
  • Ensure T4 GPU is selected (not lower tier)
  • Model requires ~12GB, T4 has 15GB (should fit)

Customization

Adjust Check Extraction Filters

Edit is_check_image() in batch_extract.py:

def is_check_image(width, height):
    # Customize these values based on your checks
    min_width, max_width = 600, 2500      # Width range
    min_height, max_height = 250, 800     # Height range
    min_aspect_ratio = 2.0                # Width/height minimum
    max_aspect_ratio = 4.5                # Width/height maximum

Adjust OCR Parsing

Edit parse_check_data() in the Colab notebook (Cell 5):

# Add custom patterns for your check format
payee_patterns = [
    r'(?:Pay to):?\s*(.+?)(?:\n|$)',     # Add your patterns here
]

Change Quality Thresholds

# Extract only high-quality images (150 DPI minimum)
python batch_extract.py pdfs/ -q 150

# Validate with strict thresholds
python check_quality.py images/ --min-dpi 150 --min-contrast 75

Advanced Usage

Using Google Drive (Colab)

Uncomment in Cell 2 of notebook:

from google.colab import drive
drive.mount('/content/drive')
CHECK_FOLDER = '/content/drive/MyDrive/checks'

Benefits:

  • No manual upload needed
  • Persistent storage
  • Easier for large batches

Gradio Web Interface (Colab)

Run Cell 8 (optional) to create a web interface:

  • Upload checks via browser
  • See results immediately
  • Share public URL (72-hour validity)

Batch Processing Multiple Runs

# Process January statements
python batch_extract.py statements/2024-01/ -o results/jan_checks -m results/jan_manifest.json

# Process February statements
python batch_extract.py statements/2024-02/ -o results/feb_checks -m results/feb_manifest.json

# Run Colab for each batch separately

Performance

Typical Processing Times

Task Time Notes
Model loading 2-3 min First time only
Single check OCR 2-3 sec Per check
Parsing <0.1 sec Per check
10 checks ~30 sec After model loaded
50 checks ~2.5 min After model loaded
100 checks ~5 min After model loaded

Tips for Faster Processing

  1. Process all checks in one Colab session (avoid reloading model)
  2. Use batch processing locally (faster extraction)
  3. Skip quality validation for trusted sources
  4. Consider Colab Pro for priority GPU access

Model Information

OLMoCR-7B

  • Source: Allen Institute for AI
  • Model: allenai/olmocr-7b-1024-preview
  • Type: Vision-Language Model (VLM)
  • Size: ~7B parameters, ~3-4GB download
  • GPU Requirements: 12GB VRAM (fits on T4 with 15GB)
  • License: Open source (Apache 2.0)

Why OLMoCR?

  • Specifically trained on document OCR
  • Handles handwriting better than Tesseract
  • Preserves document structure
  • Good performance on checks and forms

Future Enhancements

Potential Improvements

  1. Fine-tuning: Train on your specific check formats
  2. LLM Parsing: Use Claude API for more robust field extraction
  3. Automated Pipeline: Scheduled Colab runs via Cloud Functions
  4. Direct Integration: Connect to accounting software APIs
  5. Multi-language: Support non-English checks

Scaling Options

When you outgrow Colab:

  1. Hugging Face Inference: Serverless deployment (~$10-20/month)
  2. RunPod/Vast.ai: Dedicated GPU instances (~$0.20-0.40/hour)
  3. AWS SageMaker: Production deployment with auto-scaling
  4. Local GPU: One-time hardware investment for unlimited processing

Support

Common Questions

Q: Can this work with scanned PDF pages? A: No, this extracts embedded images. For scanned pages, you need to:

  1. Convert PDF pages to images (use pdf2image)
  2. Crop check regions manually or with detection
  3. Then use Colab OCR

Q: How accurate is the OCR? A: Depends on image quality:

  • Printed text: 95-99% accuracy
  • Clear handwriting: 70-90% accuracy
  • Poor handwriting: 30-60% accuracy
  • Damaged/blurry: Variable

Q: Can I process thousands of checks? A: Yes, but in batches:

  • Colab Free: ~200-300/day
  • Colab Pro: ~500-1000/day
  • For more, consider dedicated deployment

Q: Is my data private? A:

  • Local processing: Completely private
  • Google Colab: Follows Google's terms of service
  • Model runs in your Colab session (not shared)
  • Consider Colab Pro for business use

Q: Can this replace manual data entry? A: Partially:

  • Good for printed checks: Yes, mostly
  • Handwritten checks: Reduces work, needs review
  • Always validate financial data manually

License

This project is provided as-is for educational and development purposes.

Components:

  • Scripts: MIT License (free to use/modify)
  • OLMoCR Model: Apache 2.0 (Allen Institute for AI)
  • PyMuPDF: AGPL (check license for commercial use)

Credits

  • OLMoCR Model: Allen Institute for AI
  • PyMuPDF: Artifex Software
  • Workflow Design: Based on practical check extraction needs

Need help? Open an issue or refer to the troubleshooting section above.

Ready to start? Jump to Quick Start!

About

ocr checks from bank statement

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors