OCR Check Processing with OLMoCR

Extract structured data from checks in PDF bank statements using the OLMoCR-7B vision model via Google Colab.

Overview

This project provides a complete workflow for:

Extracting check images from PDF bank statements (local)
Processing with OCR using OLMoCR-7B model (Google Colab with free T4 GPU)
Parsing structured data (check number, date, payee, amount, memo)
Exporting results as CSV and JSON

Architecture

┌─────────────────────────┐
│  1. Local Machine       │
│  Extract check images   │
│  Input: PDF statements  │
│  Output: PNG images     │
└───────────┬─────────────┘
            │
            │ Upload to Google Colab
            ▼
┌─────────────────────────┐
│  2. Google Colab        │
│  OCR with OLMoCR        │
│  Parse data fields      │
│  Generate CSV & JSON    │
└───────────┬─────────────┘
            │
            │ Download results
            ▼
┌─────────────────────────┐
│  3. Local Machine       │
│  View and use results   │
│  CSV ready for Excel    │
└─────────────────────────┘

Requirements

Local Requirements

Python 3.8+
PyMuPDF (fitz): pip install PyMuPDF
PIL/Pillow: pip install Pillow
Optional: numpy, scipy (for quality checking)

Google Colab Requirements

Google account (free)
T4 GPU runtime (free tier)
Packages installed automatically in notebook

File Structure

ocr/
├── README.md                              # This file
├── olmocr_check_processor.ipynb           # Main Colab notebook
│
├── checkorc/                              # Original extraction script
│   ├── extract_images_from_pdf.py        # Simple single-PDF extractor
│   └── test.pdf                          # Sample PDF for testing
│
├── local_scripts/                         # Enhanced local scripts
│   ├── batch_extract.py                  # Batch PDF processing
│   ├── check_quality.py                  # Image quality validation
│   └── results_viewer.py                 # View OCR results
│
└── results/                               # Output directory
    ├── extracted_checks/                 # PNG check images
    ├── processed/                        # OCR results (CSV/JSON)
    └── extraction_manifest.json          # Extraction metadata

Quick Start

Step 1: Extract Check Images (Local)

Option A: Simple Single PDF

cd checkorc
python extract_images_from_pdf.py

This extracts checks from test.pdf in the same directory.

Option B: Batch Processing (Recommended)

cd local_scripts
python batch_extract.py /path/to/pdf/folder -o ../results/extracted_checks

Features:

Process multiple PDFs at once
Generate JSON manifest with metadata
Optional quality filtering
Organized output structure

Advanced options:

# Process with quality threshold (minimum 100 DPI)
python batch_extract.py /path/to/pdfs -o ../results/extracted_checks -q 100

# Recursive directory search
python batch_extract.py /path/to/pdfs -r

# Custom manifest location
python batch_extract.py /path/to/pdfs -m ../results/my_manifest.json

Step 2: Validate Image Quality (Optional)

cd local_scripts
python check_quality.py ../results/extracted_checks

This validates:

✓ Resolution/DPI (default minimum: 100 DPI)
✓ Blur detection
✓ Contrast levels
✓ File integrity

Output shows which images might have OCR issues.

Step 3: Process with OCR (Google Colab)

Open the notebook:
- Go to Google Colab
- Upload olmocr_check_processor.ipynb
- Or: File → Open notebook → Upload
Enable GPU:
- Runtime → Change runtime type → Select "T4 GPU"
Run all cells:
- Runtime → Run all
- Or press Ctrl+F9
Upload check images:
- When prompted, upload PNG files from results/extracted_checks/
- Or mount Google Drive if you prefer
Wait for processing:
- Model loading: ~2-3 minutes (first time)
- Per check: ~2-3 seconds
- Example: 50 checks ≈ 2-3 minutes + processing
Download results:
- check_ocr_results.csv - Structured data for Excel/Sheets
- check_ocr_results.json - Full data with raw OCR text

Step 4: View Results (Local)

cd local_scripts
python results_viewer.py ../results/processed/check_ocr_results.json

Options:

# Show only analysis summary
python results_viewer.py results.json -a

# View specific check
python results_viewer.py results.json -c 5

# Include raw OCR text
python results_viewer.py results.json --show-raw

Output Formats

CSV Output

File: check_ocr_results.csv

Check Number	Date	Pay to the Order of	Amount	For/Memo	Bank Name	Account (Last 4)	Source Image
1234	01/15/2024	ABC Company Inc	1250.00	Invoice #5678	Wells Fargo	5678	check_5_0.png
1235	01/20/2024	John Smith	500.00	Contractor	Wells Fargo	5678	check_5_1.png

Ready to import into Excel, Google Sheets, or accounting software
Empty cells indicate unreadable handwriting or missing data
Amounts are numeric for easy calculations

JSON Output

File: check_ocr_results.json

[
  {
    "check_number": "1234",
    "date": "01/15/2024",
    "payee_name": "ABC Company Inc",
    "amount": 1250.00,
    "memo": "Invoice #5678",
    "bank_name": "Wells Fargo",
    "account_last4": "5678",
    "image_file": "check_5_0.png",
    "raw_text": "Full OCR output..."
  }
]

Includes raw OCR text for debugging
Suitable for programmatic processing
Preserves all extracted information

Cost Analysis

Google Colab Free Tier

Cost: $0/month
GPU time: ~12-15 hours/day
Best for:
- Development and testing
- Processing 100-300 checks/day
- Occasional batch processing

Limitations:

Session timeouts after 90 minutes idle
Daily GPU quota
Manual file upload/download

Google Colab Pro

Cost: $10/month
GPU time: ~24 hours/day
Best for:
- Regular processing (500+ checks/day)
- Longer sessions
- Priority GPU access

When to Consider AWS/Other Platforms

Need 24/7 API availability
Processing >1000 checks/day
Automated workflows required
Sub-5-second response time needed

For most use cases, Colab Free is sufficient to start!

Troubleshooting

Issue: No checks extracted

Solution:

Check that PDF contains embedded images (not scanned pages)
Verify check dimensions match filters in script
Try adjusting is_check_image() parameters

Issue: Poor OCR quality

Solution:

Run check_quality.py to validate images
Ensure minimum 100 DPI resolution
Check for blur or low contrast
Re-extract at higher quality

Issue: Missing data fields

Solution:

Review raw OCR text in JSON output
Adjust regex patterns in parsing function
Some handwriting may be unreadable (expected)
Consider using LLM (Claude API) for complex parsing

Issue: Google Colab session timeout

Solution:

Process in batches before 90-minute timeout
Upgrade to Colab Pro for longer sessions
Save model to Google Drive for faster reload

Issue: Out of GPU memory

Solution:

Restart runtime (Runtime → Restart runtime)
Ensure T4 GPU is selected (not lower tier)
Model requires ~12GB, T4 has 15GB (should fit)

Customization

Adjust Check Extraction Filters

Edit is_check_image() in batch_extract.py:

def is_check_image(width, height):
    # Customize these values based on your checks
    min_width, max_width = 600, 2500      # Width range
    min_height, max_height = 250, 800     # Height range
    min_aspect_ratio = 2.0                # Width/height minimum
    max_aspect_ratio = 4.5                # Width/height maximum

Adjust OCR Parsing

Edit parse_check_data() in the Colab notebook (Cell 5):

# Add custom patterns for your check format
payee_patterns = [
    r'(?:Pay to):?\s*(.+?)(?:\n|$)',     # Add your patterns here
]

Change Quality Thresholds

# Extract only high-quality images (150 DPI minimum)
python batch_extract.py pdfs/ -q 150

# Validate with strict thresholds
python check_quality.py images/ --min-dpi 150 --min-contrast 75

Advanced Usage

Using Google Drive (Colab)

Uncomment in Cell 2 of notebook:

from google.colab import drive
drive.mount('/content/drive')
CHECK_FOLDER = '/content/drive/MyDrive/checks'

Benefits:

No manual upload needed
Persistent storage
Easier for large batches

Gradio Web Interface (Colab)

Run Cell 8 (optional) to create a web interface:

Upload checks via browser
See results immediately
Share public URL (72-hour validity)

Batch Processing Multiple Runs

# Process January statements
python batch_extract.py statements/2024-01/ -o results/jan_checks -m results/jan_manifest.json

# Process February statements
python batch_extract.py statements/2024-02/ -o results/feb_checks -m results/feb_manifest.json

# Run Colab for each batch separately

Performance

Typical Processing Times

Task	Time	Notes
Model loading	2-3 min	First time only
Single check OCR	2-3 sec	Per check
Parsing	<0.1 sec	Per check
10 checks	~30 sec	After model loaded
50 checks	~2.5 min	After model loaded
100 checks	~5 min	After model loaded

Tips for Faster Processing

Process all checks in one Colab session (avoid reloading model)
Use batch processing locally (faster extraction)
Skip quality validation for trusted sources
Consider Colab Pro for priority GPU access

Model Information

OLMoCR-7B

Source: Allen Institute for AI
Model: allenai/olmocr-7b-1024-preview
Type: Vision-Language Model (VLM)
Size: ~7B parameters, ~3-4GB download
GPU Requirements: 12GB VRAM (fits on T4 with 15GB)
License: Open source (Apache 2.0)

Why OLMoCR?

Specifically trained on document OCR
Handles handwriting better than Tesseract
Preserves document structure
Good performance on checks and forms

Future Enhancements

Potential Improvements

Fine-tuning: Train on your specific check formats
LLM Parsing: Use Claude API for more robust field extraction
Automated Pipeline: Scheduled Colab runs via Cloud Functions
Direct Integration: Connect to accounting software APIs
Multi-language: Support non-English checks

Scaling Options

When you outgrow Colab:

Hugging Face Inference: Serverless deployment (~$10-20/month)
RunPod/Vast.ai: Dedicated GPU instances (~$0.20-0.40/hour)
AWS SageMaker: Production deployment with auto-scaling
Local GPU: One-time hardware investment for unlimited processing

Support

Common Questions

Q: Can this work with scanned PDF pages? A: No, this extracts embedded images. For scanned pages, you need to:

Convert PDF pages to images (use pdf2image)
Crop check regions manually or with detection
Then use Colab OCR

Q: How accurate is the OCR? A: Depends on image quality:

Printed text: 95-99% accuracy
Clear handwriting: 70-90% accuracy
Poor handwriting: 30-60% accuracy
Damaged/blurry: Variable

Q: Can I process thousands of checks? A: Yes, but in batches:

Colab Free: ~200-300/day
Colab Pro: ~500-1000/day
For more, consider dedicated deployment

Q: Is my data private? A:

Local processing: Completely private
Google Colab: Follows Google's terms of service
Model runs in your Colab session (not shared)
Consider Colab Pro for business use

Q: Can this replace manual data entry? A: Partially:

Good for printed checks: Yes, mostly
Handwritten checks: Reduces work, needs review
Always validate financial data manually

License

This project is provided as-is for educational and development purposes.

Components:

Scripts: MIT License (free to use/modify)
OLMoCR Model: Apache 2.0 (Allen Institute for AI)
PyMuPDF: AGPL (check license for commercial use)

Credits

OLMoCR Model: Allen Institute for AI
PyMuPDF: Artifex Software
Workflow Design: Based on practical check extraction needs

Need help? Open an issue or refer to the troubleshooting section above.

Ready to start? Jump to Quick Start!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
local_scripts		local_scripts
.gitignore		.gitignore
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
olmocr_check_processor.ipynb		olmocr_check_processor.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OCR Check Processing with OLMoCR

Overview

Architecture

Requirements

Local Requirements

Google Colab Requirements

File Structure

Quick Start

Step 1: Extract Check Images (Local)

Option A: Simple Single PDF

Option B: Batch Processing (Recommended)

Step 2: Validate Image Quality (Optional)

Step 3: Process with OCR (Google Colab)

Step 4: View Results (Local)

Output Formats

CSV Output

JSON Output

Cost Analysis

Google Colab Free Tier

Google Colab Pro

When to Consider AWS/Other Platforms

Troubleshooting

Issue: No checks extracted

Issue: Poor OCR quality

Issue: Missing data fields

Issue: Google Colab session timeout

Issue: Out of GPU memory

Customization

Adjust Check Extraction Filters

Adjust OCR Parsing

Change Quality Thresholds

Advanced Usage

Using Google Drive (Colab)

Gradio Web Interface (Colab)

Batch Processing Multiple Runs

Performance

Typical Processing Times

Tips for Faster Processing

Model Information

Future Enhancements

Potential Improvements

Scaling Options

Support

Common Questions

License

Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages