Skip to content

Latest commit

 

History

History
127 lines (92 loc) · 3.47 KB

File metadata and controls

127 lines (92 loc) · 3.47 KB

Quick Start Guide

Get started with check OCR processing in 5 minutes!

Prerequisites

  • Python 3.8+ installed
  • Google account (for Colab)
  • Bank statement PDFs with embedded check images

Step-by-Step

1. Install Dependencies (30 seconds)

cd ocr
pip install -r requirements.txt

2. Extract Checks from PDF (1 minute)

cd local_scripts
python batch_extract.py /path/to/your/statement.pdf -o ../results/extracted_checks

Output: PNG images in results/extracted_checks/

3. Validate Quality - Optional (30 seconds)

python check_quality.py ../results/extracted_checks

Fix any issues before OCR processing.

4. Process with OCR (3 minutes)

  1. Open Google Colab
  2. Upload olmocr_check_processor.ipynb
  3. Enable GPU: Runtime → Change runtime type → T4 GPU
  4. Run all: Runtime → Run all (Ctrl+F9)
  5. Upload images when prompted (from results/extracted_checks/)
  6. Wait for processing (~2-3 min for 50 checks)
  7. Download check_ocr_results.csv and .json files

5. View Results (30 seconds)

python results_viewer.py ../results/processed/check_ocr_results.json

Done! Your check data is now in CSV format, ready for Excel or Google Sheets.

Command Cheat Sheet

# Extract checks from multiple PDFs
python batch_extract.py /folder/with/pdfs/ -o ../results/extracted_checks

# Extract with quality filter (min 100 DPI)
python batch_extract.py /folder/with/pdfs/ -q 100

# Validate image quality
python check_quality.py ../results/extracted_checks

# View OCR results summary
python results_viewer.py results.json -a

# View specific check
python results_viewer.py results.json -c 5

Expected Results

Input: Bank statement PDF with 10 checks

After Step 2: 10 PNG files (check_0_0.png, check_1_0.png, ...)

After Step 4:

  • check_ocr_results.csv (ready for spreadsheet)
  • check_ocr_results.json (full data with raw OCR)

Accuracy:

  • Printed text: ~95%+
  • Handwriting: ~70-80% (varies)

Troubleshooting

"No checks extracted"

  • Your PDF might not have embedded images
  • Try adjusting dimension filters in batch_extract.py

"GPU not available" in Colab

  • Go to: Runtime → Change runtime type → Select "T4 GPU"
  • Click Save

"Poor OCR results"

  • Run check_quality.py to check image quality
  • Ensure images are at least 100 DPI
  • Some handwriting is naturally hard to read

"Session timeout" in Colab

  • Process faster by batching all checks at once
  • Or upgrade to Colab Pro ($10/month) for longer sessions

What's Next?

  1. Import to Excel: Open check_ocr_results.csv in Excel/Google Sheets
  2. Review accuracy: Spot-check a few results against original images
  3. Adjust parsing: If needed, customize regex patterns in Colab notebook Cell 5
  4. Automate: Process monthly statements as they arrive

Tips

DO:

  • Process all checks in one Colab session (avoids reloading model)
  • Validate a few results manually first time
  • Keep original PDFs as backup
  • Use quality validation for important batches

DON'T:

  • Trust 100% without verification (especially for financial data)
  • Upload sensitive PDFs to public repositories
  • Process thousands of checks in Colab Free (use Pro or dedicated GPU)

Need Help?

  • Full documentation: See README.md
  • Script options: Run any script with --help flag
  • Colab issues: Check notebook markdown cells for instructions

Ready? Start with Step 1 above! 🚀