Get started with check OCR processing in 5 minutes!
- Python 3.8+ installed
- Google account (for Colab)
- Bank statement PDFs with embedded check images
cd ocr
pip install -r requirements.txtcd local_scripts
python batch_extract.py /path/to/your/statement.pdf -o ../results/extracted_checksOutput: PNG images in results/extracted_checks/
python check_quality.py ../results/extracted_checksFix any issues before OCR processing.
- Open Google Colab
- Upload
olmocr_check_processor.ipynb - Enable GPU: Runtime → Change runtime type → T4 GPU
- Run all: Runtime → Run all (Ctrl+F9)
- Upload images when prompted (from
results/extracted_checks/) - Wait for processing (~2-3 min for 50 checks)
- Download
check_ocr_results.csvand.jsonfiles
python results_viewer.py ../results/processed/check_ocr_results.jsonDone! Your check data is now in CSV format, ready for Excel or Google Sheets.
# Extract checks from multiple PDFs
python batch_extract.py /folder/with/pdfs/ -o ../results/extracted_checks
# Extract with quality filter (min 100 DPI)
python batch_extract.py /folder/with/pdfs/ -q 100
# Validate image quality
python check_quality.py ../results/extracted_checks
# View OCR results summary
python results_viewer.py results.json -a
# View specific check
python results_viewer.py results.json -c 5Input: Bank statement PDF with 10 checks
After Step 2: 10 PNG files (check_0_0.png, check_1_0.png, ...)
After Step 4:
check_ocr_results.csv(ready for spreadsheet)check_ocr_results.json(full data with raw OCR)
Accuracy:
- Printed text: ~95%+
- Handwriting: ~70-80% (varies)
- Your PDF might not have embedded images
- Try adjusting dimension filters in
batch_extract.py
- Go to: Runtime → Change runtime type → Select "T4 GPU"
- Click Save
- Run
check_quality.pyto check image quality - Ensure images are at least 100 DPI
- Some handwriting is naturally hard to read
- Process faster by batching all checks at once
- Or upgrade to Colab Pro ($10/month) for longer sessions
- Import to Excel: Open
check_ocr_results.csvin Excel/Google Sheets - Review accuracy: Spot-check a few results against original images
- Adjust parsing: If needed, customize regex patterns in Colab notebook Cell 5
- Automate: Process monthly statements as they arrive
✅ DO:
- Process all checks in one Colab session (avoids reloading model)
- Validate a few results manually first time
- Keep original PDFs as backup
- Use quality validation for important batches
❌ DON'T:
- Trust 100% without verification (especially for financial data)
- Upload sensitive PDFs to public repositories
- Process thousands of checks in Colab Free (use Pro or dedicated GPU)
- Full documentation: See
README.md - Script options: Run any script with
--helpflag - Colab issues: Check notebook markdown cells for instructions
Ready? Start with Step 1 above! 🚀