Extract structured data from checks in PDF bank statements using the OLMoCR-7B vision model via Google Colab.
This project provides a complete workflow for:
- Extracting check images from PDF bank statements (local)
- Processing with OCR using OLMoCR-7B model (Google Colab with free T4 GPU)
- Parsing structured data (check number, date, payee, amount, memo)
- Exporting results as CSV and JSON
┌─────────────────────────┐
│ 1. Local Machine │
│ Extract check images │
│ Input: PDF statements │
│ Output: PNG images │
└───────────┬─────────────┘
│
│ Upload to Google Colab
▼
┌─────────────────────────┐
│ 2. Google Colab │
│ OCR with OLMoCR │
│ Parse data fields │
│ Generate CSV & JSON │
└───────────┬─────────────┘
│
│ Download results
▼
┌─────────────────────────┐
│ 3. Local Machine │
│ View and use results │
│ CSV ready for Excel │
└─────────────────────────┘
- Python 3.8+
- PyMuPDF (fitz):
pip install PyMuPDF - PIL/Pillow:
pip install Pillow - Optional: numpy, scipy (for quality checking)
- Google account (free)
- T4 GPU runtime (free tier)
- Packages installed automatically in notebook
ocr/
├── README.md # This file
├── olmocr_check_processor.ipynb # Main Colab notebook
│
├── checkorc/ # Original extraction script
│ ├── extract_images_from_pdf.py # Simple single-PDF extractor
│ └── test.pdf # Sample PDF for testing
│
├── local_scripts/ # Enhanced local scripts
│ ├── batch_extract.py # Batch PDF processing
│ ├── check_quality.py # Image quality validation
│ └── results_viewer.py # View OCR results
│
└── results/ # Output directory
├── extracted_checks/ # PNG check images
├── processed/ # OCR results (CSV/JSON)
└── extraction_manifest.json # Extraction metadata
cd checkorc
python extract_images_from_pdf.pyThis extracts checks from test.pdf in the same directory.
cd local_scripts
python batch_extract.py /path/to/pdf/folder -o ../results/extracted_checksFeatures:
- Process multiple PDFs at once
- Generate JSON manifest with metadata
- Optional quality filtering
- Organized output structure
Advanced options:
# Process with quality threshold (minimum 100 DPI)
python batch_extract.py /path/to/pdfs -o ../results/extracted_checks -q 100
# Recursive directory search
python batch_extract.py /path/to/pdfs -r
# Custom manifest location
python batch_extract.py /path/to/pdfs -m ../results/my_manifest.jsoncd local_scripts
python check_quality.py ../results/extracted_checksThis validates:
- ✓ Resolution/DPI (default minimum: 100 DPI)
- ✓ Blur detection
- ✓ Contrast levels
- ✓ File integrity
Output shows which images might have OCR issues.
-
Open the notebook:
- Go to Google Colab
- Upload
olmocr_check_processor.ipynb - Or: File → Open notebook → Upload
-
Enable GPU:
- Runtime → Change runtime type → Select "T4 GPU"
-
Run all cells:
- Runtime → Run all
- Or press Ctrl+F9
-
Upload check images:
- When prompted, upload PNG files from
results/extracted_checks/ - Or mount Google Drive if you prefer
- When prompted, upload PNG files from
-
Wait for processing:
- Model loading: ~2-3 minutes (first time)
- Per check: ~2-3 seconds
- Example: 50 checks ≈ 2-3 minutes + processing
-
Download results:
check_ocr_results.csv- Structured data for Excel/Sheetscheck_ocr_results.json- Full data with raw OCR text
cd local_scripts
python results_viewer.py ../results/processed/check_ocr_results.jsonOptions:
# Show only analysis summary
python results_viewer.py results.json -a
# View specific check
python results_viewer.py results.json -c 5
# Include raw OCR text
python results_viewer.py results.json --show-rawFile: check_ocr_results.csv
| Check Number | Date | Pay to the Order of | Amount | For/Memo | Bank Name | Account (Last 4) | Source Image |
|---|---|---|---|---|---|---|---|
| 1234 | 01/15/2024 | ABC Company Inc | 1250.00 | Invoice #5678 | Wells Fargo | 5678 | check_5_0.png |
| 1235 | 01/20/2024 | John Smith | 500.00 | Contractor | Wells Fargo | 5678 | check_5_1.png |
- Ready to import into Excel, Google Sheets, or accounting software
- Empty cells indicate unreadable handwriting or missing data
- Amounts are numeric for easy calculations
File: check_ocr_results.json
[
{
"check_number": "1234",
"date": "01/15/2024",
"payee_name": "ABC Company Inc",
"amount": 1250.00,
"memo": "Invoice #5678",
"bank_name": "Wells Fargo",
"account_last4": "5678",
"image_file": "check_5_0.png",
"raw_text": "Full OCR output..."
}
]- Includes raw OCR text for debugging
- Suitable for programmatic processing
- Preserves all extracted information
- Cost: $0/month
- GPU time: ~12-15 hours/day
- Best for:
- Development and testing
- Processing 100-300 checks/day
- Occasional batch processing
Limitations:
- Session timeouts after 90 minutes idle
- Daily GPU quota
- Manual file upload/download
- Cost: $10/month
- GPU time: ~24 hours/day
- Best for:
- Regular processing (500+ checks/day)
- Longer sessions
- Priority GPU access
- Need 24/7 API availability
- Processing >1000 checks/day
- Automated workflows required
- Sub-5-second response time needed
For most use cases, Colab Free is sufficient to start!
Solution:
- Check that PDF contains embedded images (not scanned pages)
- Verify check dimensions match filters in script
- Try adjusting
is_check_image()parameters
Solution:
- Run
check_quality.pyto validate images - Ensure minimum 100 DPI resolution
- Check for blur or low contrast
- Re-extract at higher quality
Solution:
- Review raw OCR text in JSON output
- Adjust regex patterns in parsing function
- Some handwriting may be unreadable (expected)
- Consider using LLM (Claude API) for complex parsing
Solution:
- Process in batches before 90-minute timeout
- Upgrade to Colab Pro for longer sessions
- Save model to Google Drive for faster reload
Solution:
- Restart runtime (Runtime → Restart runtime)
- Ensure T4 GPU is selected (not lower tier)
- Model requires ~12GB, T4 has 15GB (should fit)
Edit is_check_image() in batch_extract.py:
def is_check_image(width, height):
# Customize these values based on your checks
min_width, max_width = 600, 2500 # Width range
min_height, max_height = 250, 800 # Height range
min_aspect_ratio = 2.0 # Width/height minimum
max_aspect_ratio = 4.5 # Width/height maximumEdit parse_check_data() in the Colab notebook (Cell 5):
# Add custom patterns for your check format
payee_patterns = [
r'(?:Pay to):?\s*(.+?)(?:\n|$)', # Add your patterns here
]# Extract only high-quality images (150 DPI minimum)
python batch_extract.py pdfs/ -q 150
# Validate with strict thresholds
python check_quality.py images/ --min-dpi 150 --min-contrast 75Uncomment in Cell 2 of notebook:
from google.colab import drive
drive.mount('/content/drive')
CHECK_FOLDER = '/content/drive/MyDrive/checks'Benefits:
- No manual upload needed
- Persistent storage
- Easier for large batches
Run Cell 8 (optional) to create a web interface:
- Upload checks via browser
- See results immediately
- Share public URL (72-hour validity)
# Process January statements
python batch_extract.py statements/2024-01/ -o results/jan_checks -m results/jan_manifest.json
# Process February statements
python batch_extract.py statements/2024-02/ -o results/feb_checks -m results/feb_manifest.json
# Run Colab for each batch separately| Task | Time | Notes |
|---|---|---|
| Model loading | 2-3 min | First time only |
| Single check OCR | 2-3 sec | Per check |
| Parsing | <0.1 sec | Per check |
| 10 checks | ~30 sec | After model loaded |
| 50 checks | ~2.5 min | After model loaded |
| 100 checks | ~5 min | After model loaded |
- Process all checks in one Colab session (avoid reloading model)
- Use batch processing locally (faster extraction)
- Skip quality validation for trusted sources
- Consider Colab Pro for priority GPU access
OLMoCR-7B
- Source: Allen Institute for AI
- Model:
allenai/olmocr-7b-1024-preview - Type: Vision-Language Model (VLM)
- Size: ~7B parameters, ~3-4GB download
- GPU Requirements: 12GB VRAM (fits on T4 with 15GB)
- License: Open source (Apache 2.0)
Why OLMoCR?
- Specifically trained on document OCR
- Handles handwriting better than Tesseract
- Preserves document structure
- Good performance on checks and forms
- Fine-tuning: Train on your specific check formats
- LLM Parsing: Use Claude API for more robust field extraction
- Automated Pipeline: Scheduled Colab runs via Cloud Functions
- Direct Integration: Connect to accounting software APIs
- Multi-language: Support non-English checks
When you outgrow Colab:
- Hugging Face Inference: Serverless deployment (~$10-20/month)
- RunPod/Vast.ai: Dedicated GPU instances (~$0.20-0.40/hour)
- AWS SageMaker: Production deployment with auto-scaling
- Local GPU: One-time hardware investment for unlimited processing
Q: Can this work with scanned PDF pages? A: No, this extracts embedded images. For scanned pages, you need to:
- Convert PDF pages to images (use pdf2image)
- Crop check regions manually or with detection
- Then use Colab OCR
Q: How accurate is the OCR? A: Depends on image quality:
- Printed text: 95-99% accuracy
- Clear handwriting: 70-90% accuracy
- Poor handwriting: 30-60% accuracy
- Damaged/blurry: Variable
Q: Can I process thousands of checks? A: Yes, but in batches:
- Colab Free: ~200-300/day
- Colab Pro: ~500-1000/day
- For more, consider dedicated deployment
Q: Is my data private? A:
- Local processing: Completely private
- Google Colab: Follows Google's terms of service
- Model runs in your Colab session (not shared)
- Consider Colab Pro for business use
Q: Can this replace manual data entry? A: Partially:
- Good for printed checks: Yes, mostly
- Handwritten checks: Reduces work, needs review
- Always validate financial data manually
This project is provided as-is for educational and development purposes.
Components:
- Scripts: MIT License (free to use/modify)
- OLMoCR Model: Apache 2.0 (Allen Institute for AI)
- PyMuPDF: AGPL (check license for commercial use)
- OLMoCR Model: Allen Institute for AI
- PyMuPDF: Artifex Software
- Workflow Design: Based on practical check extraction needs
Need help? Open an issue or refer to the troubleshooting section above.
Ready to start? Jump to Quick Start!