A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.
# Install from PyPI
pip install RobustDocOCR
# Install with OCR support
pip install RobustDocOCR[ocr]
# Install with development dependencies
pip install RobustDocOCR[dev]from robustdococr import preprocess_document, load_image
# Load your document image
image = load_image("document.jpg")
# Apply preprocessing pipeline
results = preprocess_document(image, show_steps=True)
# Access preprocessed image
preprocessed_image = results['final']# Process single image
robustdococr input.jpg --output output.jpg
# Process with intermediate steps display
robustdococr input.jpg --show-steps- Deskewing: Straightens rotated documents using Hough transform
- Binarization: Converts images to black & white using adaptive thresholding
- Noise Removal: Cleans up artifacts using two-stage denoising
- OCR Ready: Produces optimized images for Tesseract OCR
- Adaptive Thresholding: Handles varying lighting conditions (shadows, glare)
- Hough Transform Deskewing: Robust rotation correction (Β±45Β°)
- Two-Stage Denoising: Preserves text while removing artifacts
- 96% Text Retention: Minimal text loss during preprocessing
- Tesseract Optimized: Produces images ideal for OCR engines
| Metric | Value |
|---|---|
| Text Retention Rate | 96% |
| Character Improvement | +12% |
| Quality Distribution | 85% Excellent, 12% Good, 3% Fair |
| Rotation Correction | Handles Β±45Β° rotation effectively |
robustdococr/
βββ preprocessing/ # Core preprocessing modules
β βββ deskewing.py # Image straightening
β βββ binarization.py # Adaptive thresholding
β βββ noise_removal.py # Artifact cleaning
β βββ pipeline.py # Complete pipeline
βββ utils/ # Utility functions
β βββ image_utils.py # Image utilities
β βββ ocr_utils.py # OCR utilities
β βββ visualization.py # Visualization tools
βββ cli.py # CLI entry point
βββ main.py # Main module
βββ __init__.py # Package initialization
tests/ # Test suite
examples/ # Example scripts
notebooks/ # Jupyter notebooks
docs/ # Documentation
- Python 3.8+
- OpenCV
- NumPy
- Pillow
- Matplotlib (for visualization)
- Tesseract OCR (optional, for OCR features)
# Basic installation
pip install RobustDocOCR
# Development installation (includes test and dev dependencies)
pip install RobustDocOCR[dev]
# Installation with OCR support
pip install RobustDocOCR[ocr]
# Installation with all extras
pip install RobustDocOCR[all]- Edge Detection: Canny edge detector with thresholds (50, 150)
- Line Detection: Hough Line Transform with threshold 200
- Angle Calculation: Median angle from detected lines for robustness
- Rotation: Affine transformation with cubic interpolation
- CLAHE Enhancement: Contrast Limited Adaptive Histogram Equalization
clipLimit: 2.0tileGridSize: (8, 8)
- Adaptive Thresholding: Gaussian-weighted local thresholding
blockSize: 25C: 10Method: ADAPTIVE_THRESH_GAUSSIAN_C
- Stage 1: Non-Local Means Denoising (
h=10) applied before binarization - Stage 2: Morphological operations (2Γ2 kernel, 1 iteration) applied after binarization
Run the test suite:
pytestRun tests with coverage:
pytest --cov=robustdococr --cov-report=html- Architecture Documentation
- Usage Guide
- Decision Log
- API Reference
- Kaggle Notebook - Complete preprocessing pipeline demonstration
We welcome contributions! Please see our:
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this pipeline in your research, please cite:
@misc{robust-doc-ocr-preprocessing,
author = {3BSALAM},
title = {Robust Document OCR Preprocessing Pipeline},
year = {2026},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/3bsalam-1/RobustDocOCR}}
}This package is available on PyPI: https://pypi.org/project/RobustDocOCR/
Β© 2026 Robust Document OCR Preprocessing Pipeline