Skip to content

πŸ“„A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.

License

Notifications You must be signed in to change notification settings

3bsalam-1/RobustDocOCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Robust Document OCR Preprocessing Pipeline

License: MIT Python 3.8+ Code Style: Black PyPI version

A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.

πŸš€ Quick Start

Installation

# Install from PyPI
pip install RobustDocOCR

# Install with OCR support
pip install RobustDocOCR[ocr]

# Install with development dependencies
pip install RobustDocOCR[dev]

Basic Usage

from robustdococr import preprocess_document, load_image

# Load your document image
image = load_image("document.jpg")

# Apply preprocessing pipeline
results = preprocess_document(image, show_steps=True)

# Access preprocessed image
preprocessed_image = results['final']

Command Line Interface

# Process single image
robustdococr input.jpg --output output.jpg

# Process with intermediate steps display
robustdococr input.jpg --show-steps

πŸ“¦ Features

4-Stage Preprocessing Pipeline

  1. Deskewing: Straightens rotated documents using Hough transform
  2. Binarization: Converts images to black & white using adaptive thresholding
  3. Noise Removal: Cleans up artifacts using two-stage denoising
  4. OCR Ready: Produces optimized images for Tesseract OCR

Key Technical Features

  • Adaptive Thresholding: Handles varying lighting conditions (shadows, glare)
  • Hough Transform Deskewing: Robust rotation correction (Β±45Β°)
  • Two-Stage Denoising: Preserves text while removing artifacts
  • 96% Text Retention: Minimal text loss during preprocessing
  • Tesseract Optimized: Produces images ideal for OCR engines

🎯 Performance Metrics

Metric Value
Text Retention Rate 96%
Character Improvement +12%
Quality Distribution 85% Excellent, 12% Good, 3% Fair
Rotation Correction Handles Β±45Β° rotation effectively

πŸ“‚ Project Structure

robustdococr/
 β”œβ”€β”€ preprocessing/          # Core preprocessing modules
 β”‚   β”œβ”€β”€ deskewing.py        # Image straightening
 β”‚   β”œβ”€β”€ binarization.py     # Adaptive thresholding
 β”‚   β”œβ”€β”€ noise_removal.py    # Artifact cleaning
 β”‚   └── pipeline.py         # Complete pipeline
 β”œβ”€β”€ utils/                  # Utility functions
 β”‚   β”œβ”€β”€ image_utils.py      # Image utilities
 β”‚   β”œβ”€β”€ ocr_utils.py        # OCR utilities
 β”‚   └── visualization.py    # Visualization tools
 β”œβ”€β”€ cli.py                  # CLI entry point
 β”œβ”€β”€ main.py                 # Main module
 └── __init__.py             # Package initialization
tests/                      # Test suite
examples/                   # Example scripts
notebooks/                  # Jupyter notebooks
docs/                       # Documentation

πŸ”§ Configuration

Requirements

  • Python 3.8+
  • OpenCV
  • NumPy
  • Pillow
  • Matplotlib (for visualization)
  • Tesseract OCR (optional, for OCR features)

Installation Options

# Basic installation
pip install RobustDocOCR

# Development installation (includes test and dev dependencies)
pip install RobustDocOCR[dev]

# Installation with OCR support
pip install RobustDocOCR[ocr]

# Installation with all extras
pip install RobustDocOCR[all]

πŸ“Š Technical Specifications

Deskewing Algorithm

  • Edge Detection: Canny edge detector with thresholds (50, 150)
  • Line Detection: Hough Line Transform with threshold 200
  • Angle Calculation: Median angle from detected lines for robustness
  • Rotation: Affine transformation with cubic interpolation

Binarization Algorithm

  • CLAHE Enhancement: Contrast Limited Adaptive Histogram Equalization
    • clipLimit: 2.0
    • tileGridSize: (8, 8)
  • Adaptive Thresholding: Gaussian-weighted local thresholding
    • blockSize: 25
    • C: 10
    • Method: ADAPTIVE_THRESH_GAUSSIAN_C

Noise Removal Algorithm

  • Stage 1: Non-Local Means Denoising (h=10) applied before binarization
  • Stage 2: Morphological operations (2Γ—2 kernel, 1 iteration) applied after binarization

πŸ§ͺ Testing

Run the test suite:

pytest

Run tests with coverage:

pytest --cov=robustdococr --cov-report=html

πŸ“š Documentation

🀝 Contributing

We welcome contributions! Please see our:

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸŽ“ Citation

If you use this pipeline in your research, please cite:

@misc{robust-doc-ocr-preprocessing,
  author = {3BSALAM},
  title = {Robust Document OCR Preprocessing Pipeline},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/3bsalam-1/RobustDocOCR}}
}

πŸ”— Related Projects

πŸ“¦ PyPI

This package is available on PyPI: https://pypi.org/project/RobustDocOCR/


Β© 2026 Robust Document OCR Preprocessing Pipeline

About

πŸ“„A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published