OCRApp 📄🔍

Config-Driven OCR Pipeline for Structured Data Extraction

A desktop application that extracts structured data from images
using template-based OCR and automatically inserts it into a database.

Built with Python, Tesseract OCR, and a modular extraction engine.

🎬 Demo

OCRApp_Demo.mp4

Example workflow:

Image → Preprocessing → OCR → Extraction → Validation → Database

❗ Problem

Extracting structured information from documents is difficult because:

OCR returns unstructured text
Documents have different layouts
Data must often be manually entered into databases

This project solves the problem by using template-driven OCR extraction.

💡 Solution

OCRApp allows users to define templates describing:

What data to extract
Where it appears in the document
How to validate it

The application then:

Runs OCR on the image
Extracts fields using configured methods
Validates the data
Inserts results directly into a database

🏗 Architecture

OCRAPP
│
├── configs/        → OCR templates
├── data/           → extracted database
├── db/             → database layer
├── logs/           → application logs
├── ocr_core/       → OCR engine & extractors
├── ui/             → desktop interface
└── utils/          → shared utilities

Core Components

OCR Engine
Handles preprocessing and text extraction.

Extractors
Multiple extraction strategies:

Bounding Box extraction
Regex extraction
Label-nearby detection
Registry lookup

Template System
Allows users to configure extraction logic without modifying code.

✨ Features

Template-driven OCR extraction
Multiple extraction methods
Built-in preprocessing pipeline
Data validation
Automatic database insertion
Editable results via UI
Modular architecture

🔎 Extraction Methods

OCRApp supports multiple extraction strategies:

Method	Description
bbox_region	Extract text from bounding box
regex	Extract using pattern matching
label_nearby	Detect value near label
registry	Lookup structured values

⚙️ Installation

Clone the repository:

git clone https://github.com/yourname/ocrapp.git
cd ocrapp

Install dependencies:

pip install -r requirements.txt

Install Tesseract OCR

Windows:
https://github.com/UB-Mannheim/tesseract/wiki

Make sure the path is configured:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

▶️ Run the Application

python app.py

🧾 Template Example

{
  "fields": {
    "name": {
      "method": "label_nearby",
      "label": "Name"
    },
    "credits": {
      "method": "bbox_region",
      "bbox": [0.3,0.4,0.6,0.5]
    }
  }
}

📚 What I Learned

Designing modular Python project architectures
Building config-driven systems using templates
Implementing an OCR processing pipeline
Improving OCR accuracy through image preprocessing
Creating multiple extraction strategies (bbox, regex, label detection)
Structuring applications into logical modules
Integrating SQLite for structured data storage
Implementing logging and debugging workflows

🚀 Future Improvements

PDF batch processing
ML-based layout detection
Automatic template generation
REST API support
Batch OCR processing for large datasets

📜 License

This project is licensed under the MIT License.

MIT License

Copyright (c) 2026

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRApp 📄🔍

🎬 Demo

❗ Problem

💡 Solution

🏗 Architecture

Core Components

✨ Features

🔎 Extraction Methods

⚙️ Installation

▶️ Run the Application

🧾 Template Example

📚 What I Learned

🚀 Future Improvements

📜 License

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs/templates		configs/templates
data		data
db		db
logs		logs
ocr_core		ocr_core
ui		ui
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

OCRApp 📄🔍

🎬 Demo

❗ Problem

💡 Solution

🏗 Architecture

Core Components

✨ Features

🔎 Extraction Methods

⚙️ Installation

▶️ Run the Application

🧾 Template Example

📚 What I Learned

🚀 Future Improvements

📜 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages