Skip to content

shlokareddy1102/docuflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📄 DocuFlow

DocuFlow is an AI-based document processing and Excel automation system that converts user-uploaded documents into clean, structured Excel records through a deterministic and explainable backend pipeline.

The system is designed to be mentor-safe, academically defensible, and practically useful, focusing on real-world document workflows rather than black-box automation.


✨ Key Features

  • 📂 Single & multi-file document upload
  • 🧭 User-selected document type (Receipts, Invoices, ID Cards)
  • 🖼️ Image preprocessing using OpenCV
  • 🔤 OCR with spatial text extraction (Tesseract)
  • 📐 Layout-aware text grouping & analysis
  • 🧠 Rule-based field extraction (deterministic)
  • ✅ Validation & confidence scoring
  • 📦 Batch processing with per-file error isolation
  • 📊 Automatic Excel (.xlsx) generation
  • ⬇️ Secure Excel download endpoint
  • 🖥️ Clean frontend with progressive disclosure UI

🧠 System Philosophy

DocuFlow is intentionally designed to be:

  • Deterministic – no black-box ML decisions
  • Explainable – every processing step is inspectable
  • Modular – each pipeline stage is independent
  • Realistic – mirrors real-world document processing systems

This makes the system ideal for academic evaluation, project demos, and interview discussions.


🏗️ High-Level Architecture

Upload → Preprocessing → OCR → Layout Analysis
     → Field Extraction → Validation → Excel Generation
     → API Response → Frontend Download

📁 Project Structure

docuflow-backend/
├── app/
│   ├── main.py
│   ├── api/
│   │   ├── routes.py
│   │   ├── upload.py
│   │   └── download.py
│   └── core/
│       ├── storage.py
│       ├── preprocessing.py
│       ├── ocr.py
│       ├── layout.py
│       ├── extraction.py
│       ├── validation.py
│       └── excel.py
├── uploads/
├── processed/
├── exports/
├── index.html   # Frontend
├── requirements.txt
└── README.md

⚙️ Tech Stack

Backend

  • FastAPI – API framework
  • OpenCV – image preprocessing
  • Tesseract OCR – text extraction
  • Pillow – image handling
  • openpyxl – Excel generation

Frontend

  • HTML, CSS, Vanilla JavaScript
  • Single-page application (no framework)

🚀 Getting Started

1️⃣ Clone the Repository

git clone https://github.com/your-username/docuflow.git
cd docuflow-backend

2️⃣ Create Virtual Environment

python3 -m venv venv
source venv/bin/activate

3️⃣ Install Dependencies

python3 -m pip install -r requirements.txt

4️⃣ Install Tesseract OCR

macOS

brew install tesseract

Verify:

tesseract --version

▶️ Run the Backend

python3 -m uvicorn app.main:app --reload

Access:

  • API Docs: http://127.0.0.1:8000/docs

🖥️ Run the Frontend

  • Open index.html directly in your browser
  • Ensure backend is running at http://127.0.0.1:8000

📤 Usage Flow

  1. Select document type (Receipt / Invoice / ID Card)
  2. Upload one or more documents
  3. Click Process Documents
  4. View processing summary and warnings
  5. Download generated Excel file

Each document corresponds to one row in the Excel output.


📊 Output Example (Excel)

Date Merchant Amount Confidence Warnings
05-01-2025 Swiggy 326 0.85

🧪 Batch Processing Behavior

  • Each file is processed independently
  • Failure in one document does not stop others
  • Partial extraction is allowed with warnings
  • Excel is generated from successfully processed files only

📌 One-Line Summary

DocuFlow processes user-selected document types through a structured pipeline involving image preprocessing, OCR, layout analysis, validation, and automatic Excel generation.


👤 Author

Developed as an academic and practical project for demonstrating document intelligence pipelines.


If you’re reviewing this project:

  • Check /docs for API clarity
  • Upload sample documents
  • Inspect generated Excel outputs

Thank you for exploring DocuFlow 🚀

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages