DocuFlow is an AI-based document processing and Excel automation system that converts user-uploaded documents into clean, structured Excel records through a deterministic and explainable backend pipeline.
The system is designed to be mentor-safe, academically defensible, and practically useful, focusing on real-world document workflows rather than black-box automation.
- 📂 Single & multi-file document upload
- 🧭 User-selected document type (Receipts, Invoices, ID Cards)
- 🖼️ Image preprocessing using OpenCV
- 🔤 OCR with spatial text extraction (Tesseract)
- 📐 Layout-aware text grouping & analysis
- 🧠 Rule-based field extraction (deterministic)
- ✅ Validation & confidence scoring
- 📦 Batch processing with per-file error isolation
- 📊 Automatic Excel (.xlsx) generation
- ⬇️ Secure Excel download endpoint
- 🖥️ Clean frontend with progressive disclosure UI
DocuFlow is intentionally designed to be:
- Deterministic – no black-box ML decisions
- Explainable – every processing step is inspectable
- Modular – each pipeline stage is independent
- Realistic – mirrors real-world document processing systems
This makes the system ideal for academic evaluation, project demos, and interview discussions.
Upload → Preprocessing → OCR → Layout Analysis
→ Field Extraction → Validation → Excel Generation
→ API Response → Frontend Download
docuflow-backend/
├── app/
│ ├── main.py
│ ├── api/
│ │ ├── routes.py
│ │ ├── upload.py
│ │ └── download.py
│ └── core/
│ ├── storage.py
│ ├── preprocessing.py
│ ├── ocr.py
│ ├── layout.py
│ ├── extraction.py
│ ├── validation.py
│ └── excel.py
├── uploads/
├── processed/
├── exports/
├── index.html # Frontend
├── requirements.txt
└── README.md
- FastAPI – API framework
- OpenCV – image preprocessing
- Tesseract OCR – text extraction
- Pillow – image handling
- openpyxl – Excel generation
- HTML, CSS, Vanilla JavaScript
- Single-page application (no framework)
git clone https://github.com/your-username/docuflow.git
cd docuflow-backendpython3 -m venv venv
source venv/bin/activatepython3 -m pip install -r requirements.txtmacOS
brew install tesseractVerify:
tesseract --versionpython3 -m uvicorn app.main:app --reloadAccess:
- API Docs:
http://127.0.0.1:8000/docs
- Open
index.htmldirectly in your browser - Ensure backend is running at
http://127.0.0.1:8000
- Select document type (Receipt / Invoice / ID Card)
- Upload one or more documents
- Click Process Documents
- View processing summary and warnings
- Download generated Excel file
Each document corresponds to one row in the Excel output.
| Date | Merchant | Amount | Confidence | Warnings |
|---|---|---|---|---|
| 05-01-2025 | Swiggy | 326 | 0.85 | — |
- Each file is processed independently
- Failure in one document does not stop others
- Partial extraction is allowed with warnings
- Excel is generated from successfully processed files only
DocuFlow processes user-selected document types through a structured pipeline involving image preprocessing, OCR, layout analysis, validation, and automatic Excel generation.
Developed as an academic and practical project for demonstrating document intelligence pipelines.
If you’re reviewing this project:
- Check
/docsfor API clarity - Upload sample documents
- Inspect generated Excel outputs
Thank you for exploring DocuFlow 🚀