Date: 2025-11-29
Accepted
The platform requires an Optical Character Recognition (OCR) engine to extract text from uploaded documents (images and PDFs). Requirements:
- Accuracy: High accuracy for English and Indian languages (Hindi, Tamil, etc.).
- Open Source: Must be open-source and free to use.
- Local Execution: Must run locally on the server (no external API calls for data privacy).
- Layout Analysis: Ability to detect text blocks and layout is preferred.
We selected PaddleOCR (specifically the en_PP-OCRv3 and multilingual models) as the primary OCR engine.
- Pros: Standard open-source choice, widely available.
- Cons: Lower accuracy on complex layouts, requires separate installation of binary (tesseract-ocr), harder to configure for deep learning based layout analysis.
- Pros: PyTorch-based, easy to install, supports many languages.
- Cons: Slower than PaddleOCR on CPU, slightly lower accuracy for English scene text compared to PP-OCRv3.
- Pros: Best-in-class accuracy.
- Cons: Paid services, data leaves the premise (privacy concern), requires internet connectivity.
- High Performance: PaddleOCR offers state-of-the-art accuracy for lightweight models.
- Language Support: Excellent support for 80+ languages including Indian languages.
- Deployment: Can be installed via
pip(thoughpaddlepaddledependency can be tricky).
- Dependency Size: Requires
paddlepaddlewhich is a large library. - Output Format Changes: As experienced during development, PaddleOCR's API output format can vary between versions (list of lists vs list of dicts), requiring robust parsing logic.