|
| 1 | +--- |
| 2 | +parent: Decision Records |
| 3 | +nav_order: 47 |
| 4 | +--- |
| 5 | +# OCR Engine Selection for JabRef |
| 6 | + |
| 7 | +## Context and Problem Statement |
| 8 | + |
| 9 | +JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity? |
| 10 | + |
| 11 | +## Decision Drivers |
| 12 | + |
| 13 | +* Accuracy on academic document types (printed papers, scanned books, handwritten notes) |
| 14 | +* Privacy requirements for unpublished research materials |
| 15 | +* Cost constraints (open-source project with limited funding) |
| 16 | +* Language support for international academic community |
| 17 | +* Support for mathematical and scientific notation |
| 18 | +* Offline capability for secure research environments |
| 19 | +* Processing speed for batch operations |
| 20 | +* Implementation and maintenance complexity |
| 21 | +* Table and structure extraction capabilities |
| 22 | +* Long-term sustainability and community support |
| 23 | + |
| 24 | +## Considered Options |
| 25 | + |
| 26 | +* Option 1: Tesseract OCR (current implementation) |
| 27 | +* Option 2: Google Cloud Vision API |
| 28 | +* Option 3: AWS Textract |
| 29 | +* Option 4: Microsoft Azure Computer Vision |
| 30 | +* Option 5: EasyOCR |
| 31 | +* Option 6: PaddleOCR |
| 32 | +* Option 7: ABBYY FineReader SDK |
| 33 | + |
| 34 | +### Pros and Cons of the Options |
| 35 | + |
| 36 | +#### Option 1: Tesseract OCR |
| 37 | + |
| 38 | +Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006. |
| 39 | + |
| 40 | +* **Good**, because Version 4 adds LSTM‑based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages. Additionally 37 scripts are supported |
| 41 | +* **Good**, because completely free and open‑source (Apache 2.0 license) |
| 42 | +* **Good**, because runs entirely offline preserving document privacy |
| 43 | +* **Good**, because has an active developer community that regularly updates it, fixes bugs, and improves performance based on user feedback |
| 44 | +* **Good**, because supports over 100 languages out of the box and can be trained to recognize additional languages or custom fonts |
| 45 | +* **Good**, because Tesseract can process right‑to‑left text such as Arabic or Hebrew, many Indic scripts as well as CJK quite well |
| 46 | +* **Neutral**, because Tesseract OCR is an open‑source product that can be used for free |
| 47 | +* **Neutral**, because preprocessing can boost the character‑level accuracy of Tesseract 4.0 from 0.134 to 0.616 (+359 % relative change) and the F1 score from 0.163 to 0.729 (+347 % relative change) |
| 48 | +* **Bad**, because Tesseract OCR recognizes the text in the well‑scanned email pretty well. However, for handwritten letters or smartphone‑captured documents it may output nonsense or nothing |
| 49 | +* **Bad**, because performs best on documents with straightforward layouts but may struggle with complex layouts, requiring additional pre‑ or post‑processing |
| 50 | +* **Bad**, because it may perform poorer on noisy scans compared to other solutions |
| 51 | + |
| 52 | +##### Sources |
| 53 | + |
| 54 | +* https://www.mdpi.com/2073-8994/12/5/715 |
| 55 | +* https://nanonets.com/blog/ocr-with-tesseract/ |
| 56 | +* https://en.wikipedia.org/wiki/Tesseract_(software) |
| 57 | + |
| 58 | +#### Option 2: Google Cloud Vision API |
| 59 | + |
| 60 | +Cloud‑based OCR service from Google. |
| 61 | + |
| 62 | +* **Good**, because Vision OCR reaches 98 % text accuracy on a diverse data set |
| 63 | +* **Good**, because it performs well across complex, multilingual, and handwritten documents without language hints |
| 64 | +* **Good**, because language support rivals Azure and Tesseract and surpasses AWS Textract |
| 65 | +* **Good**, because it is the only viable option among the tested engines for reliable handwriting recognition |
| 66 | +* **Neutral**, because costs \$1.50 per 1 000 pages (first 1 000 pages free) and new customers get \$300 credit (~200 000 pages) |
| 67 | +* **Bad**, because requires an internet connection and a Google Cloud account |
| 68 | +* **Bad**, because documents are processed on Google servers (privacy concern) |
| 69 | + |
| 70 | +##### Sources |
| 71 | + |
| 72 | +* https://nanonets.com/blog/ocr-with-tesseract/ |
| 73 | +* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr |
| 74 | +* https://en.wikipedia.org/wiki/Tesseract_(software) |
| 75 | + |
| 76 | +#### Option 3: AWS Textract |
| 77 | + |
| 78 | +Amazon’s document‑analysis service. |
| 79 | + |
| 80 | +* **Good**, because moderately outperforms others on noisy scans |
| 81 | +* **Good**, because excels at extracting tables and form fields |
| 82 | +* **Good**, because focuses on scanned, structured documents (e.g., forms) |
| 83 | +* **Neutral**, because pricing similar to Google |
| 84 | +* **Bad**, because limited support for non‑Latin scripts |
| 85 | +* **Bad**, because handwriting recognition is weak |
| 86 | + |
| 87 | +##### Sources |
| 88 | + |
| 89 | +* https://news.ycombinator.com/item?id=20470439 |
| 90 | +* https://blog.roboflow.com/best-ocr-models-text-recognition/ |
| 91 | +* https://unstract.com/blog/guide-to-optical-character-recognition-with-tesseract-ocr/ |
| 92 | +* https://nanonets.com/blog/ocr-with-tesseract/ |
| 93 | +* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr |
| 94 | +* https://en.wikipedia.org/wiki/Tesseract_(software) |
| 95 | + |
| 96 | +#### Option 4: Microsoft Azure Computer Vision |
| 97 | + |
| 98 | +* **Good**, because leads Category 1 (digital screenshots) with 99.8 % accuracy |
| 99 | +* **Good**, because handles complex layouts (invoices, receipts, ID cards) well |
| 100 | +* **Good**, because supports 25+ languages |
| 101 | +* **Bad**, because handwriting recognition is poor |
| 102 | +* **Bad**, because requires cloud processing |
| 103 | + |
| 104 | +##### Sources |
| 105 | + |
| 106 | +* https://nanonets.com/blog/ocr-with-tesseract/ |
| 107 | +* https://blog.roboflow.com/best-ocr-models-text-recognition/ |
| 108 | +* https://dida.do/blog/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project |
| 109 | + |
| 110 | +#### Option 5: EasyOCR |
| 111 | + |
| 112 | +* **Good**, because among open‑source engines it often matches or exceeds peers |
| 113 | +* **Good**, because supports 70+ languages and runs locally |
| 114 | +* **Good**, because optimized for speed, enabling real‑time processing |
| 115 | +* **Bad**, because still trails top LMMs in pure accuracy |
| 116 | +* **Bad**, because limited published benchmark data |
| 117 | + |
| 118 | +##### Sources |
| 119 | + |
| 120 | +* https://www.klippa.com/en/blog/information/tesseract-ocr/ |
| 121 | + |
| 122 | +#### Option 6: PaddleOCR |
| 123 | + |
| 124 | +* **Good**, because achieves state‑of‑the‑art scores on ICDAR benchmarks |
| 125 | +* **Good**, because supports major Asian and Latin scripts and is fast |
| 126 | +* **Bad**, because supports fewer languages than Tesseract or EasyOCR |
| 127 | +* **Bad**, because community is smaller and ecosystem less mature |
| 128 | + |
| 129 | +##### Sources |
| 130 | + |
| 131 | +* https://dida.do/blog/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project |
| 132 | + |
| 133 | +#### Option 7: ABBYY FineReader SDK |
| 134 | + |
| 135 | +* **Good**, because preserves document structure (tables, zones) in output |
| 136 | +* **Good**, because excels at tabular data extraction |
| 137 | +* **Neutral**, because commercial licensing required |
| 138 | +* **Bad**, because handwriting recognition is very poor |
| 139 | +* **Bad**, because high cost (pricing not publicly listed) |
| 140 | + |
| 141 | +##### Sources |
| 142 | + |
| 143 | +* https://nanonets.com/blog/ocr-with-tesseract/ |
| 144 | +* https://en.wikipedia.org/wiki/Tesseract_(software) |
| 145 | + |
| 146 | +## Decision Outcome |
| 147 | + |
| 148 | +Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance. |
| 149 | + |
| 150 | +### Consequences |
| 151 | + |
| 152 | +* **Good**, because maintains free, privacy-preserving option as default |
| 153 | +* **Good**, because allows users to opt-in to higher accuracy when needed |
| 154 | +* **Good**, because Tesseract's 100+ language support covers academic needs |
| 155 | +* **Good**, because implementation is already complete and tested |
| 156 | +* **Bad**, because Tesseract struggles with handwritten text |
| 157 | +* **Bad**, because requires additional development for cloud integration |
| 158 | +* **Bad**, because increases support complexity with multiple engines |
| 159 | + |
| 160 | +## Full Source Overview |
| 161 | + |
| 162 | +The web resources that informed this ADR: |
| 163 | + |
| 164 | +1. <https://www.mdpi.com/2073-8994/12/5/715> |
| 165 | +2. <https://nanonets.com/blog/ocr-with-tesseract/> |
| 166 | +3. <https://en.wikipedia.org/wiki/Tesseract_(software)> |
| 167 | +4. <https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr> |
| 168 | +5. <https://news.ycombinator.com/item?id=20470439> |
| 169 | +6. <https://blog.roboflow.com/best-ocr-models-text-recognition/> |
| 170 | +7. <https://unstract.com/blog/guide-to-optical-character-recognition-with-tesseract-ocr/> |
| 171 | +8. <https://dida.do/blog/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project> |
| 172 | +9. <https://www.klippa.com/en/blog/information/tesseract-ocr/> |
0 commit comments