Skip to content

Commit 122d522

Browse files
committed
Add ADR 0047 for OCR engine selection and fix requested changes from JabRef/user-documentation#577
1 parent b89b337 commit 122d522

File tree

1 file changed

+172
-0
lines changed

1 file changed

+172
-0
lines changed
Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
---
2+
parent: Decision Records
3+
nav_order: 47
4+
---
5+
# OCR Engine Selection for JabRef
6+
7+
## Context and Problem Statement
8+
9+
JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity?
10+
11+
## Decision Drivers
12+
13+
* Accuracy on academic document types (printed papers, scanned books, handwritten notes)
14+
* Privacy requirements for unpublished research materials
15+
* Cost constraints (open-source project with limited funding)
16+
* Language support for international academic community
17+
* Support for mathematical and scientific notation
18+
* Offline capability for secure research environments
19+
* Processing speed for batch operations
20+
* Implementation and maintenance complexity
21+
* Table and structure extraction capabilities
22+
* Long-term sustainability and community support
23+
24+
## Considered Options
25+
26+
* Option 1: Tesseract OCR (current implementation)
27+
* Option 2: Google Cloud Vision API
28+
* Option 3: AWS Textract
29+
* Option 4: Microsoft Azure Computer Vision
30+
* Option 5: EasyOCR
31+
* Option 6: PaddleOCR
32+
* Option 7: ABBYY FineReader SDK
33+
34+
### Pros and Cons of the Options
35+
36+
#### Option 1: Tesseract OCR
37+
38+
Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006.
39+
40+
* **Good**, because Version 4 adds LSTM‑based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages. Additionally 37 scripts are supported
41+
* **Good**, because completely free and open‑source (Apache 2.0 license)
42+
* **Good**, because runs entirely offline preserving document privacy
43+
* **Good**, because has an active developer community that regularly updates it, fixes bugs, and improves performance based on user feedback
44+
* **Good**, because supports over 100 languages out of the box and can be trained to recognize additional languages or custom fonts
45+
* **Good**, because Tesseract can process right‑to‑left text such as Arabic or Hebrew, many Indic scripts as well as CJK quite well
46+
* **Neutral**, because Tesseract OCR is an open‑source product that can be used for free
47+
* **Neutral**, because preprocessing can boost the character‑level accuracy of Tesseract 4.0 from 0.134 to 0.616 (+359 % relative change) and the F1 score from 0.163 to 0.729 (+347 % relative change)
48+
* **Bad**, because Tesseract OCR recognizes the text in the well‑scanned email pretty well. However, for handwritten letters or smartphone‑captured documents it may output nonsense or nothing
49+
* **Bad**, because performs best on documents with straightforward layouts but may struggle with complex layouts, requiring additional pre‑ or post‑processing
50+
* **Bad**, because it may perform poorer on noisy scans compared to other solutions
51+
52+
##### Sources
53+
54+
* https://www.mdpi.com/2073-8994/12/5/715
55+
* https://nanonets.com/blog/ocr-with-tesseract/
56+
* https://en.wikipedia.org/wiki/Tesseract_(software)
57+
58+
#### Option 2: Google Cloud Vision API
59+
60+
Cloud‑based OCR service from Google.
61+
62+
* **Good**, because Vision OCR reaches 98 % text accuracy on a diverse data set
63+
* **Good**, because it performs well across complex, multilingual, and handwritten documents without language hints
64+
* **Good**, because language support rivals Azure and Tesseract and surpasses AWS Textract
65+
* **Good**, because it is the only viable option among the tested engines for reliable handwriting recognition
66+
* **Neutral**, because costs \$1.50 per 1 000 pages (first 1 000 pages free) and new customers get \$300 credit (~200 000 pages)
67+
* **Bad**, because requires an internet connection and a Google Cloud account
68+
* **Bad**, because documents are processed on Google servers (privacy concern)
69+
70+
##### Sources
71+
72+
* https://nanonets.com/blog/ocr-with-tesseract/
73+
* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr
74+
* https://en.wikipedia.org/wiki/Tesseract_(software)
75+
76+
#### Option 3: AWS Textract
77+
78+
Amazon’s document‑analysis service.
79+
80+
* **Good**, because moderately outperforms others on noisy scans
81+
* **Good**, because excels at extracting tables and form fields
82+
* **Good**, because focuses on scanned, structured documents (e.g., forms)
83+
* **Neutral**, because pricing similar to Google
84+
* **Bad**, because limited support for non‑Latin scripts
85+
* **Bad**, because handwriting recognition is weak
86+
87+
##### Sources
88+
89+
* https://news.ycombinator.com/item?id=20470439
90+
* https://blog.roboflow.com/best-ocr-models-text-recognition/
91+
* https://unstract.com/blog/guide-to-optical-character-recognition-with-tesseract-ocr/
92+
* https://nanonets.com/blog/ocr-with-tesseract/
93+
* https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr
94+
* https://en.wikipedia.org/wiki/Tesseract_(software)
95+
96+
#### Option 4: Microsoft Azure Computer Vision
97+
98+
* **Good**, because leads Category 1 (digital screenshots) with 99.8 % accuracy
99+
* **Good**, because handles complex layouts (invoices, receipts, ID cards) well
100+
* **Good**, because supports 25+ languages
101+
* **Bad**, because handwriting recognition is poor
102+
* **Bad**, because requires cloud processing
103+
104+
##### Sources
105+
106+
* https://nanonets.com/blog/ocr-with-tesseract/
107+
* https://blog.roboflow.com/best-ocr-models-text-recognition/
108+
* https://dida.do/blog/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project
109+
110+
#### Option 5: EasyOCR
111+
112+
* **Good**, because among open‑source engines it often matches or exceeds peers
113+
* **Good**, because supports 70+ languages and runs locally
114+
* **Good**, because optimized for speed, enabling real‑time processing
115+
* **Bad**, because still trails top LMMs in pure accuracy
116+
* **Bad**, because limited published benchmark data
117+
118+
##### Sources
119+
120+
* https://www.klippa.com/en/blog/information/tesseract-ocr/
121+
122+
#### Option 6: PaddleOCR
123+
124+
* **Good**, because achieves state‑of‑the‑art scores on ICDAR benchmarks
125+
* **Good**, because supports major Asian and Latin scripts and is fast
126+
* **Bad**, because supports fewer languages than Tesseract or EasyOCR
127+
* **Bad**, because community is smaller and ecosystem less mature
128+
129+
##### Sources
130+
131+
* https://dida.do/blog/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project
132+
133+
#### Option 7: ABBYY FineReader SDK
134+
135+
* **Good**, because preserves document structure (tables, zones) in output
136+
* **Good**, because excels at tabular data extraction
137+
* **Neutral**, because commercial licensing required
138+
* **Bad**, because handwriting recognition is very poor
139+
* **Bad**, because high cost (pricing not publicly listed)
140+
141+
##### Sources
142+
143+
* https://nanonets.com/blog/ocr-with-tesseract/
144+
* https://en.wikipedia.org/wiki/Tesseract_(software)
145+
146+
## Decision Outcome
147+
148+
Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance.
149+
150+
### Consequences
151+
152+
* **Good**, because maintains free, privacy-preserving option as default
153+
* **Good**, because allows users to opt-in to higher accuracy when needed
154+
* **Good**, because Tesseract's 100+ language support covers academic needs
155+
* **Good**, because implementation is already complete and tested
156+
* **Bad**, because Tesseract struggles with handwritten text
157+
* **Bad**, because requires additional development for cloud integration
158+
* **Bad**, because increases support complexity with multiple engines
159+
160+
## Full Source Overview
161+
162+
The web resources that informed this ADR:
163+
164+
1. <https://www.mdpi.com/2073-8994/12/5/715>
165+
2. <https://nanonets.com/blog/ocr-with-tesseract/>
166+
3. <https://en.wikipedia.org/wiki/Tesseract_(software)>
167+
4. <https://www.plugger.ai/blog/comparison-of-paddle-ocr-easyocr-kerasocr-and-tesseract-ocr>
168+
5. <https://news.ycombinator.com/item?id=20470439>
169+
6. <https://blog.roboflow.com/best-ocr-models-text-recognition/>
170+
7. <https://unstract.com/blog/guide-to-optical-character-recognition-with-tesseract-ocr/>
171+
8. <https://dida.do/blog/comparison-of-ocr-tools-how-to-choose-the-best-tool-for-your-project>
172+
9. <https://www.klippa.com/en/blog/information/tesseract-ocr/>

0 commit comments

Comments
 (0)