diff --git a/en/SUMMARY.md b/en/SUMMARY.md index 4600897d5..f40f968f3 100644 --- a/en/SUMMARY.md +++ b/en/SUMMARY.md @@ -75,6 +75,7 @@ * [XMP metadata support in JabRef](advanced/xmp.md) * [Remote operation](advanced/remote.md) * [Custom themes](advanced/custom-themes.md) + * [OCR engine selection](advanced/ocr-engine-selection.md) * [Journal abbreviations](advanced/journalabbreviations.md) * [New subdatabase based on AUX file](advanced/newbasedonaux.md) * [How to expand first names of an entry](advanced/how-to-expand-firstnames.md) diff --git a/en/advanced/ocr-engine-selection.md b/en/advanced/ocr-engine-selection.md new file mode 100644 index 000000000..fb4592a85 --- /dev/null +++ b/en/advanced/ocr-engine-selection.md @@ -0,0 +1,153 @@ +# ADR-002: OCR Engine Selection for JabRef + +**Status:** proposed +**Date:** 2025-07-13 +**Decision-makers:** JabRef development team, GSoC contributor (Kaan Erdem) +**Consulted:** Academic users, library scientists, OCR technology experts +**Informed:** JabRef community, contributors, institutional users + +## OCR Engine Selection for Academic Document Processing + +### Context and Problem Statement + +JabRef requires an OCR engine to extract text from scanned PDFs and image-based academic documents. Tesseract is currently implemented, but accuracy varies significantly with document quality and type. Academic documents present unique challenges: mathematical notation, multiple languages, complex layouts, tables, and mixed handwritten/printed content. Which OCR engine(s) should JabRef adopt to best serve its academic user base while balancing accuracy, cost, privacy, and implementation complexity? + +### Decision Drivers + +* Accuracy on academic document types (printed papers, scanned books, handwritten notes) +* Privacy requirements for unpublished research materials +* Cost constraints (open-source project with limited funding) +* Language support for international academic community +* Support for mathematical and scientific notation +* Offline capability for secure research environments +* Processing speed for batch operations +* Implementation and maintenance complexity +* Table and structure extraction capabilities +* Long-term sustainability and community support + +### Considered Options + +* Option 1: Tesseract OCR (current implementation) +* Option 2: Google Cloud Vision API +* Option 3: AWS Textract +* Option 4: Microsoft Azure Computer Vision +* Option 5: EasyOCR +* Option 6: PaddleOCR +* Option 7: ABBYY FineReader SDK + +### Decision Outcome + +Chosen option: "Option 1: Tesseract OCR", with planned addition of "Option 2: Google Cloud Vision API" as an optional premium feature, because Tesseract provides a solid free foundation while Google Vision offers superior accuracy for users willing to trade privacy for performance. + +### Consequences + +* **Good**, because maintains free, privacy-preserving option as default +* **Good**, because allows users to opt-in to higher accuracy when needed +* **Good**, because Tesseract's 100+ language support covers academic needs +* **Good**, because implementation is already complete and tested +* **Bad**, because Tesseract struggles with handwritten text +* **Bad**, because requires additional development for cloud integration +* **Bad**, because increases support complexity with multiple engines + +### Confirmation + +* Benchmark tests on sample academic documents (100+ PDFs) +* Privacy audit for any cloud service integration +* Performance metrics tracking (accuracy, speed, error rates) +* User feedback collection after 6‑month deployment +* Regular accuracy testing against new document types + +### Pros and Cons of the Options + +#### Option 1: Tesseract OCR + +Originally developed by Hewlett‑Packard as proprietary software in the 1980s, released as open source in 2005 and development was sponsored by Google in 2006. + +* **Good**, because Version 4 adds LSTM‑based OCR engine and models for many additional languages and scripts, bringing the total to 116 languages. Additionally 37 scripts are supported +* **Good**, because completely free and open‑source (Apache 2.0 license) +* **Good**, because runs entirely offline preserving document privacy +* **Good**, because has an active developer community that regularly updates it, fixes bugs, and improves performance based on user feedback +* **Good**, because supports over 100 languages out of the box and can be trained to recognize additional languages or custom fonts +* **Good**, because Tesseract can process right‑to‑left text such as Arabic or Hebrew, many Indic scripts as well as CJK quite well +* **Neutral**, because Tesseract OCR is an open‑source product that can be used for free +* **Neutral**, because preprocessing can boost the character‑level accuracy of Tesseract 4.0 from 0.134 to 0.616 (+359 % relative change) and the F1 score from 0.163 to 0.729 (+347 % relative change) +* **Bad**, because Tesseract OCR recognizes the text in the well‑scanned email pretty well. However, for handwritten letters or smartphone‑captured documents it may output nonsense or nothing +* **Bad**, because performs best on documents with straightforward layouts but may struggle with complex layouts, requiring additional pre‑ or post‑processing +* **Bad**, because it may perform poorer on noisy scans compared to other solutions + +#### Option 2: Google Cloud Vision API + +Cloud‑based OCR service from Google. + +* **Good**, because Vision OCR reaches 98 % text accuracy on a diverse data set +* **Good**, because it performs well across complex, multilingual, and handwritten documents without language hints +* **Good**, because language support rivals Azure and Tesseract and surpasses AWS Textract +* **Good**, because it is the only viable option among the tested engines for reliable handwriting recognition +* **Neutral**, because costs \$1.50 per 1 000 pages (first 1 000 pages free) and new customers get \$300 credit (~200 000 pages) +* **Bad**, because requires an internet connection and a Google Cloud account +* **Bad**, because documents are processed on Google servers (privacy concern) + +#### Option 3: AWS Textract + +Amazon’s document‑analysis service. + +* **Good**, because moderately outperforms others on noisy scans +* **Good**, because excels at extracting tables and form fields +* **Good**, because focuses on scanned, structured documents (e.g., forms) +* **Neutral**, because pricing similar to Google +* **Bad**, because limited support for non‑Latin scripts +* **Bad**, because handwriting recognition is weak + +#### Option 4: Microsoft Azure Computer Vision + +* **Good**, because leads Category 1 (digital screenshots) with 99.8 % accuracy +* **Good**, because handles complex layouts (invoices, receipts, ID cards) well +* **Good**, because supports 25+ languages +* **Bad**, because handwriting recognition is poor +* **Bad**, because requires cloud processing + +#### Option 5: EasyOCR + +* **Good**, because among open‑source engines it often matches or exceeds peers +* **Good**, because supports 70+ languages and runs locally +* **Good**, because optimized for speed, enabling real‑time processing +* **Bad**, because still trails top LMMs in pure accuracy +* **Bad**, because limited published benchmark data + +#### Option 6: PaddleOCR + +* **Good**, because achieves state‑of‑the‑art scores on ICDAR benchmarks +* **Good**, because supports major Asian and Latin scripts and is fast +* **Bad**, because supports fewer languages than Tesseract or EasyOCR +* **Bad**, because community is smaller and ecosystem less mature + +#### Option 7: ABBYY FineReader SDK + +* **Good**, because preserves document structure (tables, zones) in output +* **Good**, because excels at tabular data extraction +* **Neutral**, because commercial licensing required +* **Bad**, because handwriting recognition is very poor +* **Bad**, because high cost (pricing not publicly listed) + +### More Information + +* Current implementation uses Tesseract 4.x with LSTM engine +* In benchmarks, Google Cloud Vision shows the highest overall accuracy +* Handwriting (categories 2 & 3) is the main differentiator among engines +* JabRef’s privacy policy must be updated if cloud services are integrated +* Institutions may wish to provide shared API keys for Vision/Textract +* Image preprocessing has a large impact on Tesseract accuracy + +## Sources + +The web resources that informed this ADR: + +1. +2. +3. +4. +5. +6. +7. +8. +9.