EasyOCR Unstructured is a powerful library for Optical Character Recognition (OCR) that can extract text from PDFS, then group the text based on proximity.
It is intended for PDF files that have text that doesn't follow the left to right top to bottom standard of document writing.
pip install easyocr-unstructured
import easyocr_unstructured
# Initialize the EasyOCR Unstructured object
easyocr = EasyocrUnstructured()
# Invoke the OCR process on your PDF file
result = easyocr.invoke('/path/to/your_pdf_file.pdf')
#result will be a list of lists containing strings
from pprint import pprint as pp
pp(result)
The output will look something like this:
[
["This is the piece of text. Nothing near it"],
["This is the second piece of text.", "This is the third piece of text that was close to the second"],
["This is the fourth piece of text. Nothing near it"],
...
]
- Python 3.12 +
pip install easyocr-unstructured
import easyocr_unstructured
easyocr = EasyocrUnstructured()
result = easyocr.invoke('/path/to/your_pdf_file.pdf')
No tests yet
- Wing Pro
- Python 3.12
- numpy
- easyocr
- pdf2image
- hashlib
Please do, any sensible and safe change will be added!
Kevin Fink
MIT