-
Notifications
You must be signed in to change notification settings - Fork 683
Description
Is your feature request related to a problem? Please describe.
I recently began an OCR data cleaning improvement project called RemarkableOCR on top of PyTesseract, for a specialization on books, journal articles, and newspapers. This is the first time I am exploring PDF data structures, and I would like to use the native PDF data extraction of PyMuPDF rather than covert PDFs to images and OCR the images. This question regards the guarantees of PDF data blocks, which you are familiar with and I am not.
Describe the solution you'd like
fitz.open(pdf_filename)[0].get_text(option="rawdict") returns a collection of blocks, which each contain a collection of lines, which contain spans, which contains individual characters, which contains their precise bounding boxes. This data is "easy enough" to parse into word-sized bounding boxes and replicate the PyTesseract data output. My question is how you would approach this problem. The order of the blocks can be sorted so that the sequence is in left-right, top-down order; but there are configurations in which that does not easily lend itself to a structured reading orientation. Can you provide your insight into the definition of 'block' and 'line' in relation to how those are defined by PyTesseract, and in what edge cases those align and do not align? Thank you for entering into this discussion with me.