Skip to content

Alignment between PyMyPDF output format and PyTesseract? #3871

@markelwin

Description

@markelwin

Is your feature request related to a problem? Please describe.
I recently began an OCR data cleaning improvement project called RemarkableOCR on top of PyTesseract, for a specialization on books, journal articles, and newspapers. This is the first time I am exploring PDF data structures, and I would like to use the native PDF data extraction of PyMuPDF rather than covert PDFs to images and OCR the images. This question regards the guarantees of PDF data blocks, which you are familiar with and I am not.

Describe the solution you'd like
fitz.open(pdf_filename)[0].get_text(option="rawdict") returns a collection of blocks, which each contain a collection of lines, which contain spans, which contains individual characters, which contains their precise bounding boxes. This data is "easy enough" to parse into word-sized bounding boxes and replicate the PyTesseract data output. My question is how you would approach this problem. The order of the blocks can be sorted so that the sequence is in left-right, top-down order; but there are configurations in which that does not easily lend itself to a structured reading orientation. Can you provide your insight into the definition of 'block' and 'line' in relation to how those are defined by PyTesseract, and in what edge cases those align and do not align? Thank you for entering into this discussion with me.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions