Alignment between PyMyPDF output format and PyTesseract?

**Is your feature request related to a problem? Please describe.**
I recently began an OCR data cleaning improvement project called [RemarkableOCR](https://github.com/markelwin/RemarkableOCR) on top of PyTesseract, for a specialization on books, journal articles, and newspapers. This is the first time I am exploring PDF data structures, and I would like to use the native PDF data extraction of PyMuPDF rather than covert PDFs to images and OCR the images. This question regards the guarantees of PDF data blocks, which you are familiar with and I am not.

**Describe the solution you'd like**
`fitz.open(pdf_filename)[0].get_text(option="rawdict")` returns a collection of blocks, which each contain a collection of lines, which contain spans, which contains individual characters, which contains their precise bounding boxes. This data is "easy enough" to parse into word-sized bounding boxes and replicate the PyTesseract data output. My question is how you would approach this problem. The order of the blocks can be sorted so that the sequence is in left-right, top-down order; but there are configurations in which that does not easily lend itself to a structured reading orientation. Can you provide your insight into the definition of 'block' and 'line' in relation to how those are defined by PyTesseract, and in what edge cases those align and do not align? Thank you for entering into this discussion with me.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alignment between PyMyPDF output format and PyTesseract? #3871

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alignment between PyMyPDF output format and PyTesseract? #3871

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions