Skip to content
Discussion options

You must be logged in to vote

The get_text() method always uses a TextPage object - which can be either internally created for this execution of the method only, or the one provided via parameter.
The main point is that a TextPage already has the page text preprocessed (by MuPDF) in a way that takes many criteria into account, like font size, the font itself, horizontal character distance and character width, vertical distances to previous and subsequent lines, writing direction/angle and several more.
Based on these heuristics, a text hierarchy like this is established: block -> line -> span -> character.
This algorithm is executed while parsing the page's /Contents source (which contains all appearance-relevant comm…

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by milinddeore
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
enhancement wontfix no intention to resolve
2 participants
Converted from issue

This discussion was converted from issue #2525 on July 11, 2023 09:08.