get_text sort should be configurable based on feature. #2530
-
Is your feature request related to a problem? Please describe. Here is the example with
Describe the solution you'd like What i did is to sort the text(removing images), based on Y-axis positioning, this way i get to connect the lines that are closer to each other. Newline or distance between the lines will have larger distance. Following snippet of the code can help you understand what I am proposing.
Describe alternatives you've considered |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The
All of the above may break of course. If you see text columns on a page then may be looking at text that should just be read column-wise, from left to right. So my main point is, that PDF text placement can be quite arbitrary. No sorting rule or approach will always work. There are cases where you need to sort single character bboxes, after rounding the coordinates in some suitable way. You may want to look at layout preserving text extraction, which ultimately looks at single characters to reproduce a faithful text layout. Or take a look at this utility script, which aims to detect text columns. |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie Many thanks for explaining it in detail, totally appreciate it! I completely understand the challenge of data extraction from highly formatted documents. Can you please give another piece of information? I was comparing it with Abode Extract APIs and they have Sensei AI to extract blocks and document structure, the results are quite neat. Does PyMuPDF also have such AI tool in pipeline? |
Beta Was this translation helpful? Give feedback.
The
get_text()
method always uses aTextPage
object - which can be either internally created for this execution of the method only, or the one provided via parameter.The main point is that a TextPage already has the page text preprocessed (by MuPDF) in a way that takes many criteria into account, like font size, the font itself, horizontal character distance and character width, vertical distances to previous and subsequent lines, writing direction/angle and several more.
Based on these heuristics, a text hierarchy like this is established:
block -> line -> span -> character
.This algorithm is executed while parsing the page's
/Contents
source (which contains all appearance-relevant comm…