get_text sort should be configurable based on feature. #2530

milinddeore · 2023-07-10T18:17:31Z

milinddeore
Jul 10, 2023

Is your feature request related to a problem? Please describe.
During text extraction i.e. get_text we pass a parameter sort, which is based on the other of various blocks. For highly formated documents, this doesn't work well and hence line-by-line text reading fails. The same can be achived quite well with library pdftotext.

Here is the example with pdftotext layout preserving output to some extent, at least you can read line-by-line:

import pdftotext

pdf = ''
# Load your PDF
with open("ONEPAGER.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)

for p in pdf:
  for line in p.split('\n'):
    for sub_sentence in line.split('\n'):
      print(sub_sentence)

Describe the solution you'd like

What i did is to sort the text(removing images), based on Y-axis positioning, this way i get to connect the lines that are closer to each other. Newline or distance between the lines will have larger distance. Following snippet of the code can help you understand what I am proposing.

doc = fitz.open("ONEPAGE.pdf")

for page in doc:
  data = page.get_text('dict', sort=True)
  
  # Sort based on bbox 'Y-axis'
  data_sorted = sorted(data['blocks'], key=lambda item: float(item['bbox'][1]))

  # Remove images data 
  data_sorted = [value for value in data_sorted if value['type'] == 0]

Describe alternatives you've considered
Sorting has to be based on multiple options, example Y-axis, X-axis, text/images and like wise.

Answered by JorjMcKie

Jul 11, 2023

The get_text() method always uses a TextPage object - which can be either internally created for this execution of the method only, or the one provided via parameter.
The main point is that a TextPage already has the page text preprocessed (by MuPDF) in a way that takes many criteria into account, like font size, the font itself, horizontal character distance and character width, vertical distances to previous and subsequent lines, writing direction/angle and several more.
Based on these heuristics, a text hierarchy like this is established: block -> line -> span -> character.
This algorithm is executed while parsing the page's /Contents source (which contains all appearance-relevant comm…

View full answer

JorjMcKie · 2023-07-11T09:07:19Z

JorjMcKie
Jul 11, 2023
Maintainer

The get_text() method always uses a TextPage object - which can be either internally created for this execution of the method only, or the one provided via parameter.
The main point is that a TextPage already has the page text preprocessed (by MuPDF) in a way that takes many criteria into account, like font size, the font itself, horizontal character distance and character width, vertical distances to previous and subsequent lines, writing direction/angle and several more.
Based on these heuristics, a text hierarchy like this is established: block -> line -> span -> character.
This algorithm is executed while parsing the page's /Contents source (which contains all appearance-relevant commands) sequentially, from front to back.
No effort is undertaken here, to sort the text in whatever way. For pages created in a decent, canonical way, the result should be fine.
To support other cases (where blocks may occur in some arbitrary sequence), the PyMuPDF get_text() parameter sort sorts the text blocks by ascending y, then x coordinates. More precisely, if bbox is the text block's rectangle, the sort key is (bbox.y1, bbox.x0). This applies to text extraction variants "text", "blocks", "dict" and "rawdict".

For get_text("words"), the same sort key is used, but for each single word bbox - ignoring the above mentioned hierarchy.

All of the above may break of course. If you see text columns on a page then may be looking at text that should just be read column-wise, from left to right.
But you may also be looking at the columns of a table. In this case, you should read row-wise.
In addition, because we are sorting by bbox bottom value as primary sort key, things will go wrong of these bottom values (floats!!!) are not exactly the same between, and we should have better sorted by the bbox top-left coordinates - who knows.

So my main point is, that PDF text placement can be quite arbitrary. No sorting rule or approach will always work. There are cases where you need to sort single character bboxes, after rounding the coordinates in some suitable way.
A never-ending story.

You may want to look at layout preserving text extraction, which ultimately looks at single characters to reproduce a faithful text layout. Or take a look at this utility script, which aims to detect text columns.

0 replies

milinddeore · 2023-07-17T03:18:24Z

milinddeore
Jul 17, 2023
Author

@JorjMcKie Many thanks for explaining it in detail, totally appreciate it!

I completely understand the challenge of data extraction from highly formatted documents. Can you please give another piece of information? I was comparing it with Abode Extract APIs and they have Sensei AI to extract blocks and document structure, the results are quite neat. Does PyMuPDF also have such AI tool in pipeline?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

get_text sort should be configurable based on feature. #2530

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

get_text sort should be configurable based on feature. #2530

Uh oh!

milinddeore Jul 10, 2023

Replies: 2 comments

Uh oh!

JorjMcKie Jul 11, 2023 Maintainer

Uh oh!

milinddeore Jul 17, 2023 Author

milinddeore
Jul 10, 2023

JorjMcKie
Jul 11, 2023
Maintainer

milinddeore
Jul 17, 2023
Author