Mid sentence font change disrupts word order #2663
-
My hope is to extract text from PDF in "reading order". I have an example where the sequencing from I have seen related past discussions here: ![]() Here, words labelled 269 and 276 are not in the right order. This same problem occurs elsewhere, these are just two examples. For both these words, you can see that the bottom of the bounding box does not match prior words in the same line. Despite this, I would have thought that "Resorting" as discussed in #702 is an option I suppose but it seems like this is likely to be a common occurrence and shouldn't be a user issue. The PDF document is publicly available here Thanks in advance for any advice. ps. PyMuPDF 1.23.2, MacOS 13.5.2 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 9 replies
-
The word sorting happens by bottom y (y1), then left x (x0). Even if words appear to be on the same line, a different font or other pesky stuff can cause minute y1-value differences - this is also visible in you picture. So some additional logic is needed that introduces some "forgivingness" here: import fitz
doc = fitz.open("test.pdf")
page = doc[4]
words = page.get_text("words", sort=True)
line = [] # temp store for 1 line
lines = [] # of al lines
for w in words:
if not line:
line.append(w)
continue
y1old = line[-1][3] # get y1 of last word in line
y1new = w[3] # get y1 of this word
if abs(y1new - y1old) <= 3: # still same line?
line.append(w)
else: # new line
line.sort(key=lambda w: w[0]) # sort left-to-right
lines.append(" ".join([w[4] for w in line]))
line = [w]
if line: # add last line's text
lines.append(" ".join([w[4] for w in line]))
for l in lines:
print(l) |
Beta Was this translation helpful? Give feedback.
-
@JorjMcKie How about something like this, using a Python comparison function adapted to a key function using More testing required, but it seems to work to fix the reading order on this page (using the magic number of 3 as the code above suggests). from functools import cmp_to_key
# (x0, y0, x1, y1, "word", block_no, line_no, word_no)
def word_compare(word1, word2):
# Sort comparison function by two keys: first by the y1 baseline,
# then by x0. Allow for coordinate jitter of 3 in the
# y-axis, which may occur due to font change, for example.
y1_1 = word1[3]
y1_2 = word2[3]
ydelta = y1_1 - y1_2
if abs(ydelta) <= 3:
return word1[0] - word2[0] # delta x
else:
return ydelta And for now, called using something like words = page.get_text('words', clip=rect, sort=False)
wsort = sorted( words, key=cmp_to_key(word_compare) ) |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
Like I mentioned before: you will always find examples that won't work. But there always is hope. Here is script version resolving this. Changes: