Mid sentence font change disrupts word order #2663

glangford · 2023-09-11T12:37:18Z

glangford
Sep 11, 2023

My hope is to extract text from PDF in "reading order". I have an example where the sequencing from get_text('words', sort=True) is disrupted in a strange way when the font changes in the middle of a sentence.

I have seen related past discussions here:
#195
#702
#2396

Here, words labelled 269 and 276 are not in the right order. This same problem occurs elsewhere, these are just two examples.

For both these words, you can see that the bottom of the bounding box does not match prior words in the same line. Despite this, I would have thought that sort, which "sort(s) the output by vertical, then horizontal coordinates" would get the order correct.

"Resorting" as discussed in #702 is an option I suppose but it seems like this is likely to be a common occurrence and shouldn't be a user issue.

The PDF document is publicly available here
https://ai.meta.com/research/publications/seamlessm4t-massively-multilingual-multimodal-machine-translation/

Thanks in advance for any advice.

ps. PyMuPDF 1.23.2, MacOS 13.5.2

Answered by JorjMcKie

Sep 11, 2023

Like I mentioned before: you will always find examples that won't work. But there always is hope. Here is script version resolving this. Changes:

global variable that extracts minimum bbox heights
compare not only the word bottom, but also the top - if any is sufficiently close, we accept the word as a brother of the same line.

import fitz

fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("test.pdf")
page = doc[5]
words = page.get_text("words", sort=True)
line = []  # temp store for 1 line
lines = []  # of al lines
for w in words:
    if not line:
        line.append(w)
        continue
    y1old = line[-1][3]  # get y1 of last word in line
    y0old = line[-1][1]  # get y0 of l…

View full answer

JorjMcKie · 2023-09-11T13:26:57Z

JorjMcKie
Sep 11, 2023
Maintainer

The word sorting happens by bottom y (y1), then left x (x0). Even if words appear to be on the same line, a different font or other pesky stuff can cause minute y1-value differences - this is also visible in you picture.

So some additional logic is needed that introduces some "forgivingness" here:

import fitz

doc = fitz.open("test.pdf")
page = doc[4]
words = page.get_text("words", sort=True)
line = []  # temp store for 1 line
lines = []  # of al lines
for w in words:
    if not line:
        line.append(w)
        continue
    y1old = line[-1][3]  # get y1 of last word in line
    y1new = w[3]  # get y1 of this word
    if abs(y1new - y1old) <= 3:  # still same line?
        line.append(w)
    else:  # new line
        line.sort(key=lambda w: w[0])  # sort left-to-right
        lines.append(" ".join([w[4] for w in line]))
        line = [w]
if line:  # add last line's text
    lines.append(" ".join([w[4] for w in line]))
for l in lines:
    print(l)

4 replies

glangford Sep 11, 2023
Author

Thanks for this @JorjMcKie, this is very helpful.

Will this absolute value of the difference in y approach be something that is included in the library? It seems like the current approach of sorting by the bottom of y is problematic.

What about sorting guided by the midline? Would that be more robust, in your view?

JorjMcKie Sep 11, 2023
Maintainer

What about sorting guided by the midline? Would that be more robust, in your view?

Could be. As usual, there are always a few cases when things break. E.g. some PDF creators write the characters on the page in an arbitray permutation - just to inhibit successful extraction. This would cause the above logic to fail.
Actually, a benevolent PDF creator will try to write words in a way that they appear well positioned WRT to their bottom. MuPDF also assumes this and generates the TextPage object accordingly in the hierarchy blocks -> lines -> spans -> characters.
The "words" extraction only ignores the span level and produces words as those strings, that do not contain delimiters.
Therefore, y1 values (what MuPDF calls the "baseline") should be more consistent/stable than other things because MuPDF wouldn't otherwise have put them in the same line. Think for example of footnote references immediately following a word: they usually have a y0 further up than the characters of the word to which it is attached. An "y0/y1" average would introduce an inaccuracy in this case.

As per replacing the current sort algorithm:
I am not sure. The "words" extraction is a highspeed variant. Even the extremely fast Python sort keeps it that way. My above recipe is a relative slow-down.
What I might consider is accepting a user-provided callable that does the sorting. Or sort by fitz.IRect versions of the bboxes - that has a good chance to equal out the worst cases.

glangford Sep 11, 2023
Author

Or sort by fitz.IRect versions of the bboxes - that has a good chance to equal out the worst cases.

I liked the sound of this at first - but I tweaked the rendering of the bboxes for this document and it looks like the rounding for IRect doesn't result in a consistent baseline. At least in this case.

Thanks for the guidance, I will experiment further.

JorjMcKie Sep 11, 2023
Maintainer

I know - it doesn't do it here 😒.

glangford · 2023-09-11T19:02:33Z

glangford
Sep 11, 2023
Author

What I might consider is accepting a user-provided callable that does the sorting.

@JorjMcKie How about something like this, using a Python comparison function adapted to a key function using functools. This allows comparison in both y and x directions in one call.

More testing required, but it seems to work to fix the reading order on this page (using the magic number of 3 as the code above suggests).

from functools import cmp_to_key
# (x0, y0, x1, y1, "word", block_no, line_no, word_no)
def word_compare(word1, word2):
    # Sort comparison function by two keys: first by the y1 baseline,
    # then by x0. Allow for coordinate jitter of 3 in the 
    # y-axis, which may occur due to font change, for example.
    y1_1 = word1[3]
    y1_2 = word2[3]
    ydelta = y1_1 - y1_2
    if abs(ydelta) <= 3:
        return word1[0] - word2[0] # delta x
    else:
        return ydelta

And for now, called using something like

words = page.get_text('words', clip=rect, sort=False)
wsort = sorted( words, key=cmp_to_key(word_compare) )

1 reply

JorjMcKie Sep 11, 2023
Maintainer

Yes, something like that. Although the sort sequence of words within the same line is not often the problem ... but it does happen too!

glangford · 2023-09-11T21:06:19Z

glangford
Sep 11, 2023
Author

Here is another odd example from the same document (which evades the solution above). The percent symbols 71 and 95 are fine, but the bounding boxes for 149 and 162 descend well below the line. Something to do with ~ maybe.

4 replies

JorjMcKie Sep 11, 2023
Maintainer

Like I mentioned before: you will always find examples that won't work. But there always is hope. Here is script version resolving this. Changes:

global variable that extracts minimum bbox heights
compare not only the word bottom, but also the top - if any is sufficiently close, we accept the word as a brother of the same line.

import fitz

fitz.TOOLS.set_small_glyph_heights(True)
doc = fitz.open("test.pdf")
page = doc[5]
words = page.get_text("words", sort=True)
line = []  # temp store for 1 line
lines = []  # of al lines
for w in words:
    if not line:
        line.append(w)
        continue
    y1old = line[-1][3]  # get y1 of last word in line
    y0old = line[-1][1]  # get y0 of last word in line
    y1new = w[3]  # get y1 of this word
    y0new = w[1]  # get y0 of this word
    if abs(y1new - y1old) <= 3 or abs(y0new - y0old) <= 3:  # still same line?
        line.append(w)
    else:  # new line
        line.sort(key=lambda w: w[0])  # sort left-to-right
        lines.append(" ".join([w[4] for w in line]))
        line = [w]
if line:  # add last line's text
    lines.append(" ".join([w[4] for w in line]))
for l in lines:
    print(l)

Answer selected by glangford

JorjMcKie Sep 11, 2023
Maintainer

The minimum height variable takes care of bboxes of the following sizes:

glangford Sep 11, 2023
Author

Thanks @JorjMcKie , I didn't know if it was legitimate in the general case to include comparison of the word top as well. Good stuff.

Here is the compare version updated to compare word tops similarly, and compacted a bit for readability. I have tested it on the problem areas of this particular document and it seems to work well. I will keep this thread open for a bit in case there are additional ideas!

from functools import cmp_to_key
# (x0, y0, x1, y1, "word", block_no, line_no, word_no)
def word_compare(word1, word2):
    # Sort comparison function by multiple keys: first by the y coordinates,
    # then by x0. Allow for coordinate jitter of 3 in the 
    # y-axis, which may occur due to font change, for example.
    ydelta_bottom = word1[3] - word2[3]
    ydelta_top = word1[1] - word2[1]
    # If either y value is within jitter, sort by x0
    if abs(ydelta_bottom) <= 3 or abs(ydelta_top) <= 3:
        return word1[0] - word2[0] # delta x
    else:
        return ydelta_bottom

JorjMcKie Sep 12, 2023
Maintainer

Well, my argument would be:
If two words (approximately) share either of their top or bottom coord ... what situation would we have to imagine that they are not belonging to same line?
More generally, we might also l
The algorithm silently is dependent on that

The initial sort delivers contiguous sub-sequences of words, that do belong to the same line (even if they are not initially in the right, left-to-right sequence). Maybe your idea of sorting by the mean values (y0 + y1)/2 would help ensuring this here.
Characters of each word indeed occur contiguously (mostly the case but not guaranteed). If single characters have been separately stored, additional logic would have to take care of joining characters back to words again, whenever char1.x1 is close enough to the following char2.x0. Immediately raising the question: WTF is "close enough"?

Mid sentence font change disrupts word order #2663

Uh oh!

Uh oh!

glangford Sep 11, 2023

Replies: 3 comments · 9 replies

Uh oh!

JorjMcKie Sep 11, 2023 Maintainer

Uh oh!

glangford Sep 11, 2023 Author

Uh oh!

JorjMcKie Sep 11, 2023 Maintainer

Uh oh!

glangford Sep 11, 2023 Author

Uh oh!

JorjMcKie Sep 11, 2023 Maintainer

Uh oh!

glangford Sep 11, 2023 Author

Uh oh!

JorjMcKie Sep 11, 2023 Maintainer

Uh oh!

Uh oh!

glangford Sep 11, 2023 Author

Uh oh!

JorjMcKie Sep 11, 2023 Maintainer

Uh oh!

JorjMcKie Sep 11, 2023 Maintainer

Uh oh!

glangford Sep 11, 2023 Author

Uh oh!

JorjMcKie Sep 12, 2023 Maintainer

glangford
Sep 11, 2023

Replies: 3 comments 9 replies

JorjMcKie
Sep 11, 2023
Maintainer

glangford Sep 11, 2023
Author

JorjMcKie Sep 11, 2023
Maintainer

glangford Sep 11, 2023
Author

JorjMcKie Sep 11, 2023
Maintainer

glangford
Sep 11, 2023
Author

JorjMcKie Sep 11, 2023
Maintainer

glangford
Sep 11, 2023
Author

JorjMcKie Sep 11, 2023
Maintainer

JorjMcKie Sep 11, 2023
Maintainer

glangford Sep 11, 2023
Author

JorjMcKie Sep 12, 2023
Maintainer