Can't extract sequential text using sort = True #1901

hemilparmar · 2022-08-30T06:01:30Z

hemilparmar
Aug 30, 2022

When i try to extract text using sort method text gets the mix up randomly. but when i do not use sort i will get the right output.
can you please help me?

fitz version : 1.20.2

files :
test1.pdf
test2.pdf

        Code :
        with fitz.open(filename) as doc:
            text = ""
            for page in doc:
                print(f"*******************{page}*******************")
                text = page.get_text(sort=True) # wrong output
                print(text)
        ---------------------------------------------------------------------------------
                text = page.get_text() # right output
                print(text)

Answered by JorjMcKie

Aug 30, 2022

Please be more specific: what does randomly mean? These are multi-page documents, so a page number would help.
In test1.pdf page 0 there is no difference whether using sort=True or not.

As a general comment:
There never is a guarantee that sort=True will deliver text in a sequence you like. The reason is how PDF works ... not (Py-) Mupdf. Every single character can be stored internally in arbitrary sequence. For the 1090 characters of page 0 in test1.pdf this means there are 1090! = 2,1E+2839 different ways to produce the exact same page appearance. I tried to demonstrate this with the two files file1 and file2. They look the same, but file2 has its characters stored in random sequence. S…

View full answer

hemilparmar · 2022-08-30T06:02:37Z

hemilparmar
Aug 30, 2022
Author

Plz check above issue

0 replies

JorjMcKie · 2022-08-30T07:51:34Z

JorjMcKie
Aug 30, 2022
Maintainer

This is no issue, but a discussions item.

0 replies

JorjMcKie · 2022-08-30T08:45:16Z

JorjMcKie
Aug 30, 2022
Maintainer

Please be more specific: what does randomly mean? These are multi-page documents, so a page number would help.
In test1.pdf page 0 there is no difference whether using sort=True or not.

As a general comment:
There never is a guarantee that sort=True will deliver text in a sequence you like. The reason is how PDF works ... not (Py-) Mupdf. Every single character can be stored internally in arbitrary sequence. For the 1090 characters of page 0 in test1.pdf this means there are 1090! = 2,1E+2839 different ways to produce the exact same page appearance. I tried to demonstrate this with the two files file1 and file2. They look the same, but file2 has its characters stored in random sequence. So you will never be able to extract the text in reading sequence - using sort=True or not. Here, sorting by each character's coordinate and then stitching together the result is the only way.

Back to your case.

"Naive" text extraction get_text("text") reads each character in its natively stored sequence.
Block-based extraction get_text("blocks") relies on MuPDF's heuristics to structure the single characters in text lines, and those again in text blocks. These heuristics use a number of criteria like inter-character distance in relation to font size, inter-line distance, etc. This strategy more often than not leads to a satisfying page structure. Naive extraction with sort=True actually uses the blocks method, and sort the block text by the block rectangle coordinates.

But again, there can be no guarantee to get text in your preferred reading sequence. Your PDF has a two-column page layout. Text in each column may consist of multiple blocks. If you sort those by the (vertical, horizontal) coordinate criteria (which indeed is done by sort=True), then you obviously get confusing output.
PyMuPDF is not aware of the 2-column page layout (on some pages - not all). So it doesn't know, that in this case it might be better to stick with the native character sequence.

1 reply

JorjMcKie Aug 30, 2022
Maintainer

What might help in your case is this little function. It checks whether the page has two text columns. If so, return the sorted left and right text joined together:

def check2columns(page):
    prect = page.rect
    left = +prect  # left rectangle
    left.x1 = prect.width / 2
    right = +prect  # right rectangle
    right.x0 = prect.width / 2
    lblocks = []  # for blocks contained in left page half
    rblocks = []  # blocks in right page half
    blocks = page.get_text("blocks", flags=fitz.TEXTFLAGS_TEXT)
    for b in blocks:
        bbox = fitz.Rect(b[:4])
        if bbox in left:
            lblocks.append(b)
        elif bbox in right:
            rblocks.append(b)
        else:
            # any block overlapping left AND right leads to
            # outputting everything in standard sorting order
            return "".join(
                [b[4] for b in sorted(blocks, key=lambda bb: (bb[3], bb[0]))]
            )
    return "".join(
        [  # sort left and right blocks separately, then join their text
            b[4]
            for b in sorted(lblocks, key=lambda bb: (bb[3], bb[0]))
            + sorted(rblocks, key=lambda bb: (bb[3], bb[0]))
        ]
    )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't extract sequential text using sort = True #1901

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can't extract sequential text using sort = True #1901

Uh oh!

Uh oh!

hemilparmar Aug 30, 2022

Replies: 3 comments · 1 reply

Uh oh!

hemilparmar Aug 30, 2022 Author

Uh oh!

JorjMcKie Aug 30, 2022 Maintainer

Uh oh!

JorjMcKie Aug 30, 2022 Maintainer

Uh oh!

JorjMcKie Aug 30, 2022 Maintainer

hemilparmar
Aug 30, 2022

Replies: 3 comments 1 reply

hemilparmar
Aug 30, 2022
Author

JorjMcKie
Aug 30, 2022
Maintainer

JorjMcKie
Aug 30, 2022
Maintainer

JorjMcKie Aug 30, 2022
Maintainer