Can't extract sequential text using sort = True #1901
-
When i try to extract text using sort method text gets the mix up randomly. but when i do not use sort i will get the right output. fitz version : 1.20.2
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
Plz check above issue |
Beta Was this translation helpful? Give feedback.
-
This is no issue, but a discussions item. |
Beta Was this translation helpful? Give feedback.
-
Please be more specific: what does randomly mean? These are multi-page documents, so a page number would help. As a general comment: Back to your case.
But again, there can be no guarantee to get text in your preferred reading sequence. Your PDF has a two-column page layout. Text in each column may consist of multiple blocks. If you sort those by the (vertical, horizontal) coordinate criteria (which indeed is done by sort=True), then you obviously get confusing output. |
Beta Was this translation helpful? Give feedback.
Please be more specific: what does randomly mean? These are multi-page documents, so a page number would help.
In test1.pdf page 0 there is no difference whether using sort=True or not.
As a general comment:
There never is a guarantee that sort=True will deliver text in a sequence you like. The reason is how PDF works ... not (Py-) Mupdf. Every single character can be stored internally in arbitrary sequence. For the 1090 characters of page 0 in test1.pdf this means there are 1090! = 2,1E+2839 different ways to produce the exact same page appearance. I tried to demonstrate this with the two files file1 and file2. They look the same, but file2 has its characters stored in random sequence. S…