TEXT_DEHYPHENATE not working properly #1926
Replies: 8 comments
-
I believe the issue is that the text extraction is identifying different |
Beta Was this translation helpful? Give feedback.
-
Ah, have you confirmed this is the case here? |
Beta Was this translation helpful? Give feedback.
-
Just tested it: you are right! In this case, each line height is 12.74. The distance between a line's bottom to the next line's top is 4.3. So you were having the right idea - this example is not suitable for dehyphenation. |
Beta Was this translation helpful? Give feedback.
-
Based on the insight presented by your example, we will insert a comment in the documentation. |
Beta Was this translation helpful? Give feedback.
-
I'll be sure to update https://pymupdf.readthedocs.io/en/latest/vars.html?highlight=dehyphenate#TEXT_DEHYPHENATE with some notes soon. Going forward, maybe we could parameterise line-height or something alongside this flag so that lines are considered to be part of the same block? No idea if that is something which is feasible or not. |
Beta Was this translation helpful? Give feedback.
-
I am afraid this would have to happen inside MuPDF's text page logic. Any change we may want to introduce has consequences that also apply to things like text search - not yet talking about that subsequent lines may not have the same inclination angle. Also, if text is not coded in reading sequence, the whole thing breaks down anyway. |
Beta Was this translation helpful? Give feedback.
-
I think this issue has now turned into a discussion item, so let me transfer it to there. |
Beta Was this translation helpful? Give feedback.
-
" We might think about increasing the threshold WRT inter-line distances - which in this case seems to be the one reason why each line lives in its own block." I think this is a wise choice since visually the lines do seem to belong in the same block. I have written my own python code to merge blocks where the last line of the first and first line of the next fit some criteria (relative vertical distance, horizontal position, etc.). This solved the issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Bug report
Running text extraction with
TEXT_DEHYPHENATE
does not produce the expected behaviour for the following pdf: issue_one_page.pdf. (But it does work correctly on other pages...)To reproduce, run the following code on the pdf issue_one_page.pdf.
This gives
Beta Was this translation helpful? Give feedback.
All reactions