Extract text from region in reading order #1451
-
Some very short background: I am trying to create a Now to get to the point: I would like to be able to select text regions in reading order (over multiple lines). Checking out the However, checking out the Here follow two screenshots of the functionality I would like to see, the first is of Emacs with pdf-tools using the current Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 3 replies
-
There have been a number of text processing enhancements lately. For you purposes, let |
Beta Was this translation helpful? Give feedback.
-
Thank you for your quick reply. Your solution works, but it is not ideal for the following two (or three) reasons:
Here follows a screenshot of some active area using your trick (I did not yet bother to remove the 'items' from the left and the right, as I do not see how to use the result anyway, i.e. create the single annotation using multiple rectangles for markup) |
Beta Was this translation helpful? Give feedback.
-
Maybe we talked a little past each other: Where the red dots are |
Beta Was this translation helpful? Give feedback.
-
So the effort reduces to find the A, B coordinates - and, potentially, the left and right borders of the page text columns we should restrict the highlight to. BTW: the above works the same way for underline, strikethrough and squiggly annotations. |
Beta Was this translation helpful? Give feedback.
-
I am going to move this thread to the |
Beta Was this translation helpful? Give feedback.
-
Well, somehow the user selection must be communicated to your app. If you control the GUI, then you should be able to get this info. You could instruct the user to mark the start, and, potentially as a separeate interaction, the end of the region to mark. Could be just one character each. There is new rectangle method |
Beta Was this translation helpful? Give feedback.
-
Well, I don't know what happens in detail on your side, but this is my approach, which works just fine: >>> import fitz
>>> doc=fitz.open("v110-changes.pdf")
>>> page=doc[0]
>>> rl1 = page.search_for("pixmaps coming from")
>>> len(rl1)
1
>>> A = rl1[0].tl
>>> rl2=page.search_for("needs to be checked.")
>>> len(rl2)
1
>>> B = rl2[0].br
>>> page.add_highlight_annot(start=A, stop=B)
'Highlight' annotation on page 0 of v110-changes.pdf
>>> doc.ez_save("x.pdf")
>>> Before: After: |
Beta Was this translation helpful? Give feedback.
-
If it doesn't work for some PDFs only, then chances are that there exist unclear / sloppy "geometry changes" as explained in at the of the recipes chapter. |
Beta Was this translation helpful? Give feedback.
Well, I don't know what happens in detail on your side, but this is my approach, which works just fine:
Before:
After: