Extract links while extracting text? #3206
Replies: 2 comments 14 replies
-
Links have nothing to do with text. For sure, the picture is blurred because it mostly is text that is covered by some link rectangle. But it remains a coincidence: You can cover an image or also nothing by a link. With this approach, chances are that not all links in the links list will be eaten up by fitting text ... |
Beta Was this translation helpful? Give feedback.
-
Something like this: rect = link["from"] # the link "hot area"
h = rect.height * 0.1 # 10% of the link rectangle's height
smaller = rect + (0, h, 0, -h) # this rectangle has a that is 20% smaller but same width
text = page.get_textbox(smaller) The hope is that no "foreign" characters are included any longer, but still all relevant link text. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I currently have a script that extracts text from a PDF file and converts it into HTML, but I'd also like to extract links.
I know that you can extract links from a page using
first_link
, but how do you output the links while extracting the text so that they are outputted at the same time?The goal is to output the link and the text as a HTML hyperlink:
<a href="url">link text</a>
Beta Was this translation helpful? Give feedback.
All reactions