page.get_text('words') giving extra text #1682
Answered
by
JorjMcKie
kvrameshreddy
asked this question in
Looking for help
-
Hi @JorjMcKie, I have a pdf, when I extract text, the extracted text has words which is not there on the pdf, Can you help me solve this, I am not sure what is happening with this pdf file.
this is the file I am using. the words in the marked region are not available in the pdf file |
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Apr 20, 2022
Replies: 1 comment 1 reply
-
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
kvrameshreddy
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is coded like so in the PDF. For whatever weird reason. Maybe to take some invisible notes.
You would have seen the crazy coordinates if you had looked at the words' coordinates.
You can only heal this by specifying the page rectangle as the clip:
page.get_text("words", clip=page.rect)
.Here is the PDF source code as a proof: