Extraction of text #2626
-
page=doc[0] I have used the above code to extract text from page 1 of the below PDF. BNL-76953-2006-CP is present visually only once but while extracting spans of text, could see Can you please let me know the reason? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 15 replies
-
Clicking on your link doesn't do anything - cannot look at the file. |
Beta Was this translation helpful? Give feedback.
-
32542.pdf |
Beta Was this translation helpful? Give feedback.
-
Actually - if I were to mmake such a PDF, I would create
I would never ever write standard text underneath such a field - as it has happened here! What purpose does that have?! But as usual in PDF: Murphy's Law, what is possible, will happen earlier or later. |
Beta Was this translation helpful? Give feedback.
-
Can you let me know what the rectangles on page 7 represent in the below PDF. Are they graphics? |
Beta Was this translation helpful? Give feedback.
No, they are no graphics, but so-called inline images: all image information is part of the page's
/Contents
object.Because they have no xref, you don't see them using
page.get_images()
.But PyMuPDF doesn't let you down!
Information about all page images:
page.get_image_info(...)
. To extract them, use text extraction:page.get_text("dict", ...)
.