extracting image from page if its the only thing on the page #815
Replies: 3 comments 3 replies
-
Interesting question!
>>> doc.getPageFontList(page.number)
[(3, 'ttf', 'Type0', 'GlyphLessFont', 'f-0-0', 'Identity-H')]
I have a scanned example, looking like this: >>> doc.get_page_images(0)
[(14, 0, 2718, 3221, 8, 'DeviceRGB', '', 'BG', 'JPXDecode')]
>>> pprint(doc.get_page_fonts(0))
[(40, 'n/a', 'TrueType', 'TimesNewRomanPSMT', 'F_0', 'WinAnsiEncoding'),
(41, 'n/a', 'TrueType', 'TimesNewRomanPS-BoldMT', 'F_1', 'WinAnsiEncoding'),
(42, 'n/a', 'TrueType', 'TimesNewRomanPS-ItalicMT', 'F_2', 'WinAnsiEncoding')]
>>> doc[0].rect#
Rect(0.0, 0.0, 652.4500122070312, 773.0479736328125)
>>> doc[0].getImageBbox("BG")
Rect(0.0, -0.00201416015625, 652.4500122070312, 773.0479736328125)
>>> doc[0].rect in doc[0].getImageBbox("BG")
True So the OCR tool was not Tesseract. |
Beta Was this translation helpful? Give feedback.
-
No, this one means: no link, no annots, no widgets/fields This |
Beta Was this translation helpful? Give feedback.
-
This can actually only happen if there is at least one image covering the whole page, right?
A more tricky check may add security to preventing this condition: >>> pprint(doc.get_page_images(0))
[(1291, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode')] This is page 0 of the PyMuPDF PDF manual. Doing the following "find" checks, where the image "Im1" display occurs. One could then check, whether lateron more things are displayed - i.e. on top of image "Im1" (remember: we talk about a full page image only - which is not the case in our example): cont=doc[0].read_contents() # read the full concatenated contents (bytes !)
pos = cont.find(b"/Im1 Do") # search for our image
# pos = the image display command position
# check if any text objects or more display commands follow:
if cont.find(b"BT", pos) > 0 or cont.find(b" Do", pos+7) > 0:
print("relevant objects follow image", "Im1") |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like a function like
does_this_page_contain_only_an_embedded_image()
or reallywould_I_lose_anything_if_I_extract_one_bitmap_from_this_page()
.Its quite common for a PDF to contain scanned pages. E.g., one jpeg per page.
In this case its easy to use PyMuPDF to extract this image and save it to disc.
But if there is anything else on the page, I'd rather render the page to a bitmap (which is also easy in PyMuPDF)
Roughly:
I tried to write this in https://gitlab.com/plom/plom/-/blob/v0.5.13/plom/scan/scansToImages.py#L106
But it fails on at least this example:
So I gave up for now, and always render to bitmap for now, which is safe but certainly not ideal in case 1 above.
There must be something about checking/counting xrefs but I'm not knowledgeable enough about PDF to trust myself to write it.
Beta Was this translation helpful? Give feedback.
All reactions