extracting image from page if its the only thing on the page #815

cbm755 · 2021-01-09T04:54:50Z

cbm755
Jan 9, 2021

I'd like a function like does_this_page_contain_only_an_embedded_image() or really would_I_lose_anything_if_I_extract_one_bitmap_from_this_page().

Its quite common for a PDF to contain scanned pages. E.g., one jpeg per page.
In this case its easy to use PyMuPDF to extract this image and save it to disc.
But if there is anything else on the page, I'd rather render the page to a bitmap (which is also easy in PyMuPDF)

Roughly:

is there any text? return False
are there any annotations? return False
etc, etc.
Is there just one image? return True

I tried to write this in https://gitlab.com/plom/plom/-/blob/v0.5.13/plom/scan/scansToImages.py#L106

But it fails on at least this example:

page has a single image but also non-text strokes (specifically handwriting).

So I gave up for now, and always render to bitmap for now, which is safe but certainly not ideal in case 1 above.

There must be something about checking/counting xrefs but I'm not knowledgeable enough about PDF to trust myself to write it.

JorjMcKie · 2021-01-09T09:51:46Z

JorjMcKie
Jan 9, 2021
Maintainer

Interesting question!

If a page is the result of scanning, then you should have page.rect in page.getImageBbox() for at least one of its images. But depending on the scanner, you still may see two images in that case, both of (at least) page size and the recreation of the page appearance involves amalgamation of those two images ... tedious! And faithful results only achievable with Pillow or so ...
In the scan case you still may have (hidden) text, e.g. when the PDF has been OCRed. So text existence is no safe indicator. But if Tesseract has been the OCR tool, you will see the following. The GlyphLessFont is a safe indicator for a scanned, Tesseract-OCRed page.

>>> doc.getPageFontList(page.number)
[(3, 'ttf', 'Type0', 'GlyphLessFont', 'f-0-0', 'Identity-H')]

There is page.annot_xrefs() which is a list of triples (xref, type, name) for each annotation, link or field on the page. If that list is empty or not, you know a lot already.
There is page.getDrawings(), a list of dictionaries, which represent elementary PDF draw commands on the page. Emptiness of that list also is a good indicator.

I have a scanned example, looking like this:

>>> doc.get_page_images(0)
[(14, 0, 2718, 3221, 8, 'DeviceRGB', '', 'BG', 'JPXDecode')]
>>> pprint(doc.get_page_fonts(0))
[(40, 'n/a', 'TrueType', 'TimesNewRomanPSMT', 'F_0', 'WinAnsiEncoding'),
 (41, 'n/a', 'TrueType', 'TimesNewRomanPS-BoldMT', 'F_1', 'WinAnsiEncoding'),
 (42, 'n/a', 'TrueType', 'TimesNewRomanPS-ItalicMT', 'F_2', 'WinAnsiEncoding')]
>>> doc[0].rect#
Rect(0.0, 0.0, 652.4500122070312, 773.0479736328125)
>>> doc[0].getImageBbox("BG")
Rect(0.0, -0.00201416015625, 652.4500122070312, 773.0479736328125)
>>> doc[0].rect in doc[0].getImageBbox("BG")
True

So the OCR tool was not Tesseract.

3 replies

cbm755 Jan 9, 2021
Author

My priorities:

False negatives (rendering when we could in theory have extracted bitmap): this is no big deal, just some loss of quality.
False positive (extract bitmap when we should have rendered): this is a serious crime: we miss content.

That is, I should err on rendering not extraction...

Maybe adding exceptions later if we can detect that (only) OCR was done (e.g., if common iOS/Android scanning tools are doing OCR).

cbm755 Jan 9, 2021
Author

page.annot_xrefs() which is a list of triples (xref, type, name) for each annotation, link or field on the page. If that list is empty or not ...

Does this mean checking annot_xref() empty means there are no strokes on the page?

vicent4no Feb 10, 2024

Hey everyone. I am interested in this topic given a task I am trying to achieve. Sorry for asking in a really old thread however there is not much information about this available in the internet. Thanks for your time by the way.

One question:

But depending on the scanner, you still may see two images in that case, both of (at least) page size and the recreation of the 
page appearance involves amalgamation of those two images ... tedious! And faithful results only achievable with Pillow or so ...

Why would we need to use Pillow here? Given the fact that, at least theoretically, I could loop every page, do a page.get_images(). With this information, check for every image its size (height and width) and compare that size with page.mediabox and seek at most two images that are equal or bigger than the container's size.

This taking into account that the PDF is well formed and these references exist in the first place (I mean, obtaining the information using page.get_images()

JorjMcKie · 2021-01-09T19:34:58Z

JorjMcKie
Jan 9, 2021
Maintainer

Does this mean checking annot_xref() empty means there are no strokes on the page?

No, this one means: no link, no annots, no widgets/fields

This page.getDrawings() == [] means no drawings.

0 replies

JorjMcKie · 2021-01-09T20:13:46Z

JorjMcKie
Jan 9, 2021
Maintainer

false positives

This can actually only happen if there is at least one image covering the whole page, right?
If there is such an image, then a few more checks could be used to reduce that risk:

assert page.getDrawings() == []
assert page.annot_xrefs() == []

A more tricky check may add security to preventing this condition:
Like everything else, an image is displayed by a certain command in the page's (concatenated) /Contents: /name Do, where name occurs in position 7 of an item of doc.get_page_images().
Example:

>>> pprint(doc.get_page_images(0))
[(1291, 0, 261, 115, 8, 'DeviceRGB', '', 'Im1', 'DCTDecode')]

This is page 0 of the PyMuPDF PDF manual. Doing the following "find" checks, where the image "Im1" display occurs. One could then check, whether lateron more things are displayed - i.e. on top of image "Im1" (remember: we talk about a full page image only - which is not the case in our example):

cont=doc[0].read_contents()  # read the full concatenated contents (bytes !)
pos = cont.find(b"/Im1 Do")  # search for our image
# pos = the image display command position
# check if any text objects or more display commands follow:
if cont.find(b"BT", pos) > 0 or cont.find(b" Do", pos+7) > 0:
    print("relevant objects follow image", "Im1")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extracting image from page if its the only thing on the page #815

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

extracting image from page if its the only thing on the page #815

Uh oh!

cbm755 Jan 9, 2021

Replies: 3 comments · 3 replies

Uh oh!

JorjMcKie Jan 9, 2021 Maintainer

Uh oh!

cbm755 Jan 9, 2021 Author

Uh oh!

cbm755 Jan 9, 2021 Author

Uh oh!

vicent4no Feb 10, 2024

Uh oh!

JorjMcKie Jan 9, 2021 Maintainer

Uh oh!

JorjMcKie Jan 9, 2021 Maintainer

cbm755
Jan 9, 2021

Replies: 3 comments 3 replies

JorjMcKie
Jan 9, 2021
Maintainer

cbm755 Jan 9, 2021
Author

cbm755 Jan 9, 2021
Author

JorjMcKie
Jan 9, 2021
Maintainer

JorjMcKie
Jan 9, 2021
Maintainer