Get bounding box of an image FAST #908

mohammadmjn · 2021-02-19T00:48:32Z

mohammadmjn
Feb 19, 2021

Is there any approach to get image bbox using xref or any other methods which are faster than page.get_image_bbox(image)? I used page.get_image_bbox(image) but it's slow for my use case where I have a vector PDF with a lot of small raster images (with more than 500 raster images). I have to detect the image bbox to check the size of each image with the page size to decide the whole page is a raster or vector. I also checked the xref but it gives me the actual image size that in most cases the image width and height we get from xref of the object is larger than its size in PDF (its width and height obtained from bbox). Here in my current code in which I want to replace page.get_image_bbox(image) with a solution based on xref or other faster alternatives to page.get_image_bbox:

def find_images_bbox(file_path):
    doc = fitz.open(path)
    page = doc[0]
    image_list = doc.get_page_images(0, full=True)
    for i in range(len(image_list)):
        image_bbox = page.get_image_bbox(image_list[i])
        print('image {} Bbox: {}'.format(i, image_bbox))
    doc.close()

Answered by JorjMcKie

Feb 19, 2021

You don't seem to need the xref at all, do you? Or any detail on how the page appearance references the image?
If I get you right, all you need are bbox coordinates of raster images actually shown on the page.

If this is true, I recommend you use text extraction - although this seems not to be obvious:
There is a performance oriented variant, which delivers text blocks of which every image is represented by a line of text with image metadata:

pprint([b for b in page.get_text("blocks") if b[-1] == 1])  # take only image blocks
[(344.25,
  88.93597412109375,
  540.0,
  175.18597412109375,
  '<image: DeviceRGB, width 261, height 115, bpc 8>',
  0,
  1)]

An image block is represented by a 1 a…

View full answer

JorjMcKie · 2021-02-19T07:21:18Z

JorjMcKie
Feb 19, 2021
Maintainer

You don't seem to need the xref at all, do you? Or any detail on how the page appearance references the image?
If I get you right, all you need are bbox coordinates of raster images actually shown on the page.

If this is true, I recommend you use text extraction - although this seems not to be obvious:
There is a performance oriented variant, which delivers text blocks of which every image is represented by a line of text with image metadata:

pprint([b for b in page.get_text("blocks") if b[-1] == 1])  # take only image blocks
[(344.25,
  88.93597412109375,
  540.0,
  175.18597412109375,
  '<image: DeviceRGB, width 261, height 115, bpc 8>',
  0,
  1)]

An image block is represented by a 1 as last item. The first 4 items of each block represent the bbox of the text block, in our case the bbox of the image.
Here is a runtime comparison for a page with 218 images:

In [8]: %timeit imgs=[b for b in page.get_text("blocks") if b[-1] == 1]
22.4 ms ± 245 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]: images = doc.get_page_images(1,full=True)
In [10]: %timeit  imgs=[page.get_image_bbox(i) for i in images]
2.46 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]:

So you have 22.4 milliseconds versus 2.46 seconds.
Is that fast enough for you?

12 replies

JorjMcKie Apr 26, 2021
Maintainer

Look, I know nothing about your PDF, nor about what you are using for dragging things around. This type of wild guessing leads to nowhere. Please be more specific: let me see your PDF, tell me what software you are using to move page objects, ...

JorjMcKie Apr 26, 2021
Maintainer

Ah, sorry - just saw your e-mail.
I will look at it.

JorjMcKie Apr 26, 2021
Maintainer

@priyamharsh14 - wow, a very unusual case:
On page 2 there is an annotation:

>>> page=doc[1]
>>> annots=[a for a in page.annots()]
>>> annots
['Stamp' annotation on page 1 of AashutoshSingal.pdf]
>>> annot=annots[0]
>>> annot
'Stamp' annotation on page 1 of AashutoshSingal.pdf
>>> annot.info
{'content': '', 'name': '#clipboard', 'title': 'aashu', 'creationDate': "D:20200422045223+05'30'", 'modDate': "D:20200505151835+05'30'", 'subject': 'Stamp', 'id': 'c3676c0a-da19-40ff-9577-151702d035b1'}

So the creator used a stamp annotation to display something. Looking at the annotations command source:

>>> print(annot._getAP().decode())
847 0 0 174 0 0 cm
/Icon Do

So this annot displays an image called Icon. This also is the reason, why that image is not in the list of the page's images.
PyMuPDF does not support this constellation.
But using low-level functions you can still track it down:
An annotation is an object which points to another object, that controls details of its appearance, found as object AP/N:

>>> print(doc.xref_object(annot.xref))
<<
  /AP <<
    /N 7 0 R  % <=== this is the AP/N
  >>
  /C [ .898026 .133331 .215683 ]
  /CreationDate (D:20200422045223+05'30')
  /F 4
  /M (D:20200505151835+05'30')
  /NM (c3676c0a-da19-40ff-9577-151702d035b1)
  /Name /#23clipboard
  /P 1 0 R
  /Popup 17 0 R
  /Rect [ .00149536 -.0598755 595.277 122.228 ]
  /Subj (Stamp)
  /Subtype /Stamp
  /T (aashu)
  /Type /Annot
>>
>>> # let us look at the xref of AP/N:
>>> print(doc.xref_object(7))
<<
  /BBox [ 0 0 847 174 ]
  /FormType 1
  /Length 28
  /Matrix [ 1 0 0 1 0 0 ]
  /Resources <<
    /ProcSet [ /PDF ]
    /XObject <<
      /Icon 8 0 R  % <=== there you are: an image!
    >>
  >>
  /Subtype /Form
  /Type /XObject
>>
>>> # look at Icon's xref 8:
>>> print(doc.xref_object(8))
<<
  /BitsPerComponent 8
  /ColorSpace /DeviceRGB
  /Filter [ /FlateDecode ]
  /Height 174
  /Length 10925
  /SMask 9 0 R
  /Subtype /Image
  /Width 847
>>

And accordingly, you can extract that image via doc.extract_image(8). This is a PNG with transparency.

JorjMcKie Apr 26, 2021
Maintainer

The boundary box of that image is equal to the annotation's rectangle in this case.

priyamharsh14 Apr 27, 2021

@JorjMcKie Thank you so much.

JorjMcKie · 2021-02-19T08:47:53Z

JorjMcKie
Feb 19, 2021
Maintainer

The reason why we have such an apparent functional overlap here is, that the text extraction works for all document types - not just PDFs.
So, from looking at extracted text of whatever output variant, a reference back to any PDF definition specifics is not possible.

The doc.get_page_images() method and friends are for PDF only. Mehod page.get_image_bbox() also is for PDF only. I built it for cases where a reference between the page object definition and the page appearance definition is needed.

2 replies

mohammadmjn Feb 19, 2021
Author

@JorjMcKie The method you mentioned (using page.get_text) is far faster than page.get_image_bbox(). But I've noticed an issue regarding this method. I got 549 raster images from doc.get_page_images(0, full=True) while page.get_text gave me 1603 image blocks with 1 as last item. I checked the xref table of the PDF file using following snippet and figured out that only 549 objects with /Subtype /Image which shows raster images exists in the page. Why the method you mentioned gives me a lot more images than the other ones?

xref_len = doc.xref_length()
text = ""
for xref in range(1, xref_len):
    text += 'object **{}** (stream: **{}**)'.format(xref, doc.is_stream(xref))
    text += '\n{}\n'.format(doc.xref_object(xref, compressed=False))
with open('xref_text.txt', 'w') as f:
    f.write(text)

Results:
image list length: 549
get_image_bbox execution time: 86.62995505332947 seconds

image list length: 1603
get_text execution time: 0.15800905227661133 seconds

JorjMcKie Feb 19, 2021
Maintainer

Why the method you mentioned gives me a lot more images than the other ones?

get_page_images(...) only lists what is contained in the page's object definition. This sometimes even is more than what the page actually displays (I don't want to explain here, when this may happen).

get_text(...) in addition is able to detect stuff that only lives in the page's /Contents object(s) - for example inline images. So, this is not an issue at all, but even better for purpose.

And although you have 3 times more images, the whole thing is still over 550 times faster.

Get bounding box of an image FAST #908

Uh oh!

Uh oh!

mohammadmjn Feb 19, 2021

Replies: 2 comments · 14 replies

Uh oh!

Uh oh!

JorjMcKie Feb 19, 2021 Maintainer

Uh oh!

JorjMcKie Apr 26, 2021 Maintainer

Uh oh!

JorjMcKie Apr 26, 2021 Maintainer

Uh oh!

JorjMcKie Apr 26, 2021 Maintainer

Uh oh!

JorjMcKie Apr 26, 2021 Maintainer

Uh oh!

priyamharsh14 Apr 27, 2021

Uh oh!

JorjMcKie Feb 19, 2021 Maintainer

Uh oh!

Uh oh!

mohammadmjn Feb 19, 2021 Author

Uh oh!

JorjMcKie Feb 19, 2021 Maintainer

mohammadmjn
Feb 19, 2021

Replies: 2 comments 14 replies

JorjMcKie
Feb 19, 2021
Maintainer

JorjMcKie Apr 26, 2021
Maintainer

JorjMcKie Apr 26, 2021
Maintainer

JorjMcKie Apr 26, 2021
Maintainer

JorjMcKie Apr 26, 2021
Maintainer

JorjMcKie
Feb 19, 2021
Maintainer

mohammadmjn Feb 19, 2021
Author

JorjMcKie Feb 19, 2021
Maintainer