Match the results of page.get_image_info() and page.get_images() #1659

ayusonkj · 2022-03-31T01:56:49Z

ayusonkj
Mar 31, 2022

Hi,

I've been working on extracting individual images in a pdf file where I store them in a remote store after identifying them individually,
I've been using the results of page.get_images() to identify and extract individual files in a single page successfully.

However, I've been working on getting the value of the image rotation. whereas the information that I needed which is the transformation matrix can be extracted using the page.get_image_info() or page.get_text('dict').
these are the values that I was able to extract using the methods above:

{'bbox': (-1.1367249488830566,
          -1.0975341796875,
          596.3571166992188,
          842.9071655273438),
 'bpc': 8,
 'colorspace': 3,
 'cs-name': 'ICCBased(RGB,sRGB IEC61966-2.1)',
 'height': 1689,
 'number': 0,
 'size': 120177,
 **_'transform': (597.4938354492188,
               0.0,
               -0.0,
               844.0046997070312,
               -1.1367249488830566,
               -1.0975341796875),_**
 'width': 1195,
 'xres': 96,
 'yres': 96}

I am able to calculate the angle to its nearest integer,, using the c and d values of the transform matrix, however, I don't see any identifier like image_name that i can use to reference the results of the page.get_image_info to the results of the page.get_images

is there any way to identify the results of the extracted information in page.get_images to the result of the page.get_image_info?

for now, I am storing the results of page.get_image_info in a list while enumerating the images in the page.get_images to match the results by the index. of the list

Answered by JorjMcKie

Mar 31, 2022

Both methods work with completely different approaches. get_images() only works on PDFs, whereas get_image_info() works for all document types - just like get_text(), on which it is based.
The sets of images each of them reports are not equal in general. I am discussing the background in detail in the documentation.

To support that matching, get_image_info() supports the xrefs parameter. If True then image["xref"] can be used to locate the item in get_images().

But you can also use page.get_image_rects(item, transform=True) to get a list of locations of an image on the page (including the transformation matrix) using one of the items in get_images().

View full answer

JorjMcKie · 2022-03-31T11:58:27Z

JorjMcKie
Mar 31, 2022
Maintainer

Both methods work with completely different approaches. get_images() only works on PDFs, whereas get_image_info() works for all document types - just like get_text(), on which it is based.
The sets of images each of them reports are not equal in general. I am discussing the background in detail in the documentation.

To support that matching, get_image_info() supports the xrefs parameter. If True then image["xref"] can be used to locate the item in get_images().

But you can also use page.get_image_rects(item, transform=True) to get a list of locations of an image on the page (including the transformation matrix) using one of the items in get_images().

0 replies

ayusonkj · 2022-04-01T00:40:20Z

ayusonkj
Apr 1, 2022
Author

OMG!!! you are right!!! I feel stupid now, that I haven't realized that I can use this. Thanks for the help.

1 reply

JorjMcKie Apr 1, 2022
Maintainer

Rest assured: I know this feeling all too well 😉.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Match the results of page.get_image_info() and page.get_images() #1659

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Match the results of page.get_image_info() and page.get_images() #1659

Uh oh!

ayusonkj Mar 31, 2022

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

JorjMcKie Mar 31, 2022 Maintainer

Uh oh!

ayusonkj Apr 1, 2022 Author

Uh oh!

JorjMcKie Apr 1, 2022 Maintainer

ayusonkj
Mar 31, 2022

Replies: 2 comments 1 reply

JorjMcKie
Mar 31, 2022
Maintainer

ayusonkj
Apr 1, 2022
Author

JorjMcKie Apr 1, 2022
Maintainer