after crop page, then get_images() #2140

yonglee7015 · 2022-12-19T03:23:34Z

yonglee7015
Dec 19, 2022

HI I just want to get_images in specify range in a page. So I crop the page, then use function get_images. But it still get all images in this page.

page.set_cropbox(fitz.Rect(100, 100, 550, 700))
image_list = page.get_images(full=True)

How to get images in specify range in a page?

Answered by JorjMcKie

Dec 19, 2022

page.get_images() may even list images that are not at all used by the page - not to mention what you are asking.

This is because that method looks at PDF definitions only and does not inspect the page's appearance source code.

To at least restrict the above list to images that are actually referenced by this page anywhere on the MediaBox, do a page.clean_contents() first.

To restrict that list to visible images (CropBox) do this:

imglist = page.get_images()
visibles = [item for item in imglist if page.get_image_rects(item[0])[0] in page.cropbox]

get_image_rects walks through the page's appearance instructions to determine each bbox of one image (given by its xref item[0]) on the page. In…

View full answer

JorjMcKie · 2022-12-19T07:26:19Z

JorjMcKie
Dec 19, 2022
Maintainer

page.get_images() may even list images that are not at all used by the page - not to mention what you are asking.

This is because that method looks at PDF definitions only and does not inspect the page's appearance source code.

To at least restrict the above list to images that are actually referenced by this page anywhere on the MediaBox, do a page.clean_contents() first.

To restrict that list to visible images (CropBox) do this:

imglist = page.get_images()
visibles = [item for item in imglist if page.get_image_rects(item[0])[0] in page.cropbox]

get_image_rects walks through the page's appearance instructions to determine each bbox of one image (given by its xref item[0]) on the page. In the above I am looking at the first such bbox only to see whether it is in the CropBox.
A more precise code would have to go over the potentially several such rectangles - and may even include images that are at least partially visible by checking non-empty rectangle intersections.

2 replies

yonglee7015 Dec 19, 2022
Author

Thank you so much. But it seems not as I want. I test it in my code

yonglee7015 Dec 19, 2022
Author

Anyway I just want to extract images in a fixed position in pdf. based on rect can't get the correct image I need.

JorjMcKie · 2022-12-19T10:23:16Z

JorjMcKie
Dec 19, 2022
Maintainer

There must be some misconception. Can you let me have the PDF and the page number in question?

1 reply

yonglee7015 Dec 19, 2022
Author

Ok,just sent pdf to your email

JorjMcKie · 2022-12-19T13:07:52Z

JorjMcKie
Dec 19, 2022
Maintainer

As I suspected, you have been hiding major information items!
Your pages are rotated by 90°! As documented, every coordinates read or written however are (resp. must be) in unrotated coordinates.
The PDF CropBox itself always is in unrotated coordinates (in contrast to page.rect), so for setting this you need unrotated coords.
So if you extract image bboxes on a page, they will appear as if the page had not been rotated. This is also the coordinates you need for CropBox setting. If you do that and use this Rect(109.0, 241.76995849609375, 529.0, 550.219970703125) things will work fine.

11 replies

yonglee7015 Dec 20, 2022
Author

HI Jorj X. McKie, thanks for you help. Finally I found pdfplumber can solve my problem. Cos pdfplumber can export crop page to image directly. So I set a fixed crop bbox in the page,
Then it can always export the image I want.

!apt install imagemagick

import pdfplumber

pdf = pdfplumber.open('/content/015-23C110164.pdf')
currentpage = pdf.pages[0]
currentpage_crop = currentpage.crop((120,130,700,500))
image_obj = currentpage_crop.to_image(resolution=300)
image_obj.save('dddd.png', format="PNG")

JorjMcKie Dec 20, 2022
Maintainer

Ah, it is only now that I understand what you actually wanted!

With the current solution you do not actually extract the underlying image of the PDF:
You are simply taking a "photo" of a certain page area.
That would have been even easier in PyMuPDF, if we had been communicating better:

pix = page.get_pixmap(dpi=300, clip=(120,130,700,500))
pix.save("dddd.png")

yonglee7015 Dec 20, 2022
Author

Sorry for my poor English. I can't clearly express my meaning. Image rect is so dynamics,this way should be the best.

JorjMcKie Dec 20, 2022
Maintainer

That's ok. BTW I am sure that PyMuPDF is faster here, too.

What brought me on the wrong track was that talk about changing the cropbox. But this was not at all what you wanted - you needed a subimage of the full page like a photo.

yonglee7015 Dec 21, 2022
Author

Thanks. It is faster and simple. I need to read the pymupdf document carefully first.

after crop page, then get_images() #2140

Uh oh!

yonglee7015 Dec 19, 2022

Replies: 3 comments · 14 replies

Uh oh!

JorjMcKie Dec 19, 2022 Maintainer

Uh oh!

yonglee7015 Dec 19, 2022 Author

Uh oh!

yonglee7015 Dec 19, 2022 Author

Uh oh!

JorjMcKie Dec 19, 2022 Maintainer

Uh oh!

Uh oh!

yonglee7015 Dec 19, 2022 Author

Uh oh!

JorjMcKie Dec 19, 2022 Maintainer

Uh oh!

yonglee7015 Dec 20, 2022 Author

Uh oh!

JorjMcKie Dec 20, 2022 Maintainer

Uh oh!

yonglee7015 Dec 20, 2022 Author

Uh oh!

JorjMcKie Dec 20, 2022 Maintainer

Uh oh!

yonglee7015 Dec 21, 2022 Author

yonglee7015
Dec 19, 2022

Replies: 3 comments 14 replies

JorjMcKie
Dec 19, 2022
Maintainer

yonglee7015 Dec 19, 2022
Author

yonglee7015 Dec 19, 2022
Author

JorjMcKie
Dec 19, 2022
Maintainer

yonglee7015 Dec 19, 2022
Author

JorjMcKie
Dec 19, 2022
Maintainer

yonglee7015 Dec 20, 2022
Author

JorjMcKie Dec 20, 2022
Maintainer

yonglee7015 Dec 20, 2022
Author

JorjMcKie Dec 20, 2022
Maintainer

yonglee7015 Dec 21, 2022
Author