Skip to content
Discussion options

You must be logged in to vote

page.get_images() may even list images that are not at all used by the page - not to mention what you are asking.

This is because that method looks at PDF definitions only and does not inspect the page's appearance source code.

To at least restrict the above list to images that are actually referenced by this page anywhere on the MediaBox, do a page.clean_contents() first.

To restrict that list to visible images (CropBox) do this:

imglist = page.get_images()
visibles = [item for item in imglist if page.get_image_rects(item[0])[0] in page.cropbox]

get_image_rects walks through the page's appearance instructions to determine each bbox of one image (given by its xref item[0]) on the page. In…

Replies: 3 comments 14 replies

Comment options

You must be logged in to vote
2 replies
@yonglee7015
Comment options

@yonglee7015
Comment options

Answer selected by yonglee7015
Comment options

You must be logged in to vote
1 reply
@yonglee7015
Comment options

Comment options

You must be logged in to vote
11 replies
@yonglee7015
Comment options

@JorjMcKie
Comment options

@yonglee7015
Comment options

@JorjMcKie
Comment options

@yonglee7015
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants