Extract image with annotation? #2406

TunaFFish · 2023-05-12T12:10:29Z

TunaFFish
May 12, 2023

Hello, amazing PDF library, thanks so much 👍

For studying I am looking for a way to make a summary of my annotations.
I found an example on stackoverflow on how to extract highlighted text, amazing how PyMuPDF can do this task.
I am a Python newbie and just DIY programming so bear with me..

Apart from PDF_ANNOT_HIGHLIGHT I tried PDF_ANNOT_UNDERLINE and PDF_ANNOT_TEXT for Sticky Notes with annot.info["content"]I even got the color with annot.colors["stroke"]- just fantastic!

What's next on my list is to extract certain images - by annotation.
Is this possible? Maybe annotate a freehand shape around the images or write some text on it to identify which one?
My annotating device is a 10 year old iPad with Adobe Acrobat installed, the only annot tool apart from (Highlight, Underline, Striketrough) would be Text or Freehand (Pencil Tool)

Can you give me a tip on how to do this?
I already played around with doc.get_page_images() and than doc.extract_image() I just need to pass the xref somehow, I think I need to find the position (coordinates?) of such an annotation to point to the specific image and get it's metadata?

Any help appreciated, cheers!

EDIT: I just realized I am in the UTILITIES sub, sorry, please move this up in the PYMYPDF discussions, thank you

Answered by JorjMcKie

May 12, 2023

We are pleased you like PyMuPDF!

You can determine images shown on a page in various ways. The most handy one probably is page.get_image_infos(xrefs=True). Returns a list of dictionaries. Each dict describes one image with a lot of meta information, among which there also is the bbox occupied on the page. So you can select the image contained / covered by your marker annot (which always has attribute rect).
If you use the xrefs option as indicated, then this is the xref of the image for extraction.
Note however, that - irritatingly perhaps - not all images have an xref! In this case, the number is zero.

You can still extract it - just not by xref number. Please come back for more advice …

View full answer

JorjMcKie · 2023-05-12T12:44:40Z

JorjMcKie
May 12, 2023
Maintainer

We are pleased you like PyMuPDF!

You can determine images shown on a page in various ways. The most handy one probably is page.get_image_infos(xrefs=True). Returns a list of dictionaries. Each dict describes one image with a lot of meta information, among which there also is the bbox occupied on the page. So you can select the image contained / covered by your marker annot (which always has attribute rect).
If you use the xrefs option as indicated, then this is the xref of the image for extraction.
Note however, that - irritatingly perhaps - not all images have an xref! In this case, the number is zero.

You can still extract it - just not by xref number. Please come back for more advice if that happens.

Another potential hickup, or rather thing to watch out for:
If you extract an image via doc.extract_image(xref), make sure to also look at the "smask" key in te returned dictionary.If >0 then this is the xref of the image mask which takes care of transparency. In that case to must tinker together your resulting original image by joining the two pixmaps (base image, mask). Please look up documentation to find the details.

0 replies

TunaFFish · 2023-05-12T14:30:48Z

TunaFFish
May 12, 2023
Author

Thanks so much for the help, much appreciated!
Without taking into account yet your warnings about no xrefs and smaks, I got it working like this:

def handle_page(page, doc):
	annot = page.first_annot
	while annot:
		
		# PDF_ANNOT_FREE_TEXT 
		# free text written on top of the image "save" to save to disk
		if annot.type[0] == 2:
			annot_rect = annot.rect
			image_info = page.get_image_info(xrefs=True)
			image_bbox = fitz.Rect(image_info[0]["bbox"])
			image_xref = image_info[0]["xref"]
			
			# if annotation rectangle is contained in the image rectangle
			if image_bbox.contains(annot_rect):
				# one more check: if Free Text = "save"
				if annot.info["content"] == "save":
					print("image to extract: {}".format(image_xref))
					img_data_dict = doc.extract_image(image_xref)
					# write out
					img_out = open(f"image-{image_xref}.{img_data_dict['ext']}", "wb")
					img_out.write(img_data_dict["image"])
					img_out.close()
                              	else:
					print("Free Text is on top of the image, but the content is not 'save'")
			else:
				print("nothing to extract")
	
		annot = annot.next

Thanks again! 🙏😎

1 reply

JorjMcKie May 12, 2023
Maintainer

Good stuff!
As a performance optimization I would recommend though to only do page.get_image_info() once per page - as opposed to multiple times if you have multiple annotations on a page (which may not happen in your case at all).

Also there exists an iterator over a page's annotations, which you can use like this:

for annot in page.annots(types=[2]):
    ...

This saves you a few code lines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract image with annotation? #2406

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extract image with annotation? #2406

Uh oh!

Uh oh!

TunaFFish May 12, 2023

Replies: 2 comments · 1 reply

Uh oh!

JorjMcKie May 12, 2023 Maintainer

Uh oh!

TunaFFish May 12, 2023 Author

Uh oh!

JorjMcKie May 12, 2023 Maintainer

TunaFFish
May 12, 2023

Replies: 2 comments 1 reply

JorjMcKie
May 12, 2023
Maintainer

TunaFFish
May 12, 2023
Author

JorjMcKie May 12, 2023
Maintainer