Callout labels are missing from extracted images #2784

arashrad · 2023-11-05T22:49:47Z

arashrad
Nov 5, 2023

Hello everyone,

I'm currently working on extracting images from a PDF file, specifically a car user manual. I've encountered an issue where some of the extracted images are missing their "callout labels." These labels do not appear as either text or as images or masks.

Samples of pdf page and extracted image are attached.
I'd appreciate any insights or suggestions on what might be causing this issue and how to go about resolving it.
Thank you in advance for your help.

single_page.pdf

Answered by JorjMcKie

Nov 6, 2023

In your example, the little text-like things are vector graphics. They are built in a way to look like numbers or symbols from the ZapfDingbats font.
While you can extract vector graphics with PyMuPDF (and also redraw them somewhere else) using page.get_drawings(), that is probably not the solution you are looking for. Presumably you need an image that includes those symbols.

To find out that you have this situation, do this:

page=doc[0]
page.get_images()  # image has xref 808
[(808, 0, 697, 481, 8, 'DeviceGray', '', 'X17', 'FlateDecode')]
imgbbox=page.get_image_rects(808)
imgbbox
[Rect(58.20399856567383, 110.16300964355469, 392.99700927734375, 341.1340026855469)]

# check if there are dr…

View full answer

JorjMcKie · 2023-11-06T08:34:56Z

JorjMcKie
Nov 6, 2023
Maintainer

In your example, the little text-like things are vector graphics. They are built in a way to look like numbers or symbols from the ZapfDingbats font.
While you can extract vector graphics with PyMuPDF (and also redraw them somewhere else) using page.get_drawings(), that is probably not the solution you are looking for. Presumably you need an image that includes those symbols.

To find out that you have this situation, do this:

page=doc[0]
page.get_images()  # image has xref 808
[(808, 0, 697, 481, 8, 'DeviceGray', '', 'X17', 'FlateDecode')]
imgbbox=page.get_image_rects(808)
imgbbox
[Rect(58.20399856567383, 110.16300964355469, 392.99700927734375, 341.1340026855469)]

# check if there are drawings inside bbox of image
subp=[p for p in page.get_drawings() if p["rect"] in imgbbox[0]]
len(subp)  # indeed: 57 vector graphics inside
57

The simplest way is to make a pixmap of that part of the page that contains the image:

pix = page.get_pixmap(clip=imgbbox[0], dpi=300)  # make part page picture at desired dpi
pix.save("x.jpg")

Gives you this:

1 reply

arashrad Nov 6, 2023
Author

Awesome, thanks for your help Jorj! :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Callout labels are missing from extracted images #2784

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Callout labels are missing from extracted images #2784

Uh oh!

arashrad Nov 5, 2023

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Nov 6, 2023 Maintainer

Uh oh!

arashrad Nov 6, 2023 Author

arashrad
Nov 5, 2023

Replies: 1 comment 1 reply

JorjMcKie
Nov 6, 2023
Maintainer

arashrad Nov 6, 2023
Author