Callout labels are missing from extracted images #2784
-
Hello everyone, I'm currently working on extracting images from a PDF file, specifically a car user manual. I've encountered an issue where some of the extracted images are missing their "callout labels." These labels do not appear as either text or as images or masks. Samples of pdf page and extracted image are attached. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
In your example, the little text-like things are vector graphics. They are built in a way to look like numbers or symbols from the ZapfDingbats font. To find out that you have this situation, do this: page=doc[0]
page.get_images() # image has xref 808
[(808, 0, 697, 481, 8, 'DeviceGray', '', 'X17', 'FlateDecode')]
imgbbox=page.get_image_rects(808)
imgbbox
[Rect(58.20399856567383, 110.16300964355469, 392.99700927734375, 341.1340026855469)]
# check if there are drawings inside bbox of image
subp=[p for p in page.get_drawings() if p["rect"] in imgbbox[0]]
len(subp) # indeed: 57 vector graphics inside
57 The simplest way is to make a pixmap of that part of the page that contains the image: pix = page.get_pixmap(clip=imgbbox[0], dpi=300) # make part page picture at desired dpi
pix.save("x.jpg") |
Beta Was this translation helpful? Give feedback.
In your example, the little text-like things are vector graphics. They are built in a way to look like numbers or symbols from the ZapfDingbats font.
While you can extract vector graphics with PyMuPDF (and also redraw them somewhere else) using
page.get_drawings()
, that is probably not the solution you are looking for. Presumably you need an image that includes those symbols.To find out that you have this situation, do this: