Recovery Image in PDF bug #2492

ericosmic · 2023-06-25T06:19:32Z

ericosmic
Jun 25, 2023

os system : ubuntu
fitz version: ('1.20.2', '1.20.3', '20220813000001')

bug describtion: PDF file contain some images , but extracted info of this image only contain color not words. The pdf attach below:
zf1.pdf

Answered by JorjMcKie

Jun 25, 2023

Page 0 contains 4 images with masks, where each image and its mask have different resolutions.
PyMuPDF does not support this.

In [1]: import fitz
In [2]: doc=fitz.open("zf1.pdf")
In [3]: page=doc[0]
In [4]: page.get_images()
Out[4]:
[(28, 29, 2, 2, 1, 'Indexed', '', 'Image28', ''),
 (39, 40, 2, 2, 1, 'Indexed', '', 'Image39', ''),
 (41, 42, 2, 2, 1, 'Indexed', '', 'Image41', ''),
 (43, 44, 2, 2, 1, 'Indexed', '', 'Image43', '')]
In [5]: print(doc.xref_object(28)) # base image
<<
  /Type /XObject
  /Subtype /Image
  /Width 2  # <===
  /Height 2 # <===
  /ColorSpace [ /Indexed /DeviceRGB 1 <FF0000FFFFFF> ]
  /BitsPerComponent 1
  /Interpolate false
  /SMask 29 0 R
  /Length 2
>>
In [6]: print(

View full answer

JorjMcKie · 2023-06-25T09:00:21Z

JorjMcKie
Jun 25, 2023
Maintainer

Page 0 contains 4 images with masks, where each image and its mask have different resolutions.
PyMuPDF does not support this.

In [1]: import fitz
In [2]: doc=fitz.open("zf1.pdf")
In [3]: page=doc[0]
In [4]: page.get_images()
Out[4]:
[(28, 29, 2, 2, 1, 'Indexed', '', 'Image28', ''),
 (39, 40, 2, 2, 1, 'Indexed', '', 'Image39', ''),
 (41, 42, 2, 2, 1, 'Indexed', '', 'Image41', ''),
 (43, 44, 2, 2, 1, 'Indexed', '', 'Image43', '')]
In [5]: print(doc.xref_object(28)) # base image
<<
  /Type /XObject
  /Subtype /Image
  /Width 2  # <===
  /Height 2 # <===
  /ColorSpace [ /Indexed /DeviceRGB 1 <FF0000FFFFFF> ]
  /BitsPerComponent 1
  /Interpolate false
  /SMask 29 0 R
  /Length 2
>>
In [6]: print(doc.xref_object(29))  # SMask
<<
  /Type /XObject
  /Subtype /Image
  /Width 3186 # <===
  /Height 401 # <===
  /ColorSpace /DeviceGray
  /BitsPerComponent 1
  /Filter /FlateDecode
  /Length 10321
>>

There is no way in PyMuPDF to combine base image and its SMask in this case.
You can however extract for instance xref 29 as a separate image and save it as a PNG, which will then also show the text visible on page.

1 reply

ericosmic Jun 30, 2023
Author

Thanks for your reply. It is big help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recovery Image in PDF bug #2492

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Recovery Image in PDF bug #2492

Uh oh!

Uh oh!

ericosmic Jun 25, 2023

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Jun 25, 2023 Maintainer

Uh oh!

Uh oh!

ericosmic Jun 30, 2023 Author

ericosmic
Jun 25, 2023

Replies: 1 comment 1 reply

JorjMcKie
Jun 25, 2023
Maintainer

ericosmic Jun 30, 2023
Author