pixmap must be grayscale or rgb to write as png #1880

devpro9219 · 2020-03-18T16:02:33Z

devpro9219
Mar 18, 2020

Hi. I've tried to use this guide.
https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-imga.py

When I try to read all images from the pdf, I got this error.

mupdf: pixmap must be grayscale or rgb to write as png
Traceback (most recent call last):
  File "image.py", line 97, in <module>
    imgdata = pix.getPNGData()
  File "/home/qwe/ocr-env/lib/python3.6/site-packages/fitz/fitz.py", line 4170, in getPNGData
    barray = self._getImageData(1)
  File "/home/qwe/ocr-env/lib/python3.6/site-packages/fitz/fitz.py", line 4151, in _getImageData
    return _fitz.Pixmap__getImageData(self, format)
RuntimeError: pixmap must be grayscale or rgb to write as png

Do you have any idea or can you help me?
I attached the PDF file what I've tried.
Thank you.
BCBSMI EOB.pdf

Answered by JorjMcKie

Aug 18, 2022

Hi, this script fo working fine for me but extract the image in Grey in my pdf file all image in CMYK formate can you help how solve this

You must find a file format supporting CMYK. There are a selected few directly supported by Pixmap.save(). If none suits you, check the Pillow documentation for one such and then use Pixmap.pil_save(...) with the right parameters - again, please consult Pillow docu for choosing the right parameters in place of ....

View full answer

JorjMcKie · 2020-03-18T17:45:15Z

JorjMcKie
Mar 18, 2020
Maintainer

There obviously are images in the pdf with more than 3 color components. You must either store those in some CMYK image Format (png won’t work) or convert it to RGB first. I’m underway currently, so please have a look at examples extract-imga.py, where this is done, too. Von meinem iPhone gesendet Am 18.03.2020 um 12:02 schrieb IT Engineer. <[email protected]>: Hi. I've tried to use this guide. https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-imga.py When I try to read all images from the pdf, I got this error. mupdf: pixmap must be grayscale or rgb to write as png Traceback (most recent call last): File "image.py", line 97, in <module> imgdata = pix.getPNGData() File "/home/qwe/ocr-env/lib/python3.6/site-packages/fitz/fitz.py", line 4170, in getPNGData barray = self._getImageData(1) File "/home/qwe/ocr-env/lib/python3.6/site-packages/fitz/fitz.py", line 4151, in _getImageData return _fitz.Pixmap__getImageData(self, format) RuntimeError: pixmap must be grayscale or rgb to write as png Do you have any idea or can you help me? I attached the PDF file what I've tried. Thank you. BCBSMI EOB.pdf<https://github.com/pymupdf/PyMuPDF/files/4349984/BCBSMI.EOB.pdf> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#469>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB7IDIU4KMCT35FAA4MZYDTRIDWCVANCNFSM4LOU4AGA>.

0 replies

devpro9219 · 2020-03-18T18:27:35Z

devpro9219
Mar 18, 2020
Author

Hi, I've already run https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-imga.py and It shows the above issues.

Please give me advise.
Thanks

0 replies

JorjMcKie · 2020-03-18T19:11:38Z

JorjMcKie
Mar 18, 2020
Maintainer

Ok, I am back home now. Let me check your PDF.

0 replies

devpro9219 · 2020-03-18T19:26:06Z

devpro9219
Mar 18, 2020
Author

Thank you.

0 replies

JorjMcKie · 2020-03-18T19:29:35Z

JorjMcKie
Mar 18, 2020
Maintainer

Aha, resolved the issue:

The latest PyMuPDF also accepts the ICC color system, therefore corlorspaces may be presented which do have the right number of color components but still are neither DeviceGRAY, nor DeviceRGB. This required an adjustment of extract-imga.py. Here is an update:
extract-imga.zip

0 replies

devpro9219 · 2020-03-18T19:46:04Z

devpro9219
Mar 18, 2020
Author

Thank you. It seems like it works.
However, I have one more question, It seems like it doesn't handle the mask area from the pdf.
Also, some images didn't extracted. Do you have any sample to handle those things based on your library?
Thank you.

0 replies

devpro9219 · 2020-03-18T19:47:12Z

devpro9219
Mar 18, 2020
Author

This is the part which can't detect with this library.
Please give me the advise.
Thank you.

0 replies

JorjMcKie · 2020-03-18T20:59:47Z

JorjMcKie
Mar 18, 2020
Maintainer

Your PDF is a complex example! You almost have to rewrite a PDF viewer for a full analysis. Here are my findings:

The images extracted by the script are the images which really exist - there are no other ones.
The red rectangle in screenshot 3 is not an image but a so-called XObject (a page embedded from another PDF). I have added a script which is able to extract those things ...
Other apparently graphical elements are no images but drawings generated by the PDF-internal mini-language (similar to PostScript). The red frame in your last post is such a non-image. There is no way to extract these things (except, as mentioned, by writing a PDF viewer-like script)

Here is a more advanced text extraction script, which should extract the text in the correct reading sequence:
doubles.zip

Here is a script to extract XObjects (like the red box on page 2):
xobj-extract.zip

0 replies

devpro9219 · 2020-03-18T21:35:10Z

devpro9219
Mar 18, 2020
Author

Hi. Thanks for your advise. However, I need to get exactly same image with the current PDF.
However, this is my question.

How to get the image position and width and height on the PDF?
For example, the attachment 2, the image is shown the part of image on the pdf.
However, I get the full document image(attached).
And the color of the computer image is not same as the PDF.

There is any way to detect the mask or effect for each images?
Thanks

0 replies

devpro9219 · 2020-03-18T21:59:12Z

devpro9219
Mar 18, 2020
Author

I've also read fully document, however, I didn't catch how I can get the x,y,width,height information about the image.
Is it possible?
Thanks!

0 replies

JorjMcKie · 2020-03-19T09:07:21Z

JorjMcKie
Mar 19, 2020
Maintainer

I don't know your full motivation behind all this. But here are a few hints that may help:

doc = fitz.open("BCB...")
page=doc[1]  # page 2
imglist = doc.getPageImageList(1, True)  # full image list of that page
bbox = page.getImageBbox(imglist[0])  # this is img-46.png
# just to demonstrate we do have it:
page.addRectAnnot(bbox)  # gives this:

When we extract the images, the mask is automatically detected and applied! My own extraction produced this. Not 100% the same colors, but good. The difference is probably caused by the conversion to RGB.

You can also try to not convert to RGB in that script. Use an image format which supports CMYK like PAM or Photoshop image (PSD):

...
if pix.colorspace.name not in (fitz.csGRAY.name, fitz.csRGB.name):
    pix.writeImage("xxxx.pam")

But in your case this does not work either. If you use the MuPDF command line tool mutool extract <infile.pdf> (which extracts images and fonts), the results are even worse.

So I guess that is what you can get from me ...

Note: you should use PyMuPDF v1.16.13! It contains ICC color support and your problem image needs this for optimal rendering.

0 replies

JorjMcKie · 2020-03-22T12:15:38Z

JorjMcKie
Mar 22, 2020
Maintainer

@devpro9219 - Assuming your questions were answered.
Please do not hesitate to re-open or open new issues.

0 replies

aleem75321 · 2022-08-18T10:29:07Z

aleem75321
Aug 18, 2022

Hi, this script fo working fine for me but extract the image in Grey in my pdf file all image in CMYK formate can you help how solve this

0 replies

JorjMcKie · 2022-08-18T10:52:35Z

JorjMcKie
Aug 18, 2022
Maintainer

Hi, this script fo working fine for me but extract the image in Grey in my pdf file all image in CMYK formate can you help how solve this

You must find a file format supporting CMYK. There are a selected few directly supported by Pixmap.save(). If none suits you, check the Pillow documentation for one such and then use Pixmap.pil_save(...) with the right parameters - again, please consult Pillow docu for choosing the right parameters in place of ....

1 reply

JorjMcKie Aug 18, 2022
Maintainer

You also need not take a pixmap as intermediate if you have the image file or binary via doc.extract_image(xref).
In that case simply use Pillow directly.

pixmap must be grayscale or rgb to write as png #1880

Uh oh!

devpro9219 Mar 18, 2020

Replies: 14 comments · 1 reply

Uh oh!

JorjMcKie Mar 18, 2020 Maintainer

Uh oh!

devpro9219 Mar 18, 2020 Author

Uh oh!

JorjMcKie Mar 18, 2020 Maintainer

Uh oh!

devpro9219 Mar 18, 2020 Author

Uh oh!

JorjMcKie Mar 18, 2020 Maintainer

Uh oh!

devpro9219 Mar 18, 2020 Author

Uh oh!

devpro9219 Mar 18, 2020 Author

Uh oh!

Uh oh!

JorjMcKie Mar 18, 2020 Maintainer

Uh oh!

devpro9219 Mar 18, 2020 Author

Uh oh!

devpro9219 Mar 18, 2020 Author

Uh oh!

Uh oh!

JorjMcKie Mar 19, 2020 Maintainer

Uh oh!

JorjMcKie Mar 22, 2020 Maintainer

Uh oh!

aleem75321 Aug 18, 2022

Uh oh!

JorjMcKie Aug 18, 2022 Maintainer

Uh oh!

JorjMcKie Aug 18, 2022 Maintainer

devpro9219
Mar 18, 2020

Replies: 14 comments 1 reply

JorjMcKie
Mar 18, 2020
Maintainer

devpro9219
Mar 18, 2020
Author

JorjMcKie
Mar 18, 2020
Maintainer

devpro9219
Mar 18, 2020
Author

JorjMcKie
Mar 18, 2020
Maintainer

devpro9219
Mar 18, 2020
Author

devpro9219
Mar 18, 2020
Author

JorjMcKie
Mar 18, 2020
Maintainer

devpro9219
Mar 18, 2020
Author

devpro9219
Mar 18, 2020
Author

JorjMcKie
Mar 19, 2020
Maintainer

JorjMcKie
Mar 22, 2020
Maintainer

aleem75321
Aug 18, 2022

JorjMcKie
Aug 18, 2022
Maintainer

JorjMcKie Aug 18, 2022
Maintainer