Can I convert from pix to PIL without saving to disk? #1678

alejandrofm · 2019-07-17T17:26:35Z

alejandrofm
Jul 17, 2019

Hi! I am currently doing this to extract images from PDFs and then feeding those to tesseract-OCR, it seems that I'm wasting, time, quality and not optimizing the code when I write to PNG and read the PNG in the next line, is there any way to feed a pixmap to a Image element?

            pix = page.getPixmap(matrix=mat, alpha=False)  # render page to an image
            saving_name = filepath.split('\\')[-1].replace('pdf', '') # extract name from file
            pix.writePNG("page{}{}.png".format(saving_name, page.number))  # store image as a PNG
            imagen = Image.open("page{}{}.png".format(saving_name, page.number))  # open image
            imagen.show() #show image

Thank you!

Answered by rozeappletree

May 16, 2021

@alejandrofm

I'm VERY happy with the result currently using fitz.csGRAY and mode R , with your help I improved:

Avoid creating a single a file for each page and writing it to disk.

Avoid reading file from disk in next line of code.

Lower size of the file i'm working on.

Thank you a lot!

Can you please give a code snippet for fitz.csGRAY and mode R

View full answer

JorjMcKie · 2019-07-17T20:45:44Z

JorjMcKie
Jul 17, 2019
Maintainer

Sure there is!
There are even a number of alternatives to avoid it:

You can convert the pixmap to a number of image formats as bytes objects. Use data = pix.getImageData("format") for this. You can choose among several "format" alternatives trading size of the bytes object data for creation speed. Then use img = Image.open(io.BytesIO(data), ...) to open it as a PIL image.
Use the pixmap directly as input to create a PIL image: img = Image.frombytes(mode, [pix.width, pix.height], pix.samples). The string "mode" in your case should probably be "RBG".

To improve the OCR detection, you can render the PDF page at higher quality by using a matrix in Page.getPixmap(). E.g. matrix = fitz.Matrix(2,2) will increase resolution by a factor of 2 in each (x, y) direction - thus yileding a 4 times larger pixmap.
But if the PDF contains images (which I assume to be) then that may have limited effect for OCR recognition. If on the other hand the page contains normal text, then there would be no need to use OCR ...

1 reply

Aspace2create Jan 12, 2023

line 146, in keyPressEvent
data=pxm.getImageData("PNG")
AttributeError: 'QPixmap' object has no attribute 'getImageData'

alejandrofm · 2019-07-17T21:07:20Z

alejandrofm
Jul 17, 2019
Author

Thank you @JorjMcKie its a single image per page and I'm already using matrix 2,2 with great results.
Any option (1 or 2) or parameter recommendation for the conversion to PIL?
Thanks again!

0 replies

JorjMcKie · 2019-07-17T21:47:22Z

JorjMcKie
Jul 17, 2019
Maintainer

Option 2 is about one third faster than option 1 with a similar memory footprint ...
I just measured 10 loops, best of 3: 42.3 ms per loop (option 2) versus 10 loops, best of 3: 64.5 ms per loop (option 1) for a large pixmap with a samples size of about 23 MB.

Option 1 is important for feeding image data to some packages, most notably tkinter. In the latter case you would use pix.getImageData("ppm") as input to tk.PhotoImage in a Python version independent manner.

0 replies

alejandrofm · 2019-07-18T18:45:57Z

alejandrofm
Jul 18, 2019
Author

Thankyou A LOT @JorjMcKie it works great, could I have any problem if the source image is grayscale or b&w? will it "scale" to RGB?
Thanks!

0 replies

JorjMcKie · 2019-07-18T21:42:38Z

JorjMcKie
Jul 18, 2019
Maintainer

You are welcome!
No, you can freely choose the resulting colorspace when using getPixmap. If using fitz.csGRAY the pixmap will be smaller of course (only one third of RGB), which may be utterly sufficient for doing the OCR stuff.

0 replies

alejandrofm · 2019-07-19T13:34:00Z

alejandrofm
Jul 19, 2019
Author

I'm VERY happy with the result currently using fitz.csGRAY and mode R , with your help I improved:

Avoid creating a single a file for each page and writing it to disk.
Avoid reading file from disk in next line of code.
Lower size of the file i'm working on.

Thank you a lot!

0 replies

JorjMcKie · 2019-07-19T14:35:33Z

JorjMcKie
Jul 19, 2019
Maintainer

close issue?

0 replies

rozeappletree · 2021-05-16T11:29:07Z

rozeappletree
May 16, 2021

@alejandrofm

I'm VERY happy with the result currently using fitz.csGRAY and mode R , with your help I improved:

Avoid creating a single a file for each page and writing it to disk.

Avoid reading file from disk in next line of code.

Lower size of the file i'm working on.

Thank you a lot!

Can you please give a code snippet for fitz.csGRAY and mode R

0 replies

JorjMcKie · 2021-05-16T11:58:11Z

JorjMcKie
May 16, 2021
Maintainer

@rakesh4real @alejandrofm -
Just wanted to let you guys know, that the recent PyMuPDF versions have new methods for direct support of PIL / Pillow - so you no longer need fiddling around with pixmap attributes to achieve this:

Pixmap.pillowSave(...) saves a pixmap to an image file using Pillow. The arguments are the same as for Image.save(...) in pillow. This makes it possible, to output JPEG images for example ... just everything supported by Pillow.
Note that this method will be renamed to pixmap.pil_save() in version 1.18.14.
Similarly, Pixmap.pillowData(...) returns a bytes object using Pillow. Otherwise the same arguments supported.
Again a rename in version 1.18.14: Pixmap.pil_tobytes().

0 replies

canklot · 2022-04-18T09:30:02Z

canklot
Apr 18, 2022

For those who wants numpy array not PIL.Image

pix = PIL.Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
pix = np.array(pix)

0 replies

JorjMcKie · 2022-04-18T11:17:35Z

JorjMcKie
Apr 18, 2022
Maintainer

@canklot - there is an even more direct and much faster way for ndarrays: see here.

0 replies

Can I convert from pix to PIL without saving to disk? #1678

Uh oh!

alejandrofm Jul 17, 2019

Replies: 11 comments · 1 reply

Uh oh!

Uh oh!

JorjMcKie Jul 17, 2019 Maintainer

Uh oh!

Aspace2create Jan 12, 2023

Uh oh!

alejandrofm Jul 17, 2019 Author

Uh oh!

JorjMcKie Jul 17, 2019 Maintainer

Uh oh!

alejandrofm Jul 18, 2019 Author

Uh oh!

JorjMcKie Jul 18, 2019 Maintainer

Uh oh!

alejandrofm Jul 19, 2019 Author

Uh oh!

JorjMcKie Jul 19, 2019 Maintainer

Uh oh!

rozeappletree May 16, 2021

Uh oh!

JorjMcKie May 16, 2021 Maintainer

Uh oh!

canklot Apr 18, 2022

Uh oh!

Uh oh!

JorjMcKie Apr 18, 2022 Maintainer

alejandrofm
Jul 17, 2019

Replies: 11 comments 1 reply

JorjMcKie
Jul 17, 2019
Maintainer

alejandrofm
Jul 17, 2019
Author

JorjMcKie
Jul 17, 2019
Maintainer

alejandrofm
Jul 18, 2019
Author

JorjMcKie
Jul 18, 2019
Maintainer

alejandrofm
Jul 19, 2019
Author

JorjMcKie
Jul 19, 2019
Maintainer

rozeappletree
May 16, 2021

JorjMcKie
May 16, 2021
Maintainer

canklot
Apr 18, 2022

JorjMcKie
Apr 18, 2022
Maintainer