Perf problem while inserting a lot of tif images #2154

jlb6907 · 2022-12-30T20:25:03Z

jlb6907
Dec 30, 2022

Hi,

I want to test the performance of PyMuPDF to insert a lot of TIF images with G4 compression into a new pdf file.

So I write a small python script to test the difference between an old PDF library (QuickPDF) and PyMuPDF, by inserting 334 Tiff images in a PDF file : my script.

The result :
MUPDF : 334 images in 33.27 seconds - Pdf File Size = 15790274 bytes
QUICKPDF : 334 images in 0.55 seconds - Pdf File Size = 15699678 bytes

I work with Python 3.10.4 / pymupdf-1.21.1 / Windows 11

How to have similar perf with PyMuPDF ?

Answered by JorjMcKie

Jan 1, 2023

Well here is a hacky approach, which does the job in 0.1 seconds.
Your suspicion that the image binary in PDF is a substring of the TIFF file content, is confirmed:
We need to strip of the first 8 and the last 215 bytes of the file content to get the PDF stuff.
The basic idea is anticipating a later MuPDF optimization by preparing the image xrefs as required using PyMuPDF's low-level code. Then use the insert_image() format which refers to an existing image xref instead to a new image:

import fitz
import time
import os, pathlib

# image object definition without /Filter etc.
xref_templ = """<</Type /XObject /Subtype /Image /Width &width /Height &height /BitsPerComponent 1 
/ColorSpace /De…

View full answer

JorjMcKie · 2022-12-30T21:28:48Z

JorjMcKie
Dec 30, 2022
Maintainer

You are using convert_to_pdf() for the image files. This is probably unfortunate in this situation. Try the following loop instead:

for filename in os.listdir(directory):
    filepath = os.path.join(directory, filename)
    if pathlib.Path(filepath).suffix==".tif":
        imgbytes = pathlib.Path(filepath).read_bytes()
        prof = fitz.image_profile(imgbytes)
        nb = nb + 1
        pagePdf = doc.new_page(width=prof["width"], height=prof["height"])
        pagePdf.insert_image(pagePdf.rect, stream=imgbytes)

Cannot confirm the improvement myself - without the images.

1 reply

JorjMcKie Dec 30, 2022
Maintainer

Please report any news on this here. I just have tested the above revised code with 320 JPEG images with a total size of 96 MB and got an overall duration of 1.4 seconds.

jlb6907 · 2022-12-31T05:41:59Z

jlb6907
Dec 31, 2022
Author

With your code :

MUPDF : 334 images in 19.99 seconds - Pdf File Size = 227550174 bytes
QUICKPDF : 334 images in 0.55 seconds - Pdf File Size = 15699678 bytes

The file size is too big (14x).

TIFF with Group 4 compression is a common use case for creating a PDF with bitonal images (B&W scanned book)
QUICKPDF insert TIF images with G4 Compression without modification into the PDF file, and it's very fast.

1 reply

JorjMcKie Dec 31, 2022
Maintainer

Hm, ok.
As per the file size, this should addessed / solved by using the right compression options in save.
As per the image insertion itself, could you let me have a few example TIFFs please, so I can have a deeper look?

jlb6907 · 2022-12-31T09:47:45Z

jlb6907
Dec 31, 2022
Author

A zip file containing tif files (with Group 4 compression) from a public book : link

Version 2 of my python script :

MUPDF (convert_to_pdf and show_pdf_page) : 439 images in 63.44 seconds - Pdf File Size = 25597042 bytes
MUPDF (insert_image) : 439 images in 40.34 seconds - Pdf File Size = 479175944 bytes
QUICKPDF : 439 images in 1.52 seconds - Pdf File Size = 25477869 bytes

1 reply

JorjMcKie Dec 31, 2022
Maintainer

Thank you for the image files! On my machine running this script:

import fitz
import time
import os, pathlib

t0 = time.perf_counter()
imgfiles = os.listdir()
doc = fitz.open()
for f in imgfiles:
    if not f.endswith(".tif"):
        continue
    ibytes = pathlib.Path(f).read_bytes()
    prof = fitz.image_profile(ibytes)
    page = doc.new_page(width=prof["width"], height=prof["height"])
    page.insert_image(page.rect, stream=ibytes)
t1 = time.perf_counter()
doc.ez_save(__file__.replace(".py", ".pdf"))
t2 = time.perf_counter()
print(f"total {round(t2-t0,2)} seconds")
print(f"PDF create {round(t1-t0,2)} seconds")

Has these runtimes:

total 34.1 seconds
PDF create 23.78 seconds

and creates a file of size 25.430.588 bytes.
I am using Win 11, Python 3.11, PyMuPDF 1.21.1., machine 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz 2.30 GHz, 16 GB memory.

Let me see if there exist more optimization options.

jlb6907 · 2022-12-31T12:25:01Z

jlb6907
Dec 31, 2022
Author

My resultats for now :

	Time (s)	Size (bytes)
PyMuPDF (convert_to_pdf and show_pdf_page)	64,65	25 597 042
PyMuPDF (insert_image)	44,74	4 79 175 944
PyMuPDF (insert_image and ez_save)	43,2	25 430 588
QuickPDF	1,68	25 477 869

QuickPDF is still 25x faster than PyMuPDF, for my use case.
It's a little strange !

2 replies

JorjMcKie Dec 31, 2022
Maintainer

It's a little strange !

Yes, one way to look at it. Actually obviously MuPDF just has no specialized code to take advantage of this TIFF compression level. I am sure it can be made - just a matter of attention to that topic.
Maybe it is worthwhile I talk to my MuPDF colleagues and see what they say.

JorjMcKie Dec 31, 2022
Maintainer

Can you find out, how the images are defined in the Quickfox case?
I mean a printout of one image definition. If xref is an image xref, could you do print(doc.xref_object(xref))?
Would like to try something ...

jlb6907 · 2022-12-31T14:45:23Z

jlb6907
Dec 31, 2022
Author

Would like to try something ...

Yes, no problem. The PDF file generated with QuickPDF is here
Thanks a lot for your help.

0 replies

jlb6907 · 2022-12-31T15:06:55Z

jlb6907
Dec 31, 2022
Author

I will do more tests next week.

TIFF with G4 compression is a native image format in a PDF file, so perhaps the images are just incorporated "as is" in the output file.

1 reply

JorjMcKie Dec 31, 2022
Maintainer

TIFF with G4 compression is a native image format in a PDF file, so perhaps the images are just incorporated "as is" in the output file.

Don't think so. Tried it out with low-level functions of PyMuPDF. For example, image441.tif has a size of 20098 bytes, but inserted as image its size is 19876 - for both, the MuPDF and the Quickfoc case.

JorjMcKie · 2023-01-01T12:24:37Z

JorjMcKie
Jan 1, 2023
Maintainer

Well here is a hacky approach, which does the job in 0.1 seconds.
Your suspicion that the image binary in PDF is a substring of the TIFF file content, is confirmed:
We need to strip of the first 8 and the last 215 bytes of the file content to get the PDF stuff.
The basic idea is anticipating a later MuPDF optimization by preparing the image xrefs as required using PyMuPDF's low-level code. Then use the insert_image() format which refers to an existing image xref instead to a new image:

import fitz
import time
import os, pathlib

# image object definition without /Filter etc.
xref_templ = """<</Type /XObject /Subtype /Image /Width &width /Height &height /BitsPerComponent 1 
/ColorSpace /DeviceGray /Decode [1 0]>>"""

# the filter specifics for Group4
dec_parms = "<</K -1 /Columns &width /Rows &height /BlackIs1 true>>"

img_xrefs = {}  # stores xref with (width, height)
lx = 223 - 8  # must subtract this from overall imgfile length
t0 = time.perf_counter()
imgfiles = os.listdir()

doc = fitz.open()
for f in imgfiles:
    if not f.endswith(".tif"):
        continue
    ibytes = pathlib.Path(f).read_bytes()  # read image file content
    reduced = ibytes[8 : len(ibytes) - lx]  # the part stored in the PDF
    prof = fitz.image_profile(ibytes)  # determine image profile
    w = prof["width"]
    h = prof["height"]
    objstr = xref_templ.replace("&width", str(w)).replace("&height", str(h))  # update obj def
    xref = doc.get_new_xref()  # get a new xref
    img_xrefs[xref] = (w, h)  # store image info
    doc.update_object(xref, objstr)  # define image object
    doc.update_stream(xref, reduced, compress=False)  # attach image binary (no extra compression!)
    doc.xref_set_key(xref, "Filter", "/CCITTFaxDecode")  # only NOW can add filter info
    doc.xref_set_key(  # add more filter info
        xref,
        "DecodeParms",
        dec_parms.replace("&width", str(w)).replace("&height", str(h)),
    )

# now images are in the PDF. Make pages referring to them
for xref, (w, h) in img_xrefs.items():
    page = doc.new_page(width=w, height=h)
    page.insert_image(page.rect, xref=xref)  # this is a highspeed insertion variant

t1 = time.perf_counter()
doc.save(__file__.replace(".py", ".pdf"))
t2 = time.perf_counter()
print(f"total {round(t2-t0,2)} seconds")
print(f"PDF create {round(t1-t0,2)} seconds")

3 replies

JorjMcKie Jan 1, 2023
Maintainer

You should be able to confirm the measurements.
If so, then the revised approach is 15 times faster than Quickfox.

jlb6907 Jan 1, 2023
Author

My final resultats :

	Time (s)	Size (bytes)
PyMuPDF (convert_to_pdf and show_pdf_page)	64,65	25 597 042
PyMuPDF (insert_image)	44,74	4 79 175 944
PyMuPDF (insert_image and ez_save)	43,2	25 430 588
PyMuPDF (hacky)	0.23	25 446 159
QuickPDF	1,68	25 477 869

PyMuPDF (hacky) is 7x faster than QuickPDF.

So, PyMuPDF (hacky) is definitely the best option for me !

Thanks for everything.

JorjMcKie Jan 1, 2023
Maintainer

Please to see that it works for you.
As always in cases of "hacky" solutions, there remains an uneasy feeling: how can I be sure to not do this if the input does not conform with the assumptions - especially Group4 compression?
For this it may help to check / confirm input files with Pillow like this:

Image.open("image003.tif").info
{'compression': 'group4', 'dpi': (300.0, 300.0)}

jlb6907 · 2023-01-01T18:33:26Z

jlb6907
Jan 1, 2023
Author

I have push all the hacky code into a reusable function :

import fitz, os, pathlib
from tkinter import filedialog
from PIL import Image

def insert_tiff_with_group4_compression(page, rect, tiff_path, width, height):
    xref_templ = """<</Type /XObject /Subtype /Image /Width &width /Height &height /BitsPerComponent 1 /ColorSpace /DeviceGray /Decode [1 0]>>"""
    dec_parms = "<</K -1 /Columns &width /Rows &height /BlackIs1 true>>"
    lx = 223 - 8  # must subtract this from overall imgfile length
    imgbytes = pathlib.Path(tiff_path).read_bytes()
    reduced = imgbytes[8 : len(imgbytes) - lx]  # the part stored in the PDF
    objstr = xref_templ.replace("&width", str(width)).replace("&height", str(height))  # update obj def
    doc = page.parent
    xref = doc.get_new_xref()  # get a new xref
    doc.update_object(xref, objstr)  # define image object
    doc.update_stream(xref, reduced, compress=False)  # attach image binary (no extra compression!)
    doc.xref_set_key(xref, "Filter", "/CCITTFaxDecode")  # NOW add filter info
    doc.xref_set_key(  # add more filter info
        xref,
        "DecodeParms",
        dec_parms.replace("&width", str(width)).replace("&height", str(height)),
    )
    page.insert_image(rect, xref=xref)

directory = filedialog.askdirectory(title="Select images directory with tiff g4")
doc = fitz.open() 
for filename in os.listdir(directory):
    tiff_path = os.path.join(directory, filename)
    if pathlib.Path(tiff_path).suffix==".tif":
        with Image.open(tiff_path) as tiff:
            if tiff.info["compression"]=="group4":
                page = doc.new_page(width=tiff.width, height=tiff.height)
                insert_tiff_with_group4_compression(page, page.rect, tiff_path, tiff.width, tiff.height)
pdf_path = directory + "/test_mupdf_insert_image_hacky.pdf" 
doc.save(pdf_path)

4 replies

jlb6907 Jan 1, 2023
Author

Is it possible to add a "insert_tiff_with_group4_compression" method into the "Page" Class in PyMuPDF ?

JorjMcKie Jan 1, 2023
Maintainer

In principle yes, but would be ugly, because base library MuPDF should natively support this optimization.
For example, support of the CCITTFaxDecode filter has been added not long ago.
Let me talk to the MuPDF developers and see what they have to say.

jlb6907 Jan 11, 2023
Author

Let me talk to the MuPDF developers and see what they have to say.

Any news from the MuPDF developers on this optimization ?

JorjMcKie Jan 11, 2023
Maintainer

no, please give it time, I will notify you.

Perf problem while inserting a lot of tif images #2154

Uh oh!

jlb6907 Dec 30, 2022

Replies: 8 comments · 13 replies

Uh oh!

JorjMcKie Dec 30, 2022 Maintainer

Uh oh!

JorjMcKie Dec 30, 2022 Maintainer

Uh oh!

Uh oh!

jlb6907 Dec 31, 2022 Author

Uh oh!

JorjMcKie Dec 31, 2022 Maintainer

Uh oh!

jlb6907 Dec 31, 2022 Author

Uh oh!

JorjMcKie Dec 31, 2022 Maintainer

Uh oh!

Uh oh!

jlb6907 Dec 31, 2022 Author

Uh oh!

JorjMcKie Dec 31, 2022 Maintainer

Uh oh!

JorjMcKie Dec 31, 2022 Maintainer

Uh oh!

jlb6907 Dec 31, 2022 Author

Uh oh!

jlb6907 Dec 31, 2022 Author

Uh oh!

JorjMcKie Dec 31, 2022 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jan 1, 2023 Maintainer

Uh oh!

JorjMcKie Jan 1, 2023 Maintainer

Uh oh!

Uh oh!

jlb6907 Jan 1, 2023 Author

Uh oh!

JorjMcKie Jan 1, 2023 Maintainer

Uh oh!

Uh oh!

jlb6907 Jan 1, 2023 Author

Uh oh!

jlb6907 Jan 1, 2023 Author

Uh oh!

JorjMcKie Jan 1, 2023 Maintainer

Uh oh!

jlb6907 Jan 11, 2023 Author

Uh oh!

JorjMcKie Jan 11, 2023 Maintainer

jlb6907
Dec 30, 2022

Replies: 8 comments 13 replies

JorjMcKie
Dec 30, 2022
Maintainer

JorjMcKie Dec 30, 2022
Maintainer

jlb6907
Dec 31, 2022
Author

JorjMcKie Dec 31, 2022
Maintainer

jlb6907
Dec 31, 2022
Author

JorjMcKie Dec 31, 2022
Maintainer

jlb6907
Dec 31, 2022
Author

JorjMcKie Dec 31, 2022
Maintainer

JorjMcKie Dec 31, 2022
Maintainer

jlb6907
Dec 31, 2022
Author

jlb6907
Dec 31, 2022
Author

JorjMcKie Dec 31, 2022
Maintainer

JorjMcKie
Jan 1, 2023
Maintainer

JorjMcKie Jan 1, 2023
Maintainer

jlb6907 Jan 1, 2023
Author

JorjMcKie Jan 1, 2023
Maintainer

jlb6907
Jan 1, 2023
Author

jlb6907 Jan 1, 2023
Author

JorjMcKie Jan 1, 2023
Maintainer

jlb6907 Jan 11, 2023
Author

JorjMcKie Jan 11, 2023
Maintainer