Perf problem while inserting a lot of tif images #2154
-
Hi, I want to test the performance of PyMuPDF to insert a lot of TIF images with G4 compression into a new pdf file. So I write a small python script to test the difference between an old PDF library (QuickPDF) and PyMuPDF, by inserting 334 Tiff images in a PDF file : my script. The result : I work with Python 3.10.4 / pymupdf-1.21.1 / Windows 11 How to have similar perf with PyMuPDF ? |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 13 replies
-
You are using for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
if pathlib.Path(filepath).suffix==".tif":
imgbytes = pathlib.Path(filepath).read_bytes()
prof = fitz.image_profile(imgbytes)
nb = nb + 1
pagePdf = doc.new_page(width=prof["width"], height=prof["height"])
pagePdf.insert_image(pagePdf.rect, stream=imgbytes) Cannot confirm the improvement myself - without the images. |
Beta Was this translation helpful? Give feedback.
-
With your code : MUPDF : 334 images in 19.99 seconds - Pdf File Size = 227550174 bytes The file size is too big (14x). TIFF with Group 4 compression is a common use case for creating a PDF with bitonal images (B&W scanned book) |
Beta Was this translation helpful? Give feedback.
-
A zip file containing tif files (with Group 4 compression) from a public book : link Version 2 of my python script :
|
Beta Was this translation helpful? Give feedback.
-
My resultats for now :
QuickPDF is still 25x faster than PyMuPDF, for my use case. |
Beta Was this translation helpful? Give feedback.
-
Yes, no problem. The PDF file generated with QuickPDF is here |
Beta Was this translation helpful? Give feedback.
-
I will do more tests next week. TIFF with G4 compression is a native image format in a PDF file, so perhaps the images are just incorporated "as is" in the output file. |
Beta Was this translation helpful? Give feedback.
-
Well here is a hacky approach, which does the job in 0.1 seconds. import fitz
import time
import os, pathlib
# image object definition without /Filter etc.
xref_templ = """<</Type /XObject /Subtype /Image /Width &width /Height &height /BitsPerComponent 1
/ColorSpace /DeviceGray /Decode [1 0]>>"""
# the filter specifics for Group4
dec_parms = "<</K -1 /Columns &width /Rows &height /BlackIs1 true>>"
img_xrefs = {} # stores xref with (width, height)
lx = 223 - 8 # must subtract this from overall imgfile length
t0 = time.perf_counter()
imgfiles = os.listdir()
doc = fitz.open()
for f in imgfiles:
if not f.endswith(".tif"):
continue
ibytes = pathlib.Path(f).read_bytes() # read image file content
reduced = ibytes[8 : len(ibytes) - lx] # the part stored in the PDF
prof = fitz.image_profile(ibytes) # determine image profile
w = prof["width"]
h = prof["height"]
objstr = xref_templ.replace("&width", str(w)).replace("&height", str(h)) # update obj def
xref = doc.get_new_xref() # get a new xref
img_xrefs[xref] = (w, h) # store image info
doc.update_object(xref, objstr) # define image object
doc.update_stream(xref, reduced, compress=False) # attach image binary (no extra compression!)
doc.xref_set_key(xref, "Filter", "/CCITTFaxDecode") # only NOW can add filter info
doc.xref_set_key( # add more filter info
xref,
"DecodeParms",
dec_parms.replace("&width", str(w)).replace("&height", str(h)),
)
# now images are in the PDF. Make pages referring to them
for xref, (w, h) in img_xrefs.items():
page = doc.new_page(width=w, height=h)
page.insert_image(page.rect, xref=xref) # this is a highspeed insertion variant
t1 = time.perf_counter()
doc.save(__file__.replace(".py", ".pdf"))
t2 = time.perf_counter()
print(f"total {round(t2-t0,2)} seconds")
print(f"PDF create {round(t1-t0,2)} seconds") |
Beta Was this translation helpful? Give feedback.
-
I have push all the hacky code into a reusable function : import fitz, os, pathlib
from tkinter import filedialog
from PIL import Image
def insert_tiff_with_group4_compression(page, rect, tiff_path, width, height):
xref_templ = """<</Type /XObject /Subtype /Image /Width &width /Height &height /BitsPerComponent 1 /ColorSpace /DeviceGray /Decode [1 0]>>"""
dec_parms = "<</K -1 /Columns &width /Rows &height /BlackIs1 true>>"
lx = 223 - 8 # must subtract this from overall imgfile length
imgbytes = pathlib.Path(tiff_path).read_bytes()
reduced = imgbytes[8 : len(imgbytes) - lx] # the part stored in the PDF
objstr = xref_templ.replace("&width", str(width)).replace("&height", str(height)) # update obj def
doc = page.parent
xref = doc.get_new_xref() # get a new xref
doc.update_object(xref, objstr) # define image object
doc.update_stream(xref, reduced, compress=False) # attach image binary (no extra compression!)
doc.xref_set_key(xref, "Filter", "/CCITTFaxDecode") # NOW add filter info
doc.xref_set_key( # add more filter info
xref,
"DecodeParms",
dec_parms.replace("&width", str(width)).replace("&height", str(height)),
)
page.insert_image(rect, xref=xref)
directory = filedialog.askdirectory(title="Select images directory with tiff g4")
doc = fitz.open()
for filename in os.listdir(directory):
tiff_path = os.path.join(directory, filename)
if pathlib.Path(tiff_path).suffix==".tif":
with Image.open(tiff_path) as tiff:
if tiff.info["compression"]=="group4":
page = doc.new_page(width=tiff.width, height=tiff.height)
insert_tiff_with_group4_compression(page, page.rect, tiff_path, tiff.width, tiff.height)
pdf_path = directory + "/test_mupdf_insert_image_hacky.pdf"
doc.save(pdf_path) |
Beta Was this translation helpful? Give feedback.
Well here is a hacky approach, which does the job in 0.1 seconds.
Your suspicion that the image binary in PDF is a substring of the TIFF file content, is confirmed:
We need to strip of the first 8 and the last 215 bytes of the file content to get the PDF stuff.
The basic idea is anticipating a later MuPDF optimization by preparing the image xrefs as required using PyMuPDF's low-level code. Then use the
insert_image()
format which refers to an existing image xref instead to a new image: