get_text('block') is much slower on jp2 images #3082

VeryLazyBoy · 2024-01-22T18:11:42Z

VeryLazyBoy
Jan 22, 2024

Description of the bug

Given a pdf made of 1 jp2 image, the get_text('block') is slower compared with pdfs made of png or jpg images.

jp2 speed: 0.6 seconds
png speed: 0.00019 seconds
jpeg speed: 0.00012 seconds

How to reproduce the bug

import fitz
import time

def image_pdf_speed(image_file):
    img_doc = fitz.Document(image_file)
    pdf_bytes = img_doc.convert_to_pdf()
    pdf_doc = fitz.Document(stream=pdf_bytes)
    st = time.time()
    pdf_doc[0].get_text('blocks')
    print(time.time() - st)


jp2_file = 'debug.jp2'
print('JPEG 2000 speed:')
image_pdf_speed(jp2_file) # 0.6 seconds

png_file = 'debug.png'
print('PNG speed:')
image_pdf_speed(png_file) # 0.00019 seconds

jpeg_file = 'debug.jpeg'
print('JPEG speed:')
image_pdf_speed(jpeg_file) # 0.00012 seconds

Here are the images I used to test
images.zip

PyMuPDF version

1.23.8 or earlier.

I noticed that since 1.23.9, get_text('block') no longer return image blocks. For newer versions, the speed difference should be checked by get_text('dict').

Operating system

MacOS

Python version

3.8

JorjMcKie · 2024-01-22T22:34:08Z

JorjMcKie
Jan 22, 2024
Maintainer

Firstly, I don't regard this as a bug, as data are correctly provided in each of the cases.
Secondly, especially for the "blocks" output variant, you are explicitly not interested in the image itself, but only in its location (bbox). One could certainly argue, whether or not there are faster ways to get this limited information.

Recommending the "dict" output variant for sure is the wrong direction, because this explicitly says that you want the image binary itself - which entails resource consumptions in terms of decompression effort and memory usage for the extracted image.
You still can access image information with the "blocks" output - simply by specifying flags=fitz.TEXTFLAGS_DICT.

But in general, PyMPDF's access to image information is relayed by MuPDF, which in turn uses third party libraries for its image handling.

So overall I would say, these access times are as they are. You have multiple choices to only access the bbox and other meta-information only, or also load the image itself in addition and, most importantly, to exclude any information when you are interested in text only.

BTW, loading the PNG image binary requires around 50% of the time needed for the JP2.

It is a little unclear to me what your actual problem is. Maybe you let us know, so we can better help you choosing the best path.

0 replies

VeryLazyBoy · 2024-01-23T01:05:21Z

VeryLazyBoy
Jan 23, 2024
Author

Thank you for your explanation. I initially didn't have much knowledge about JP2 images, so I mistakenly assumed that all image formats should have similar loading times. Currently, I need to obtain the position information of all the images on this page. Is there a faster method that doesn't require decompressing JP2 images? I tried using page.get_images, but it doesn't seem to directly return the position of the images on the page.

0 replies

JorjMcKie · 2024-01-23T09:11:53Z

JorjMcKie
Jan 23, 2024
Maintainer

I understand your intention better now - thanks for the background.

I suggest to use Page.get_image_info.
The fastest response is possible with default parameters.

What it does:
It walks through the page's appearance command stream to identify any image invocation. Each image is represented by its meta information in a dictionary. The boundary box is included, the image binary is not.
The dictionaries appear in invocation sequence.

Included is every invocation of every image. So depending on how the page is constructed, the same image may occur multiple times. Use hashes=True to identify equal images. This may incur increased response time because the MD5 computation is based on the image binary (temporarily loaded in that case).
The list contains images reachable via an xref as well as so-called "inline" images: the ones only known by this page.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

get_text('block') is much slower on jp2 images #3082

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

get_text('block') is much slower on jp2 images #3082

Uh oh!

Uh oh!

VeryLazyBoy Jan 22, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 3 comments

Uh oh!

JorjMcKie Jan 22, 2024 Maintainer

Uh oh!

VeryLazyBoy Jan 23, 2024 Author

Uh oh!

JorjMcKie Jan 23, 2024 Maintainer

VeryLazyBoy
Jan 22, 2024

JorjMcKie
Jan 22, 2024
Maintainer

VeryLazyBoy
Jan 23, 2024
Author

JorjMcKie
Jan 23, 2024
Maintainer