get_text('block') is much slower on jp2 images #3082
Replies: 3 comments
-
Firstly, I don't regard this as a bug, as data are correctly provided in each of the cases. Recommending the "dict" output variant for sure is the wrong direction, because this explicitly says that you want the image binary itself - which entails resource consumptions in terms of decompression effort and memory usage for the extracted image. But in general, PyMPDF's access to image information is relayed by MuPDF, which in turn uses third party libraries for its image handling. So overall I would say, these access times are as they are. You have multiple choices to only access the bbox and other meta-information only, or also load the image itself in addition and, most importantly, to exclude any information when you are interested in text only. BTW, loading the PNG image binary requires around 50% of the time needed for the JP2. It is a little unclear to me what your actual problem is. Maybe you let us know, so we can better help you choosing the best path. |
Beta Was this translation helpful? Give feedback.
-
Thank you for your explanation. I initially didn't have much knowledge about JP2 images, so I mistakenly assumed that all image formats should have similar loading times. Currently, I need to obtain the position information of all the images on this page. Is there a faster method that doesn't require decompressing JP2 images? I tried using |
Beta Was this translation helpful? Give feedback.
-
I understand your intention better now - thanks for the background. I suggest to use Page.get_image_info. What it does: Included is every invocation of every image. So depending on how the page is constructed, the same image may occur multiple times. Use |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
Given a pdf made of 1
jp2
image, theget_text('block')
is slower compared with pdfs made ofpng
orjpg
images.How to reproduce the bug
Here are the images I used to test
images.zip
PyMuPDF version
1.23.8 or earlier.
Operating system
MacOS
Python version
3.8
Beta Was this translation helpful? Give feedback.
All reactions