Getting pixmap shape without loading pixmap in memory #1894

PasaOpasen · 2022-08-22T12:23:16Z

PasaOpasen
Aug 22, 2022

Is there a way to get page pixels size without loading this page to memory? Now I'm using the code

matrix = fitz.Matrix(dpi/72, dpi/72)
pix = doc.load_page(p).get_pixmap(matrix=matrix)
return (pix.width, pix.height)

but it works slow for document with many pages and uses too much memory. Can I get the future image pixels size without its "construction" ?

Answered by JorjMcKie

Aug 23, 2022

I made a comparison of the 3 methods:
´´´python
doc=fitz.open("adobe.pdf")

def test1():
t0=time.perf_counter()
for page in doc:
irect=page.rect.irect
pixsize= irect.width * irect.height * 3 + 88
t1=time.perf_counter()
return t1-t0

def test2():
t0=time.perf_counter()
for i in range(doc.page_count):
xref=doc.page_xref(i)
mb = doc.xref_get_key(xref, "MediaBox")
if mb[0]!="array":
raise ValueError("no mediabox for page",i)
w, h = mb[1][1:-1].split()[2:]
pixsize = w * h * 3 + 88
t1=time.perf_counter()
return t1-t0

def test3():
t0=time.perf_counter()
for page in doc:
pix=page.get_pixmap()
size=pix.size
t1=time.perf_counter()
return t1-t0


> Note that other than default DPI values need not be co…

View full answer

JorjMcKie · 2022-08-22T12:56:08Z

JorjMcKie
Aug 22, 2022
Maintainer

A typical Discussions item - no issue.

1 reply

JorjMcKie Aug 22, 2022
Maintainer

... and even less a bug!

JorjMcKie · 2022-08-22T13:04:28Z

JorjMcKie
Aug 22, 2022
Maintainer

As to your question:
No there is not. But you can look at the page object definition, inspect what the various rectangle value say and calculate the supposed pixmap size yourself:

xref=doc.page_xref(pno)  # will not load the page
mediabox = doc.xref_get_key(xref, "MediaBox")
cropbox =  doc.xref_get_key(xref, "CropBox")

The return values are either ("null", "null") if the resp. key is not there, or something like ("array", "[0 0 595 842]").
Then you get to e.g. cropbox = pix.irect = fitz.Rect(0, 0, 595, 842).

0 replies

JorjMcKie · 2022-08-22T13:06:45Z

JorjMcKie
Aug 22, 2022
Maintainer

Sizes for other dpi values can be computes via round(width / 72 * dpi) etc.

7 replies

JorjMcKie Aug 22, 2022
Maintainer

Well, then I would say too bad. BTW loading the page (and not creating the pixmap) is also very fast. Then use page.cropbox and go with that. That should work corectly.

JorjMcKie Aug 23, 2022
Maintainer

Forgot to mention: the Pixmap.size property is the value width * height * n + value. Where n = Pixmap.n (n = 3 for RGB) and value is Pixmap.size - len(Pixmap.samples_mv) - normally a value of 88.
Maybe that explains your differences.

PasaOpasen Aug 23, 2022
Author

Forgot to mention: the Pixmap.size property is the value width * height * n + value. Where n = Pixmap.n (n = 3 for RGB) and value is Pixmap.size - len(Pixmap.samples_mv) - normally a value of 88. Maybe that explains your differences.

but it doesn't change Pixmap shape?

JorjMcKie Aug 23, 2022
Maintainer

but it doesn't change Pixmap shape?

No, how would it / could it do that? Of course I referred to width = Pixmap.w - not the page.rect.width.

JorjMcKie Aug 23, 2022
Maintainer

Another thing to know: the both PDF rectangles MediaBox and CropBox are inheritable. So both may be missing for any page. If both are omitted, then there will exist default specifications further up in the page tree hierarchy of the PDF. In extreme cases these values may be stored just once in the PDF catalog. Or there could be values for whatever ranges of pages.
To be safe, you may want to still load the page and use page.cropbox for above calculation. This should be always correct.

JorjMcKie · 2022-08-23T13:28:02Z

JorjMcKie
Aug 23, 2022
Maintainer

I made a comparison of the 3 methods:
´´´python
doc=fitz.open("adobe.pdf")

def test1():
t0=time.perf_counter()
for page in doc:
irect=page.rect.irect
pixsize= irect.width * irect.height * 3 + 88
t1=time.perf_counter()
return t1-t0

def test2():
t0=time.perf_counter()
for i in range(doc.page_count):
xref=doc.page_xref(i)
mb = doc.xref_get_key(xref, "MediaBox")
if mb[0]!="array":
raise ValueError("no mediabox for page",i)
w, h = mb[1][1:-1].split()[2:]
pixsize = w * h * 3 + 88
t1=time.perf_counter()
return t1-t0

def test3():
t0=time.perf_counter()
for page in doc:
pix=page.get_pixmap()
size=pix.size
t1=time.perf_counter()
return t1-t0


> Note that other than default DPI values need not be computed: the result will just have to multiplied by `(DPI/72)**2` when desired.

The results are interesting:
1. 0.123 sec
2. 0.004 sec
3. 2.295 sec

So the method with loading the page, but no pixmap, is 18 times faster than the pixmap method, and method 2 is >570 times faster.

2 replies

JorjMcKie Aug 23, 2022
Maintainer

The Adobe manual has 1310 pages ...

PasaOpasen Aug 23, 2022
Author

Thank u!

Getting pixmap shape without loading pixmap in memory #1894

Uh oh!

PasaOpasen Aug 22, 2022

Replies: 4 comments · 10 replies

Uh oh!

JorjMcKie Aug 22, 2022 Maintainer

Uh oh!

JorjMcKie Aug 22, 2022 Maintainer

Uh oh!

JorjMcKie Aug 22, 2022 Maintainer

Uh oh!

Uh oh!

JorjMcKie Aug 22, 2022 Maintainer

Uh oh!

JorjMcKie Aug 22, 2022 Maintainer

Uh oh!

JorjMcKie Aug 23, 2022 Maintainer

Uh oh!

PasaOpasen Aug 23, 2022 Author

Uh oh!

JorjMcKie Aug 23, 2022 Maintainer

Uh oh!

JorjMcKie Aug 23, 2022 Maintainer

Uh oh!

JorjMcKie Aug 23, 2022 Maintainer

Uh oh!

JorjMcKie Aug 23, 2022 Maintainer

Uh oh!

PasaOpasen Aug 23, 2022 Author

PasaOpasen
Aug 22, 2022

Replies: 4 comments 10 replies

JorjMcKie
Aug 22, 2022
Maintainer

JorjMcKie Aug 22, 2022
Maintainer

JorjMcKie
Aug 22, 2022
Maintainer

JorjMcKie
Aug 22, 2022
Maintainer

JorjMcKie Aug 22, 2022
Maintainer

JorjMcKie Aug 23, 2022
Maintainer

PasaOpasen Aug 23, 2022
Author

JorjMcKie Aug 23, 2022
Maintainer

JorjMcKie Aug 23, 2022
Maintainer

JorjMcKie
Aug 23, 2022
Maintainer

JorjMcKie Aug 23, 2022
Maintainer

PasaOpasen Aug 23, 2022
Author