Unexpected Image extraction #2124

buptyyf · 2022-12-14T14:13:34Z

buptyyf
Dec 14, 2022

Please provide all mandatory information!

Describe the bug (mandatory)

I am working on image extraction from PDF. But pymupdf generates some png images inverted color, some png images inverted by x coordinate, and some images meet expectation.

To Reproduce (mandatory)

This is test pdf:
Test.pdf

My test code is:

import io
import fitz
from PIL import Image
path = 'Test.pdf'
doc = fitz.open(path, filetype="pdf")

img_index = 0
page_count = doc.page_count
if page_count:

    for page_no in range(page_count):
        blocks = doc[page_no].getText('dict')['blocks']
        for ind, block in enumerate(blocks):
            if block['type'] == 1:
                try:
                    image = Image.open(io.BytesIO(block['image']))
                    image.save(open("{}.{}".format(img_index, block['ext']), "wb"))
                    img_index += 1
                except Exception as e:
                    print(e)

And I also use PyMuPDF-Utilities exmaple https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/examples/extract-images/extract.py to test extract images, which get the same result.

The result is:

This image should be "1961" but get this.

This image should be black font color and white background color, but get the inverted color

This image behavior meet my expectation.

Answered by JorjMcKie

Dec 15, 2022

How to one on one map image block from get_text() with page.get_images() list item?
Because page.get_images() list item has xref, but image block from get_text() don't have xref property.

A 1:1 correspondence between page.get_images() and the image blocks in page.get_text() cannot exist / be guaranteed, see the documentation here.

To recover images in the same way as a page shows them, you must investigate the transform matrix and use a package like Pillow to back-transform the extracted image accordingly.
I have once written a little function matprop, that does this investigation. Here is a ZIP of it:
matrix_property.zip
Lets look at page 8 of your test file and see how to use it:

import

View full answer

JorjMcKie · 2022-12-14T18:16:20Z

JorjMcKie
Dec 14, 2022
Maintainer

As is mentioned in the documentation:
If you extract images via Page.get_text(), then you will not get transparency information. Only base images are extracted.
Whatever your expectations are: don't be disappointed if extracting this way.

But you actually are lucky: your 2100+ images seem to have no transparency and with only 6 exceptions they all are extracted correctly!
If they seem to be flipped or rotated, then this means, they are stored that way - and the page creator knew what he must do to show them more nicely.
You can find out what he did by looking at the "tranform" matrix in each image block. The documentation will tell you how to interpret that matrix.
Example page 284, image block 0:

>>> blocks[0].keys()
dict_keys(['number', 'type', 'bbox', 'width', 'height', 'ext', 'colorspace', 'xres', 'yres', 'bpc', 'transform', 'size', 'image'])
>>> blocks[0]["bpc"]
1
>>> blocks[0]["transform"]
(3.0, 0.0, -0.0, -1.8461999893188477, 294.5830078125, 414.6283874511719)

This shows:

the transform matrix.d is < 0, therefore an up-down flip
the bpc = 1, therefore this image is black&white. It is one of the 6 mentioned exceptions. To store it in a more accustomed way, perform a color inversion like that:

pix = fitz.Pixmap(blocks[0]["image"])
pix.invert_irect()
pix.save("inverted.png")

1 reply

buptyyf Dec 15, 2022
Author

Thanks. But this method don't solve my problems. I also have some problems as follows:

up-down flip problem.
"the transform matrix.d is < 0, therefore an up-down flip"
I want to flip it back, making it likes normal. pix.invert_irect() just inverted color. What should I do?
get_text() can't extract image transparency info.
I learn from you answer use doc.extract_image api to extract image with transparency info. But it's not working for these images. doc.extract_image() "image" property value is same as page.get_text() block "image" property.
My understanding is that, when I extract the transparency of image, these images will display normally. I don't know whether my understanding is right.
How to one on one map image block from get_text() with page.get_images() list item?
Because page.get_images() list item has xref, but image block from get_text() don't have xref property.

JorjMcKie · 2022-12-14T18:20:42Z

JorjMcKie
Dec 14, 2022
Maintainer

You may have questions, so I am converting this to a "Discussions" item.

0 replies

JorjMcKie · 2022-12-15T10:32:01Z

JorjMcKie
Dec 15, 2022
Maintainer

How to one on one map image block from get_text() with page.get_images() list item?
Because page.get_images() list item has xref, but image block from get_text() don't have xref property.

A 1:1 correspondence between page.get_images() and the image blocks in page.get_text() cannot exist / be guaranteed, see the documentation here.

To recover images in the same way as a page shows them, you must investigate the transform matrix and use a package like Pillow to back-transform the extracted image accordingly.
I have once written a little function matprop, that does this investigation. Here is a ZIP of it:
matrix_property.zip
Lets look at page 8 of your test file and see how to use it:

import fitz
from PIL import Image
import io
from matrix_property import matprop
doc=fitz.open("Test.pdf")
page = doc[8]
imgb = [b for b in page.get_text("dict")["blocks"] if b["type==1]]
for i, b in enumerate(imgb):
    image0 = Image.open(io.BytesIO(b["image"]))
    mat = fitz.Matrix(b["transform"])
    if matprop(mat)[0] == 2:  # this is an up-down flip!
        image1 = image0.transpose(Image.Transpose.FLIP_TOP_BOTTOM)
        image1.save(f"img{i}.{b"ext"})

This will produce the following images:

Depending on the returns of function matprop, select the right Image.Transpose function to achieve your result.

5 replies

buptyyf Dec 15, 2022
Author

Thanks a lot! But I can't find the answer of No.2 problem.

get_text() can't extract image transparency info.
I learn from #1919 use doc.extract_image api to extract image with transparency info. But it's not working for these images. doc.extract_image() "image" property value is same as page.get_text() block "image" property.
My understanding is that, when I extract the transparency of image, these images will display normally. I don't know whether my understanding is right.

JorjMcKie Dec 15, 2022
Maintainer

My understanding is that, when I extract the transparency of image, these images will display normally. I don't know whether my understanding is right.

No, this is a misconception. All the images in your test file are intransparent, meaning they do not have an alpha channel.
Therefore your extraction via get_text() was successful.
If you use the PDF-specific approach (remember: get_text works for all supported document types - not only PDF) Page.get_images(), then you will be given the image's xref in position 0 of each item and the xref of the transparency mask in position 1. If this value is > 0, then the image given in position 0 must be modified:

add an alpha channel
put the mask bytes into the new alpha channel
then proceed with the resulting pixmap

To find out the transform matrix in this situation again, use Page.get_image_rects(xref, transform=True), and proceed like above.

buptyyf Dec 15, 2022
Author

All the images in your test file are intransparent, meaning they do not have an alpha channel. Therefore your extraction via get_text() was successful.

I don't know why these images are black background color when they are extracted by get_text(). But show white background color in pdf file. The smask of all these images is 0.

I know I can use pix.invert_irect() to handle these six images. But how can I detect which image should be inverted in other pdf files? Does it mean when bpc is 1, I should inverted that?

JorjMcKie Dec 15, 2022
Maintainer

bpc=1 is the only case where this is necessary.
Otherwise it is a bug, which you should report please.

buptyyf Dec 16, 2022
Author

Both of this image and this image have the same property bpc = 1. But only the second image should be inverted color. How can I detect which image should be inverted?

buptyyf · 2022-12-16T04:26:53Z

buptyyf
Dec 16, 2022
Author

Another question about extract image.

CB_Aj05vi5wT4PZ6bp6cf.pdf

I extract the first image in page 0 by two ways.

I use get_text() api extract image content directly. This image change to black background color, but it's white background in pdf.

Then I checked this image as follows:

>>> doc.extract_image(14)
{'ext': 'jpeg', 'smask': 15, 'width': 1244, 'height': 514, 'colorspace': 3, 'bpc': 8, 'xres': 96, 'yres': 96, 'cs-name': 'DeviceRGB', 'image': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00`\x00`\x00\x00...(\xa2\x80\n(\xa2\x80?\xff\xd9'}
>>> doc.extract_image(15)
{'ext': 'png', 'smask': 0, 'width': 1244, 'height': 514, 'colorspace': 1, 'bpc': 8, 'xres': 96, 'yres': 96, 'cs-name': 'DeviceGray', 'image': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x04\xdc...\x00\x00IEND\xaeB`\x82'}
>>> pix14 = fitz.Pixmap(doc, 14)

>>>pix15 = fitz.Pixmap(doc, 15)

>>>pix = fitz.Pixmap(pix14, pix15)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/yifei/.pyenv/versions/3.10.7/envs/wrpdf_proj/lib/python3.10/site-packages/fitz/fitz.py", line 6820, in __init__
    _fitz.Pixmap_swiginit(self, _fitz.new_Pixmap(*args))
RuntimeError: color pixmap must not have an alpha channel

I use for reference your method, but get some errors.

In addition, the bpc of this image is 8, not 1, so I don't use pix.invert_irect() api to invert it.

0 replies

JorjMcKie · 2022-12-16T08:34:27Z

JorjMcKie
Dec 16, 2022
Maintainer

This is a base image with an SMask (= alpha channel) stored. So the get_text() method will fail for it.
You then correctly extracted both, base image and SMask. But the PDF creator stored the base image with a (presumably useless) alpha channel - that's what the error message complains about.

You can remove the alpha channel from the base image before you add the SMask pixmap:

if pix14.alpha:  # check if alpha channel there
    pix14 = fitz.Pixmap(pix14, 0)  # remove alpha channel
pix = fitz.Pixmap(pix14, pix15)

3 replies

buptyyf Dec 16, 2022
Author

Almighty! You can solve everything. Thanks a lot!
And this question at above also need your help. All of these images' smask are 0, but some need inverted and others don't need.

JorjMcKie Dec 16, 2022
Maintainer

smask = 0 does not determine whether or not to invert the pixmap - the existence of an smask > 0 only says we need an alpha channel.
But if img["bpc"] = 1 (and especially if also img["cs-name"] = "None"), the inversion is (probably - there may be issues) required.

buptyyf Dec 16, 2022
Author

Okay!
img["cs-name"] = "None" is a key point I missed.

Thanks again!!!

Unexpected Image extraction #2124

Uh oh!

Uh oh!

buptyyf Dec 14, 2022

Describe the bug (mandatory)

To Reproduce (mandatory)

Replies: 5 comments · 9 replies

Uh oh!

Uh oh!

JorjMcKie Dec 14, 2022 Maintainer

Uh oh!

Uh oh!

buptyyf Dec 15, 2022 Author

Uh oh!

JorjMcKie Dec 14, 2022 Maintainer

Uh oh!

JorjMcKie Dec 15, 2022 Maintainer

Uh oh!

Uh oh!

buptyyf Dec 15, 2022 Author

Uh oh!

JorjMcKie Dec 15, 2022 Maintainer

Uh oh!

buptyyf Dec 15, 2022 Author

Uh oh!

JorjMcKie Dec 15, 2022 Maintainer

Uh oh!

buptyyf Dec 16, 2022 Author

Uh oh!

buptyyf Dec 16, 2022 Author

Uh oh!

JorjMcKie Dec 16, 2022 Maintainer

Uh oh!

buptyyf Dec 16, 2022 Author

Uh oh!

JorjMcKie Dec 16, 2022 Maintainer

Uh oh!

buptyyf Dec 16, 2022 Author

buptyyf
Dec 14, 2022

Replies: 5 comments 9 replies

JorjMcKie
Dec 14, 2022
Maintainer

buptyyf Dec 15, 2022
Author

JorjMcKie
Dec 14, 2022
Maintainer

JorjMcKie
Dec 15, 2022
Maintainer

buptyyf Dec 15, 2022
Author

JorjMcKie Dec 15, 2022
Maintainer

buptyyf Dec 15, 2022
Author

JorjMcKie Dec 15, 2022
Maintainer

buptyyf Dec 16, 2022
Author

buptyyf
Dec 16, 2022
Author

JorjMcKie
Dec 16, 2022
Maintainer

buptyyf Dec 16, 2022
Author

JorjMcKie Dec 16, 2022
Maintainer

buptyyf Dec 16, 2022
Author