Skip to content

FItz - Error when processing images from pdf to html #1389

@carlaTV

Description

@carlaTV

Please provide all mandatory information!

Describe the bug (mandatory)

When page.get_text() function is applied to some pdfs that contain images, it returns the image turned upside down + it adds a black image with the same size as the original image behind it. This does not happen to other documents.

To Reproduce (mandatory)

I attach two documents:

In order to reproduce the error, please run the following code:

`
import fitz

file_ok = 'document_OK.pdf'
file_error = 'document_ERROR.pdf'

doc_ok = fitz.open(file_ok)
doc_error = fitz.open(file_error)

html_text_ok = ''
for page in doc_ok:
html_text_ok += page.get_text("html")

html_text_error = ''
for page in doc_error:
html_text_error += page.get_text('html')
`

Expected behavior (optional)

The expected behavior is the following:
image
Which generates the image in base64 in the document image_base64_OK.txt and the html in the file document_OK.txt (it is not permitted to upload html files, so I attach it as txt).

document_OK.txt
image_base64_OK.txt

However, the image in document_ERROR.pdf is converted to two images:
image

  • A black rectangle.
  • The image in the pdf turned upside down.
    I also attach the base64 files of both images and the html file generated by fitz, document_ERROR.txt (again, I attach it as txt despite it is a html file).

document_ERROR.txt
image_ERROR1.txt
image_ERROR2.txt

Screenshots (optional)

Screenshots added in the previous section.

Your configuration (mandatory)

Operating system: MacOS 12.0.1

Python and PyMuPDF versions:
3.8.12 (default, Oct 22 2021, 18:39:35)
[Clang 13.0.0 (clang-1300.0.29.3)]
darwin

PyMuPDF 1.19.1: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-10-23 00:00:01.
Built for Python 3.8 on darwin (64-bit).

Additional context (optional)

Of course, I am aware that the problem might come from the pdf, but I don't really know why the image is recognized as two images and one of them is turned upside down.
I would really appreciate if you give me some insights on how to solve this. Thanks a lot!

Metadata

Metadata

Assignees

Labels

upstream bugbug outside this package

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions