-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Please provide all mandatory information!
Describe the bug (mandatory)
When page.get_text() function is applied to some pdfs that contain images, it returns the image turned upside down + it adds a black image with the same size as the original image behind it. This does not happen to other documents.
To Reproduce (mandatory)
I attach two documents:
-
document_OK.pdf --> document where the error does not happen
document_OK.pdf -
document_ERROR.pdf --> document where the error happens.
document_ERROR.pdf
In order to reproduce the error, please run the following code:
`
import fitz
file_ok = 'document_OK.pdf'
file_error = 'document_ERROR.pdf'
doc_ok = fitz.open(file_ok)
doc_error = fitz.open(file_error)
html_text_ok = ''
for page in doc_ok:
html_text_ok += page.get_text("html")
html_text_error = ''
for page in doc_error:
html_text_error += page.get_text('html')
`
Expected behavior (optional)
The expected behavior is the following:

Which generates the image in base64 in the document image_base64_OK.txt and the html in the file document_OK.txt (it is not permitted to upload html files, so I attach it as txt).
document_OK.txt
image_base64_OK.txt
However, the image in document_ERROR.pdf is converted to two images:

- A black rectangle.
- The image in the pdf turned upside down.
I also attach the base64 files of both images and the html file generated by fitz, document_ERROR.txt (again, I attach it as txt despite it is a html file).
document_ERROR.txt
image_ERROR1.txt
image_ERROR2.txt
Screenshots (optional)
Screenshots added in the previous section.
Your configuration (mandatory)
Operating system: MacOS 12.0.1
Python and PyMuPDF versions:
3.8.12 (default, Oct 22 2021, 18:39:35)
[Clang 13.0.0 (clang-1300.0.29.3)]
darwin
PyMuPDF 1.19.1: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-10-23 00:00:01.
Built for Python 3.8 on darwin (64-bit).
Additional context (optional)
Of course, I am aware that the problem might come from the pdf, but I don't really know why the image is recognized as two images and one of them is turned upside down.
I would really appreciate if you give me some insights on how to solve this. Thanks a lot!