Cannot open broken documents #1991

IvanDesuo · 2022-10-24T23:38:47Z

IvanDesuo
Oct 24, 2022

Please provide all mandatory information!

Describe the bug (mandatory)

I have developed a python script using PyMuPDF to extract info from medical pdf and organize the data as I want, with graphs and stuff in mass, in a for loop. So it opens all docs (using fitz.open) in the folder, extracts text from a given page, cleans the text, tokanize it and builds excel sheets and graphs with target data. It works well, however I'm facing something strange when I try to use the script with a new kind document (based on the one that the script was developed), however shorter in number of pages with less info.

As I run the code I get the error "cannot open broken document". But the document is based on the previous one that is working correctly (a pdf generated on MS word, based on a docx). What would defined a broken document? How can I certify it is indeed broken? I can provide a sample of documents if it is needed to verify the error.

Furthermore, as I get this error and simply to try to reuse the script with the previous documents, the script stops working and I start to get the same error, even with the documents it worked before. I need to restart the computer or unzip the package in a new location. It is like the new documents spoil the script definitely.

To Reproduce (mandatory)

Traceback (most recent call last):
File "C:\Users\desuo\Desktop\Laudo Miner v1.0\Laudo Miner v1.0\pdf_laudo_miner_v1_for_executable.py", line 51, in
with fitz.open(os.path.join(path, files)) as pdf_file:
File "C:\Python39\lib\site-packages\fitz\fitz.py", line 3876, in init
_fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
fitz.fitz.FileDataError: cannot open broken document

Expected behavior (optional)

I would expect it worked as it is intended.

Screenshots (optional)

This is the structure of the script. In the input_pdf folder, I put all the documents I want to be mined. graphs is where frequency graphs of tokens are generated.

Your configuration (mandatory)

Windows 10 (64 bits)
Python version 3.9
PyMuPDF version 1.20.2 using pip install

"PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.9 on win32 (64-bit)."

Additional context (optional)

Add any other context about the problem here.

Thanks for your work in this libray.

Answered by JorjMcKie

Oct 25, 2022

Why it works isolated but not in a for loop?

Obviously, the only remaining explanation now is some problem in your code, or something else in your setup.

I recommend again to print those mupdf_warnings() whenever an open raises an exception. This may help to understand, what it is that PyMuPDF actually is trying to open. Who knows what has happened to the file on its way to where your script lives?

I reviewed the two files you attached - and no problem whatsoever is popping up.

So for the time being, your issue cannot be reproduced so far.

View full answer

JorjMcKie · 2022-10-25T08:07:55Z

JorjMcKie
Oct 25, 2022
Maintainer

Please send me one example PDF that you cannot open. If possible please also send me the docx Word document from which that PDF has been created (presumably via the MS Word export function).
I am doing this type of thing myself and never encountered issues.

To do more investigation yourself:

# open your problem file
doc = fitz.open("file.pdf")
# after the exception do this:
print(fitz.TOOLS.mupdf_warnings())
# and inspect the output

The output of the function shows a collection of error and warning messages issued (mostly) by the underlying MuPDF open routines.

0 replies

IvanDesuo · 2022-10-25T12:51:18Z

IvanDesuo
Oct 25, 2022
Author

I did some more investigation and I suppose maybe it is indeed related to the word/pdf conversion:

When I open the "broken" file using fits outside of the for loop it works I have no exception warning. I also can extract the target page:

doc = fitz.open("Vox l name.pdf") text = doc.get_page_text(1) print(fitz.TOOLS.mupdf_warnings()) print(text)

In this example I extracted text from page 2 from a "broken" pdf, however in the print I got text from page 3 as well, may it be because of the presence of page sections or breaks on word document? Because there page breaks in there.

Out of curiosity here is the for loop code I'm using that is not working:

`path = "./input_pdf"
doc_number = 1

lis = []
for files in os.listdir(path):
with fitz.open(os.path.join(path, files)) as pdf_file:
text_by_page = pdf_file.get_page_text(1)
lis.append(text_by_page)
data = pd.DataFrame({"text": lis})
doc_number+=1
`
Why it works isolated but not in a for loop?

To provide the document for you to look up I had to anonymize the patient name and other stuff, so I had to do some changes in the documents. After doing these changes and converting to pdf using my computer, I tested and the script worked as intended. The "broken" pdfs were generated in other computer, strange enough some have 300 KB and others up 3 MB, even tho they have the similar content. I'm attaching the altered files here, they are working, however they still could provide some clues on what is happening. I still want to understand what is wrong with these pdfs to prevent the same error in the future.

Vox l masc.docx
Vox l masc.pdf

Thanks for your attention.

0 replies

JorjMcKie · 2022-10-25T14:24:57Z

JorjMcKie
Oct 25, 2022
Maintainer

Why it works isolated but not in a for loop?

Obviously, the only remaining explanation now is some problem in your code, or something else in your setup.

I recommend again to print those mupdf_warnings() whenever an open raises an exception. This may help to understand, what it is that PyMuPDF actually is trying to open. Who knows what has happened to the file on its way to where your script lives?

I reviewed the two files you attached - and no problem whatsoever is popping up.

So for the time being, your issue cannot be reproduced so far.

0 replies

IvanDesuo · 2022-10-25T14:37:47Z

IvanDesuo
Oct 25, 2022
Author

This is what I get in mupdf_warnings() printing:

`>>> print(fitz.TOOLS.mupdf_warnings())
cannot recognize version marker
trying to repair broken xref
repairing PDF document
no objects found

`

0 replies

JorjMcKie · 2022-10-25T14:50:09Z

JorjMcKie
Oct 25, 2022
Maintainer

Aha!
Now it is definite: That file is no PDF!
It is no bug of PyMuPDF.
I will transfer this issue to the "Discussions" tab and help you further investigate.

3 replies

IvanDesuo Oct 25, 2022
Author

Wow! maybe an error during word conversion? it says for me it is a pdf file. Strange!

Thx for the help!

JorjMcKie Oct 25, 2022
Maintainer

At least it is now extremely improbably that we have a (Py-) MuPDF issue. Otherwise, we'll open a new issue.

On the next open exception, do this please:

import pathlib
import fitz
try:
    doc = fitz.open(filename)
except:
    print(fitz.TOOLS.mupdf_warnings())
    pth = pathlib.Path(filename)
    buffer = pth.read_bytes()
    print(buffer[:200])  # shows the first few hundred bytes of the file
    # save the file to another place for later investigation
    out = open("test.pdf", "wb")
    out.write(buffer)
    out.close()

Then please send me the output of the printout, and try to open "test.pdf" offline and see what happens.
If now the open works, you have a problem in your script.
If not, then some problem in the file delivery to your script exists.

JorjMcKie Oct 25, 2022
Maintainer

What you are looking at are just Windows explorer views to your files. That's no proof that those are valid PDF files. Windows only look at the file extension (.pdf) and assumes it's a PDF.

JorjMcKie · 2022-10-25T15:11:27Z

JorjMcKie
Oct 25, 2022
Maintainer

This is how a valid PDF version marker looks like version 1.5 in this case):

5 replies

IvanDesuo Oct 25, 2022
Author

As I mentioned, when I try to open isolated outside the for loop it works, so the exception wont be called out, I removed it to get the printout either way.

here is the printout:

b'%PDF-1.3\n%\xe2\xe3\xcf\xd3\r\n2 0 obj\n<<\n/Length 17132\n>>\nstream\r\nq\n1 i \n11.999 666.681 221.76 163.32 re\nW n\n/GS1 gs\nq\n223.56 0 0 163.56 11.1214 666.6813 cm\n/Im1 Do\nQ\nQ\nq\n1 i \n11.999 503.121 221.76 163.56 re\nW n\n1 '

and yes I was able to open it offline as well.

Which makes me think there is something in the for loop that causing the issue, which I also don't understand what could be because it works fines with other files and also works after I convert the docx to pdfs on my computer.

As I don't have attachment to code and I just want it to work, I'm attaching the app in case you have time and/or interest to check it out. I still think it is more than a script error. I also included a folder with the "broken pdf". In the input folder there are two working pdfs which are basically the same of the broken ones, but generated on my PC. If you move the "broken" pdfs to the input folder the code should stop working. I'm far from being an expert coder, I'm giving the baby steps here, then it is probably not very well optimized, however it works for what I need.

Miner v1.0.zip

Thanks again!

JorjMcKie Oct 25, 2022
Maintainer

The code looks harmless to me. I also tried the files in broken ... and they did work!

IvanDesuo Oct 25, 2022
Author

Oh well, lol...Probably something on my configuration then? make no sense to me and this is what I was afraid of. Some strange thing going on here, i will need to try to isolate it further.

Thanks for your help!

IvanDesuo Oct 25, 2022
Author

Just a heads up, it was indeed a flaw in the script, specifically in the for loop. I don't know how or why it seems to be related to desktop.ini file, once the script was trying to get it and open as a pdf, or something like that. However it seems to be solved with a new for loop code:

for files in os.listdir(path): if fnmatch.fnmatch(files, '*.pdf'): with fitz.open(os.path.join(path, files)) as pdf_file: text_by_page = pdf_file.get_page_text(-2) lis.append(text_by_page) data = pd.DataFrame({"text": lis}) doc_number+=1

and further in the code I also changed another line:

file_names = [os.path.join(f) for f in os.listdir(path) if fnmatch.fnmatch(f, '*.pdf')]

So yeah, nothing related to the PyMuPDF.

Thanks so much for your help

JorjMcKie Oct 25, 2022
Maintainer

Aha! Thanks for the information.

Cannot open broken documents #1991

Uh oh!

Uh oh!

IvanDesuo Oct 24, 2022

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Additional context (optional)

Replies: 6 comments · 8 replies

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

Uh oh!

IvanDesuo Oct 25, 2022 Author

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

IvanDesuo Oct 25, 2022 Author

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

IvanDesuo Oct 25, 2022 Author

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

Uh oh!

IvanDesuo Oct 25, 2022 Author

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

Uh oh!

IvanDesuo Oct 25, 2022 Author

Uh oh!

IvanDesuo Oct 25, 2022 Author

Uh oh!

JorjMcKie Oct 25, 2022 Maintainer

IvanDesuo
Oct 24, 2022

Replies: 6 comments 8 replies

JorjMcKie
Oct 25, 2022
Maintainer

IvanDesuo
Oct 25, 2022
Author

JorjMcKie
Oct 25, 2022
Maintainer

IvanDesuo
Oct 25, 2022
Author

JorjMcKie
Oct 25, 2022
Maintainer

IvanDesuo Oct 25, 2022
Author

JorjMcKie Oct 25, 2022
Maintainer

JorjMcKie Oct 25, 2022
Maintainer

JorjMcKie
Oct 25, 2022
Maintainer

IvanDesuo Oct 25, 2022
Author

JorjMcKie Oct 25, 2022
Maintainer

IvanDesuo Oct 25, 2022
Author

IvanDesuo Oct 25, 2022
Author

JorjMcKie Oct 25, 2022
Maintainer