Cannot open broken documents #1991
-
Please provide all mandatory information! Describe the bug (mandatory)I have developed a python script using PyMuPDF to extract info from medical pdf and organize the data as I want, with graphs and stuff in mass, in a for loop. So it opens all docs (using fitz.open) in the folder, extracts text from a given page, cleans the text, tokanize it and builds excel sheets and graphs with target data. It works well, however I'm facing something strange when I try to use the script with a new kind document (based on the one that the script was developed), however shorter in number of pages with less info. As I run the code I get the error "cannot open broken document". But the document is based on the previous one that is working correctly (a pdf generated on MS word, based on a docx). What would defined a broken document? How can I certify it is indeed broken? I can provide a sample of documents if it is needed to verify the error. Furthermore, as I get this error and simply to try to reuse the script with the previous documents, the script stops working and I start to get the same error, even with the documents it worked before. I need to restart the computer or unzip the package in a new location. It is like the new documents spoil the script definitely. To Reproduce (mandatory)Traceback (most recent call last): Expected behavior (optional)I would expect it worked as it is intended. Screenshots (optional)This is the structure of the script. In the input_pdf folder, I put all the documents I want to be mined. graphs is where frequency graphs of tokens are generated. Your configuration (mandatory)
"PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library. Additional context (optional)Add any other context about the problem here. Thanks for your work in this libray. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 8 replies
-
Please send me one example PDF that you cannot open. If possible please also send me the docx Word document from which that PDF has been created (presumably via the MS Word export function). To do more investigation yourself: # open your problem file
doc = fitz.open("file.pdf")
# after the exception do this:
print(fitz.TOOLS.mupdf_warnings())
# and inspect the output The output of the function shows a collection of error and warning messages issued (mostly) by the underlying MuPDF open routines. |
Beta Was this translation helpful? Give feedback.
-
I did some more investigation and I suppose maybe it is indeed related to the word/pdf conversion:
In this example I extracted text from page 2 from a "broken" pdf, however in the print I got text from page 3 as well, may it be because of the presence of page sections or breaks on word document? Because there page breaks in there. Out of curiosity here is the for loop code I'm using that is not working: `path = "./input_pdf" lis = []
Vox l masc.docx Thanks for your attention. |
Beta Was this translation helpful? Give feedback.
-
Obviously, the only remaining explanation now is some problem in your code, or something else in your setup. I recommend again to print those mupdf_warnings() whenever an open raises an exception. This may help to understand, what it is that PyMuPDF actually is trying to open. Who knows what has happened to the file on its way to where your script lives? I reviewed the two files you attached - and no problem whatsoever is popping up. So for the time being, your issue cannot be reproduced so far. |
Beta Was this translation helpful? Give feedback.
-
This is what I get in mupdf_warnings() printing: `>>> print(fitz.TOOLS.mupdf_warnings())
|
Beta Was this translation helpful? Give feedback.
-
Aha! |
Beta Was this translation helpful? Give feedback.
-
This is how a valid PDF version marker looks like version 1.5 in this case): |
Beta Was this translation helpful? Give feedback.
Obviously, the only remaining explanation now is some problem in your code, or something else in your setup.
I recommend again to print those mupdf_warnings() whenever an open raises an exception. This may help to understand, what it is that PyMuPDF actually is trying to open. Who knows what has happened to the file on its way to where your script lives?
I reviewed the two files you attached - and no problem whatsoever is popping up.
So for the time being, your issue cannot be reproduced so far.