-
Please provide all mandatory information! Describe the bug (mandatory)I have a flock of PDFs that are have in the following attributes: To Reproduce (mandatory)Run PDF2Text.py the produced text has nothing to do with the pdf data. Expected behavior (optional)Get the pdf text Screenshots (optional)If applicable, add screenshots to help explain your problem. Your configuration (mandatory)
print(sys.version, "\n", sys.platform, "\n", fitz.doc) PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library. Attached the example pdf |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments
-
I am sorry, but this is a MuPDF problem ... if it is a problem at all! The file uses a so-called Type3 font and contains no information for back-translating glyphs to unicode values. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the prompt reply! So in this format is it possible to use the replace font scripts? I've tried OCRing it but there are many random characters where lines are close to the letters... although the text output is very close by 90%.. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately not, because font replacement also internally uses text extraction 😒. |
Beta Was this translation helpful? Give feedback.
-
I'll try that as soon as possible! |
Beta Was this translation helpful? Give feedback.
-
I am going to convert this to a Discussions item, so you may add any further observation to it later on. |
Beta Was this translation helpful? Give feedback.
I am sorry, but this is a MuPDF problem ... if it is a problem at all! The file uses a so-called Type3 font and contains no information for back-translating glyphs to unicode values.
The
/Encoding
object xref 107 would have to contain this info - which it doesn't.The only thing you can do is OCR-ing ...