Replies: 6 comments
-
Confirming the issue. Also happens with the base library. I am forwarding this to the base library's issue system, ok? As per your question: |
Beta Was this translation helpful? Give feedback.
-
For your information: this is the issue entered on MuPDF's bug tracker. |
Beta Was this translation helpful? Give feedback.
-
I looked a bit deeper in the problem. |
Beta Was this translation helpful? Give feedback.
-
I did more analysis about your case. Now comes the point:
... the following CMAP is being used:
This change (made for both used fonts) removes your problem: every text is extracted as expected. So why do other text extraction software produce results in this situation? It seems that this is simply the result of guesswork. The original full file Liberation Serif fonts do map glyph number 3 to space. I haven't received a note from MuPDF's issue system, so I guess we will have to wait for a definite answer from them. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the details @JorjMcKie! The document itself is not important. I just grabbed a bunch of random documents and reported the two most noticeable issues. (In case it might help make (py)mupdf better) That said, as someone not intimately familiar with unicode and fonts, your analysis was very educational! |
Beta Was this translation helpful? Give feedback.
-
I am going to move this to "Discussions" - as we have clarified, there is no bug (except in the PDF itslf). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
When using
page.get_text_blocks
on a specific document (attached), every single space becomes a question mark (65533), e.g."The�Count�of�Monte�Cristo\n"
. I'm aware that this is how mupdf/pymupdf denotes glyphs it cannot understand, but it's odd that the same document can be read fine with apple's Preview and google's Chrome/pdfium.797The-Count-of-Monte-Cristo.pdf
To Reproduce (mandatory)
Your configuration (mandatory)
Additional question
Does (py)mupdf have any sort of 'drop invalid characters' option? Ideally we'd drop both �'s and others (e.g. split surrogates from #2608, or Private Use ones such as U+10FC31).
Of course, I can do
string.replace('\uFFFD', ' ')
, but that messes withpage.get_text_words
result, plus I'd need to compile a list of all 'bad characters', which seems wrong.Beta Was this translation helpful? Give feedback.
All reactions