Text Extraction from PDF Results in Garbled Characters #3801
Replies: 3 comments 2 replies
-
This is no bug, but goes back to properties / deficiencies of the used font(s). In addition, PyMuPDF's default extraction flags use the glyph number instead of the Unicode then the Unicode's value is But as you report: when other extractors also deliver crab, then we have just bad luck! |
Beta Was this translation helpful? Give feedback.
-
The only way out I see is using OCR ... |
Beta Was this translation helpful? Give feedback.
-
All data in a PDF, including object definitions and binary content is available for low-level access and update. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
When extracting text from the attached PDF, the output contains garbled characters. Additionally, when I tried copying & pasting the content from other PDF viewers, similar issues occurred.
I'm unsure if this is related to encoding settings or if there is a way to correct this behavior. Any guidance or potential fixes would be appreciated.
How to reproduce the bug
PDF is as attached
king arthur.pdf
The Phantom of the Opera.pdf
Code Sample:
Output:
The extracted content includes cid values such as:
PyMuPDF version
1.24.9
Operating system
MacOS
Python version
3.12
Beta Was this translation helpful? Give feedback.
All reactions