Skip to content
Discussion options

You must be logged in to vote

This is not unusual!
Please remember that PDF is a file format primarily meant for viewing data, to a lesser extent for extracting.
So for a font it is perfectly possible to correctly display characters, but not supporting the extraction of the written text.
For extraction, a translation table (usually the data in object /ToUnicode) is used, that delivers the original unicode number that has cause the character's appearance in the PDF.
This table may be missing (or be incorrect or incomplete). In those cases you will see the error unicode 0xFFFD displayed as the black questionmark.

There can be nothing done about this situation - except using OCR as described in this example script.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by tangent2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants