Question about � by page.get_text() #2401
-
Hello, I run the code below, get text like '�����������'.
If I run the code below, I can get the image. The text "上海增值税电子普通发票" in the green box.
Q1: Why get_text returning "�" while pixmap getting the correct word? What is the difference between them? Maybe a code as "�" when character decode failed? Q2: Any method to get the raw data (perhaps bytes) to decode such text by my customer decoder? pdf data here Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This is not unusual! There can be nothing done about this situation - except using OCR as described in this example script. |
Beta Was this translation helpful? Give feedback.
This is not unusual!
Please remember that PDF is a file format primarily meant for viewing data, to a lesser extent for extracting.
So for a font it is perfectly possible to correctly display characters, but not supporting the extraction of the written text.
For extraction, a translation table (usually the data in object
/ToUnicode
) is used, that delivers the original unicode number that has cause the character's appearance in the PDF.This table may be missing (or be incorrect or incomplete). In those cases you will see the error unicode
0xFFFD
displayed as the black questionmark.There can be nothing done about this situation - except using OCR as described in this example script.