Question about � by page.get_text() #2401

tangent2018 · 2023-05-11T03:35:05Z

tangent2018
May 11, 2023

Hello,

I run the code below, get text like '��'.

>>> doc = fitz.open(filename=pdf_path)
>>> page = doc[0]
>>> print(page.get_text(option='dict'))
{'width': 595.35,
 'height': 396.9,
 'blocks': [{'number': 0,
   'type': 0,
   'bbox': (193.16273498535156,
    19.852397918701172,
    402.1956787109375,
    38.855403900146484),
   'lines': [{'spans': [{'size': 19.003005981445312,
       'flags': 12,
       'font': 'BWSimKai',
       'color': 10244642,
       'ascender': 0.859375,
       'descender': -0.140625,
       'text': '�����������',
       'origin': (193.16273498535156, 36.18310546875),
       'bbox': (193.16273498535156,
        19.852397918701172,
        402.1956787109375,
        38.855403900146484)}],
     'wmode': 0,
     'dir': (1.0, 0.0),
     'bbox': (193.16273498535156,
      19.852397918701172,
      402.1956787109375,
      38.855403900146484)}]},
...}
>>> print(page.get_fonts())
[(5, 'ttf', 'Type0', 'BWSimKai', 'JF1', 'Identity-H'), (12, 'n/a', 'Type0', 'SimSun', 'JF2', 'UniGB-UCS2-H'), (15, 'n/a', 'TrueType', 'CourierNew', 'JF3', 'WinAnsiEncoding')]

If I run the code below, I can get the image. The text "上海增值税电子普通发票" in the green box.

>>> trans = fitz.Matrix(2, 2).prerotate(0)
>>> pm = page.get_pixmap(matrix=trans, alpha=False)

image here

Q1: Why get_text returning "�" while pixmap getting the correct word? What is the difference between them? Maybe a code as "�" when character decode failed?

Q2: Any method to get the raw data (perhaps bytes) to decode such text by my customer decoder?

pdf data here
20230423092505d66c17d3fa77473e81839dd829197931.pdf

Thank you!

Answered by JorjMcKie

May 11, 2023

This is not unusual!
Please remember that PDF is a file format primarily meant for viewing data, to a lesser extent for extracting.
So for a font it is perfectly possible to correctly display characters, but not supporting the extraction of the written text.
For extraction, a translation table (usually the data in object /ToUnicode) is used, that delivers the original unicode number that has cause the character's appearance in the PDF.
This table may be missing (or be incorrect or incomplete). In those cases you will see the error unicode 0xFFFD displayed as the black questionmark.

There can be nothing done about this situation - except using OCR as described in this example script.

View full answer

JorjMcKie · 2023-05-11T11:50:11Z

JorjMcKie
May 11, 2023
Maintainer

This is not unusual!
Please remember that PDF is a file format primarily meant for viewing data, to a lesser extent for extracting.
So for a font it is perfectly possible to correctly display characters, but not supporting the extraction of the written text.
For extraction, a translation table (usually the data in object /ToUnicode) is used, that delivers the original unicode number that has cause the character's appearance in the PDF.
This table may be missing (or be incorrect or incomplete). In those cases you will see the error unicode 0xFFFD displayed as the black questionmark.

There can be nothing done about this situation - except using OCR as described in this example script.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about � by page.get_text() #2401

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question about � by page.get_text() #2401

Uh oh!

tangent2018 May 11, 2023

Replies: 1 comment

Uh oh!

JorjMcKie May 11, 2023 Maintainer

tangent2018
May 11, 2023

JorjMcKie
May 11, 2023
Maintainer