Can't view the corresponding text from font file #2517

CodePythonFollow · 2023-07-05T14:44:48Z

CodePythonFollow
Jul 5, 2023

I extracted the fonts using the method below

fonts = page.get_fonts(full=True)

for font in fonts:
    res = file.extract_font(font[0])

I use the following method to obtain text information

blocks = page.get_text("dict", flags=11)["blocks"]
for b in blocks:  # iterate through the text blocks
    for l in b["lines"]:  # iterate through the text lines
        for s in l["spans"]:  # iterate through the text spans
            print("")
            font_properties = "Font: '%s' (%s), size %g, color #%06x" % (
                s["font"],  # font name
                flags_decomposer(s["flags"]),  # readable font flags
                s["size"],  # font size
                s["color"],  # font color
            )
            print("Text: '%s'" %  s["text"])  # simple print of text
            print(f"box: {s['bbox']}")  # simple print of text
            print(font_properties)

out:

I can't see the corresponding text in the extracted font. font type: cid

The text "重要提示" not in font file.

I wonder if something went wrong with me？

Answered by JorjMcKie

Jul 6, 2023

Are you talking about incomplete font files？

Yes, exactly. Could be that the font in question only has /ToUnicode entries for numbers and maybe always returns the space character code for all other glyphs.

WRT to "incomplete" fonts: The /ToUnicode information is optional for a font. There is no law or rule prescribing that a font must provide this information at all. Please recall that PDF has originally been created to display information for human reception. Not as a data store - things like text extraction, image extraction, etc. came later and - as I wrote - are not necessarily reliable.

View full answer

JorjMcKie · 2023-07-05T15:02:25Z

JorjMcKie
Jul 5, 2023
Maintainer

This is not bug, but a "Discussions" item. Let me transfer first.

2 replies

ousia Jul 16, 2023

Do you really think this is an announcement? I think Q&A would make more sense here (writing that after mischaracterizing myself my previous discussion as announcement) .

JorjMcKie Aug 4, 2023
Maintainer

of course you are right!

JorjMcKie · 2023-07-05T15:22:10Z

JorjMcKie
Jul 5, 2023
Maintainer

Back-translation information from a visible "glyph" on a page to its unicode, that was used to generate that glyph, may or may not be provided by a font - this is voluntary.

This information may be missing completely: then no text extraction is possible at all and you will only see invalid unicode symbols U+FFFD, '�' (white question mark with a black diamond shaped background).
This information may be incomplete: only some characters have been forgotten and therefore appear as �.
This information may be given, but wrong on purpose: this can is used to prevent text extraction. By not letting � appear, a program may not realize an extraction problem (although it is extractiing garbage) and will not invoke OCR - as it would otherwise.

0 replies

CodePythonFollow · 2023-07-05T15:29:45Z

CodePythonFollow
Jul 5, 2023
Author

I have encountered the ' �' character you said, the above situation is a normal display Chinese, but the font file can only see the numbers without getting the Chinese characters.
Are you talking about incomplete font files？

� like it:

1 reply

JorjMcKie Jul 6, 2023
Maintainer

Are you talking about incomplete font files？

Yes, exactly. Could be that the font in question only has /ToUnicode entries for numbers and maybe always returns the space character code for all other glyphs.

WRT to "incomplete" fonts: The /ToUnicode information is optional for a font. There is no law or rule prescribing that a font must provide this information at all. Please recall that PDF has originally been created to display information for human reception. Not as a data store - things like text extraction, image extraction, etc. came later and - as I wrote - are not necessarily reliable.

Answer selected by JorjMcKie

CodePythonFollow · 2023-07-06T14:32:49Z

CodePythonFollow
Jul 6, 2023
Author

Thank you very much.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can't view the corresponding text from font file #2517

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can't view the corresponding text from font file #2517

Uh oh!

CodePythonFollow Jul 5, 2023

Replies: 4 comments · 3 replies

Uh oh!

JorjMcKie Jul 5, 2023 Maintainer

Uh oh!

ousia Jul 16, 2023

Uh oh!

JorjMcKie Aug 4, 2023 Maintainer

Uh oh!

Uh oh!

JorjMcKie Jul 5, 2023 Maintainer

Uh oh!

Uh oh!

CodePythonFollow Jul 5, 2023 Author

Uh oh!

JorjMcKie Jul 6, 2023 Maintainer

Uh oh!

CodePythonFollow Jul 6, 2023 Author

CodePythonFollow
Jul 5, 2023

Replies: 4 comments 3 replies

JorjMcKie
Jul 5, 2023
Maintainer

JorjMcKie Aug 4, 2023
Maintainer

JorjMcKie
Jul 5, 2023
Maintainer

CodePythonFollow
Jul 5, 2023
Author

JorjMcKie Jul 6, 2023
Maintainer

CodePythonFollow
Jul 6, 2023
Author