font problem #1638

hellohr11 · 2022-03-10T21:09:28Z

hellohr11
Mar 10, 2022

Thank you for working on pymupdf. It is very useful!

Now I work on extracting text information from pdf. I use page.get_text('dict')['blocks'] to get the bbox and font information. However, the text I pulled looks like '��'. The font names are like Generic1-Regular or Generic7-Regular something like that. I tried to use fitz.Font('Generic1-Regular') and then font.unicode_to_glyph_name(), it doesn't work. Do you have any idea why this happen and how I could solve it?

Thanks!

Answered by JorjMcKie

Mar 10, 2022

MuPDF does not support all fonts. That as a preliminary statement.
And then there are fonts designed to be immunized against text extraction.
And then there are cases where text exists only as part inside images, or text appearing only as elementary drawinging operstions (like a capital "D" being drawn as a "|" followed by a left-open semi-circle, etc.).
In any of such cases you have to fallback to OCRing the page.
Please have a look at example scripts in the Utilities repository.

View full answer

JorjMcKie · 2022-03-10T21:15:56Z

JorjMcKie
Mar 10, 2022
Maintainer

MuPDF does not support all fonts. That as a preliminary statement.
And then there are fonts designed to be immunized against text extraction.
And then there are cases where text exists only as part inside images, or text appearing only as elementary drawinging operstions (like a capital "D" being drawn as a "|" followed by a left-open semi-circle, etc.).
In any of such cases you have to fallback to OCRing the page.
Please have a look at example scripts in the Utilities repository.

2 replies

hellohr11 Mar 10, 2022
Author

Thank you for your answer. I could extract cid file from the pdf, are these helpful in decoding?

JorjMcKie Mar 11, 2022
Maintainer

No, missing font support will not change ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

font problem #1638

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

font problem #1638

Uh oh!

hellohr11 Mar 10, 2022

Replies: 1 comment · 2 replies

Uh oh!

JorjMcKie Mar 10, 2022 Maintainer

Uh oh!

hellohr11 Mar 10, 2022 Author

Uh oh!

JorjMcKie Mar 11, 2022 Maintainer

hellohr11
Mar 10, 2022

Replies: 1 comment 2 replies

JorjMcKie
Mar 10, 2022
Maintainer

hellohr11 Mar 10, 2022
Author

JorjMcKie Mar 11, 2022
Maintainer