font problem #1638
-
Hi @JorjMcKie , Thank you for working on pymupdf. It is very useful! Now I work on extracting text information from pdf. I use page.get_text('dict')['blocks'] to get the bbox and font information. However, the text I pulled looks like '���������������'. The font names are like Generic1-Regular or Generic7-Regular something like that. I tried to use fitz.Font('Generic1-Regular') and then font.unicode_to_glyph_name(), it doesn't work. Do you have any idea why this happen and how I could solve it? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
MuPDF does not support all fonts. That as a preliminary statement. |
Beta Was this translation helpful? Give feedback.
MuPDF does not support all fonts. That as a preliminary statement.
And then there are fonts designed to be immunized against text extraction.
And then there are cases where text exists only as part inside images, or text appearing only as elementary drawinging operstions (like a capital "D" being drawn as a "|" followed by a left-open semi-circle, etc.).
In any of such cases you have to fallback to OCRing the page.
Please have a look at example scripts in the Utilities repository.