Font name encoding issues and recurrence in page.get_fonts(). #1934

yufc2002 · 2022-09-21T07:52:05Z

yufc2002
Sep 21, 2022

Describe the bug

I want to match fonts of texts captured by page.get_texttrace() with fonts captured by page.get_fonts().

This is the sample PDF
6.pdf

import fitz
from pprint import pprint

fitz.TOOLS.set_subset_fontnames(on=True)

pdf_file = '6.pdf'
doc = fitz.open(pdf_file)

text_fonts = []
embedded_fonts = []
xref_visited = []

for page in doc:
    for font in page.get_fonts():
        xref, ext, type, basefont, name, encoding = font
        if xref in xref_visited:
            continue
        embedded_fonts.append(font)
        xref_visited.append(xref)

    for text in page.get_texttrace():
        font = text['font']
        if font not in text_fonts:
            text_fonts.append(font) 

pprint(embedded_fonts)
pprint(text_fonts)

Embedded Fonts are:

[(20, 'ttf', 'Type0', 'TNXQRC+SimSun', 'C2_0', 'Identity-H'),
 (21, 'n/a', 'TrueType', 'TimesNewRomanPSMT', 'TT0', 'WinAnsiEncoding'),
 (22, 'n/a', 'TrueType', 'TimesNewRomanPS-ItalicMT', 'TT1', 'WinAnsiEncoding'),
 (23, 'n/a', 'TrueType', 'TimesNewRomanPS-BoldMT', 'TT2', 'WinAnsiEncoding'),
 (24, 'ttf', 'TrueType', 'YOZFJS+Calibri', 'TT3', 'WinAnsiEncoding'),
 (57, 'n/a', 'Type1', 'Times-Roman', 'TP1', 'WinAnsiEncoding'),
 (58, 'n/a', 'Type0', 'STSong-Light', 'TP2', 'GBK-EUC-H'),
 (59, 'ttf', 'Type0', 'BCDIEE+Cambria Math', 'F9', 'Identity-H'),
 (60, 'n/a', 'TrueType', 'Times New Roman', 'F8', 'WinAnsiEncoding'),
 (61, 'ttf', 'Type0', 'Times New Roman', 'F7', 'Identity-H'),
 (62, 'ttf', 'Type0', 'BCDHEE+å®\x8bä½\x93', 'F6', 'Identity-H'),
 (63, 'ttf', 'TrueType', 'BCDGEE+æ¥·ä½\x93', 'F5', 'WinAnsiEncoding'),
 (64, 'ttf', 'Type0', 'BCDFEE+æ¥·ä½\x93', 'F4', 'Identity-H'),
 (65, 'n/a', 'TrueType', 'Times New Roman,Bold', 'F3', 'WinAnsiEncoding'),
 (66, 'ttf', 'Type0', 'Times New Roman,Bold', 'F2', 'Identity-H'),
 (67, 'ttf', 'Type0', 'BCDEEE+é»\x91ä½\x93', 'F1', 'Identity-H'),
 (77, 'ttf', 'TrueType', 'BCDKEE+å®\x8bä½\x93', 'F11', 'WinAnsiEncoding'),
 (78, 'ttf', 'Type0', 'BCDJEE+Wingdings', 'F10', 'Identity-H'),
 (79, 'ttf', 'Type0', 'Times New Roman,Italic', 'F12', 'Identity-H'),
 (85, 'n/a', 'TrueType', 'Times New Roman,Italic', 'F13', 'WinAnsiEncoding'),
 (130,  'ttf','TrueType', 'BCDLEE+Arabic Transparent', 'F14', 'WinAnsiEncoding'),
 (107, 'n/a', 'TrueType', 'Arial', 'F16', 'WinAnsiEncoding'),
 (108, 'ttf', 'TrueType','BCDMEE+å¾®è½¯é\x9b\x85é»\x91', 'F15', 'WinAnsiEncoding'),
 (117, 'ttf', 'Type0', 'BCDNEE+ä»¿å®\x8b', 'F17', 'Identity-H')]

Fonts in the page.get_texttrace() are:

['TNXQRC+SimSun',
 'TimesNewRomanPSMT',
 'TimesNewRomanPS-ItalicMT',
 'TimesNewRomanPS-BoldMT',
 'YOZFJS+Calibri',
 'BCDEEE+黑体',
 'Times New Roman,Bold',
 'BCDFEE+楷体',
 'BCDGEE+楷体',
 'BCDHEE+宋体',
 'Times New Roman',
 'BCDIEE+Cambria Math',
 'STSong-Light',
 'Times-Roman',
 'BCDJEE+Wingdings',
 'BCDKEE+宋体',
 'Times New Roman,Italic',
 'BCDLEE+Arabic Transparent',
 'BCDMEE+微软雅黑',
 'Arial',
 'BCDNEE+仿宋']

The Chinese font names in page.get_fonts() have encoding issue.
And Times New Roman occurs several times, especially xref 60 and 61. I am aware that they are different, but how should I match them with text fonts named 'Times New Roman'?

Is it possible that I can access the font xref of texts in page.get_texttrace()?

Thank you!

My configuration (mandatory)

Python 3.8.8
PyMuPDF 1.20.0: Python bindings for the MuPDF 1.20.1 library.

Answered by JorjMcKie

Sep 21, 2022

Is it possible that I can access the font xref of texts in page.get_texttrace()?

No, this is a low-level, high-speed method for easy access to single characters and their glyphs. It must not be overloaded with access to other information. Also remember that all text extractions and searches work for all document types - not only for PDFs. So PDF xrefs either wouldn't make sense within these methods at all, or require code that significantly slows down things - just for corner-case purposes.

The Chinese font names in page.get_fonts() have encoding issue.

No, the font names in get_fonts() are directly taken from the PDF object definition, where each non-Latin character is encoded in PDF…

View full answer

JorjMcKie · 2022-09-21T11:24:14Z

JorjMcKie
Sep 21, 2022
Maintainer

Is it possible that I can access the font xref of texts in page.get_texttrace()?

No, this is a low-level, high-speed method for easy access to single characters and their glyphs. It must not be overloaded with access to other information. Also remember that all text extractions and searches work for all document types - not only for PDFs. So PDF xrefs either wouldn't make sense within these methods at all, or require code that significantly slows down things - just for corner-case purposes.

The Chinese font names in page.get_fonts() have encoding issue.

No, the font names in get_fonts() are directly taken from the PDF object definition, where each non-Latin character is encoded in PDF manner (pls. see PDF documentation).

And Times New Roman occurs several times, especially xref 60 and 61. I am aware that they are different, but how should I match them with text fonts named 'Times New Roman'?

You can request the full fontname to be returned in the text extraction functions via fitz.TOOLS.set_subset_fontnames(True). Then subsetted fontnames will be shown in full - including the subset identifier prefix "ABCDEF+".
This may also help you to match the xref, because it is highly improbable that different fonts will have the same subsetting prefix.

1 reply

yufc2002 Sep 22, 2022
Author

Thanks for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Font name encoding issues and recurrence in page.get_fonts(). #1934

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Font name encoding issues and recurrence in page.get_fonts(). #1934

Uh oh!

Uh oh!

yufc2002 Sep 21, 2022

Describe the bug

My configuration (mandatory)

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Sep 21, 2022 Maintainer

Uh oh!

yufc2002 Sep 22, 2022 Author

yufc2002
Sep 21, 2022

Replies: 1 comment 1 reply

JorjMcKie
Sep 21, 2022
Maintainer

yufc2002 Sep 22, 2022
Author