Is there a way to recognize equation in pdf? #763

HeroadZ · 2020-12-11T07:54:43Z

HeroadZ
Dec 11, 2020

The equation has a special representation in pdf. Is it possible to ignore it when extracting text since it is meaningless?

I have checked the format of several equations.

blocks = page.getText("dict", flags=0)["blocks"]
pprint(blocks)

It seems like that the font of some characters in an equation starts with "CMMI", and flags is 6 while for plain text it is 4. Is there any explanation for flags and font?

Answered by JorjMcKie

Dec 11, 2020

Is there any explanation for flags and font?

Look in the documenttion here for explaining flags.
Concerning fonts: the span dictionary just contains the fontname. If the font is an embedded subset, then (in PDF) it starts with 6 arbitrary upper case letters followed by "+", followed by the fontname from which the subset was created, e.g. VIBWDS+NimbusSanL-Bold.
Because VIBWDS is totally meaningless and only serves as a unique identifier for some technical reason, it is omitted and only NimbusSanL-Bold is contained in the dict.

View full answer

JorjMcKie · 2020-12-11T08:27:09Z

JorjMcKie
Dec 11, 2020
Maintainer

Is there any explanation for flags and font?

Look in the documenttion here for explaining flags.
Concerning fonts: the span dictionary just contains the fontname. If the font is an embedded subset, then (in PDF) it starts with 6 arbitrary upper case letters followed by "+", followed by the fontname from which the subset was created, e.g. VIBWDS+NimbusSanL-Bold.
Because VIBWDS is totally meaningless and only serves as a unique identifier for some technical reason, it is omitted and only NimbusSanL-Bold is contained in the dict.

0 replies

JorjMcKie · 2020-12-11T08:33:02Z

JorjMcKie
Dec 11, 2020
Maintainer

I am not sure where "CMMI" came from in your question. Maybe purely arbitrary in the example you have been looking at.
To be more explicit with flags = 6 = 0B110, which corresponds to "italic, serifed".
Not even monospaced. So I would say, it is not very typical ...

2 replies

HeroadZ Dec 11, 2020
Author

If I don't misunderstand you, you mean that there is no standard format for the equation in pdf, right? Pdf just save equations in normal characters in some common format like "italic", not using some special font like "Cambria Math" in ms word. In this situation, I have to recognize it in a new way.

JorjMcKie Dec 11, 2020
Maintainer

Yes, exactly!
In PDF, text is just text. The PDF specification contains nothing to sub-divide different kinds of text. Equations are also text and be coded in any font, can be italic, or normal, mono-spaced of proportional, serifed or sans-serifed.
Also note that the equation symbol appears in program code listings a lot - PyMuPDF.pdf is full of such examples.

So I would say, that you have to develop your own way of recognizing equations ... and whatever you will develop, may not work with the next PDF example.

JorjMcKie · 2020-12-11T08:42:17Z

JorjMcKie
Dec 11, 2020
Maintainer

The equation has a special representation in pdf.

I am not at all sure that this is true!

Is it possible to ignore it when extracting text since it is meaningless?

That would be your responsibility as a programmer. PyMuPDF cannot take on a position to judge the meaning of extracted text. This obviously is a matter of context - PyMuPDF is not in the semantics business.

1 reply

HeroadZ Dec 11, 2020
Author

I see. Thank you very much. I think I need to rethink how to solve this question in a new way :D

MasterYip · 2023-07-19T08:24:38Z

MasterYip
Jul 19, 2023

I've found something useful by asking GPT here. Hope this is useful.
For example:
"size": 9.962599754333496,
"flags": 6,
"font": "CMMI10",
"color": 0,
"ascender": 0.75,
"descender": -0.25,
"text": " n",
"origin": [
423.5783996582031,
342.6888122558594
],
"bbox": [
423.5783996582031,
335.21685791015625,
434.1545715332031,
345.1794738769531
]
The font name "CMMI10" is a reference to a specific typeface in the Computer Modern font family. This font family was created by Donald Knuth and is commonly used in typesetting documents, especially with the LaTeX typesetting system.
In this context, the font name "CMMI10" indicates that the text "n" is rendered using the Computer Modern Math Italic font at size 10 points. The font is mathematically italic, which means it is specifically designed for mathematical symbols and expressions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is there a way to recognize equation in pdf? #763

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is there a way to recognize equation in pdf? #763

Uh oh!

HeroadZ Dec 11, 2020

Replies: 4 comments · 3 replies

Uh oh!

Uh oh!

JorjMcKie Dec 11, 2020 Maintainer

Uh oh!

JorjMcKie Dec 11, 2020 Maintainer

Uh oh!

HeroadZ Dec 11, 2020 Author

Uh oh!

JorjMcKie Dec 11, 2020 Maintainer

Uh oh!

JorjMcKie Dec 11, 2020 Maintainer

Uh oh!

HeroadZ Dec 11, 2020 Author

Uh oh!

MasterYip Jul 19, 2023

HeroadZ
Dec 11, 2020

Replies: 4 comments 3 replies

JorjMcKie
Dec 11, 2020
Maintainer

JorjMcKie
Dec 11, 2020
Maintainer

HeroadZ Dec 11, 2020
Author

JorjMcKie Dec 11, 2020
Maintainer

JorjMcKie
Dec 11, 2020
Maintainer

HeroadZ Dec 11, 2020
Author

MasterYip
Jul 19, 2023