Is there a way to recognize equation in pdf? #763
-
The equation has a special representation in pdf. Is it possible to ignore it when extracting text since it is meaningless? I have checked the format of several equations. blocks = page.getText("dict", flags=0)["blocks"]
pprint(blocks) It seems like that the font of some characters in an equation starts with "CMMI", and flags is 6 while for plain text it is 4. Is there any explanation for flags and font? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
Look in the documenttion here for explaining |
Beta Was this translation helpful? Give feedback.
-
I am not sure where "CMMI" came from in your question. Maybe purely arbitrary in the example you have been looking at. |
Beta Was this translation helpful? Give feedback.
-
I am not at all sure that this is true!
That would be your responsibility as a programmer. PyMuPDF cannot take on a position to judge the meaning of extracted text. This obviously is a matter of context - PyMuPDF is not in the semantics business. |
Beta Was this translation helpful? Give feedback.
-
I've found something useful by asking GPT here. Hope this is useful. |
Beta Was this translation helpful? Give feedback.
Look in the documenttion here for explaining
flags
.Concerning fonts: the span dictionary just contains the fontname. If the font is an embedded subset, then (in PDF) it starts with 6 arbitrary upper case letters followed by "+", followed by the fontname from which the subset was created, e.g.
VIBWDS+NimbusSanL-Bold
.Because
VIBWDS
is totally meaningless and only serves as a unique identifier for some technical reason, it is omitted and onlyNimbusSanL-Bold
is contained in the dict.