Skip to content
Discussion options

You must be logged in to vote

In PDF, things like headers, footers, section headlines, or tables are terra incognita. While there exist specifications how to express higher order information, they are often not used.

So you mostly will simply have text, and it is left to yours wits how to find those structures. There is no library that can do this for you.
You have to look at things like text position, font properties (bold, italic), font size, text color, etc.
In your case it looks like you could successfully check for bold and all-caps text. Findings such properties is no problem with PyMuPDF.
A simple iteration like this brings you already close:

for page in doc:
    for block in page.get_text("dict",flags=fitz.TEX…

Replies: 1 comment 5 replies

Comment options

You must be logged in to vote
5 replies
@ytiam
Comment options

@JorjMcKie
Comment options

@ytiam
Comment options

@JorjMcKie
Comment options

Answer selected by ytiam
@ytiam
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants