-
Hello, I need to write a script in Python to loop through a bunch of PDFs, each containing one or more articles, and grab its title + date published to build a Table of Contents. Fonts could be different each time, but I notice they are the same size in each. Can PyMuPDF do this? Thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Yes, this can be done: |
Beta Was this translation helpful? Give feedback.
-
Glad it works for you.
Because the result of |
Beta Was this translation helpful? Give feedback.
Yes, this can be done:
Extract the text with sufficient information:
page.get_text("dict",...)
. The result is a dictionary of stacked dictionaries described here.The lowest hierarchy level dict contains the font size, the text itself and some other more font attributes.
If you know the desired font size, take the text, the page number and text position and store this information in a list.
When done iterating over the applicable PDFs, you can make a new PDF with a page on to which write a text line from each of the previously created list items.
Each of the written lines can be overlaid with a hyperlink pointing to the respective PDF + page from where the information was previously extra…