Grab all strings of a given size? #2912

Shohreh · 2023-12-19T14:09:54Z

Shohreh
Dec 19, 2023

Hello,

I need to write a script in Python to loop through a bunch of PDFs, each containing one or more articles, and grab its title + date published to build a Table of Contents.

Fonts could be different each time, but I notice they are the same size in each.

Can PyMuPDF do this?

Thank you.

Answered by JorjMcKie

Dec 19, 2023

Yes, this can be done:
Extract the text with sufficient information: page.get_text("dict",...). The result is a dictionary of stacked dictionaries described here.
The lowest hierarchy level dict contains the font size, the text itself and some other more font attributes.
If you know the desired font size, take the text, the page number and text position and store this information in a list.
When done iterating over the applicable PDFs, you can make a new PDF with a page on to which write a text line from each of the previously created list items.
Each of the written lines can be overlaid with a hyperlink pointing to the respective PDF + page from where the information was previously extra…

View full answer

JorjMcKie · 2023-12-19T15:01:27Z

JorjMcKie
Dec 19, 2023
Maintainer

Yes, this can be done:
Extract the text with sufficient information: page.get_text("dict",...). The result is a dictionary of stacked dictionaries described here.
The lowest hierarchy level dict contains the font size, the text itself and some other more font attributes.
If you know the desired font size, take the text, the page number and text position and store this information in a list.
When done iterating over the applicable PDFs, you can make a new PDF with a page on to which write a text line from each of the previously created list items.
Each of the written lines can be overlaid with a hyperlink pointing to the respective PDF + page from where the information was previously extracted.

1 reply

Shohreh Dec 21, 2023
Author

Thanks much. This script works fine to grab the parts ("spans" apparently) I'm looking for, altough I had to specify a size of 8.25 while LibreOffice displays 8.2.

pattern = re.compile(r' 20\d{2}$') #only keep lines that contain a date
title = ""
doc = fitz.open('blah.pdf')
page = doc[0]
# read page text as a dictionary, suppressing extra spaces in CJK fonts
blocks = page.get_text("dict", flags=11)["blocks"] # blocks are the top hierarchy
for b in blocks:  # iterate through the text blocks
  for l in b["lines"]:  # iterate through the text lines
      for s in l["spans"]:  # iterate through the text spans
          if s["size"] == 19.5:
            #title: join into single line
            title += s["text"] + " "
          if s["size"] == 8.25:
            #date: only keep if ends with 20\d\d$
            if pattern.search(s["text"]):
              print("Date: %s" % s["text"])
print("Title: ",title.strip())

Out of curiosity, what does this syntax do (the trailing ["blocks"])?
blocks = page.get_text("dict", flags=11)["blocks"]

JorjMcKie · 2023-12-21T07:08:31Z

JorjMcKie
Dec 21, 2023
Maintainer

Glad it works for you.

Out of curiosity, what does this syntax do (the trailing ["blocks"])?

Because the result of get_text("dict") looks like {"width": n, "height": m, "blocks": [...]}. Width and height usually are the dimension of the page, but not necessarily so.
The dictionary actually describes the content of a TextPage. This structure abstracts from the document file type, i.e. looks the same for all supported documents, EPUB, MOBI, XPS, PDF, etc.

1 reply

Shohreh Dec 21, 2023
Author

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grab all strings of a given size? #2912

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Grab all strings of a given size? #2912

Uh oh!

Uh oh!

Shohreh Dec 19, 2023

Replies: 2 comments · 2 replies

Uh oh!

JorjMcKie Dec 19, 2023 Maintainer

Uh oh!

Uh oh!

Shohreh Dec 21, 2023 Author

Uh oh!

JorjMcKie Dec 21, 2023 Maintainer

Uh oh!

Shohreh Dec 21, 2023 Author

Shohreh
Dec 19, 2023

Replies: 2 comments 2 replies

JorjMcKie
Dec 19, 2023
Maintainer

Shohreh Dec 21, 2023
Author

JorjMcKie
Dec 21, 2023
Maintainer

Shohreh Dec 21, 2023
Author