Get text from pdf page excluding page number #3992

vignesh0710 · 2024-10-26T20:31:07Z

vignesh0710
Oct 26, 2024

Trying to get text from pdf page excluding the page number in the right bottom corner

code:

import fitz 
# version -> 1.23.1

def get_clip(fitz_page):
  
   margin = 25
   page_rect = fitz_page.rect
   clip_rect = fitz.Rect(page_rect.x0 + 0, page_rect.y0 + 0, page_rect.x1 - 0, page_rect.y1 - margin)
   return clip_rect

doc = fitz.open("f.pdf")
clip_text = get_clip(doc[1])
doc[1].get_text(clip = clip_text)

This works for some cases, but often it removes more text than the page number.

Is there a better way to remove the page number when getting text from page?

JorjMcKie · 2024-10-28T12:31:27Z

JorjMcKie
Oct 28, 2024
Maintainer

No one can know where the PDF creator has decided to put header, footer, etc. including page numbers. All this is just text as per the PDF's perspective.
There is no other way for you than writing code that investigates the situation.
For example, extract the text via the "blocks" variant using sort=True. Then look at the string part of the last block tuple. If it contains something that "looks like" a page number, use the block's coordinates for removal / exclusion.
Something like this:

blocks=page.get_text("blocks", sort=True)
if "Page" in blocks[-1][4]:  # text in the last block, adjust as needed
    blocks = blocks[:-1]  # ignore last block
text = "\n".join([b[4] for b in blocks])

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Get text from pdf page excluding page number #3992

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Get text from pdf page excluding page number #3992

Uh oh!

vignesh0710 Oct 26, 2024

Replies: 1 comment

Uh oh!

Uh oh!

JorjMcKie Oct 28, 2024 Maintainer

vignesh0710
Oct 26, 2024

JorjMcKie
Oct 28, 2024
Maintainer