Skip to content
Discussion options

You must be logged in to vote

That PDF contains sloppy handling of geometry changes. Cleaning it first thing after reading it will solve the problem:

import fitz

pdf1 = fitz.open("8.pdf")
for page1 in pdf1:
    page1.clean_contents()  # <=== do this before anything else with the page
    shape = page1.new_shape()
    words = page1.get_text("words")
    for w in words:
        shape.draw_rect(w[:4])
    shape.finish(fill=(1, 1, 0), fill_opacity=0.3)
    shape.commit()
    print(f"Added {len(words)} rectangles on page {page1.number}.")
pdf1.save("8_out1.pdf", garbage=3, deflate=True)

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@kvrameshreddy
Comment options

Answer selected by kvrameshreddy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants