Pymupdf grouping same text of different pages in different text_blocks #2899

vignesh0710 · 2023-12-15T04:46:56Z

vignesh0710
Dec 15, 2023

Description of the bug

Goal: Pymupdf highlight difference between 2 pdf pages, version (1.22.1)

Trying to compare 2 pdf pages - p1 and p2 and highlight the difference in p1

Algorithm:

1. Get text_blocks with bounding_box from each_page
2. Compare text_blocks of p1 with p2
3. for every text_block which is different use the respective bounding_box to highlight the diffeerence

Code:

def get_text_blocks(page):

    blocks = []
    blocks_bbox = []
    blocks = page.get_text_blocks()
    for block in blocks:
        #appending the bounding box of the block
        blocks_bbox.append(block[0:4])
        #appending the text from the block
        blocks.append(block[4])
   return blocks, blocks_bbox

difference psuedo_code:

diff = [list of text_blocks IN p1 and NOT IN p2]
for each_diff in diff:  
     #get the bounding_box of the difference block
     rect = fitz.rect(bounding_box)
     annot = p1.add_highlight_annot(rect)
     annot.update()

This works. But in certain cases though the contents are identical they get grouped into different text blocks so while comparing it is highlighting wrong.

Example:

p1:

block_1: line1, line2
block_2: line3

p2:

block_1: line1, line2, line3

Though the identical 3 lines (back-to-back) - line1, line2, line3 are present in both the pages p1 and p2 since the blocks are different it is getting flagged

Also, tried the get_text and compare line by line approach, it is not working.

Any suggestions on how to fix this will be helpful?

How to reproduce the bug

explained above

PyMuPDF version

1.23.5 or earlier

Operating system

Windows

Python version

3.8

JorjMcKie · 2023-12-15T07:54:09Z

JorjMcKie
Dec 15, 2023
Maintainer

Did you not try to sort the text blocks before comparing them?
Your problem sounds like a sequence difference only.
The simplest sorting (by the blocks) is this blocks = page.get_text("blocks", sort=True). This sorts by ascending bottom, then left coordinate.

If that is still too coarse-grained, you can try the same on a word or line level:
words = page.get_text("words", sort=True). This extraction has the additional advantage that white space differences are ignored.

Lines can be extracted and sorted line this:

lines = []
for b in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
    lines.extend(b["lines"])
lines.sort(key=lambda l: (l["bbox"][3], l["bbox"][0]))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pymupdf grouping same text of different pages in different text_blocks #2899

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Pymupdf grouping same text of different pages in different text_blocks #2899

Uh oh!

vignesh0710 Dec 15, 2023

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 1 comment

Uh oh!

JorjMcKie Dec 15, 2023 Maintainer

vignesh0710
Dec 15, 2023

JorjMcKie
Dec 15, 2023
Maintainer