Pymupdf grouping same text of different pages in different text_blocks #2899
Unanswered
vignesh0710
asked this question in
Q&A
Replies: 1 comment
-
Did you not try to sort the text blocks before comparing them? If that is still too coarse-grained, you can try the same on a word or line level: Lines can be extracted and sorted line this: lines = []
for b in page.get_text("dict", flags=fitz.TEXTFLAGS_TEXT)["blocks"]:
lines.extend(b["lines"])
lines.sort(key=lambda l: (l["bbox"][3], l["bbox"][0])) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
Goal: Pymupdf highlight difference between 2 pdf pages, version (1.22.1)
Trying to compare 2
pdf pages - p1 and p2
and highlight the difference inp1
Algorithm:
Code:
difference psuedo_code:
This works. But in certain cases though the
contents
areidentical
they get grouped intodifferent text blocks
so while comparing it is highlighting wrong.Example:
p1:
p2:
Though the identical 3 lines (back-to-back) -
line1, line2, line3
are present in both the pagesp1
andp2
since theblocks
are different it is getting flaggedAlso, tried the
get_text
and compareline by line
approach, it is not working.Any suggestions on how to fix this will be helpful?
How to reproduce the bug
explained above
PyMuPDF version
1.23.5 or earlier
Operating system
Windows
Python version
3.8
Beta Was this translation helpful? Give feedback.
All reactions