page.get_textbox extracting text not in rect #3067
Replies: 1 comment 2 replies
-
Clip-driven text extraction has to make decisions as to whether at all or to which extent including characters that only partly overlap the clip area. In PyMuPDF, the decision has been made to also include characters that overlap in any way. There is some influence regarding how to compute a character's boundary box: Depending on the font, a certain portion of "empty" space above and below the visible character's part are included in the bbox. The PDF creator is responsible for choosing inter-line distances when writing text. If he does not use If you are not satisfied with these decisions or options, then there is no way other than deciding yourself an a by-character basis whether it should be included of not. There is currently no way to globally vote for strict inclusion. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Description of the bug
I am extracting text given a bounding box. When using page.get_textbox(rect=bbox), i am getting text which is just above/below the bounding box (bbox), this can be verified by seeing the bbox visually on the pdf page.
When i use page.get_text(clip=bbox), it works in some cases, but in other cases it still captures some text. The pdf used to reproduce this example is :
text_extraction_box.pdf
How to reproduce the bug
Code to reproduce the bug:
def draw_bbox_on_page(page, bboxes, color="green", show=False):
def show_page(page, clip="full"):
doc = fitz.open("text_extraction_box.pdf")
page = doc[0]
box1 = [[69.46566714662494, 335.91371154785156, 141.06473759242468, 341.3014272054036]]
box2 = [[74.4209976196289, 324.5425109863281, 136.3159942626953, 335.91371154785156]]
box3 = [[69.46566714662494, 341.3014272054036, 141.06473759242468, 346.7720184326172]]
box4 = [[69.46566714662494, 358.16371154785156, 141.06473759242468, 363.5532582600911]]
draw_bbox_on_page(page, box4, show=True)
print(page.get_text(clip=box4[0]))#, flags=fitz.TEXT_INHIBIT_SPACES))
print(page.get_textbox(rect=box4[0]))
The output of first print statement is: "y \n y"
The output of 2nd print statement is: "Sync\Async"
The output of 1st print statement with flag=fitz.TEXT_INHIBIT_SPACES is "Sync/Async"
The other box inputs show similar problems. Whats the reason behind this, is it because even if some small part of text comes inside bbox then that text gets captured? If yes, can we control this by saying only extract text if it is "significantly" inside the bbox ? Also, why is the output of .get_text() and .get_textbox() different?
PyMuPDF version
1.23.8 or earlier
Operating system
Windows
Python version
3.11
Beta Was this translation helpful? Give feedback.
All reactions