-
Notifications
You must be signed in to change notification settings - Fork 678
Description
Description of the bug
I am extracting text given a bounding box. When using page.get_textbox(rect=bbox), i am getting text which is just above/below the bounding box (bbox), this can be verified by seeing the bbox visually on the pdf page.
When i use page.get_text(clip=bbox), it works in some cases, but in other cases it still captures some text. The pdf used to reproduce this example is :
text_extraction_box.pdf
How to reproduce the bug
Code to reproduce the bug:
def draw_bbox_on_page(page, bboxes, color="green", show=False):
for bbox in bboxes:
page.draw_rect(bbox, color=fitz.pdfcolor[color], width=0.5)
if show:
show_page(page)
def show_page(page, clip="full"):
DPI=200
clip = page.bound() if clip == "full" else clip
pix = page.get_pixmap(dpi=DPI, clip=clip)
image = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv) # HWC
plt.figure(dpi=DPI)
plt.title('pdf page')
_ = plt.imshow(image, extent=(0, pix.w*72/DPI, pix.h*72/DPI, 0))
doc = fitz.open("text_extraction_box.pdf")
page = doc[0]
box1 = [[69.46566714662494, 335.91371154785156, 141.06473759242468, 341.3014272054036]]
box2 = [[74.4209976196289, 324.5425109863281, 136.3159942626953, 335.91371154785156]]
box3 = [[69.46566714662494, 341.3014272054036, 141.06473759242468, 346.7720184326172]]
box4 = [[69.46566714662494, 358.16371154785156, 141.06473759242468, 363.5532582600911]]
draw_bbox_on_page(page, box4, show=True)
print(page.get_text(clip=box4[0]))#, flags=fitz.TEXT_INHIBIT_SPACES))
print(page.get_textbox(rect=box4[0]))
The output of first print statement is: "y \n y"
The output of 2nd print statement is: "Sync\Async"
The output of 1st print statement with flag=fitz.TEXT_INHIBIT_SPACES is "Sync/Async"
The other box inputs show similar problems. Whats the reason behind this, is it because even if some small part of text comes inside bbox then that text gets captured? If yes, can we control this by saying only extract text if it is "significantly" inside the bbox ? Also, why is the output of .get_text() and .get_textbox() different?
PyMuPDF version
1.23.8 or earlier
Operating system
Windows
Python version
3.11