Skip to content

page.get_textbox extracting text not in rect #3066

@theshahshow

Description

@theshahshow

Description of the bug

I am extracting text given a bounding box. When using page.get_textbox(rect=bbox), i am getting text which is just above/below the bounding box (bbox), this can be verified by seeing the bbox visually on the pdf page.

When i use page.get_text(clip=bbox), it works in some cases, but in other cases it still captures some text. The pdf used to reproduce this example is :
text_extraction_box.pdf

How to reproduce the bug

Code to reproduce the bug:

def draw_bbox_on_page(page, bboxes, color="green", show=False):

for bbox in bboxes:
    page.draw_rect(bbox, color=fitz.pdfcolor[color], width=0.5)

if show:
    show_page(page)

def show_page(page, clip="full"):

DPI=200

clip = page.bound() if clip == "full" else clip
pix = page.get_pixmap(dpi=DPI, clip=clip)
image = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv) # HWC

plt.figure(dpi=DPI)
plt.title('pdf page')
_ = plt.imshow(image, extent=(0, pix.w*72/DPI, pix.h*72/DPI, 0))

doc = fitz.open("text_extraction_box.pdf")
page = doc[0]

box1 = [[69.46566714662494, 335.91371154785156, 141.06473759242468, 341.3014272054036]]
box2 = [[74.4209976196289, 324.5425109863281, 136.3159942626953, 335.91371154785156]]
box3 = [[69.46566714662494, 341.3014272054036, 141.06473759242468, 346.7720184326172]]
box4 = [[69.46566714662494, 358.16371154785156, 141.06473759242468, 363.5532582600911]]

draw_bbox_on_page(page, box4, show=True)
print(page.get_text(clip=box4[0]))#, flags=fitz.TEXT_INHIBIT_SPACES))
print(page.get_textbox(rect=box4[0]))

The output of first print statement is: "y \n y"
The output of 2nd print statement is: "Sync\Async"
The output of 1st print statement with flag=fitz.TEXT_INHIBIT_SPACES is "Sync/Async"

The other box inputs show similar problems. Whats the reason behind this, is it because even if some small part of text comes inside bbox then that text gets captured? If yes, can we control this by saying only extract text if it is "significantly" inside the bbox ? Also, why is the output of .get_text() and .get_textbox() different?

PyMuPDF version

1.23.8 or earlier

Operating system

Windows

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions