page.get_textbox extracting text not in rect

### Description of the bug

I am extracting text given a bounding box. When using page.get_textbox(rect=bbox), i am getting text which is just above/below the bounding box (bbox), this can be verified by seeing the bbox visually on the pdf page.

When i use page.get_text(clip=bbox), it works in some cases, but in other cases it still captures some text. The pdf used to reproduce this example is :
[text_extraction_box.pdf](https://github.com/pymupdf/PyMuPDF/files/13986252/text_extraction_box.pdf)


### How to reproduce the bug

Code to reproduce the bug:

def draw_bbox_on_page(page, bboxes, color="green", show=False):

    for bbox in bboxes:
        page.draw_rect(bbox, color=fitz.pdfcolor[color], width=0.5)
    
    if show:
        show_page(page)

def show_page(page, clip="full"):

    DPI=200

    clip = page.bound() if clip == "full" else clip
    pix = page.get_pixmap(dpi=DPI, clip=clip)
    image = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv) # HWC

    plt.figure(dpi=DPI)
    plt.title('pdf page')
    _ = plt.imshow(image, extent=(0, pix.w*72/DPI, pix.h*72/DPI, 0))



doc = fitz.open("text_extraction_box.pdf")
page = doc[0]


box1 = [[69.46566714662494, 335.91371154785156, 141.06473759242468, 341.3014272054036]]
box2 = [[74.4209976196289, 324.5425109863281, 136.3159942626953, 335.91371154785156]]
box3 = [[69.46566714662494, 341.3014272054036, 141.06473759242468, 346.7720184326172]]
box4 = [[69.46566714662494, 358.16371154785156, 141.06473759242468, 363.5532582600911]]


draw_bbox_on_page(page, box4, show=True)
print(page.get_text(clip=box4[0]))#, flags=fitz.TEXT_INHIBIT_SPACES))
print(page.get_textbox(rect=box4[0]))

The output of first print statement is: "y \n y"
The output of 2nd print statement is: "Sync\Async"
The output of 1st print statement with flag=fitz.TEXT_INHIBIT_SPACES is "Sync/Async"

The other box inputs show similar problems. Whats the reason behind this, is it because even if some small part of text comes inside bbox then that text gets captured? If yes, can we control this by saying only extract text if it is "significantly" inside the bbox ? Also, why is the output of .get_text() and .get_textbox() different? 

### PyMuPDF version

1.23.8 or earlier

### Operating system

Windows

### Python version

3.11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

page.get_textbox extracting text not in rect #3066

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

page.get_textbox extracting text not in rect #3066

Description

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions