page.get_textbox extracting text not in rect #3067

theshahshow · 2024-01-19T08:21:48Z

theshahshow
Jan 19, 2024

Description of the bug

I am extracting text given a bounding box. When using page.get_textbox(rect=bbox), i am getting text which is just above/below the bounding box (bbox), this can be verified by seeing the bbox visually on the pdf page.

When i use page.get_text(clip=bbox), it works in some cases, but in other cases it still captures some text. The pdf used to reproduce this example is :
text_extraction_box.pdf

How to reproduce the bug

Code to reproduce the bug:

def draw_bbox_on_page(page, bboxes, color="green", show=False):

for bbox in bboxes:
    page.draw_rect(bbox, color=fitz.pdfcolor[color], width=0.5)

if show:
    show_page(page)

def show_page(page, clip="full"):

DPI=200

clip = page.bound() if clip == "full" else clip
pix = page.get_pixmap(dpi=DPI, clip=clip)
image = np.ndarray([pix.h, pix.w, 3], dtype=np.uint8, buffer=pix.samples_mv) # HWC

plt.figure(dpi=DPI)
plt.title('pdf page')
_ = plt.imshow(image, extent=(0, pix.w*72/DPI, pix.h*72/DPI, 0))

doc = fitz.open("text_extraction_box.pdf")
page = doc[0]

box1 = [[69.46566714662494, 335.91371154785156, 141.06473759242468, 341.3014272054036]]
box2 = [[74.4209976196289, 324.5425109863281, 136.3159942626953, 335.91371154785156]]
box3 = [[69.46566714662494, 341.3014272054036, 141.06473759242468, 346.7720184326172]]
box4 = [[69.46566714662494, 358.16371154785156, 141.06473759242468, 363.5532582600911]]

draw_bbox_on_page(page, box4, show=True)
print(page.get_text(clip=box4[0]))#, flags=fitz.TEXT_INHIBIT_SPACES))
print(page.get_textbox(rect=box4[0]))

The output of first print statement is: "y \n y"
The output of 2nd print statement is: "Sync\Async"
The output of 1st print statement with flag=fitz.TEXT_INHIBIT_SPACES is "Sync/Async"

The other box inputs show similar problems. Whats the reason behind this, is it because even if some small part of text comes inside bbox then that text gets captured? If yes, can we control this by saying only extract text if it is "significantly" inside the bbox ? Also, why is the output of .get_text() and .get_textbox() different?

PyMuPDF version

1.23.8 or earlier

Operating system

Windows

Python version

3.11

JorjMcKie · 2024-01-19T09:47:54Z

JorjMcKie
Jan 19, 2024
Maintainer

Clip-driven text extraction has to make decisions as to whether at all or to which extent including characters that only partly overlap the clip area. In PyMuPDF, the decision has been made to also include characters that overlap in any way.

There is some influence regarding how to compute a character's boundary box:

Depending on the font, a certain portion of "empty" space above and below the visible character's part are included in the bbox.
For example in Helvetica, the character bbox height is 37.4% larger than the visible part.
Setting a global parameter fitz.TOOLS.set_small_glyph_heights(True) ignores those 37.4% and delivers heights that equal the visible part both, in test searches and text extractions.

The PDF creator is responsible for choosing inter-line distances when writing text. If he does not use 1.374 * fontsize (baseline to baseline) in case of Helvetica, but a smaller value, then the lines will overlap - even if that may not be visible. If he did not exaggerate, PyMuPDF can cope with this by setting said global parameter.

If you are not satisfied with these decisions or options, then there is no way other than deciding yourself an a by-character basis whether it should be included of not. There is currently no way to globally vote for strict inclusion.

3 replies

theshahshow Jan 19, 2024
Author

Do you think the following solution will work in general when using page.get_textbox(rect=bbox) ->

We can create a new bbox which has its height cut down by 37.4% from top and bottom and use this new bbox in get_textbox.
cell_height = bbox[3] - bbox[1]
bbox_new = (bbox[0], bbox[1]+0.374cell_height, bbox[2], bbox[3] - 0.374cell_height))

I tried this and is working for this table atleast. I am guessing for each font there is some percentage by which the char bbox is larger, so we can detect the font of text inside a bbox and then reduce the bbox size by that much.

JorjMcKie Jan 19, 2024
Maintainer

A font has two values, font.ascender and font.descender which control this. The "standard"/"natural" (my terminology) line height is (fitz.ascender - font.descender) * fontsize (descender is negative).
Here are some values for the Base14 fonts (Helvetica, Times-Roman, Courier):

import fitz
helv = fitz.Font("helv");helv.ascender-helv.descender
1.3740000426769257
tiro = fitz.Font("tiro");tiro.ascender-tiro.descender
1.3339999616146088
cour=fitz.Font("cour");cour.ascender-cour.descender
1.2489999830722809

jsormaz Dec 17, 2025

Clip-driven text extraction has to make decisions as to whether at all or to which extent including characters that only partly overlap the clip area. In PyMuPDF, the decision has been made to also include characters that overlap in any way.

Am I interpreting this correctly? The docs seem to contradict this statement:

https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_text

clip (rect-like) – restrict the extraction to this rectangle. If None (default), the visible part of the page is taken. Any content (text, images) that is not fully contained in clip will be completely omitted. To avoid clipping altogether use clip=pymupdf.INFINITE_RECT(). Only then the extraction will contain all items. This parameter has no effect on options “html”, “xhtml” and “xml”.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

page.get_textbox extracting text not in rect #3067

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

page.get_textbox extracting text not in rect #3067

Uh oh!

theshahshow Jan 19, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Replies: 1 comment · 3 replies

Uh oh!

JorjMcKie Jan 19, 2024 Maintainer

Uh oh!

theshahshow Jan 19, 2024 Author

Uh oh!

JorjMcKie Jan 19, 2024 Maintainer

Uh oh!

jsormaz Dec 17, 2025

theshahshow
Jan 19, 2024

Replies: 1 comment 3 replies

JorjMcKie
Jan 19, 2024
Maintainer

theshahshow Jan 19, 2024
Author

JorjMcKie Jan 19, 2024
Maintainer