After doing page.insert_htmlbox to insert some text , how to extract this same text with get_text where the place it has been inserted is respected. #3283

808Code · 2024-03-19T17:37:14Z

808Code
Mar 19, 2024

I did
page.insert_htmlbox(widget.rect, f"{index}", overlay = True)

Where i wanted to place a number where pdf has input fields.

Visual of what i did:

But when i do page.get_text all the numbers are ignored .

What i wanted was/ Desired Output :

.......Purpose of Loan? 0 Purchase 1 Construction 2 Refinance 3 Other ......

What it gives me:

....Purpose of Loan? Purchase Construction Refinance Other ....

and at the end of the text , all number from0 to ..... total input field is given.

How to get my desired output.

Thankyou for any help.

Answered by JorjMcKie

Mar 20, 2024

Yes , when i do page.get_text(clip= widget.rect) .I get the text associated with the widget , but i want the text to be extracted the same way get_text with clip = None. works i.e.. extracted text is in the same order as it appears in the pdf. where widget text is beside whatever the normal pdf text is along.

This cannot work!
All new stuff in a PDF can only be appended to old content - not inserted in the middle of things by some miracle. This a PDF peculiarity - not a PyMuPDF restriction.
The only way you have is sorting extracted text in a suitable way.
There is a sort parameter in get_text() which behaves slightly differently depending on the output option.
In your case however - as…

View full answer

JorjMcKie · 2024-03-19T21:33:12Z

JorjMcKie
Mar 19, 2024
Maintainer

All text extraction and text search variants support the clip parameter: a rect_like.
So if you used page.get_text(clip=rect) where rect is the rectangle used with insert_htmlbox you should receive that inserted text.

1 reply

808Code Mar 20, 2024
Author

Yes , when i do page.get_text(clip= widget.rect) .I get the text associated with the widget , but i want the text to be extracted the same way get_text with clip = None. works i.e.. extracted text is in the same order as it appears in the pdf. where widget text is beside whatever the normal pdf text is along.

JorjMcKie · 2024-03-20T15:41:27Z

JorjMcKie
Mar 20, 2024
Maintainer

Yes , when i do page.get_text(clip= widget.rect) .I get the text associated with the widget , but i want the text to be extracted the same way get_text with clip = None. works i.e.. extracted text is in the same order as it appears in the pdf. where widget text is beside whatever the normal pdf text is along.

This cannot work!
All new stuff in a PDF can only be appended to old content - not inserted in the middle of things by some miracle. This a PDF peculiarity - not a PyMuPDF restriction.
The only way you have is sorting extracted text in a suitable way.
There is a sort parameter in get_text() which behaves slightly differently depending on the output option.
In your case however - as you are probably inserting "words" in some way - you probably have to get_text("words", sort=True) and then process the single words, re-synthesizing lines etc.

1 reply

808Code Mar 21, 2024
Author

Oh oh .Thankyou.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

After doing page.insert_htmlbox to insert some text , how to extract this same text with get_text where the place it has been inserted is respected. #3283

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

After doing page.insert_htmlbox to insert some text , how to extract this same text with get_text where the place it has been inserted is respected. #3283

Uh oh!

808Code Mar 19, 2024

Replies: 2 comments · 2 replies

Uh oh!

JorjMcKie Mar 19, 2024 Maintainer

Uh oh!

808Code Mar 20, 2024 Author

Uh oh!

JorjMcKie Mar 20, 2024 Maintainer

Uh oh!

808Code Mar 21, 2024 Author

808Code
Mar 19, 2024

Replies: 2 comments 2 replies

JorjMcKie
Mar 19, 2024
Maintainer

808Code Mar 20, 2024
Author

JorjMcKie
Mar 20, 2024
Maintainer

808Code Mar 21, 2024
Author