Searching and highlighting a paragraph #1817

dineshzende · 2022-07-15T07:40:47Z

dineshzende
Jul 15, 2022

Discussed in #1815

^{Originally posted by dineshzende July 15, 2022}
Hello Authors!
Let me first thank your wonderful creation. It has helped us and reduced our efforts.
I am new to the usage of this library. I am actually looking for searching a paragraph and highlighting it. But I can only be able to search and highlight the text on one line. If the search text is spread over multiple lines, it is not showing the results.

I tried inserting '\n' and '\r' in the search string, but it didn't work.
Can you please help me in understanding how to search text spread on multiple lines (paragraph) ?

Answered by JorjMcKie

Jul 15, 2022

This can have a number of reasons.
First of all, the length of the string you are searching for should not be exaggerated. Three or four lines are probably ok ... if no other problem comes in the way.

Then, your PDF need not have the text stored in reading sequence, i.e. in the sequence you are accustomed to read and which also looks like the page is showing it.
MuPDF search mechanism does not reorder text in any specific sequence before starting to search. Instead it parses text as it physically is stored - and of course will not find something where character sequence is scrambled in some way.

Third, some PDF creators want to save file size and may simulate special text effects (like bo…

View full answer

JorjMcKie · 2022-07-15T11:40:56Z

JorjMcKie
Jul 15, 2022
Maintainer

This can have a number of reasons.
First of all, the length of the string you are searching for should not be exaggerated. Three or four lines are probably ok ... if no other problem comes in the way.

Then, your PDF need not have the text stored in reading sequence, i.e. in the sequence you are accustomed to read and which also looks like the page is showing it.
MuPDF search mechanism does not reorder text in any specific sequence before starting to search. Instead it parses text as it physically is stored - and of course will not find something where character sequence is scrambled in some way.

Third, some PDF creators want to save file size and may simulate special text effects (like boldness or shadowing). You can e.g. simulate bold by writing text twice, the second time with some tiny offset. This will spare the file size of a font's bold version. When you extract (or search) you will encounter e.g. "tthhiiss" or "thisthis" instead of this, and MuPDF will hence not find "this", or find it twice.

So if you have a long phrase (presumably for the purpose of later highlighting it), try to search in separate steps: first for the start of the phrase, then for the end.
Text highlighting supports specifying a start and an end point.

3 replies

dineshzende Jul 15, 2022
Author

Thank you very much @JorjMcKie for your time and the answer. I am trying the last step you suggested. The only problem with that is that there are chances that the first/last phrase of the paragraph may have several occurrences in the PDFs. So it may result in the set of wrong offsets.
Still figuring out to break this.
Although PyMuPDF is a great option for me, is there any alternative library that can solve this issue?

Thank you very much once again.

JorjMcKie Jul 15, 2022
Maintainer

is there any alternative library that can solve this issue?

Not aware of any if you are looking for a programmatic interface. The difficulties you describe also cannot be resolved by anything I believe - except your own wits 😎.
E.g. subselect by position on page: If the start and stop phrases A and B occur more than once you, you will have several A, B, A, B, ... each with their associated rect: sort them and associate the right B to its A ...

dineshzende Jul 16, 2022
Author

Thanks for the reply. If there are multiple occurrences, then I have thought of pairing them and handling them through code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Searching and highlighting a paragraph #1817

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Searching and highlighting a paragraph #1817

Uh oh!

dineshzende Jul 15, 2022

Discussed in #1815

Replies: 1 comment · 3 replies

Uh oh!

JorjMcKie Jul 15, 2022 Maintainer

Uh oh!

dineshzende Jul 15, 2022 Author

Uh oh!

JorjMcKie Jul 15, 2022 Maintainer

Uh oh!

dineshzende Jul 16, 2022 Author

dineshzende
Jul 15, 2022

Replies: 1 comment 3 replies

JorjMcKie
Jul 15, 2022
Maintainer

dineshzende Jul 15, 2022
Author

JorjMcKie Jul 15, 2022
Maintainer

dineshzende Jul 16, 2022
Author