Remove some blocks and then use standard PyMuPDF get_text functionality? #2079

stevesimmons · 2022-11-22T11:44:45Z

stevesimmons
Nov 22, 2022

Is your feature request related to a problem? Please describe.
I want to extract text from PDFs where some pages have text overlays that need to be ignored.

Describe the solution you'd like
I would like to scan the blocks on a page, delete (or otherwise mark as "ignore") the ones I don't want, and then use page.get_text(clip=rect) on the target regions for the remaining blocks.

Describe alternatives you've considered
Alternatively stream the elements of a page into a new empty page, filtering out the blocks I don't need. And then do the text extraction on the filtered page.

Additional context
Any suggestions would be most welcome. I would like to solve 100% of this problem with PyMuPDF because it is such a great tool.

Answered by JorjMcKie

Nov 22, 2022

The most challenging part is how you want to identify undesired text. If you can do that, no additional technique is required: simply ignore it on encounter while extracting text.
But maybe I miss some aspect of your requirement.

When you do page.get_text("dict", ...) you will be given all text with all available meta-information, from font name, ~ properties, ~ size, text color, position information, writing direction.
What else do you need to for identification?

View full answer

JorjMcKie · 2022-11-22T12:18:55Z

JorjMcKie
Nov 22, 2022
Maintainer

The most challenging part is how you want to identify undesired text. If you can do that, no additional technique is required: simply ignore it on encounter while extracting text.
But maybe I miss some aspect of your requirement.

When you do page.get_text("dict", ...) you will be given all text with all available meta-information, from font name, ~ properties, ~ size, text color, position information, writing direction.
What else do you need to for identification?

0 replies

JorjMcKie · 2022-11-22T12:27:11Z

JorjMcKie
Nov 22, 2022
Maintainer

Carefully looking again, I am not sure what you technically mean by text "overlays". Just text overlapping other text?

0 replies

stevesimmons · 2022-11-22T14:28:36Z

stevesimmons
Nov 22, 2022
Author

Ah, good suggestion! I am too used to page.get_text(clip=...) magically bringing my text back in a single string.

I'll look at the dict option now and use the other properties to filter out the text I don't want. The bits I don't want are a different font size, so irrespective of the clip window I will still be able to filter them out.

Thank you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove some blocks and then use standard PyMuPDF get_text functionality? #2079

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Remove some blocks and then use standard PyMuPDF get_text functionality? #2079

Uh oh!

Uh oh!

stevesimmons Nov 22, 2022

Replies: 3 comments

Uh oh!

JorjMcKie Nov 22, 2022 Maintainer

Uh oh!

JorjMcKie Nov 22, 2022 Maintainer

Uh oh!

Uh oh!

stevesimmons Nov 22, 2022 Author

stevesimmons
Nov 22, 2022

JorjMcKie
Nov 22, 2022
Maintainer

JorjMcKie
Nov 22, 2022
Maintainer

stevesimmons
Nov 22, 2022
Author