Remove some blocks and then use standard PyMuPDF get_text functionality? #2079
-
Is your feature request related to a problem? Please describe. Describe the solution you'd like Describe alternatives you've considered Additional context |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments
-
The most challenging part is how you want to identify undesired text. If you can do that, no additional technique is required: simply ignore it on encounter while extracting text. When you do |
Beta Was this translation helpful? Give feedback.
-
Carefully looking again, I am not sure what you technically mean by text "overlays". Just text overlapping other text? |
Beta Was this translation helpful? Give feedback.
-
Ah, good suggestion! I am too used to I'll look at the Thank you |
Beta Was this translation helpful? Give feedback.
The most challenging part is how you want to identify undesired text. If you can do that, no additional technique is required: simply ignore it on encounter while extracting text.
But maybe I miss some aspect of your requirement.
When you do
page.get_text("dict", ...)
you will be given all text with all available meta-information, from font name, ~ properties, ~ size, text color, position information, writing direction.What else do you need to for identification?