Skip to content
Discussion options

You must be logged in to vote

The most challenging part is how you want to identify undesired text. If you can do that, no additional technique is required: simply ignore it on encounter while extracting text.
But maybe I miss some aspect of your requirement.

When you do page.get_text("dict", ...) you will be given all text with all available meta-information, from font name, ~ properties, ~ size, text color, position information, writing direction.
What else do you need to for identification?

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by JorjMcKie
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
2 participants
Converted from issue

This discussion was converted from issue #2077 on November 23, 2022 07:41.