Skip to content
Discussion options

You must be logged in to vote

Using redaction annotations for removing all text from a page is an efficient and safe method - what are your concerns here?

Finding out what type of situation you are actually facing, is however challenging for sure. PyMuPDF can help you - but there will never be 100% security because of the endless different possibilitis in PDF. Here are a few ways that at least approximately may lead to the right conclusions.

  1. If the list page.get_bboxlog() contains items ("ignore-text", ...) then there exists hidden text which probably was generated by some OCR engine.
  2. If a page contains text written with Tesseract's GlyphLessFont, then obviously the page was OCRed with Tesseract. This text is also de…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@djoltes
Comment options

Answer selected by djoltes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants