Removing text from prior OCR #2466
-
We work with a lot of PDFs that originate outside the organisation, and these often arrive with very poor OCR from some other (unknown) tool. I'm looking to remove all existing OCR from individual pages prior to using Textract to generate a new invisible 'layer' for all text on these pages. It seems like the way to do this is by using add_redact_annot(page.rect) and then apply_redactions() with images = fitz.PDF_REDACT_IMAGE_NONE), but I'm wondering if there's another way that's more efficient and/or safer. A secondary problem is that many of our docs are mixed--some pages from Word or other tools that provide their own invisible text, while others were clearly OCR'd previously and still others that are image-only. I sure wish there was an easy way to identify previous OCR output so I'd have a clear flag for which pages require refreshing. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Using redaction annotations for removing all text from a page is an efficient and safe method - what are your concerns here? Finding out what type of situation you are actually facing, is however challenging for sure. PyMuPDF can help you - but there will never be 100% security because of the endless different possibilitis in PDF. Here are a few ways that at least approximately may lead to the right conclusions.
So wrapping up: |
Beta Was this translation helpful? Give feedback.
Using redaction annotations for removing all text from a page is an efficient and safe method - what are your concerns here?
Finding out what type of situation you are actually facing, is however challenging for sure. PyMuPDF can help you - but there will never be 100% security because of the endless different possibilitis in PDF. Here are a few ways that at least approximately may lead to the right conclusions.
page.get_bboxlog()
contains items("ignore-text", ...)
then there exists hidden text which probably was generated by some OCR engine.GlyphLessFont
, then obviously the page was OCRed with Tesseract. This text is also de…