Removing text from prior OCR #2466

djoltes · 2023-06-12T16:26:40Z

djoltes
Jun 12, 2023

We work with a lot of PDFs that originate outside the organisation, and these often arrive with very poor OCR from some other (unknown) tool. I'm looking to remove all existing OCR from individual pages prior to using Textract to generate a new invisible 'layer' for all text on these pages. It seems like the way to do this is by using add_redact_annot(page.rect) and then apply_redactions() with images = fitz.PDF_REDACT_IMAGE_NONE), but I'm wondering if there's another way that's more efficient and/or safer.

A secondary problem is that many of our docs are mixed--some pages from Word or other tools that provide their own invisible text, while others were clearly OCR'd previously and still others that are image-only. I sure wish there was an easy way to identify previous OCR output so I'd have a clear flag for which pages require refreshing.

Answered by JorjMcKie

Jun 12, 2023

Using redaction annotations for removing all text from a page is an efficient and safe method - what are your concerns here?

Finding out what type of situation you are actually facing, is however challenging for sure. PyMuPDF can help you - but there will never be 100% security because of the endless different possibilitis in PDF. Here are a few ways that at least approximately may lead to the right conclusions.

If the list page.get_bboxlog() contains items ("ignore-text", ...) then there exists hidden text which probably was generated by some OCR engine.
If a page contains text written with Tesseract's GlyphLessFont, then obviously the page was OCRed with Tesseract. This text is also de…

View full answer

JorjMcKie · 2023-06-12T19:06:57Z

JorjMcKie
Jun 12, 2023
Maintainer

Using redaction annotations for removing all text from a page is an efficient and safe method - what are your concerns here?

Finding out what type of situation you are actually facing, is however challenging for sure. PyMuPDF can help you - but there will never be 100% security because of the endless different possibilitis in PDF. Here are a few ways that at least approximately may lead to the right conclusions.

If the list page.get_bboxlog() contains items ("ignore-text", ...) then there exists hidden text which probably was generated by some OCR engine.
If a page contains text written with Tesseract's GlyphLessFont, then obviously the page was OCRed with Tesseract. This text is also detectable as "ignore-text" (hidden text).
Some OCR engines generate self identification in the PDF's metadata dictionary.
If a page is covered by one (or at most two) image(s) completely and any text on the page occurs earlier then covering images on page in that list page.get_bboxlog(), then the text will (almost certainly - not almost) be invisible and might therefore have been generated by some of the more exotic OCR engines. "Almost certainly" here refers to case of transparent images, which cover text but keep it readable.
You may have no text and no image on a page which yet is not empty: it may contain vector graphics which "simulate" text by drawing the components of each letter like "A" is drawn like "/-", "o" like a small circle etc. Here, only some heuristic checks may lead to the conclusion, that a page should be subject to some OCR and then check its output for interpretable outcome.

So wrapping up:
Give up to hope for a clear and foolproof way - you have to get involved with technical detail beyond your liking.

1 reply

djoltes Jun 12, 2023
Author

Wasn't really concerned -- more of a best-practices query, really. What I mentioned works well (was just testing it today), I'm just being thorough by asking.

For flagging of OCR/not OCR pages, I'm also looking at text length (obvious), whether the font list is empty, and other possible indicators.

Yes, we have a very challenging situation and I'm perfectly happy working through various indicators to identify and flag pages that require re-OCR. What I find amusing is that what I've developed on a part-time basis in the last 2-3 months using PyMuPDF and a variety of Open Source and/or commercial NER and other services appears to be better than most, if not all commercial solutions that generally rely on managed, known document formats and layouts and use X/Y coordinates rather than actual content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Removing text from prior OCR #2466

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Removing text from prior OCR #2466

Uh oh!

djoltes Jun 12, 2023

Replies: 1 comment · 1 reply

Uh oh!

JorjMcKie Jun 12, 2023 Maintainer

Uh oh!

Uh oh!

djoltes Jun 12, 2023 Author

djoltes
Jun 12, 2023

Replies: 1 comment 1 reply

JorjMcKie
Jun 12, 2023
Maintainer

djoltes Jun 12, 2023
Author