Redacting text while preserving layout #2195

djoltes · 2023-01-26T19:33:39Z

djoltes
Jan 26, 2023

Need some ideas; I'm relatively new to PuyMuPDF but have a requirement to perform redaction on files that may or may not contain image pages. Redacting text pages is trivial and I have it working well, but some files contain images of invoices or other documents and I need to preserve the existing layout so I can't just grab the text layer.

Has anyone developed a process that will detect 'image' pages, then route that page into a handler to OCR the page and return the full output so redaction can be performed against the extracted text? I looked at the 'ocrpages.py' sample that uses ocrmypdf, but keep getting errors during extraction -- I have both Tesseract and GSview installed (Windows 2019 server) and they've been added to the PATH, and can't figure out why I get the 'cannot find the file specified' errors.

Text Length: 1

Scanning contents: 0%| | 0/1 [00:00<?, ?page/s]
Scanning contents: 100%|##########| 1/1 [00:00<00:00, 51.13page/s]

OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s][tesseract] lots of diacritics - possibly poor OCR

OCR: 50%|##### | 0.5/1.0 [00:05<00:05, 10.43s/page]
OCR: 100%|##########| 1.0/1.0 [00:05<00:00, 5.22s/page]

Recompressing JPEGs: 0image [00:00, ?image/s]
Recompressing JPEGs: 0image [00:00, ?image/s]

Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s]
Deflating JPEGs: 100%|##########| 1/1 [00:00<00:00, 50.02image/s]

[WinError 2] The system cannot find the file specified
[WinError 2] The system cannot find the file specified

Any leads appreciated...

JorjMcKie · 2023-01-27T12:31:35Z

JorjMcKie
Jan 27, 2023
Maintainer

There are several indicators suggesting that a page "is" an image and has no text:

There exists (at least one) an image covering the complete page rectangle.
page.get_text() delivers an empty string.
If font GlyphLessFont is being used on your page, then you know that the page was OCRed using Tesseract

Check 1. like this:

imgxrefs=[item[0] for item in page.get_images()]
rects = []  # list of all rectangles on page covered by some image
for xref in imgxrefs:
    rects.extend(page.get_image_rects(xref))
for r in rects:
    if page.rect in r:
        print("page covered by an image")
        break
# to check if a page is at least "roughly" covered, check this:
    if abs(page.rect - r) < some_threshold:   # like some_threshold = 0.1 or so
        print("page covered by an image")
        break

Check 3. like this:

ocred = [item[3] for item in page.get_fonts() if "GlyphLessFont" in item[3]] != []

If false, then the page was not OCRed (with Tesseract at least).

But please consider this:
If you redact an OCRed page, then the text will be redacted for sure, but the image also needs to be modified to remove the text part inside the image. For this the image will be extracted and white / empty rectangles will be put in it at all placed where there is redacted text.
This procedure takes a relatively long time, so make sure to do this only once per page: add redaction annotations for all text pieces to remove, then execute page.apply_redactions().

12 replies

JorjMcKie Jan 27, 2023
Maintainer

Ah, I believe I have it now.
I think you have two options:

When you have created that PDF with the OCRed pixmap, nothing prevents you from redacting its single page right away there. Then copy it over to your original PDF. For this copying, you must use target_pdf.insert_pdf(ocr_pdf, start_at=page.number+1). This will make the OCRed page immediately follow tha current page. Then do target_pdf.delete_page(page.number) to remove the current page - this lets the OCRed page take over its place.
If do not really need the image page to contain OCRed text, but you simply want to clean the text from the image, you can still make an redaction annotation for it and let the image be modified. You can use the same rectangle that you computed for the ocr_pdf.

As you have observed: re-assigning the Python page variable to something else has no influence.

djoltes Jan 27, 2023
Author

Ah, that's what I was looking for...and yes, it makes sense that you can't just replace the object. That said, I get an error from the library on the delete attempt.

      if len(pageContent) <= 1:
           # This means we need to OCR the page first...
           try:
               pageExtract = ocr_the_page(page)
               doc.insert_pdf(pageExtract, start_at=page.number+1)
               doc.delete_page(page.number)
               continue

Traceback (most recent call last):
File "C:\Analytics\workspace\WeeklyInjury\src\PDFredact.py", line 158, in
doc.delete_page(page.number)
File "C:\Program Files\python_310\lib\site-packages\fitz\fitz.py", line 5281, in delete_page
while pno < 0:
TypeError: '<' not supported between instances of 'NoneType' and 'int'

JorjMcKie Jan 27, 2023
Maintainer

doc.delete_page(page.number)

Ah! Because inserting pages from whereever, all existing page objects are being invalidated - as a precaution against invalidation of page numbers. I tricked myself out here 😉.
So you must save the number of the current page into some variable before insert_pdf() which you then use here.

djoltes Jan 27, 2023
Author

LOL, I hate when that happens! Okay, this seems to be working -- at least I'm now getting the OCR'd texts in the output PDF, so thanks very much for the chat.

This is all part of a project to redact texts automatically as much as possible -- currently I have handlers for phone numbers, vehicle IDs (VINs), email addresses, and I'm using NER tools from spaCy to detect proper names. We'll see what the accuracy level turns out to be...

JorjMcKie Jan 27, 2023
Maintainer

LOL, I hate when that happens!

😎 How much more would you hate to see your Python interpreter crashing because of referencing a page that no longer exists ...

Redacting text while preserving layout #2195

Uh oh!

Uh oh!

djoltes Jan 26, 2023

Replies: 1 comment · 12 replies

Uh oh!

Uh oh!

JorjMcKie Jan 27, 2023 Maintainer

Uh oh!

JorjMcKie Jan 27, 2023 Maintainer

Uh oh!

djoltes Jan 27, 2023 Author

Uh oh!

Uh oh!

JorjMcKie Jan 27, 2023 Maintainer

Uh oh!

djoltes Jan 27, 2023 Author

Uh oh!

JorjMcKie Jan 27, 2023 Maintainer

djoltes
Jan 26, 2023

Replies: 1 comment 12 replies

JorjMcKie
Jan 27, 2023
Maintainer

JorjMcKie Jan 27, 2023
Maintainer

djoltes Jan 27, 2023
Author

JorjMcKie Jan 27, 2023
Maintainer

djoltes Jan 27, 2023
Author

JorjMcKie Jan 27, 2023
Maintainer