Redacting text while preserving layout #2195
Replies: 1 comment 12 replies
-
There are several indicators suggesting that a page "is" an image and has no text:
Check 1. like this: imgxrefs=[item[0] for item in page.get_images()]
rects = [] # list of all rectangles on page covered by some image
for xref in imgxrefs:
rects.extend(page.get_image_rects(xref))
for r in rects:
if page.rect in r:
print("page covered by an image")
break
# to check if a page is at least "roughly" covered, check this:
if abs(page.rect - r) < some_threshold: # like some_threshold = 0.1 or so
print("page covered by an image")
break Check 3. like this: ocred = [item[3] for item in page.get_fonts() if "GlyphLessFont" in item[3]] != [] If false, then the page was not OCRed (with Tesseract at least). But please consider this: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Need some ideas; I'm relatively new to PuyMuPDF but have a requirement to perform redaction on files that may or may not contain image pages. Redacting text pages is trivial and I have it working well, but some files contain images of invoices or other documents and I need to preserve the existing layout so I can't just grab the text layer.
Has anyone developed a process that will detect 'image' pages, then route that page into a handler to OCR the page and return the full output so redaction can be performed against the extracted text? I looked at the 'ocrpages.py' sample that uses ocrmypdf, but keep getting errors during extraction -- I have both Tesseract and GSview installed (Windows 2019 server) and they've been added to the PATH, and can't figure out why I get the 'cannot find the file specified' errors.
Text Length: 1
Scanning contents: 0%| | 0/1 [00:00<?, ?page/s]
Scanning contents: 100%|##########| 1/1 [00:00<00:00, 51.13page/s]
OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s][tesseract] lots of diacritics - possibly poor OCR
OCR: 50%|##### | 0.5/1.0 [00:05<00:05, 10.43s/page]
OCR: 100%|##########| 1.0/1.0 [00:05<00:00, 5.22s/page]
Recompressing JPEGs: 0image [00:00, ?image/s]
Recompressing JPEGs: 0image [00:00, ?image/s]
Deflating JPEGs: 0%| | 0/1 [00:00<?, ?image/s]
Deflating JPEGs: 100%|##########| 1/1 [00:00<00:00, 50.02image/s]
[WinError 2] The system cannot find the file specified
[WinError 2] The system cannot find the file specified
Any leads appreciated...
Beta Was this translation helpful? Give feedback.
All reactions