how to preserve page structure( table, layout) while doing OCR #4681
Replies: 1 comment
-
Hi @ranjith-3330 , please could you post this discussion on our forum page here: https://forum.pymupdf.com and include your PDF if possible? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
hi team
for the general pdf (not images)
i am able to extract the table content, and redact the information what ever i needed, it is working fine,
now we have scanned pdf
i used OCR extact the information
full_tp = page.get_textpage_ocr(
flags=0,
dpi=300,
full=True,
language='eng',
tessdata=r"C:\Users\3330\AppData\Local\Programs\Tesseract-OCR\tessdata"
after doing OCR completely lost the layout
how to preserve the layout like tables structure while doing OCR, so that i can iterate the table and redact the rows which ever not needed.
redaction is not applied correctly after applying OCR
search_terms = ["Partner Acknowledged", "0060584486", "5,555,390.00", "0.00"]
for term in search_terms:
rectangles = full_tp.search(term) # find all occurrences of the term
if rectangles:
print(f"Found '{term}' at positions: {rectangles}")
for rect in rectangles:
# Mark the area for redaction
page.add_redact_annot(rect)
else:
print(f"'{term}' not found on this page.")
page.apply_redactions()
for general pdf everything is working fine.
search_terms = ["Partner Acknowledged", "0060584486", "5,555,390.00", "0.00"]
for term in search_terms:
rectangles = full_tp.search(term) # find all occurrences of the term
if rectangles:
print(f"Found '{term}' at positions: {rectangles}")
for rect in rectangles:
# Mark the area for redaction
page.add_redact_annot(rect)
else:
print(f"'{term}' not found on this page.")
page.apply_redactions()
how to tackle below cases while using OCR
1) preserving structure (Tables , layout )
2) applying reduction correctly
Beta Was this translation helpful? Give feedback.
All reactions