This repository was archived by the owner on Feb 16, 2023. It is now read-only.
Possible to include both tesseract OCR & existing document content #124
rknightion
started this conversation in
Feature Requests
Replies: 1 comment 4 replies
-
I've you've left PAPERLESS_OCR_MODE at its default value
Am I missing something? |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm wondering if it would be possible to include both the tesseract OCR content view as well as a tab/section for any content already embedded in the PDF?
For example, my scanner OCRs documents quite well, and often produces better results than tesseract/ocrmypdf. But I also have a lot of documents without any built in text data.
Conversely, some emailed PDFs have the raw text data embedded in as 100% accurate, so OCRing them is going to produce a sub-optimal result in comparison. It's fine imo to still OCR them, but having the option to view the original text in the UI would be helpful.
A third potential use, is where a document (government forms are a good example of this) have random/unnecessary embedded text entries not actually shown on the page. For these documents, the file content is less useful in comparison to if it was OCR'd.
To be clear I'm not necessarily discussing choosing what ends up in the PDF/A archive version that can stay as is, but the ability to view the existing text content/layer of an original PDF in the UI below the tesseract OCR data.
Beta Was this translation helpful? Give feedback.
All reactions