Possible to include both tesseract OCR & existing document content #124

rknightion · 2020-12-11T12:33:34Z

rknightion
Dec 11, 2020

I'm wondering if it would be possible to include both the tesseract OCR content view as well as a tab/section for any content already embedded in the PDF?
For example, my scanner OCRs documents quite well, and often produces better results than tesseract/ocrmypdf. But I also have a lot of documents without any built in text data.

Conversely, some emailed PDFs have the raw text data embedded in as 100% accurate, so OCRing them is going to produce a sub-optimal result in comparison. It's fine imo to still OCR them, but having the option to view the original text in the UI would be helpful.

A third potential use, is where a document (government forms are a good example of this) have random/unnecessary embedded text entries not actually shown on the page. For these documents, the file content is less useful in comparison to if it was OCR'd.

To be clear I'm not necessarily discussing choosing what ends up in the PDF/A archive version that can stay as is, but the ability to view the existing text content/layer of an original PDF in the UI below the tesseract OCR data.

jonaswinkler · 2020-12-11T12:52:27Z

jonaswinkler
Dec 11, 2020
Maintainer

I've you've left PAPERLESS_OCR_MODE at its default value skip, paperless handles PDF documents in the following way:

If a document does not have any embedded text, paperless will always attempt OCR, no matter what configuration is used. Document content is required by paperless to operate properly.
If a document already has text, paperless will reuse that text. This is the case for digital documents, and scanned documents for which OCR has already been performed.
If a document contains text on some pages only (for whatever reason), paperless will attempt OCR on the pages with missing text and reuse the text from pages with text.
In addition to that, the configuration option skip_noarchive skips PDF/A generation entirely if text is present in a document.

Am I missing something?

4 replies

rknightion Dec 11, 2020
Author

In essence what I'm suggesting is for an option where if a document has embedded text, paperless OCR's it.
In the UI there's currently a box for document content, but this isn't always clear if that content is already embedded in the document or if it's generated via OCR.
My idea is to have 2 boxes. One that shows any embedded document content, and another that shows the OCR'd content.

In essence, similar to using the "redo" option for PAPERLESS_OCR_MODE, but rather than discarding the existing document text, but keeping both the original embedded and OCR'd text (with the OCRd one going into the archived file and original stored in the DB or vise versa. For my use case I'm less concerned about the archived file and more about being able to view both sets of text in the UI or at some point the API).
Some other DMS' function that way. A good example off the top of my head is Mayan, which (by default) OCR's everything and stores both the OCRd content & existing embedded content in the DB and has a view for both.

I hope that explains it more?

jonaswinkler Dec 11, 2020
Maintainer

I see your point now. Need to think about that. Will report back when I'm done.

rknightion Dec 11, 2020
Author

Cool! - Thanks.

One of the main use cases of this more generally, is we often say to give the option of OCRing if the "scanners detected text is inaccurate", yet the user has no easy way to see if it's inaccurate compared to the tesseract OCR version in the UI. They'd have to import it with skip, check it, then manually re-run the OCR and check that result. A simple view with both on can help people make that informed decision for each document more easily.
Side note: thanks for all your hard work paperless-ng really is awesome!

jonaswinkler Dec 11, 2020
Maintainer

Alright, so it's getting rather smoky in here.

I don't exactly know how other systems do it, I've only used them for short periods of time. I believe that performing OCR on every document isn't exactly required, especially for many digitally produced documents and thus makes the consumption process unnecessary long.

Paperless performs a couple actions with the document immediately after consumption, such as automatic matching of metadata for which rules have been defined, and adding the content to the search index. With both content fields available and populated, I'm not exactly sure which to use.

So, regarding the issue at hand, deciding which version (extracted or ocr'ed) of the content to keep.

There's no need to add another database field. The text of the original document is stored in that document, and accessing that is relatively fast.
I could add a new tab/field/popup/something that shows the text of the original. The content data field will always contain what's been produced by OCRmyPDF. If OCR_MODE=skip, this will almost always be the same as the text from the original document, if available, since OCRmyPDF just copies the pages verbatim in that mode, if text is present. I suppose this will be confusing to users.
For users using skip (which is the majority) and scanners with built-in OCR, we could do the following. Add some menu option to have OCR performed afterwards. This is a long-running task and users will not get any notification about task completion, as of now. After that's done, the content field will be updated with the results from the OCR process, and they can decide if that's better than before or not. Provide some button to revert that.
For users using redo or force, They could use the same revert button to use the original text, instead of the OCR'ed one.

I don't exactly know if all that is required. For instance, I am pretty much aware that tesseract produces much better results than the software provided by my scanner, so I've just turned of the OCR option of my scanner and that's it.

The impact of this would be marginally better searching due to content of better quality. Is that worth it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible to include both tesseract OCR & existing document content #124

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Possible to include both tesseract OCR & existing document content #124

Uh oh!

rknightion Dec 11, 2020

Replies: 1 comment · 4 replies

Uh oh!

jonaswinkler Dec 11, 2020 Maintainer

Uh oh!

rknightion Dec 11, 2020 Author

Uh oh!

jonaswinkler Dec 11, 2020 Maintainer

Uh oh!

rknightion Dec 11, 2020 Author

Uh oh!

Uh oh!

jonaswinkler Dec 11, 2020 Maintainer

rknightion
Dec 11, 2020

Replies: 1 comment 4 replies

jonaswinkler
Dec 11, 2020
Maintainer

rknightion Dec 11, 2020
Author

jonaswinkler Dec 11, 2020
Maintainer

rknightion Dec 11, 2020
Author

jonaswinkler Dec 11, 2020
Maintainer