consumer error on scanned pdf #465

msrv · 2021-01-28T22:14:56Z

msrv
Jan 28, 2021

Hi Jonas,

thank you for providing paperless-ng! It is an awesome project. However, my paperless-ng installation's consumer (1.0, docker, debian 10) fails to ocr a scanned pdf.

System:
Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-13-amd64 Architecture: x86-64

Docker:
Docker version 20.10.2, build 2291f61

Docker-compose:
docker-compose version 1.27.4, build 40524192

Steps to reproduce the problem:

Scan document (Brother ADS-1700W)
PDF (5 mb) goes to \IP\scan
\IP\scan resides on my nas, is mounted to /media/scan via fstab, uid,gid are same as docker user specified in docker-compose.env,
OCR Task fails

Expected behavior:

Successfull OCR of document

Error log of failed task:

`: Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 176, in parse
ocrmypdf.ocr(**ocr_args)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 368, in run_pipeline
validate_pdfinfo_options(context)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 193, in validate_pdfinfo_options
raise InputFileError()
ocrmypdf.exceptions.InputFileError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 179, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 193, in parse
raise ParseError(e)
documents.parsers.ParseError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 196, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError`

Error log of webserver when running docker-compose up:

webserver_1 | ERROR 2021-01-28 22:58:34,107 _pipeline This PDF has a user fillable form. --redo-ocr is not currently possible on such files. webserver_1 | ERROR 2021-01-28 22:58:34,114 loggers Error while consuming document Scan_1_28012021_003254.pdf: webserver_1 | 22:58:34 [Q] ERROR Failed [Scan_1_28012021_003254.pdf] - : Traceback (most recent call last): webserver_1 | File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 176, in parse webserver_1 | ocrmypdf.ocr(**ocr_args) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr webserver_1 | return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 368, in run_pipeline webserver_1 | validate_pdfinfo_options(context) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 193, in validate_pdfinfo_options webserver_1 | raise InputFileError() webserver_1 | ocrmypdf.exceptions.InputFileError webserver_1 | webserver_1 | During handling of the above exception, another exception occurred: webserver_1 | webserver_1 | Traceback (most recent call last): webserver_1 | File "/usr/src/paperless/src/documents/consumer.py", line 179, in try_consume_file webserver_1 | document_parser.parse(self.path, mime_type, self.filename) webserver_1 | File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 193, in parse webserver_1 | raise ParseError(e) webserver_1 | documents.parsers.ParseError webserver_1 | webserver_1 | During handling of the above exception, another exception occurred: webserver_1 | webserver_1 | Traceback (most recent call last): webserver_1 | File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker webserver_1 | res = f(*task["args"], **task["kwargs"]) webserver_1 | File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file webserver_1 | override_tag_ids=override_tag_ids) webserver_1 | File "/usr/src/paperless/src/documents/consumer.py", line 196, in try_consume_file webserver_1 | raise ConsumerError(e) webserver_1 | documents.consumer.ConsumerError

I noticed the line: ERROR 2021-01-28 22:58:34,107 _pipeline This PDF has a user fillable form. --redo-ocr is not currently possible on such files.

The PDF definetely has no user fillable form, it is a non-ocr scan of a printed document.

Steps i tried:

Rebuilding image --> no success
Permissions --> okay, ls-la shows uid and gid are correct
OCR options: doesn't make a difference, skip, redo-ocr or omitting them completely make no difference.

Do you have any ideas on this?

Edit: Manually adding the pdf via uploader throws the same error

jonaswinkler · 2021-01-28T22:43:47Z

jonaswinkler
Jan 28, 2021
Maintainer

This is an issue with the underlying OCR library not supporting this particular file. Regarding the fillable form: PDF is a pretty wild format, and some hidden elements in that document might appear as if they are forms.

Please open the logs, and set the filter to debug. There should be a related line that says "Calling OCRmyPDF with ..." or similar. Please post that.

This is also part of #246. I'm figuring out a better workflow to support more file types.

Edit.

If you want to help me figure out #246 a little more, set PAPERLESS_OCR_MODE=force in your configuration file, restart, and try adding that file again. Please report back if that works. This will disregard any weird PDF content, convert everything into images, and add selectable text on top. This mode is (should be) compatible with every PDF file, however: The resulting file might look blurry if zoomed in and might be somewhat larger.

Would you rather have paperless:

fail on invalid files, or
fall back to "force" for invalid files, and produce files that may be much larger (due to conversion of every page to images)?

2 replies

msrv Jan 31, 2021
Author

Hi Jonas and thanks for your reply :)

As desired, here's the debug log when using "redo" in OCR settings.

31.01.21, 20:39 ERROR Error while consuming document Scan_1_28012021_003254.pdf:
31.01.21, 20:39 DEBUG Deleting directory /tmp/paperless/paperless-8n5lt6am
31.01.21, 20:39 DEBUG Encountered an error: . Trying to use text from original.
31.01.21, 20:39 DEBUG Calling OCRmyPDF with {'input_file': '/usr/src/paperless/src/../consume/Scan_1_28012021_003254.pdf', 'output_file': '/tmp/paperless/paperless-8n5lt6am/archive.pdf', 'use_threads': True, 'jobs': 3, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'clean': True, 'pages': '1-2', 'redo_ocr': True}
31.01.21, 20:39 DEBUG Parsing Scan_1_28012021_003254.pdf...
31.01.21, 20:39 DEBUG Parser: RasterisedDocumentParser
31.01.21, 20:39 DEBUG Detected mime type: application/pdf
31.01.21, 20:39 INFO Consuming Scan_1_28012021_003254.pdf
31.01.21, 20:39 INFO Polling directory for changes: /usr/src/paperless/src/../consume

Here the debug log when using "force" in OCR settings:

31.01.21, 20:43 INFO Document 2019-06-01 Praxis Scan_1_28012021_003254 consumption finished
31.01.21, 20:43 DEBUG Deleting directory /tmp/paperless/paperless-2g7_dxxx
31.01.21, 20:43 DEBUG Deleting file /usr/src/paperless/src/../consume/Scan_1_28012021_003254.pdf
31.01.21, 20:43 DEBUG Assigning correspondent Praxis to 2019-06-01 Scan_1_28012021_003254
31.01.21, 20:43 DEBUG Assigning Correspondent Praxis to document 2019-06-01 Scan_1_28012021_003254 because it contains this word: praxis
31.01.21, 20:43 DEBUG Saving record to database
31.01.21, 20:43 DEBUG Execute: optipng -silent -o5 /tmp/paperless/paperless-2g7_dxxx/convert.png -out /tmp/paperless/paperless-2g7_dxxx/thumb_optipng.png
31.01.21, 20:43 DEBUG Execute: convert -density 300 -scale 500x5000> -alpha remove -strip -auto-orient /usr/src/paperless/src/../consume/Scan_1_28012021_003254.pdf[0] /tmp/paperless/paperless-2g7_dxxx/convert.png
31.01.21, 20:43 DEBUG Generating thumbnail for Scan_1_28012021_003254.pdf...
31.01.21, 20:42 DEBUG Calling OCRmyPDF with {'input_file': '/usr/src/paperless/src/../consume/Scan_1_28012021_003254.pdf', 'output_file': '/tmp/paperless/paperless-2g7_dxxx/archive.pdf', 'use_threads': True, 'jobs': 3, 'language': 'deu', 'output_type': 'pdfa', 'progress_bar': False, 'clean': True, 'pages': '1-2', 'force_ocr': True}
31.01.21, 20:42 DEBUG Parsing Scan_1_28012021_003254.pdf...
31.01.21, 20:42 DEBUG Parser: RasterisedDocumentParser
31.01.21, 20:42 DEBUG Detected mime type: application/pdf
31.01.21, 20:42 INFO Consuming Scan_1_28012021_003254.pdf
31.01.21, 20:42 INFO Polling directory for changes: /usr/src/paperless/src/../consume

And then the document is OCRd successfully but looks blurry.

I guess it depends on what happens with the file. Does Paperless-ng store an archived and unchanged version of the file, when i select "force"? Then it would be best if it defaults to "skip" and when it fails to "force".

Thank you!

jonaswinkler Jan 31, 2021
Maintainer

I guess it depends on what happens with the file. Does Paperless-ng store an archived and unchanged version of the file, when i select "force"? Then it would be best if it defaults to "skip" and when it fails to "force".

--force renders every page as an image. This is why it looks blurry. However, this approach works with every PDF document and that's why I'm considering to add that as a fallback option.

Original documents are always stored unmodified. Normally you don't need to access these files all that often, but if you do, there's a dropdown button on the document details page next to the download button. I might need to make that more visible.

I'll also need to write some actual user documentation that explains all these features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

consumer error on scanned pdf #465

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

consumer error on scanned pdf #465

Uh oh!

Uh oh!

msrv Jan 28, 2021

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

jonaswinkler Jan 28, 2021 Maintainer

Uh oh!

Uh oh!

msrv Jan 31, 2021 Author

Uh oh!

jonaswinkler Jan 31, 2021 Maintainer

msrv
Jan 28, 2021

Replies: 1 comment 2 replies

jonaswinkler
Jan 28, 2021
Maintainer

msrv Jan 31, 2021
Author

jonaswinkler Jan 31, 2021
Maintainer