Replies: 1 comment 2 replies
-
This is an issue with the underlying OCR library not supporting this particular file. Regarding the fillable form: PDF is a pretty wild format, and some hidden elements in that document might appear as if they are forms. Please open the logs, and set the filter to debug. There should be a related line that says "Calling OCRmyPDF with ..." or similar. Please post that. This is also part of #246. I'm figuring out a better workflow to support more file types. Edit. If you want to help me figure out #246 a little more, set Would you rather have paperless:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Jonas,
thank you for providing paperless-ng! It is an awesome project. However, my paperless-ng installation's consumer (1.0, docker, debian 10) fails to ocr a scanned pdf.
System:
Operating System: Debian GNU/Linux 10 (buster) Kernel: Linux 4.19.0-13-amd64 Architecture: x86-64
Docker:
Docker version 20.10.2, build 2291f61
Docker-compose:
docker-compose version 1.27.4, build 40524192
Steps to reproduce the problem:
Expected behavior:
Successfull OCR of document
Error log of failed task:
`: Traceback (most recent call last):
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 176, in parse
ocrmypdf.ocr(**ocr_args)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr
return run_pipeline(options=options, plugin_manager=plugin_manager, api=True)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 368, in run_pipeline
validate_pdfinfo_options(context)
File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 193, in validate_pdfinfo_options
raise InputFileError()
ocrmypdf.exceptions.InputFileError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/paperless/src/documents/consumer.py", line 179, in try_consume_file
document_parser.parse(self.path, mime_type, self.filename)
File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 193, in parse
raise ParseError(e)
documents.parsers.ParseError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker
res = f(*task["args"], **task["kwargs"])
File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file
override_tag_ids=override_tag_ids)
File "/usr/src/paperless/src/documents/consumer.py", line 196, in try_consume_file
raise ConsumerError(e)
documents.consumer.ConsumerError`
Error log of webserver when running docker-compose up:
webserver_1 | ERROR 2021-01-28 22:58:34,107 _pipeline This PDF has a user fillable form. --redo-ocr is not currently possible on such files. webserver_1 | ERROR 2021-01-28 22:58:34,114 loggers Error while consuming document Scan_1_28012021_003254.pdf: webserver_1 | 22:58:34 [Q] ERROR Failed [Scan_1_28012021_003254.pdf] - : Traceback (most recent call last): webserver_1 | File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 176, in parse webserver_1 | ocrmypdf.ocr(**ocr_args) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/api.py", line 326, in ocr webserver_1 | return run_pipeline(options=options, plugin_manager=plugin_manager, api=True) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_sync.py", line 368, in run_pipeline webserver_1 | validate_pdfinfo_options(context) webserver_1 | File "/usr/local/lib/python3.7/site-packages/ocrmypdf/_pipeline.py", line 193, in validate_pdfinfo_options webserver_1 | raise InputFileError() webserver_1 | ocrmypdf.exceptions.InputFileError webserver_1 | webserver_1 | During handling of the above exception, another exception occurred: webserver_1 | webserver_1 | Traceback (most recent call last): webserver_1 | File "/usr/src/paperless/src/documents/consumer.py", line 179, in try_consume_file webserver_1 | document_parser.parse(self.path, mime_type, self.filename) webserver_1 | File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 193, in parse webserver_1 | raise ParseError(e) webserver_1 | documents.parsers.ParseError webserver_1 | webserver_1 | During handling of the above exception, another exception occurred: webserver_1 | webserver_1 | Traceback (most recent call last): webserver_1 | File "/usr/local/lib/python3.7/site-packages/django_q/cluster.py", line 436, in worker webserver_1 | res = f(*task["args"], **task["kwargs"]) webserver_1 | File "/usr/src/paperless/src/documents/tasks.py", line 73, in consume_file webserver_1 | override_tag_ids=override_tag_ids) webserver_1 | File "/usr/src/paperless/src/documents/consumer.py", line 196, in try_consume_file webserver_1 | raise ConsumerError(e) webserver_1 | documents.consumer.ConsumerError
I noticed the line:
ERROR 2021-01-28 22:58:34,107 _pipeline This PDF has a user fillable form. --redo-ocr is not currently possible on such files.
The PDF definetely has no user fillable form, it is a non-ocr scan of a printed document.
Steps i tried:
Do you have any ideas on this?
Edit: Manually adding the pdf via uploader throws the same error
Beta Was this translation helpful? Give feedback.
All reactions