Ghostscript pdf not recognised #1948

grego1981 · 2022-10-03T11:06:16Z

grego1981
Oct 3, 2022

Please provide all mandatory information!

Describe the bug (mandatory)

I have a flock of PDFs that are have in the following attributes:
Producer: GPL Ghostscript 9.15
PDF Version: 1.4
That are not recognised using PDF2Text.py form utilities and returns garbage data.

To Reproduce (mandatory)

Run PDF2Text.py the produced text has nothing to do with the pdf data.

Expected behavior (optional)

Get the pdf text

Screenshots (optional)

If applicable, add screenshots to help explain your problem.

Your configuration (mandatory)

Ubuntu 20.04
Python version, Python 3.8.10
PyMuPDF version pymupdf-1.20.2, installation method (wheel ).

print(sys.version, "\n", sys.platform, "\n", fitz.doc)
3.8.10 (default, Jun 22 2022, 20:18:18)
[GCC 9.4.0]
linux

PyMuPDF 1.20.2: Python bindings for the MuPDF 1.20.3 library.
Version date: 2022-08-13 00:00:01.
Built for Python 3.8 on linux (64-bit).

Attached the example pdf
example.pdf

Answered by JorjMcKie

Oct 3, 2022

I am sorry, but this is a MuPDF problem ... if it is a problem at all! The file uses a so-called Type3 font and contains no information for back-translating glyphs to unicode values.
The /Encoding object xref 107 would have to contain this info - which it doesn't.
The only thing you can do is OCR-ing ...

View full answer

JorjMcKie · 2022-10-03T11:41:02Z

JorjMcKie
Oct 3, 2022
Maintainer

I am sorry, but this is a MuPDF problem ... if it is a problem at all! The file uses a so-called Type3 font and contains no information for back-translating glyphs to unicode values.
The /Encoding object xref 107 would have to contain this info - which it doesn't.
The only thing you can do is OCR-ing ...

0 replies

grego1981 · 2022-10-03T12:20:45Z

grego1981
Oct 3, 2022
Author

Thank you for the prompt reply! So in this format is it possible to use the replace font scripts? I've tried OCRing it but there are many random characters where lines are close to the letters... although the text output is very close by 90%..

0 replies

JorjMcKie · 2022-10-03T12:54:40Z

JorjMcKie
Oct 3, 2022
Maintainer

Thank you for the prompt reply! So in this format is it possible to use the replace font scripts? I've tried OCRing it but there are many random characters where lines are close to the letters... although the text output is very close by 90%..

Unfortunately not, because font replacement also internally uses text extraction 😒.
You may influence the OCR quality by using a higher resolution, works in ocrmypdf as well as with PyMuPDF. Both are based on Tesseract anyway.

0 replies

grego1981 · 2022-10-03T13:26:56Z

grego1981
Oct 3, 2022
Author

Thank you for the prompt reply! So in this format is it possible to use the replace font scripts? I've tried OCRing it but there are many random characters where lines are close to the letters... although the text output is very close by 90%..

Unfortunately not, because font replacement also internally uses text extraction 😒. You may influence the OCR quality by using a higher resolution, works in ocrmypdf as well as with PyMuPDF. Both are based on Tesseract anyway.

I'll try that as soon as possible!

0 replies

JorjMcKie · 2022-10-05T10:30:44Z

JorjMcKie
Oct 5, 2022
Maintainer

I am going to convert this to a Discussions item, so you may add any further observation to it later on.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ghostscript pdf not recognised #1948

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Ghostscript pdf not recognised #1948

Uh oh!

Uh oh!

grego1981 Oct 3, 2022

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Screenshots (optional)

Your configuration (mandatory)

Replies: 5 comments

Uh oh!

JorjMcKie Oct 3, 2022 Maintainer

Uh oh!

grego1981 Oct 3, 2022 Author

Uh oh!

JorjMcKie Oct 3, 2022 Maintainer

Uh oh!

grego1981 Oct 3, 2022 Author

Uh oh!

JorjMcKie Oct 5, 2022 Maintainer

grego1981
Oct 3, 2022

JorjMcKie
Oct 3, 2022
Maintainer

grego1981
Oct 3, 2022
Author

JorjMcKie
Oct 3, 2022
Maintainer

grego1981
Oct 3, 2022
Author

JorjMcKie
Oct 5, 2022
Maintainer