Change Visibility of OCR'd pdf text layer #3537

mikejokic · 2024-05-31T05:47:22Z

mikejokic
May 31, 2024

Is your feature request related to a problem? Please describe.

I have OCR'd an image to generate a text layer over the image. This text layer is invisible in the pdf. I then use ghostscript to remove image and vector data to just keep the text layer to further reduce file size but keep page textual structure intact.

TestOCR.pdf - OCR'd image as pdf

TestOCR_textonly.pdf - removed image and vector data using ghostscript -dFILTERIMAGE -dFILTERVECTOR, We can highlight over this "blank" pdf to see the text layer is still there.

TestOCR.pdf

TestOCR_textonly.pdf

Describe the solution you'd like

Make this text layer visible in TestOCR_textonly.pdf. I want the OCR'd text to be visible following the same structural layout as the input.
Can I change the render mode or color for all the text in this pdf to be visible?
My pipeline will eventually deal with very large pdf files, so would like the solution to be performant as well.

@JorjMcKie I have tried your solutions for changing text font color found here but to no avail. Would really appreciate any support.

JorjMcKie · 2024-05-31T09:06:46Z

JorjMcKie
May 31, 2024
Maintainer

OCR-ed text may have been made invisible in a number of different ways.
Choosing some color (like white-on-white or black cat in the night) is not among these alternatives.
So changing the text color is the wrong idea.

Sometimes the text is written in "background", such that the image covers it.
Most often though, OCRed text is stored with the PDF attribute "hidden" so removing the image will still not make it visible.

You can locate the respective PDF command 3 Tr and use a hacky way removing / changing it.
In your case however, OCR was done with Tesseract obviously. Its `OCR-ed text may have been made invisible in a number of different ways.
Choosing some color (like white-on-white or black cat in the night) is not among these alternatives.
So changing the text color is the wrong idea.

Sometimes the text is written in "background", such that the image covers it.
Most often though, OCRed text is stored with the PDF attribute "hidden" so removing the image will still not make it visible.

You can locate the respective PDF command 3 Tr and use a hacky way removing / changing it.
In your case however, OCR was done with Tesseract obviously. Its GlyphLessFont means exactly that: it is a font for which no visible representation exists - IAW there exist no glyphs. So all the previous hacks will still not lead to visible text!

The only way I see is this approach:

Remove the image` means exactly that: it is a font for which no visible representation exists - IAW there exist no glyphs. So all the previous hacks will still not lead to visible text!

The only way I see is this approach:

Remove image(s) etc.
Replace the OCR-ed text may have been made invisible in a number of different ways.
Choosing some color (like white-on-white or black cat in the night) is not among these alternatives.
So changing the text color is the wrong idea.

Sometimes the text is written in "background", such that the image covers it.
Most often though, OCRed text is stored with the PDF attribute "hidden" so removing the image will still not make it visible.

You can locate the respective PDF command 3 Tr and use a hacky way removing / changing it.
In your case however, OCR was done with Tesseract obviously. Its GlyphLessFont means exactly that: it is a font for which no visible representation exists - IAW there exist no glyphs. So all the previous hacks will still not lead to visible text!

The only way I see is this approach:

Remove the image(s) etc.
Replace GlyphLessFont by Courier. GlyphLessFont is a mono-spaced font, so Courier is a possible / good choice. You can use the font replacement script here.

This is what comes out in your test case:

Pretty ugly ... 🤷‍♂️

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change Visibility of OCR'd pdf text layer #3537

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Change Visibility of OCR'd pdf text layer #3537

Uh oh!

Uh oh!

mikejokic May 31, 2024

Replies: 1 comment

Uh oh!

JorjMcKie May 31, 2024 Maintainer

mikejokic
May 31, 2024

JorjMcKie
May 31, 2024
Maintainer