Adding text layer to a scanned PDF #775

andrei-volkau · 2020-12-17T16:52:01Z

andrei-volkau
Dec 17, 2020

It would be good to know how to add a text layer to a scanned PDF.

Let's consider the following document as an example.
test.pdf

Raw JSON response of Amazon Textract API call:
pdf-response.json.zip

Dec 17, 2020

Without really knowing the internals of the attached JSON info, let's assume it contains all the required text.
Then simply create a TextWriter object an append text piece by text piece

tw = fitz.TextWriter(page.rect)  # need the intended page's size here
# for each text piece (a word, a string, a character, ... everything goes)
tw.append(
    pos,  # the insertion point
    text,  # the text to insert
    font=font,  # a fitz.Font(...) object
    fontsize=fontsize,
   )
# ... repeat the above with arbitrary other fonts / fontsizes, when done:
tw.writeText(page, render_mode=3,...)  # write the whole text writer as hidden (render mode 3) text.

View full answer

JorjMcKie · 2020-12-17T17:21:45Z

JorjMcKie
Dec 17, 2020
Maintainer

Without really knowing the internals of the attached JSON info, let's assume it contains all the required text.
Then simply create a TextWriter object an append text piece by text piece

tw = fitz.TextWriter(page.rect)  # need the intended page's size here
# for each text piece (a word, a string, a character, ... everything goes)
tw.append(
    pos,  # the insertion point
    text,  # the text to insert
    font=font,  # a fitz.Font(...) object
    fontsize=fontsize,
   )
# ... repeat the above with arbitrary other fonts / fontsizes, when done:
tw.writeText(page, render_mode=3,...)  # write the whole text writer as hidden (render mode 3) text.

1 reply

sjscotti Jun 4, 2022

Hi, I have a question.
Since some OCR systems do a segmentation of text into blocks, lines, and words (e.g., OCR-D, which also has repos on github), is there a way to include that block and line information along with the word information in the hidden text? This information can be used to help identify reading order or separate the text from different articles for complex layouts such as newspapers. Since block number, line number, and word number can be extracted from a pdf using page.get_text("words"), I would think there is a way to set those parameters when writing a pdf too.

JorjMcKie · 2022-06-06T13:30:35Z

JorjMcKie
Jun 6, 2022
Maintainer

@sjscotti - not quite clear what you mean:
If you do page.get_text(...) you will get all the same information whether OCR or not - including blocks, lines, ...
Including output from other packages is a niche request for PyMuPDF. You can develop something yourself to integrate that with the PyMuPDF output.

1 reply

sjscotti Jun 7, 2022

I implemented something much like you described above where I am using a series of TextWriter.append commands to place each individual word on an image-only PDF. Since I am only supplying the

    pos,  # the insertion point
    text,  # the text to insert
    font=font,  # a fitz.Font(...) object
    fontsize=fontsize,

info, as done in your example, when I later reopen the PDF and do a Page.get_text("word") command, it will return the dict with bounding box and the text of the word along with block, line and word number. But how does it know what to use for block, line, and word number since I never input those when creating the text on the page? But it is creating those I found out. However, I would have grouped them differently to correspond to what the OCR program I am using detected.

JorjMcKie · 2022-06-08T04:24:39Z

JorjMcKie
Jun 8, 2022
Maintainer

@sjscotti

But how does it know what to use for block, line, and word number

This information is generated automatically by MuPDF heuristics, when a page is read.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding text layer to a scanned PDF #775

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Adding text layer to a scanned PDF #775

Uh oh!

andrei-volkau Dec 17, 2020

Replies: 3 comments · 2 replies

Uh oh!

JorjMcKie Dec 17, 2020 Maintainer

Uh oh!

sjscotti Jun 4, 2022

Uh oh!

JorjMcKie Jun 6, 2022 Maintainer

Uh oh!

Uh oh!

sjscotti Jun 7, 2022

Uh oh!

JorjMcKie Jun 8, 2022 Maintainer

andrei-volkau
Dec 17, 2020

Replies: 3 comments 2 replies

JorjMcKie
Dec 17, 2020
Maintainer

JorjMcKie
Jun 6, 2022
Maintainer

JorjMcKie
Jun 8, 2022
Maintainer