How handle text fragment in pdf? #3547

Muhammadraafat1 · 2024-06-04T16:36:44Z

Muhammadraafat1
Jun 4, 2024

My code works well (let's say in pdf1), but when I applied it to (pdf2) it didn't work well specific in redaction address.
for example we should replace the this (word="610 E Griffin Pkwy Suite D Mission, TX", replacement="310 W Palmirs Blvd Suite F Antonio, NM")
the output of pdf1:

the output of pdf2:

when I tried to see the difference why it works in pdf1 only, I print page.text()
I found in the pdf2:
Address
1:
610
E
Griffin
Pkwy
Suite D
Mission,
TX
78572

But in pdf1:
Address 1:
610 E Griffin Pkwy Suite D Mission, TX
78572
So, Is this the problem or not? and if it how deal with pdfs have these format of text?

Answered by JorjMcKie

Jun 5, 2024

I think I tried to explain this to you in another Discussions post already:
You obviously are dealing with OCR'ed pages. So you are not looking at actual text, but at images!
When you search / extract text, then you will get the information that your OCR engine was capable to detect.

This is always error-prone!

The text rectangles may not exactly match the corresponding image-text (because of whatever reasons), dirt or skewed scanning may have confused the logic. Same is true for drawings: the OCR engine may think this is some text, or otherwise, your redaction / text insertion may destroy text borders that you actually wish to retain, etc., etc., and so on.

So depending on the specific s…

View full answer

JorjMcKie · 2024-06-05T11:51:35Z

JorjMcKie
Jun 5, 2024
Maintainer

I think I tried to explain this to you in another Discussions post already:
You obviously are dealing with OCR'ed pages. So you are not looking at actual text, but at images!
When you search / extract text, then you will get the information that your OCR engine was capable to detect.

This is always error-prone!

The text rectangles may not exactly match the corresponding image-text (because of whatever reasons), dirt or skewed scanning may have confused the logic. Same is true for drawings: the OCR engine may think this is some text, or otherwise, your redaction / text insertion may destroy text borders that you actually wish to retain, etc., etc., and so on.

So depending on the specific situation on a page, OCR may deliver one line in one case or multiple lines / words in another case, where we as humans immediately understand that the intention is the same in both cases.

You simply have to develop code that can cope with these problems.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How handle text fragment in pdf? #3547

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How handle text fragment in pdf? #3547

Uh oh!

Muhammadraafat1 Jun 4, 2024

Replies: 1 comment

Uh oh!

JorjMcKie Jun 5, 2024 Maintainer

Muhammadraafat1
Jun 4, 2024

JorjMcKie
Jun 5, 2024
Maintainer