How handle text fragment in pdf? #3547
-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I think I tried to explain this to you in another This is always error-prone! The text rectangles may not exactly match the corresponding image-text (because of whatever reasons), dirt or skewed scanning may have confused the logic. Same is true for drawings: the OCR engine may think this is some text, or otherwise, your redaction / text insertion may destroy text borders that you actually wish to retain, etc., etc., and so on. So depending on the specific situation on a page, OCR may deliver one line in one case or multiple lines / words in another case, where we as humans immediately understand that the intention is the same in both cases. You simply have to develop code that can cope with these problems. |
Beta Was this translation helpful? Give feedback.
I think I tried to explain this to you in another
Discussions
post already:You obviously are dealing with OCR'ed pages. So you are not looking at actual text, but at images!
When you search / extract text, then you will get the information that your OCR engine was capable to detect.
This is always error-prone!
The text rectangles may not exactly match the corresponding image-text (because of whatever reasons), dirt or skewed scanning may have confused the logic. Same is true for drawings: the OCR engine may think this is some text, or otherwise, your redaction / text insertion may destroy text borders that you actually wish to retain, etc., etc., and so on.
So depending on the specific s…