Duplicated text #2319
-
Describe the bug (mandatory)Some of the data read from this pdf is duplicated on a single page (first page). Visually the text is only once on the page. Also when you send the document into the Adobe PDF -> Doc converter, it does not produce any duplicated text on the first page.
Unfortunately it is not exactly duplicated. So a workaround where I would deduplicate by checking for an exact span duplicate is not possible. Possibly related issues: #379 and #218 This is one of many spans/blocks that are duplicated in the pdf attached.
To Reproduce (mandatory)
Your configuration (mandatory)
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 7 replies
-
This is no bug: as you have observed yourself, the text is stored like that in the file. |
Beta Was this translation helpful? Give feedback.
-
Took a quick look at the first page: MuPDF does recognize "bold simulation", which is the duplication of same text in (almost) the same position directly following each other. While this is certainly helpful, it cannot cover arbitrary craziness: So it is left to the wits of the developer to find a solution. Something coming into my mind is:
|
Beta Was this translation helpful? Give feedback.
-
So, one year later, how did you resolved the problem? |
Beta Was this translation helpful? Give feedback.
Took a quick look at the first page:
As you wrote, there are blocks which simply are exact duplicates of other blocks, while yet others occupy the same place but contain longer text which still starts with the shorter text written at the same position. Interestingly, they also do not necessarily follow each other directly e.g., block 3
'RESERVE BANK OF VANUATU\n'
and block 21'RESERVE BANK OF VANUATU \n'
(note the extra space before the line break).MuPDF does recognize "bold simulation", which is the duplication of same text in (almost) the same position directly following each other. While this is certainly helpful, it cannot cover arbitrary craziness:
The duplication in this case is on…