Skip to content
Discussion options

You must be logged in to vote

Your basic problem is that the inter-character spacing information is not delivered to you by our text extraction:
In PDF you can treat character positions individually, sometimes letting the characters themselves "decide" about their distance to the predecessor, sometimes adding a modifier that shifts the current character just a bit, left or right.
Other differences stem from how justified text is implemented: if words are significantly apart from each other here, they form a separate span, in other cases (distance not large enough), MuPDF decides to leave them in the same span.
Etc.
Whatever algorithm you choose: it won't get perfect this way in a failsafe manner.
You probably have to …

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by gokhanbektas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants