Hyphens in extracted text seem to be broken. #911
Replies: 1 comment
-
Seems like the problem stems from handling special hyphen characters: If the hyphen's unicode is Also, if I switch the backend to |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I've been experimenting with docling and I found this weird behaviour when I'm exporting a pdf document to json.
Here's my code:
It exports the data to the json without a problem, but the extracted text output gets mangled when it is containing hyphens. For example if the raw text in the pdf is
the respective exported text in the json file is the following:
For some reason, hyphens '-' are removed and batched together at random places. Not sure if this is intended behaviour, a bug, or the pdf file is broken, any pointers to what might be happening is appreciated.
P.S. when I'm using pdfplumber to extract text, it doesn't mangle hyphens like this.
Beta Was this translation helpful? Give feedback.
All reactions