Handling of Manuscript/Line Numbers #1998
Unanswered
adambuttrick
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
From my understanding, Docling was trained a large corpus of arXiv preprints and similar works. Despite this, it seems to struggle with (or perhaps learned) the line numbers that are often present in these works, interspersing them throughout extracted text or introducing odd line break patterns as a result of them. Has anyone else encountered or identified settings that mitigate this? Currently, I'm employing some post-processing normalization that attempts to detect and remove naively on the basis of finding sequences of integers dispersed throughout a block of text, but this is obviously problematic for certain contexts (e.g. math papers) and does little to mitigate the line break issue.
Beta Was this translation helpful? Give feedback.
All reactions