Handling of Manuscript/Line Numbers #1998

adambuttrick · 2025-07-27T20:24:51Z

adambuttrick
Jul 27, 2025

From my understanding, Docling was trained a large corpus of arXiv preprints and similar works. Despite this, it seems to struggle with (or perhaps learned) the line numbers that are often present in these works, interspersing them throughout extracted text or introducing odd line break patterns as a result of them. Has anyone else encountered or identified settings that mitigate this? Currently, I'm employing some post-processing normalization that attempts to detect and remove naively on the basis of finding sequences of integers dispersed throughout a block of text, but this is obviously problematic for certain contexts (e.g. math papers) and does little to mitigate the line break issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handling of Manuscript/Line Numbers #1998

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Handling of Manuscript/Line Numbers #1998

Uh oh!

Uh oh!

adambuttrick Jul 27, 2025

Replies: 0 comments

adambuttrick
Jul 27, 2025