Hyphens in extracted text seem to be broken. #911

PopcornPaws · 2025-02-07T11:38:00Z

PopcornPaws
Feb 7, 2025

Hi, I've been experimenting with docling and I found this weird behaviour when I'm exporting a pdf document to json.

Here's my code:

from pathlib import Path
from docling.document_converter import DocumentConverter

path = "../some/path/to/file.pdf"
converter = DocumentConverter()
doc = converter.convert(path).document
doc.save_as_json(Path("exported.json"))

It exports the data to the json without a problem, but the extracted text output gets mangled when it is containing hyphens. For example if the raw text in the pdf is

for the purpose of this Document, commercial-off-the-shelf, off-the-shelf and modified-off-the-shelf software for which evidence of use is available

the respective exported text in the json file is the following:

{
    "metadata_stuff": "...",
    "orig": "for the purpose of this Document, commercial off the shelf, ---off the shelf --and modified off the shelf software for which evidence of use is available ---",
    "text": "for the purpose of this Document, commercial off the shelf, ---off the shelf --and modified off the shelf software for which evidence of use is available ---"
}

For some reason, hyphens '-' are removed and batched together at random places. Not sure if this is intended behaviour, a bug, or the pdf file is broken, any pointers to what might be happening is appreciated.

P.S. when I'm using pdfplumber to extract text, it doesn't mangle hyphens like this.

PopcornPaws · 2025-02-07T13:29:08Z

PopcornPaws
Feb 7, 2025
Author

Seems like the problem stems from handling special hyphen characters:

If the hyphen's unicode is U+2010 : HYPHEN, then parsing produces the above described result, if it is U+002D : HYPHEN-MINUS {hyphen, dash; minus sign} then it's parsed correctly.

Also, if I switch the backend to PyPdfiumDocumentBackend, all hyphens are parsed correctly (with an added whitespace character around them here and there).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hyphens in extracted text seem to be broken. #911

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hyphens in extracted text seem to be broken. #911

Uh oh!

PopcornPaws Feb 7, 2025

Replies: 1 comment

Uh oh!

PopcornPaws Feb 7, 2025 Author

PopcornPaws
Feb 7, 2025

PopcornPaws
Feb 7, 2025
Author