-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Labels
bugSomething isn't workingSomething isn't workingvttissues related to the WebVTT backendissues related to the WebVTT backend
Description
Bug
The current backend parser for WebVTT files lacks of some features. In addition, the transformation into DoclingDocument could be improved to help users separate the cue text from the cue metadata.
- Cue blocks with text like
P&Lfail to parse since&is forbidden according to the specs (https://www.w3.org/TR/webvtt1/#webvtt-cue-text-span)- We could relax this constraint
- The voice annotation (the speaker), is parsed in Docling by adding it to the cue span text as a prefix. E.g.
<v Narrator>Welcome</v>becomesNarrator: Welcome- We could put this as a label of the text item, to avoid missing cue text and cue metadata.
- Cue text spans with mixed formatted text are parsed into Docling inline groups, but spaces are ignore and therefore it is not possible to reproduce the correct spacing of the text without formatting.
- The language annotation "language:en-US" is not parsed (a warning message is sent).
- REGION and STYLE blocks are not addressed and trigger warnings
- The WebVTT cue class span is not addressed.
Steps to reproduce
Check the following script that illustrates the gaps and shows the expected parsed text without metadata and formatting.
Docling version
Docling version: 2.58.0
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.7.0
Python: cpython-312 (3.12.10)
Platform: macOS-14.7.1-arm64-arm-64bit
Python version
Python 3.12.10
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingvttissues related to the WebVTT backendissues related to the WebVTT backend