Skip to content

Unsupported features in WebVTT parser #2525

@ceberam

Description

@ceberam

Bug

The current backend parser for WebVTT files lacks of some features. In addition, the transformation into DoclingDocument could be improved to help users separate the cue text from the cue metadata.

  • Cue blocks with text like P&L fail to parse since & is forbidden according to the specs (https://www.w3.org/TR/webvtt1/#webvtt-cue-text-span)
    • We could relax this constraint
  • The voice annotation (the speaker), is parsed in Docling by adding it to the cue span text as a prefix. E.g. <v Narrator>Welcome</v> becomes Narrator: Welcome
    • We could put this as a label of the text item, to avoid missing cue text and cue metadata.
  • Cue text spans with mixed formatted text are parsed into Docling inline groups, but spaces are ignore and therefore it is not possible to reproduce the correct spacing of the text without formatting.
  • The language annotation "language:en-US" is not parsed (a warning message is sent).
  • REGION and STYLE blocks are not addressed and trigger warnings
  • The WebVTT cue class span is not addressed.

Steps to reproduce

Check the following script that illustrates the gaps and shows the expected parsed text without metadata and formatting.

test_process_docling_vtt.py

Docling version

Docling version: 2.58.0
Docling Core version: 2.48.4
Docling IBM Models version: 3.9.1
Docling Parse version: 4.7.0
Python: cpython-312 (3.12.10)
Platform: macOS-14.7.1-arm64-arm-64bit

Python version

Python 3.12.10

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingvttissues related to the WebVTT backend

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions