Commit 1c74a9b
feat: Implementation of HTML backend with headless browser (#2969)
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document.
- Conversion preserves reading order given by HTML DOM tree
- Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc.
- Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples)
- Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks
- Support for inline styling (bold, italic, etc.)
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>1 parent 90d6dd4 commit 1c74a9b
File tree
134 files changed
+40503
-7572
lines changed- .github/workflows
- docling
- backend
- datamodel
- docs
- examples
- getting_started
- tests/data
- groundtruth/docling_v2
- html
- webp/groundtruth/docling_v2
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
134 files changed
+40503
-7572
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| |||
0 commit comments