feat: Implementation of HTML backend with headless browser#2969
feat: Implementation of HTML backend with headless browser#2969maxmnemonic merged 31 commits intomainfrom
Conversation
|
✅ DCO Check Passed Thanks @maxmnemonic, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
4ea086e to
7e10a9d
Compare
|
Documentation Updates 1 document(s) were updated by changes in this PR: What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Changes@@ -128,6 +128,68 @@
---
+### HTML
+- **Pipeline/Backend**: `SimplePipeline` + `HTMLDocumentBackend`
+- **Installation Requirements**: HTML rendering (with headless browser support) requires the `htmlrender` extra: `pip install docling[htmlrender]`. This installs Playwright and related dependencies.
+- **Key Options** (`HTMLBackendOptions`):
+ - `render_page` (bool, default: False): Enable headless browser rendering to capture page images and element bounding boxes
+ - `render_page_width` (int, default: 794): Render page width in CSS pixels (A4 @ 96 DPI)
+ - `render_page_height` (int, default: 1123): Render page height in CSS pixels (A4 @ 96 DPI)
+ - `render_page_orientation` (Literal["portrait", "landscape"], default: "portrait"): Page orientation
+ - `render_print_media` (bool, default: True): Use print media emulation when rendering
+ - `render_wait_until` (Literal["load", "domcontentloaded", "networkidle"], default: "networkidle"): Playwright wait condition before extracting DOM
+ - `render_wait_ms` (int, default: 0): Extra delay in milliseconds after load
+ - `render_device_scale` (float, default: 1.0): Device scale factor for rendering
+ - `page_padding` (int, default: 0): Padding in CSS pixels applied to HTML body before rendering
+ - `render_full_page` (bool, default: False): Capture a single full-height page image instead of paginating
+ - `render_dpi` (int, default: 96): DPI used for page images created from rendering
+ - `fetch_images` (bool, default: False): Fetch and embed images from the HTML
+ - `enable_local_fetch` (bool): Enable fetching resources from the local filesystem
+ - `source_uri` (Path or str): Base URI for resolving relative paths in HTML
+- **Processing**:
+ - Reading order is preserved from the HTML DOM tree
+ - Supports HTML form elements: checkboxes, radio buttons, text inputs, and other input fields
+ - Supports key-value pair conventions where HTML elements with matching IDs (e.g., "key1" and "key1_value1") are automatically paired as key-value relationships
+ - When `render_page=True`, uses Playwright headless browser to materialize HTML pages into images
+ - Adds provenances with bounding boxes to all elements in the converted document when rendering is enabled
+ - Can handle local file paths and remote URLs
+ - Heuristic that glues independent inline HTML elements with single-character text into larger text blocks
+ - Support for inline styling (bold, italic, etc.)
+- **Usage Example**:
+
+```python
+from pathlib import Path
+from docling.datamodel.backend_options import HTMLBackendOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, HTMLFormatOption
+
+html_options = HTMLBackendOptions(
+ render_page=True,
+ render_page_width=794,
+ render_page_height=1123,
+ render_device_scale=2.0,
+ render_page_orientation="portrait",
+ render_print_media=True,
+ render_wait_until="networkidle",
+ render_wait_ms=500,
+ render_full_page=True,
+ render_dpi=144,
+ page_padding=16,
+ fetch_images=True,
+)
+
+converter = DocumentConverter(
+ format_options={
+ InputFormat.HTML: HTMLFormatOption(backend_options=html_options)
+ }
+)
+
+result = converter.convert("path/to/file.html")
+doc = result.document
+```
+
+---
+
### LaTeX
- **Pipeline/Backend**: `SimplePipeline` + `LatexDocumentBackend`
- **Key Options** (`LatexBackendOptions`): |
…ight) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…TML backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…er values, and restricting key-value only for the ones that satisfy scope if there are such. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…e and render scale compute, and an example on how to run html_backend with rendering Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
… Example that uses multi-processing for conversion; Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…when inside key-value pair Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…ding order inside the field_item Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…ents such as text Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…compute checkbox bboxes, improved handling of single-character inline groups Signed-off-by Maksym Lysak <mly@zurich.ibm.com> Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
… overflowing viewport, removal of empty inline groups and elements with negative bounding boxes Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…n the html_backend Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…eatures Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…, p, summary. Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
a2caaf2 to
afb89b4
Compare
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
ceberam
left a comment
There was a problem hiding this comment.
@maxmnemonic just looking at the parsing output, it looks much neater, with less noise and with visual elements that get lost in the current implementation. However, looking at the Wiki duck page, I have seen a pattern that seems strange (see below).
tests/data/groundtruth/docling_v2/html_inline_group_in_table_cell.html.json
Show resolved
Hide resolved
tests/data/groundtruth/docling_v2/html_rich_table_cells.html.json
Outdated
Show resolved
Hide resolved
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…inks and inline code blocks Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
ceberam
left a comment
There was a problem hiding this comment.
Great!
Just to help code maintenance, some docstrings and code consolidation would be nice afterwards.
Checklist: