Skip to content

feat: Implementation of HTML backend with headless browser#2969

Merged
maxmnemonic merged 31 commits intomainfrom
dev/html_backend_rendered
Mar 24, 2026
Merged

feat: Implementation of HTML backend with headless browser#2969
maxmnemonic merged 31 commits intomainfrom
dev/html_backend_rendered

Conversation

@maxmnemonic
Copy link
Member

@maxmnemonic maxmnemonic commented Feb 9, 2026

  • Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document.
  • Conversion preserves reading order given by HTML DOM tree
  • Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc.
  • Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples)
  • Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks
  • Support for inline styling (bold, italic, etc.)

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@maxmnemonic maxmnemonic self-assigned this Feb 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

DCO Check Passed

Thanks @maxmnemonic, all your commits are properly signed off. 🎉

@maxmnemonic maxmnemonic added the html issue related to html backend label Feb 9, 2026
@mergify
Copy link

mergify bot commented Feb 9, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@maxmnemonic maxmnemonic requested a review from vagenas March 16, 2026 12:41
@maxmnemonic maxmnemonic force-pushed the dev/html_backend_rendered branch 2 times, most recently from 4ea086e to 7e10a9d Compare March 16, 2026 13:47
@maxmnemonic maxmnemonic marked this pull request as ready for review March 16, 2026 13:51
@dosubot
Copy link

dosubot bot commented Mar 16, 2026

Documentation Updates

1 document(s) were updated by changes in this PR:

What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?
View Changes
@@ -128,6 +128,68 @@
 
 ---
 
+### HTML
+- **Pipeline/Backend**: `SimplePipeline` + `HTMLDocumentBackend`
+- **Installation Requirements**: HTML rendering (with headless browser support) requires the `htmlrender` extra: `pip install docling[htmlrender]`. This installs Playwright and related dependencies.
+- **Key Options** (`HTMLBackendOptions`):
+    - `render_page` (bool, default: False): Enable headless browser rendering to capture page images and element bounding boxes
+    - `render_page_width` (int, default: 794): Render page width in CSS pixels (A4 @ 96 DPI)
+    - `render_page_height` (int, default: 1123): Render page height in CSS pixels (A4 @ 96 DPI)
+    - `render_page_orientation` (Literal["portrait", "landscape"], default: "portrait"): Page orientation
+    - `render_print_media` (bool, default: True): Use print media emulation when rendering
+    - `render_wait_until` (Literal["load", "domcontentloaded", "networkidle"], default: "networkidle"): Playwright wait condition before extracting DOM
+    - `render_wait_ms` (int, default: 0): Extra delay in milliseconds after load
+    - `render_device_scale` (float, default: 1.0): Device scale factor for rendering
+    - `page_padding` (int, default: 0): Padding in CSS pixels applied to HTML body before rendering
+    - `render_full_page` (bool, default: False): Capture a single full-height page image instead of paginating
+    - `render_dpi` (int, default: 96): DPI used for page images created from rendering
+    - `fetch_images` (bool, default: False): Fetch and embed images from the HTML
+    - `enable_local_fetch` (bool): Enable fetching resources from the local filesystem
+    - `source_uri` (Path or str): Base URI for resolving relative paths in HTML
+- **Processing**:
+    - Reading order is preserved from the HTML DOM tree
+    - Supports HTML form elements: checkboxes, radio buttons, text inputs, and other input fields
+    - Supports key-value pair conventions where HTML elements with matching IDs (e.g., "key1" and "key1_value1") are automatically paired as key-value relationships
+    - When `render_page=True`, uses Playwright headless browser to materialize HTML pages into images
+    - Adds provenances with bounding boxes to all elements in the converted document when rendering is enabled
+    - Can handle local file paths and remote URLs
+    - Heuristic that glues independent inline HTML elements with single-character text into larger text blocks
+    - Support for inline styling (bold, italic, etc.)
+- **Usage Example**:
+
+```python
+from pathlib import Path
+from docling.datamodel.backend_options import HTMLBackendOptions
+from docling.datamodel.base_models import InputFormat
+from docling.document_converter import DocumentConverter, HTMLFormatOption
+
+html_options = HTMLBackendOptions(
+    render_page=True,
+    render_page_width=794,
+    render_page_height=1123,
+    render_device_scale=2.0,
+    render_page_orientation="portrait",
+    render_print_media=True,
+    render_wait_until="networkidle",
+    render_wait_ms=500,
+    render_full_page=True,
+    render_dpi=144,
+    page_padding=16,
+    fetch_images=True,
+)
+
+converter = DocumentConverter(
+    format_options={
+        InputFormat.HTML: HTMLFormatOption(backend_options=html_options)
+    }
+)
+
+result = converter.convert("path/to/file.html")
+doc = result.document
+```
+
+---
+
 ### LaTeX
 - **Pipeline/Backend**: `SimplePipeline` + `LatexDocumentBackend`
 - **Key Options** (`LatexBackendOptions`):

How did I do? Any feedback?  Join Discord

Maksym Lysak added 16 commits March 23, 2026 09:43
…ight) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…TML backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…er values, and restricting key-value only for the ones that satisfy scope if there are such.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…e and render scale compute, and an example on how to run html_backend with rendering

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
… Example that uses multi-processing for conversion;

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…when inside key-value pair

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…ding order inside the field_item

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…ents such as text

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Maksym Lysak added 11 commits March 23, 2026 09:43
…compute checkbox bboxes, improved handling of single-character inline groups

Signed-off-by Maksym Lysak <mly@zurich.ibm.com>

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
… overflowing viewport, removal of empty inline groups and elements with negative bounding boxes

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…n the html_backend

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…eatures

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
…, p, summary.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
@maxmnemonic maxmnemonic force-pushed the dev/html_backend_rendered branch from a2caaf2 to afb89b4 Compare March 23, 2026 08:44
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxmnemonic just looking at the parsing output, it looks much neater, with less noise and with visual elements that get lost in the current implementation. However, looking at the Wiki duck page, I have seen a pattern that seems strange (see below).

Copy link
Member

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comments for details.

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
@maxmnemonic maxmnemonic requested a review from ceberam March 24, 2026 09:27
…inks and inline code blocks

Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
@maxmnemonic maxmnemonic requested a review from vagenas March 24, 2026 09:50
@maxmnemonic maxmnemonic added the enhancement New feature or request label Mar 24, 2026
Signed-off-by: Maksym Lysak <mly@zurich.ibm.com>
Copy link
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
Just to help code maintenance, some docstrings and code consolidation would be nice afterwards.

@maxmnemonic maxmnemonic merged commit 1c74a9b into main Mar 24, 2026
27 checks passed
@maxmnemonic maxmnemonic deleted the dev/html_backend_rendered branch March 24, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request html issue related to html backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants