Skip to content

Commit 1c74a9b

Browse files
maxmnemonicMaksym Lysak
andauthored
feat: Implementation of HTML backend with headless browser (#2969)
- Implementation of HTML backend that (optionally) uses headless browser (via Playwright) to materialize HTML pages into images, and add provenances with bboxes to all elements in the converted docling document. - Conversion preserves reading order given by HTML DOM tree - Added support for HTML "input" fields: checkboxes, radiobuttons, text inputs, etc. - Added support to Key-Value convention in HTML (i.e. elements with id "key1" and "key1_value1" will be paired as key-values, see test cases as examples) - Heuristic that glues independent inline HTML elements with single-character text in them into larger text blocks - Support for inline styling (bold, italic, etc.) Signed-off-by: Maksym Lysak <mly@zurich.ibm.com> Co-authored-by: Maksym Lysak <mly@zurich.ibm.com>
1 parent 90d6dd4 commit 1c74a9b

File tree

134 files changed

+40503
-7572
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

134 files changed

+40503
-7572
lines changed

.github/workflows/checks.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ env:
2020
tests/test_asr_pipeline.py
2121
tests/test_threaded_pipeline.py
2222
PYTEST_TO_SKIP: |-
23-
EXAMPLES_TO_SKIP: '^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|suryaocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model|granitedocling_repetition_stopping|mlx_whisper_example|gpu_standard_pipeline|gpu_vlm_pipeline|demo_layout_vlm|post_process_ocr_with_vlm)\.py$|xbrl_conversion\.ipynb$'
23+
EXAMPLES_TO_SKIP: '^(batch_convert|compare_vlm_models|minimal|minimal_vlm_pipeline|minimal_asr_pipeline|export_multimodal|custom_convert|develop_picture_enrichment|rapidocr_with_custom_models|suryaocr_with_custom_models|offline_convert|pictures_description|pictures_description_api|vlm_pipeline_api_model|granitedocling_repetition_stopping|mlx_whisper_example|gpu_standard_pipeline|gpu_vlm_pipeline|demo_layout_vlm|post_process_ocr_with_vlm|run_with_formats_html_rendered|run_with_formats_html_rendered_mp)\.py$|xbrl_conversion\.ipynb$'
2424

2525
jobs:
2626
lint:

0 commit comments

Comments
 (0)