-
Notifications
You must be signed in to change notification settings - Fork 3.1k
feat: Layout + VLM model with layout prompt #2244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Signed-off-by: Christoph Auer <[email protected]>
0dbdd1f to
72007b9
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
a63ebb1 to
0a39e2f
Compare
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Peter El Hachem <[email protected]> Signed-off-by: ElHachem02 <[email protected]>
d572cf4 to
6ea00bb
Compare
Signed-off-by: ElHachem02 <[email protected]>
Signed-off-by: ElHachem02 <[email protected]>
|
@dosu review |
|
The experimental VLM pipeline is implemented in the main codebase (not a separate experimental module) and supports producing DocTags and injecting layout information into prompts for vision-language models. The pipeline is flexible, allowing prompt customization per page, and supports multiple VLM backends (Transformers, MLX, vLLM, API) with output formats like DocTags, Markdown, and HTML. Layout prompt injection is handled by constructing prompts that can include box locations or other layout model outputs at runtime. Documentation and usage examples have been updated and are available, including scripts for minimal pipelines, model comparisons, API usage, and GPU acceleration. These examples demonstrate how to configure the pipeline, inject layout into prompts, and export results in various formats. Recent PRs confirm that these examples and docs are maintained and integrated into the CI workflow for visibility and testing (source, source, source). Automated test coverage for this experimental feature is not clearly documented, but the code is modular, maintainable, and well-documented for experimental status. If you need to see concrete usage, check the example scripts like To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR standardizes the build_prompt method call signature across VLM models and introduces a new experimental threaded pipeline that combines layout model preprocessing with VLM processing.
- Standardizes
build_prompt()to accept fullPageobjects instead of inconsistentpage.parsed_pageor callable checks - Introduces
BaseVlmPageModelas a common base class for VLM page models - Adds experimental
ThreadedLayoutVlmPipelinethat injects layout coordinates into VLM prompts
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| docling/models/vlm_models_inline/mlx_model.py | Simplified to use standardized build_prompt(page) instead of callable check |
| docling/models/vlm_models_inline/hf_transformers_model.py | Changed from build_prompt(page.parsed_page) to build_prompt(page) for consistency |
| docling/models/api_vlm_model.py | Refactored to inherit from BaseVlmPageModel, added process_images() method, and standardized prompt building (with inconsistency issue) |
| docling/experimental/pipeline/threaded_layout_vlm_pipeline.py | New experimental pipeline combining layout and VLM processing with layout coordinate injection |
| docling/experimental/demo_layout_vlm.py | Demo script for the new threaded pipeline (contains typo in parameter name) |
| docling/experimental/datamodel/threaded_layout_vlm_pipeline_options.py | Configuration options for the new pipeline |
| docling/experimental/pipeline/init.py | New package initialization |
| docling/experimental/datamodel/init.py | New package initialization |
| docling/experimental/init.py | New experimental package with docstring |
| docling/datamodel/pipeline_options_vlm_model.py | Updated build_prompt() signature to accept Union[Page, SegmentedPage] with duplicate import |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if hi_res_image is not None: | ||
| images.append(hi_res_image) | ||
| prompt = self.vlm_options.build_prompt( | ||
| page.parsed_page |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent parameter passed to build_prompt(). This file calls build_prompt(page.parsed_page) while the other VLM models (mlx_model.py, hf_transformers_model.py) call build_prompt(page). This should be changed to self.vlm_options.build_prompt(page) for consistency.
| page.parsed_page | |
| page |
| images.append(hi_res_image) | ||
| prompt = self.vlm_options.build_prompt( | ||
| page.parsed_page | ||
| ) # ask christoph |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or resolve the TODO comment '# ask christoph' before merging to production.
| ) # ask christoph | |
| ) |
|
|
||
| stop_reason = VlmStopReason.UNSPECIFIED | ||
|
|
||
| if self.vlm_options.custom_stopping_criteria: # Ask christoph |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove or resolve the TODO comment '# Ask christoph' before merging to production.
| if self.vlm_options.custom_stopping_criteria: # Ask christoph | |
| if self.vlm_options.custom_stopping_criteria: |
| generate_page_images=True, | ||
| ) | ||
|
|
||
| pipeline_options_classic_vlm = VlmPipelineOptions(vlm_otpions=GRANITEDOCLING_VLLM) |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'vlm_otpions' to 'vlm_options'.
| pipeline_options_classic_vlm = VlmPipelineOptions(vlm_otpions=GRANITEDOCLING_VLLM) | |
| pipeline_options_classic_vlm = VlmPipelineOptions(vlm_options=GRANITEDOCLING_VLLM) |
| from typing import Any, Dict, List, Literal, Optional, Union | ||
| from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Union | ||
|
|
||
| from docling_core.types.doc.page import SegmentedPage |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import: SegmentedPage is imported both at line 4 and inside the TYPE_CHECKING block at line 13. The import at line 4 should be removed since it's only used for type annotations and is already imported in the TYPE_CHECKING block.
| from docling_core.types.doc.page import SegmentedPage |
| layout_injection = f"{layout_xml}" | ||
|
|
||
| custom_prompt = base_prompt + layout_injection | ||
| print(f"Layout injection prompt: {custom_prompt}") |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement should be removed or replaced with proper logging using the _log logger that's already defined at the module level.
| print(f"Layout injection prompt: {custom_prompt}") | |
| _log.debug(f"Layout injection prompt: {custom_prompt}") |
| _log = logging.getLogger(__name__) | ||
|
|
||
|
|
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global variable '_log' is not used.
| _log = logging.getLogger(__name__) |
| import itertools | ||
| import logging | ||
| from pathlib import Path | ||
| from typing import Iterable, List, Optional, Union, cast |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import of 'Iterable' is not used.
| from typing import Iterable, List, Optional, Union, cast | |
| from typing import List, Optional, Union, cast |
Establishes a pipeline targeting VLMs that:
docling.experimentalmodule path.Checklist: