Skip to content

Conversation

@cau-git
Copy link
Contributor

@cau-git cau-git commented Sep 10, 2025

Establishes a pipeline targeting VLMs that:

  • Produce DocTags
  • Accept a prompt with injected layout (box locations) from a layout model

⚠️ Note: This is an experimental pipeline, hence it resides in the docling.experimental module path.

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@mergify
Copy link

mergify bot commented Sep 10, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

github-actions bot commented Sep 10, 2025

DCO Check Passed

Thanks @cau-git, all your commits are properly signed off. 🎉

@cau-git cau-git force-pushed the cau/layout_vlm_pipeline branch from 0dbdd1f to 72007b9 Compare September 10, 2025 15:38
@codecov
Copy link

codecov bot commented Sep 10, 2025

@cau-git cau-git force-pushed the cau/layout_vlm_pipeline branch from a63ebb1 to 0a39e2f Compare September 11, 2025 14:50
Signed-off-by: Christoph Auer <[email protected]>
Signed-off-by: Christoph Auer <[email protected]>
@ElHachem02 ElHachem02 self-assigned this Nov 5, 2025
@ElHachem02 ElHachem02 force-pushed the cau/layout_vlm_pipeline branch from d572cf4 to 6ea00bb Compare November 5, 2025 10:05
@cau-git cau-git requested a review from Copilot November 5, 2025 13:10
@ElHachem02
Copy link
Contributor

@dosu review

@dosubot
Copy link

dosubot bot commented Nov 5, 2025

The experimental VLM pipeline is implemented in the main codebase (not a separate experimental module) and supports producing DocTags and injecting layout information into prompts for vision-language models. The pipeline is flexible, allowing prompt customization per page, and supports multiple VLM backends (Transformers, MLX, vLLM, API) with output formats like DocTags, Markdown, and HTML. Layout prompt injection is handled by constructing prompts that can include box locations or other layout model outputs at runtime.

Documentation and usage examples have been updated and are available, including scripts for minimal pipelines, model comparisons, API usage, and GPU acceleration. These examples demonstrate how to configure the pipeline, inject layout into prompts, and export results in various formats. Recent PRs confirm that these examples and docs are maintained and integrated into the CI workflow for visibility and testing (source, source, source).

Automated test coverage for this experimental feature is not clearly documented, but the code is modular, maintainable, and well-documented for experimental status. If you need to see concrete usage, check the example scripts like gpu_vlm_pipeline.py, minimal_vlm_pipeline.py, and vlm_pipeline_api_model.py for end-to-end demonstrations.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR standardizes the build_prompt method call signature across VLM models and introduces a new experimental threaded pipeline that combines layout model preprocessing with VLM processing.

  • Standardizes build_prompt() to accept full Page objects instead of inconsistent page.parsed_page or callable checks
  • Introduces BaseVlmPageModel as a common base class for VLM page models
  • Adds experimental ThreadedLayoutVlmPipeline that injects layout coordinates into VLM prompts

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
docling/models/vlm_models_inline/mlx_model.py Simplified to use standardized build_prompt(page) instead of callable check
docling/models/vlm_models_inline/hf_transformers_model.py Changed from build_prompt(page.parsed_page) to build_prompt(page) for consistency
docling/models/api_vlm_model.py Refactored to inherit from BaseVlmPageModel, added process_images() method, and standardized prompt building (with inconsistency issue)
docling/experimental/pipeline/threaded_layout_vlm_pipeline.py New experimental pipeline combining layout and VLM processing with layout coordinate injection
docling/experimental/demo_layout_vlm.py Demo script for the new threaded pipeline (contains typo in parameter name)
docling/experimental/datamodel/threaded_layout_vlm_pipeline_options.py Configuration options for the new pipeline
docling/experimental/pipeline/init.py New package initialization
docling/experimental/datamodel/init.py New package initialization
docling/experimental/init.py New experimental package with docstring
docling/datamodel/pipeline_options_vlm_model.py Updated build_prompt() signature to accept Union[Page, SegmentedPage] with duplicate import

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if hi_res_image is not None:
images.append(hi_res_image)
prompt = self.vlm_options.build_prompt(
page.parsed_page
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent parameter passed to build_prompt(). This file calls build_prompt(page.parsed_page) while the other VLM models (mlx_model.py, hf_transformers_model.py) call build_prompt(page). This should be changed to self.vlm_options.build_prompt(page) for consistency.

Suggested change
page.parsed_page
page

Copilot uses AI. Check for mistakes.
images.append(hi_res_image)
prompt = self.vlm_options.build_prompt(
page.parsed_page
) # ask christoph
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or resolve the TODO comment '# ask christoph' before merging to production.

Suggested change
) # ask christoph
)

Copilot uses AI. Check for mistakes.

stop_reason = VlmStopReason.UNSPECIFIED

if self.vlm_options.custom_stopping_criteria: # Ask christoph
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or resolve the TODO comment '# Ask christoph' before merging to production.

Suggested change
if self.vlm_options.custom_stopping_criteria: # Ask christoph
if self.vlm_options.custom_stopping_criteria:

Copilot uses AI. Check for mistakes.
generate_page_images=True,
)

pipeline_options_classic_vlm = VlmPipelineOptions(vlm_otpions=GRANITEDOCLING_VLLM)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'vlm_otpions' to 'vlm_options'.

Suggested change
pipeline_options_classic_vlm = VlmPipelineOptions(vlm_otpions=GRANITEDOCLING_VLLM)
pipeline_options_classic_vlm = VlmPipelineOptions(vlm_options=GRANITEDOCLING_VLLM)

Copilot uses AI. Check for mistakes.
from typing import Any, Dict, List, Literal, Optional, Union
from typing import TYPE_CHECKING, Any, Dict, List, Literal, Optional, Union

from docling_core.types.doc.page import SegmentedPage
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import: SegmentedPage is imported both at line 4 and inside the TYPE_CHECKING block at line 13. The import at line 4 should be removed since it's only used for type annotations and is already imported in the TYPE_CHECKING block.

Suggested change
from docling_core.types.doc.page import SegmentedPage

Copilot uses AI. Check for mistakes.
layout_injection = f"{layout_xml}"

custom_prompt = base_prompt + layout_injection
print(f"Layout injection prompt: {custom_prompt}")
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement should be removed or replaced with proper logging using the _log logger that's already defined at the module level.

Suggested change
print(f"Layout injection prompt: {custom_prompt}")
_log.debug(f"Layout injection prompt: {custom_prompt}")

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +51
_log = logging.getLogger(__name__)


Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global variable '_log' is not used.

Suggested change
_log = logging.getLogger(__name__)

Copilot uses AI. Check for mistakes.
import itertools
import logging
from pathlib import Path
from typing import Iterable, List, Optional, Union, cast
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Iterable' is not used.

Suggested change
from typing import Iterable, List, Optional, Union, cast
from typing import List, Optional, Union, cast

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants