How to extract images when using Docling with VLM? Also wondering can we use VLM in the OCR pipeline? #2833

AaronJan · 2026-01-04T03:23:35Z

AaronJan
Jan 4, 2026

Happy new year!

So I was testing Docling's PDF parsing capablities with OCR and VLM pipelines against scanned PDF files, but seems the generate_picture_images option doesn't work with VLM pipline, I cannot see any images been extracted like the OCR pipeline did. Am I using it wrong, or that is expected (I'm aware that we can use VLM to generate description for the images, but I just want to keep the original images in the markdown)?

    vlm_options = ApiVlmOptions(
        # VLM provider setup ...

        timeout=300,
        scale=1.0,
        response_format=ResponseFormat.MARKDOWN,
        prompt="You are an expert OCR assistant. Convert the provided image to Markdown.",
    )

    pipeline_options = VlmPipelineOptions()
    pipeline_options.enable_remote_services = True
    pipeline_options.vlm_options = vlm_options

    # NOTE: Not working in VLM Pipeline
    pipeline_options.generate_picture_images = True

    return pipeline_options

and the converting and saving parts:

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
                pipeline_cls=VlmPipeline,
            ),
        }
    )

    # Assemble paths
    # ...

    # Save artifacts
    result.document.save_as_markdown(
        md_path, image_mode=ImageRefMode.REFERENCED, artifacts_dir=image_dir
    )

Another thing is, can we only use VLM to do OCR instead of parsing the whole page? First, it's slow and costly. I'm satistified with the layout detection results produced by the PDF pipeline, and it can extract images too, just the results of the supported OCR engines were not as good as VLM. So maybe we can crop the text parts and only send them to VLM to do OCR?

Thanks! 🙏

Answered by dosubot[bot]

Jan 4, 2026

You can't directly extend PdfPipelineOptions to use VLM for OCR on detected text regions, but you can achieve this by creating a custom OCR engine that runs VLM inference on cropped regions, then registering it with Docling's OCR factory and selecting it via ocr_options. The standard pipeline is designed to let you swap in custom OCR logic this way (reference, reference).

Here's a high-level code pattern:

Subclass BaseOcrModel (see EasyOcrModel for reference):
For each detected text region (bounding box), crop the region from the page image, call your VLM model on the crop, and assemble the results as TextCell objects.

from docling.models.base_ocr_model import BaseOcrModel
from docling…

View full answer

2026-01-04T03:28:50Z

dosubot[bot]
bot Jan 4, 2026

This is expected: in Docling's VLM pipeline, the generate_picture_images option only works if the VLM model output includes explicit image references (like tokens in Markdown or DocTags). If the model/prompt doesn't produce PictureItem elements, no images will be extracted or saved, even if generate_picture_images is True. This is different from the standard OCR pipeline, which always extracts images from the PDF itself regardless of model output. When using remote VLM APIs (like OpenAI's GPT-4o), the model can't return embedded images or base64 data, so image extraction must be done locally and merged with the VLM output see details see also.

If you want to keep the original images in your markdown, the recommended workflow is to use the standard PDF pipeline for layout detection and image extraction (with generate_picture_images=True), then use VLM for enrichment (like image description or OCR) as a separate step. Docling's pipelines are modular, so you can combine outputs as needed see example.

Currently, there's no built-in way to use VLM only for OCR on detected text regions within Docling. The VLM pipeline processes full-page images by default. To do region-based VLM OCR, you'd need to run layout detection first (with the standard pipeline), crop the text regions yourself, and then call the VLM model on those crops outside of Docling's built-in pipelines see code reference.

If you want a sample hybrid workflow or code pattern for this, let me know!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

AaronJan · 2026-01-04T06:21:49Z

AaronJan
Jan 4, 2026
Author

@dosu Can I extend the built-in PdfPipelineOptions to achieve using VLM for OCR on text regions? Can you give me some code examples?

3 replies

dosubot[bot] bot Jan 4, 2026

You can't directly extend PdfPipelineOptions to use VLM for OCR on detected text regions, but you can achieve this by creating a custom OCR engine that runs VLM inference on cropped regions, then registering it with Docling's OCR factory and selecting it via ocr_options. The standard pipeline is designed to let you swap in custom OCR logic this way (reference, reference).

Here's a high-level code pattern:

Subclass BaseOcrModel (see EasyOcrModel for reference):
For each detected text region (bounding box), crop the region from the page image, call your VLM model on the crop, and assemble the results as TextCell objects.

from docling.models.base_ocr_model import BaseOcrModel
from docling_core.types.doc.page import TextCell, BoundingRectangle
from docling.models.factories.ocr_factory import get_ocr_factory

class VlmRegionOcrModel(BaseOcrModel):
    def __init__(self, enabled, artifacts_path, options, accelerator_options):
        super().__init__(enabled, artifacts_path, options, accelerator_options)
        # Initialize your VLM client/model here

    def __call__(self, conv_res, page_batch):
        for page in page_batch:
            # Assume you have a way to get text region bounding boxes (e.g., from layout)
            text_regions = self.get_text_regions(page)
            ocr_cells = []
            for bbox in text_regions:
                crop_img = page._backend.get_page_image(scale=1.0, cropbox=bbox)
                # Call your VLM OCR model on crop_img
                text = self.vlm_ocr_infer(crop_img)
                cell = TextCell(
                    text=text,
                    orig=text,
                    from_ocr=True,
                    rect=BoundingRectangle.from_bounding_box(bbox),
                )
                ocr_cells.append(cell)
            # Attach ocr_cells to the page as needed
            yield page

    def get_text_regions(self, page):
        # Use layout or segmentation to get bounding boxes for text regions
        # Return a list of BoundingBox objects
        pass

    def vlm_ocr_infer(self, image):
        # Run your VLM model on the cropped image and return the recognized text
        pass

Register your custom OCR model with the OCR factory (as a plugin or directly).
Configure PdfPipelineOptions to use your custom OCR engine:

from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions(
    do_ocr=True,
    ocr_options=YourCustomVlmOcrOptions(),  # Define options for your engine
)

Run the pipeline as usual with DocumentConverter.

This approach lets you leverage Docling's layout detection and region cropping, but swap in VLM for OCR on each region (reference, reference). You'll need to handle batching, error handling, and VLM API specifics in your implementation.

If you want to keep things simple, you can also run region-based VLM OCR outside the pipeline: extract regions using Docling's layout, crop them, run VLM OCR, and merge results back into your document.

Let me know if you want a more detailed example or have questions about integrating with a specific VLM API!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Answer selected by AaronJan

noamkap Jan 27, 2026

@dosu is there a way to custom the PDFPipeline with VLM capabilities to improve the exported document structure? I prefer the structured output of the VLMPipeline, but it sometimes misses some text components

dosubot[bot] bot Jan 27, 2026

Yes, you can customize Docling's PDF pipeline to leverage VLM capabilities for improved document structure, but it's not a direct plug-and-play. Here are the main approaches:

Hybrid pipeline with force_backend_text: In the VLM pipeline, set force_backend_text=True and use a VLM model that outputs bounding boxes (DocTags format). This lets Docling use deterministic backend text extraction for text regions while using the VLM for images, tables, and complex layout—giving you more complete text coverage and structured output. The VLM must predict all regions for this to work well. Details & code
Custom OCR engine: You can register a custom OCR engine (subclassing BaseOcrModel) that crops detected text regions and runs VLM inference on each crop, then merges the results back into the pipeline. This lets you use VLM for OCR on regions while keeping Docling's layout and structure logic. Pattern & code example
Picture/table enrichment via API: Use the standard PDF pipeline for layout and image extraction, then configure do_picture_description=True and set picture_description_options to point to your VLM API. This offloads image/table/code block description to the VLM, keeping Docling's structure and resource usage minimal. Example config
Post-processing for hierarchy: For better heading and section structure, run tools like docling-hierarchical-pdf after Docling parses the PDF but before Markdown export. This infers heading levels and updates the document model for more accurate Markdown. Workflow & code

Trade-offs: Region-based VLM OCR can lose layout and style unless you carefully merge results. The VLM pipeline is model-driven—prompt engineering and model choice matter. Remote VLMs can't return embedded images; images must be extracted locally and merged.

Best practices: Use the latest Docling and VLM models, tune prompts for structure, and consider merging outputs from both pipelines for optimal results. For complex tables/lists, recent updates in docling-core help reconstruct layout using cell span metadata. Discussion

Let me know if you want a concrete code sample for any of these workflows!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

AaronJan · 2026-01-04T12:05:43Z

AaronJan
Jan 4, 2026
Author

I wrote a PoC, the WER is definitely lower, but the layout shifted, moreover, style and table structure all lost...

I gave up 🤷

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract images when using Docling with VLM? Also wondering can we use VLM in the OCR pipeline? #2833

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How to extract images when using Docling with VLM? Also wondering can we use VLM in the OCR pipeline? #2833

Uh oh!

AaronJan Jan 4, 2026

Replies: 3 comments · 3 replies

Uh oh!

dosubot[bot] bot Jan 4, 2026

Uh oh!

AaronJan Jan 4, 2026 Author

Uh oh!

dosubot[bot] bot Jan 4, 2026

Uh oh!

noamkap Jan 27, 2026

Uh oh!

dosubot[bot] bot Jan 27, 2026

Uh oh!

Uh oh!

AaronJan Jan 4, 2026 Author

AaronJan
Jan 4, 2026

Replies: 3 comments 3 replies

dosubot[bot]
bot Jan 4, 2026

AaronJan
Jan 4, 2026
Author

AaronJan
Jan 4, 2026
Author