-
|
Happy new year! So I was testing Docling's PDF parsing capablities with OCR and VLM pipelines against scanned PDF files, but seems the vlm_options = ApiVlmOptions(
# VLM provider setup ...
timeout=300,
scale=1.0,
response_format=ResponseFormat.MARKDOWN,
prompt="You are an expert OCR assistant. Convert the provided image to Markdown.",
)
pipeline_options = VlmPipelineOptions()
pipeline_options.enable_remote_services = True
pipeline_options.vlm_options = vlm_options
# NOTE: Not working in VLM Pipeline
pipeline_options.generate_picture_images = True
return pipeline_optionsand the converting and saving parts: converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
pipeline_cls=VlmPipeline,
),
}
)
# Assemble paths
# ...
# Save artifacts
result.document.save_as_markdown(
md_path, image_mode=ImageRefMode.REFERENCED, artifacts_dir=image_dir
)Another thing is, can we only use VLM to do OCR instead of parsing the whole page? First, it's slow and costly. I'm satistified with the layout detection results produced by the PDF pipeline, and it can extract images too, just the results of the supported OCR engines were not as good as VLM. So maybe we can crop the text parts and only send them to VLM to do OCR? Thanks! 🙏 |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 3 replies
-
|
This is expected: in Docling's VLM pipeline, the generate_picture_images option only works if the VLM model output includes explicit image references (like If you want to keep the original images in your markdown, the recommended workflow is to use the standard PDF pipeline for layout detection and image extraction (with generate_picture_images=True), then use VLM for enrichment (like image description or OCR) as a separate step. Docling's pipelines are modular, so you can combine outputs as needed see example. Currently, there's no built-in way to use VLM only for OCR on detected text regions within Docling. The VLM pipeline processes full-page images by default. To do region-based VLM OCR, you'd need to run layout detection first (with the standard pipeline), crop the text regions yourself, and then call the VLM model on those crops outside of Docling's built-in pipelines see code reference. If you want a sample hybrid workflow or code pattern for this, let me know! To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
|
@dosu Can I extend the built-in PdfPipelineOptions to achieve using VLM for OCR on text regions? Can you give me some code examples? |
Beta Was this translation helpful? Give feedback.
-
|
I wrote a PoC, the WER is definitely lower, but the layout shifted, moreover, style and table structure all lost... I gave up 🤷 |
Beta Was this translation helpful? Give feedback.
You can't directly extend PdfPipelineOptions to use VLM for OCR on detected text regions, but you can achieve this by creating a custom OCR engine that runs VLM inference on cropped regions, then registering it with Docling's OCR factory and selecting it via ocr_options. The standard pipeline is designed to let you swap in custom OCR logic this way (reference, reference).
Here's a high-level code pattern:
For each detected text region (bounding box), crop the region from the page image, call your VLM model on the crop, and assemble the results as TextCell objects.