Docling Qwen3-VL Plugin

This package integrates the cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit vision-language model with Docling through the plugin system. It provides:

OCR Engine - Extract text from scanned documents with layout-aware bounding boxes
Picture Description - Generate detailed descriptions of images/figures in documents
Table Structure Detection - Analyze and extract table structures from document images
Layout Analysis - Detect and classify document layout elements
Picture Classification - Classify images into semantic categories
Code/Formula Detection - Extract code blocks and mathematical formulas

Features

AWQ-quantized Qwen3-VL default model for lower VRAM usage
Layout-aware OCR with bounding boxes via QWENVL_HTML mode
Picture/figure description and captioning for document enrichment
Table structure extraction with cell and row/column detection
Document layout analysis identifying headings, paragraphs, figures, etc.
Picture classification for semantic image categorization
Code and formula extraction with language detection and LaTeX conversion
Multilingual support (32 languages)
Native 256K context length
Chain-of-thought reasoning with the "Thinking" model variant
4-bit/8-bit quantization support (reduce VRAM from ~16GB to ~5GB)

Requirements

NVIDIA GPU with CUDA support (required - CPU not supported)
Python 3.10+
Docling >= 2.63.0 with external plugins enabled
transformers >= 4.51.0
~16GB VRAM (full precision) or ~5GB VRAM (4-bit quantization)

Installation

From PyPI (not yet available)

pip install docling-ocr-qwen3vl

From Source

git clone https://github.com/mayflower/docling-ocr-qwen3vl.git
cd docling-ocr-qwen3vl
pip install -e .

For quantization support (reduces VRAM from ~16GB to ~5GB):

pip install -e ".[quantization]"

For development with tests:

pip install -e ".[test,quantization]"

Quick Start

Python API

from docling import DocumentConverter, InputFormat
from docling.datamodel.pipeline_options import PdfFormatOption, PdfPipelineOptions
from docling_ocr_qwen3vl.options import Qwen3VlOcrOptions, Qwen3VlPromptMode, Qwen3VlQuantization

# Configure pipeline with Qwen3-VL OCR
opts = PdfPipelineOptions()
opts.allow_external_plugins = True
opts.do_ocr = True
opts.ocr_options = Qwen3VlOcrOptions(
    prompt_mode=Qwen3VlPromptMode.QWENVL_HTML,  # Layout-aware with bounding boxes
    quantization=Qwen3VlQuantization.INT4,       # Use 4-bit to reduce VRAM
    force_full_page_ocr=True,
)

# Convert document
converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}
)
result = converter.convert("scanned.pdf")
print(result.document.export_to_markdown())

CLI

# Basic usage
docling --allow-external-plugins --ocr-engine qwen3vl_ocr scanned.pdf

# With specific options (JSON format)
docling --allow-external-plugins \
  --ocr-engine qwen3vl_ocr \
  --ocr-options '{"prompt_mode": "qwenvl_html", "quantization": "int4"}' \
  scanned.pdf

Prompt Modes

The plugin supports multiple prompt modes for different use cases:

Mode	Description	Bounding Boxes
`OCR`	Extract plain text preserving reading order	No
`MARKDOWN`	Convert document to markdown format	No
`STRUCTURED`	Extract text with layout awareness	No
`QWENVL_HTML`	Layout-aware HTML with precise bounding boxes	Yes

QWENVL_HTML Mode (Recommended for Docling)

The QWENVL_HTML mode produces HTML output with data-bbox attributes containing element coordinates. This is the recommended mode for Docling integration as it provides:

Accurate bounding boxes for each text element
Proper element type classification (headings, paragraphs, tables, etc.)
Normalized coordinates (0-1000 scale) that Docling converts to document coordinates

from docling_ocr_qwen3vl.options import Qwen3VlOcrOptions, Qwen3VlPromptMode

opts.ocr_options = Qwen3VlOcrOptions(
    prompt_mode=Qwen3VlPromptMode.QWENVL_HTML,
    force_full_page_ocr=True,
)

Example output format:

<h1 data-bbox="400 80 580 90">Document Title</h1>
<p data-bbox="100 120 900 150">First paragraph text...</p>
<p data-bbox="100 160 900 190">Second paragraph text...</p>

Quantization (Reduce VRAM Usage)

The plugin supports 4-bit and 8-bit quantization via BitsAndBytes, significantly reducing VRAM requirements:

Mode	VRAM (approx)	Quality
Full precision (bf16)	~16GB	Best
8-bit (int8)	~8GB	Very Good
4-bit (int4)	~5GB	Good

Installation

pip install -e ".[quantization]"
# or
pip install bitsandbytes

Usage

from docling_ocr_qwen3vl.options import Qwen3VlOcrOptions, Qwen3VlQuantization

# 4-bit quantization (recommended for limited VRAM)
opts.ocr_options = Qwen3VlOcrOptions(
    quantization=Qwen3VlQuantization.INT4,
    force_full_page_ocr=True,
)

# 8-bit quantization (better quality, more VRAM)
opts.ocr_options = Qwen3VlOcrOptions(
    quantization=Qwen3VlQuantization.INT8,
    force_full_page_ocr=True,
)

Quantization Options

Option	Default	Description
`quantization`	`NONE`	Quantization mode: `NONE`, `INT8`, `INT4`
`bnb_4bit_quant_type`	`"nf4"`	4-bit quantization type: `nf4` or `fp4`
`bnb_4bit_use_double_quant`	`True`	Nested quantization for extra memory savings

Configuration Options

Option	Default	Description
`model_repo_id`	`"cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit"`	Hugging Face model identifier
`device`	`"cuda"`	Torch device (must be CUDA)
`dtype`	`"auto"`	Model dtype (auto, bfloat16, float16, float32)
`max_new_tokens`	`4096`	Maximum tokens to generate
`temperature`	`0.6`	Sampling temperature
`top_p`	`0.95`	Nucleus sampling probability
`top_k`	`20`	Top-k sampling parameter
`do_sample`	`True`	Enable stochastic decoding
`prompt_mode`	`OCR`	Prompt strategy (OCR, MARKDOWN, STRUCTURED, QWENVL_HTML)
`attn_implementation`	`"flash_attention_2"`	Attention backend
`page_scale`	`2.0`	PDF rasterization scale factor

Docling Integration

Step 1: Install Both Packages

Both Docling and this plugin must be in the same Python environment:

pip install -U docling
pip install -e ".[quantization]"  # or pip install docling-ocr-qwen3vl

Step 2: Enable External Plugins

External plugins must be explicitly enabled for security:

Python:

opts = PdfPipelineOptions()
opts.allow_external_plugins = True

CLI:

docling --allow-external-plugins ...

Step 3: Configure OCR Options

Python (recommended configuration):

from docling_ocr_qwen3vl.options import Qwen3VlOcrOptions, Qwen3VlPromptMode, Qwen3VlQuantization

opts.do_ocr = True
opts.ocr_options = Qwen3VlOcrOptions(
    prompt_mode=Qwen3VlPromptMode.QWENVL_HTML,  # Best for layout analysis
    quantization=Qwen3VlQuantization.INT4,       # Reduce VRAM usage
    force_full_page_ocr=True,                    # OCR entire page
)

CLI:

docling --allow-external-plugins \
  --ocr-engine qwen3vl_ocr \
  --ocr-options '{"prompt_mode": "qwenvl_html", "quantization": "int4", "force_full_page_ocr": true}' \
  scanned.pdf

Picture Description (Image Enrichment)

This plugin also provides a picture description model for Docling's enrichment pipeline. This uses Qwen3-VL to generate detailed descriptions of images, figures, and diagrams found in documents.

Python API

from docling import DocumentConverter, InputFormat
from docling.datamodel.pipeline_options import PdfFormatOption, PdfPipelineOptions
from docling_ocr_qwen3vl.options import Qwen3VlPictureDescriptionOptions, Qwen3VlQuantization

# Configure pipeline with picture description
opts = PdfPipelineOptions()
opts.allow_external_plugins = True
opts.do_picture_description = True
opts.generate_picture_images = True  # Required! Without this, images won't be passed to the model
opts.picture_description_options = Qwen3VlPictureDescriptionOptions(
    quantization=Qwen3VlQuantization.INT4,  # Reduce VRAM usage
)

# Convert document - images will be enriched with descriptions
converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}
)
result = converter.convert("document.pdf")

# Access picture descriptions
for item in result.document.iterate_items():
    if hasattr(item, 'annotations') and item.annotations:
        for ann in item.annotations:
            if hasattr(ann, 'content'):
                print(f"Picture description: {ann.content}")

Important: You must set generate_picture_images = True for picture description to work. Without this, Docling won't generate the image data needed by the model, and you'll see  placeholders instead of descriptions.

CLI

docling --allow-external-plugins \
  --picture-description-engine qwen3vl \
  --picture-description-options '{"quantization": "int4"}' \
  --generate-picture-images \
  document.pdf

Note: The --generate-picture-images flag is required for picture description to work via CLI.

Configuration Options

Option	Default	Description
`model_repo_id`	`"cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit"`	Hugging Face model identifier
`prompt`	(detailed description prompt)	Custom prompt for image description
`max_new_tokens`	`512`	Maximum tokens for description
`quantization`	`NONE`	Quantization mode: `NONE`, `INT8`, `INT4`
`temperature`	`0.6`	Sampling temperature
`do_sample`	`True`	Enable stochastic decoding

Custom Prompts

You can customize the description prompt:

from docling_ocr_qwen3vl.options import Qwen3VlPictureDescriptionOptions

opts.picture_description_options = Qwen3VlPictureDescriptionOptions(
    prompt="Describe this chart or diagram. Focus on the data being presented, axes labels, and key insights.",
    quantization=Qwen3VlQuantization.INT4,
)

Table Structure Detection

The plugin provides table structure detection using Qwen3-VL to analyze and extract table structures from document images, including cell boundaries, row/column spans, and content.

Python API

from docling_ocr_qwen3vl.options import Qwen3VlTableStructureOptions, Qwen3VlQuantization

opts = PdfPipelineOptions()
opts.allow_external_plugins = True
opts.table_structure_options = Qwen3VlTableStructureOptions(
    quantization=Qwen3VlQuantization.INT4,
    do_cell_matching=True,  # Match cells back to PDF text
)

Configuration Options

Option	Default	Description
`model_repo_id`	`"cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit"`	Hugging Face model identifier
`max_new_tokens`	`4096`	Maximum tokens for table extraction
`quantization`	`NONE`	Quantization mode: `NONE`, `INT8`, `INT4`
`do_cell_matching`	`True`	Match predicted cells back to PDF text cells

Layout Analysis

The plugin provides layout analysis using Qwen3-VL to detect and classify document elements like headings, paragraphs, figures, tables, and more.

Python API

from docling_ocr_qwen3vl.options import Qwen3VlLayoutOptions, Qwen3VlQuantization

opts = PdfPipelineOptions()
opts.allow_external_plugins = True
opts.layout_options = Qwen3VlLayoutOptions(
    quantization=Qwen3VlQuantization.INT4,
)

Detected Element Types

The layout model can detect and classify the following document elements:

Headings (title, section headers)
Paragraphs (text blocks)
Tables
Figures/Images
Lists (ordered and unordered)
Captions
Footnotes
Headers/Footers
Page numbers

Configuration Options

Option	Default	Description
`model_repo_id`	`"cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit"`	Hugging Face model identifier
`max_new_tokens`	`4096`	Maximum tokens for layout analysis
`quantization`	`NONE`	Quantization mode: `NONE`, `INT8`, `INT4`
`keep_empty_clusters`	`False`	Keep clusters without text cells
`skip_cell_assignment`	`False`	Skip cell-to-cluster assignment

Picture Classification

The plugin provides picture classification to categorize images into semantic categories like photos, diagrams, charts, logos, etc.

Python API

from docling_ocr_qwen3vl.options import Qwen3VlPictureClassifierOptions, Qwen3VlQuantization

opts = PdfPipelineOptions()
opts.allow_external_plugins = True
opts.picture_classifier_options = Qwen3VlPictureClassifierOptions(
    quantization=Qwen3VlQuantization.INT4,
)

Supported Categories

The classifier can identify:

Photos/Photographs
Charts (bar, line, pie, etc.)
Diagrams (flowcharts, architecture, etc.)
Logos
Icons
Screenshots
Illustrations
Maps
Technical drawings

Configuration Options

Option	Default	Description
`model_repo_id`	`"cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit"`	Hugging Face model identifier
`max_new_tokens`	`256`	Maximum tokens for classification
`quantization`	`NONE`	Quantization mode: `NONE`, `INT8`, `INT4`

Code and Formula Detection

The plugin provides code and formula detection to extract programming code blocks and mathematical formulas from document images.

Python API

from docling_ocr_qwen3vl.options import Qwen3VlCodeFormulaOptions, Qwen3VlQuantization

opts = PdfPipelineOptions()
opts.allow_external_plugins = True
opts.code_formula_options = Qwen3VlCodeFormulaOptions(
    quantization=Qwen3VlQuantization.INT4,
    do_code_enrichment=True,    # Enable code extraction
    do_formula_enrichment=True,  # Enable formula extraction
)

Supported Languages

The code detector can identify and extract code in:

Python, JavaScript, TypeScript
Java, C, C++, C#
Go, Rust, Ruby, PHP
Swift, Kotlin, SQL
Bash/Shell scripts
HTML, CSS, JSON, YAML, XML

Formula Extraction

Mathematical formulas are extracted and converted to LaTeX format for easy rendering and processing.

Configuration Options

Option	Default	Description
`model_repo_id`	`"cyankiwi/Qwen3-VL-4B-Thinking-AWQ-4bit"`	Hugging Face model identifier
`max_new_tokens`	`2048`	Maximum tokens for extraction
`quantization`	`NONE`	Quantization mode: `NONE`, `INT8`, `INT4`
`do_code_enrichment`	`True`	Enable code block detection
`do_formula_enrichment`	`True`	Enable formula detection

Plugin Compatibility

This plugin provides 6 model implementations, but only 4 are currently usable via external plugins due to limitations in docling's factory system:

Feature	Plugin Class	Status	Notes
OCR	`Qwen3VlOcrModel`	Pluggable	Works via `ocr_engines` entry point
Picture Description	`Qwen3VlPictureDescriptionModel`	Pluggable	Works via `picture_description` entry point
Table Structure	`Qwen3VlTableStructureModel`	Pluggable	Works via `table_structure_engines` entry point
Layout Analysis	`Qwen3VlLayoutModel`	Pluggable	Works via `layout_engines` entry point
Picture Classification	`Qwen3VlPictureClassifierModel`	Not Pluggable	Docling hardcodes `PictureClassifierModel`
Code/Formula Detection	`Qwen3VlCodeFormulaModel`	Not Pluggable	Docling hardcodes `EquationCodeEnrichmentModel`

The Picture Classification and Code/Formula Detection models are implemented in this plugin but cannot be used until docling adds factory-based plugin support for these model types. The model implementations are ready and can be enabled once docling supports external plugins for these categories.

Combining Multiple Features

You can use multiple Qwen3-VL features together for comprehensive document processing:

from docling import DocumentConverter, InputFormat
from docling.datamodel.pipeline_options import PdfFormatOption, PdfPipelineOptions
from docling_ocr_qwen3vl.options import (
    Qwen3VlOcrOptions, Qwen3VlPromptMode,
    Qwen3VlPictureDescriptionOptions,
    Qwen3VlTableStructureOptions,
    Qwen3VlLayoutOptions,
    Qwen3VlQuantization
)

opts = PdfPipelineOptions()
opts.allow_external_plugins = True

# Enable OCR
opts.do_ocr = True
opts.ocr_options = Qwen3VlOcrOptions(
    prompt_mode=Qwen3VlPromptMode.QWENVL_HTML,
    quantization=Qwen3VlQuantization.INT4,
    force_full_page_ocr=True,
)

# Enable picture description
opts.do_picture_description = True
opts.generate_picture_images = True  # Required for picture description!
opts.picture_description_options = Qwen3VlPictureDescriptionOptions(
    quantization=Qwen3VlQuantization.INT4,
)

# Enable table structure
opts.do_table_structure = True
opts.table_structure_options = Qwen3VlTableStructureOptions(
    quantization=Qwen3VlQuantization.INT4,
)

# Enable layout analysis (Qwen3-VL)
opts.layout_options = Qwen3VlLayoutOptions(
    quantization=Qwen3VlQuantization.INT4,
)

# Create converter and process document
converter = DocumentConverter(
    format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)}
)
result = converter.convert("document.pdf")
print(result.document.export_to_markdown())

Important Notes:

Set generate_picture_images = True when using picture description (required for the model to receive image data)
All models share the same underlying Qwen3-VL model weights, so VRAM usage is efficient when using multiple features
Picture Classification and Code/Formula are not yet pluggable in docling (see Plugin Compatibility section above)

Docling-Serve Integration

Docling-serve is the HTTP API server for Docling. To use this plugin with docling-serve:

Step 1: Install in the Same Environment

# Install docling-serve
pip install docling-serve

# Install this plugin
pip install docling-ocr-qwen3vl[quantization]
# or from source:
pip install -e ".[quantization]"

Step 2: Enable External Plugins

Set the environment variable before starting the server:

export DOCLING_SERVE_ALLOW_EXTERNAL_PLUGINS=1

Step 3: Start the Server

# Start docling-serve
docling-serve run

# Or with uvicorn directly
uvicorn docling_serve.app:app --host 0.0.0.0 --port 5000

Step 4: Make API Requests

Use the /convert endpoint with OCR options in the request body:

curl -X POST "http://localhost:5000/convert" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@scanned.pdf" \
  -F 'options={
    "pdf_pipeline_options": {
      "do_ocr": true,
      "ocr_options": {
        "kind": "qwen3vl_ocr",
        "prompt_mode": "qwenvl_html",
        "quantization": "int4",
        "force_full_page_ocr": true
      }
    }
  }'

Picture Description via API

curl -X POST "http://localhost:5000/convert" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@document.pdf" \
  -F 'options={
    "pdf_pipeline_options": {
      "do_picture_description": true,
      "picture_description_options": {
        "kind": "qwen3vl",
        "quantization": "int4"
      }
    }
  }'

Docker Deployment

For Docker deployments, ensure the plugin is installed in the container and the environment variable is set:

FROM python:3.11-slim

# Install CUDA runtime (required)
# ... CUDA installation steps ...

# Install dependencies
RUN pip install docling-serve docling-ocr-qwen3vl[quantization]

# Enable external plugins
ENV DOCLING_SERVE_ALLOW_EXTERNAL_PLUGINS=1

# Expose port
EXPOSE 5000

# Run server
CMD ["docling-serve", "run", "--host", "0.0.0.0", "--port", "5000"]

For GPU support with docker-compose:

version: '3.8'
services:
  docling:
    build: .
    ports:
      - "5000:5000"
    environment:
      - DOCLING_SERVE_ALLOW_EXTERNAL_PLUGINS=1
      - CUDA_VISIBLE_DEVICES=0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Attention Backends

Qwen3-VL defaults to flash_attention_2 for best performance. If flash-attn is not installed, the plugin automatically falls back to eager attention.

# Default: flash_attention_2 (requires flash-attn package)
Qwen3VlOcrOptions(attn_implementation="flash_attention_2")

# Fallback: eager attention (no extra dependencies)
Qwen3VlOcrOptions(attn_implementation="eager")

To install flash-attn:

pip install flash-attn --no-build-isolation

Multi-GPU Support

For systems with multiple GPUs, you can specify which GPU to use:

# Use specific GPU
Qwen3VlOcrOptions(device="cuda:1")

Or use environment variable:

CUDA_VISIBLE_DEVICES=1 docling --allow-external-plugins --ocr-engine qwen3vl_ocr doc.pdf

Troubleshooting

CUDA Out of Memory

If you encounter CUDA OOM errors:

Use quantization (reduces VRAM from ~16GB to ~5GB):

Qwen3VlOcrOptions(quantization=Qwen3VlQuantization.INT4)

Reduce max_new_tokens:
```
Qwen3VlOcrOptions(max_new_tokens=2048)
```
Reduce page_scale (lower resolution OCR):
```
Qwen3VlOcrOptions(page_scale=1.5)
```

flash-attn Not Available

If you see warnings about flash-attn, the plugin will fall back to eager attention automatically. To install flash-attn:

pip install flash-attn --no-build-isolation

Plugin Not Found

Ensure both packages are in the same Python environment and external plugins are enabled:

opts.allow_external_plugins = True  # Required!

Picture Descriptions Show ``

If picture descriptions show  placeholders instead of actual descriptions, you need to enable image generation:

opts.do_picture_description = True
opts.generate_picture_images = True  # This is required!
opts.picture_description_options = Qwen3VlPictureDescriptionOptions(...)

Without generate_picture_images = True, Docling won't pass image data to the picture description model.

Slow First Inference

The first inference is slow because the model needs to be loaded. Subsequent inferences are much faster. The model is cached in memory after first load.

Development

# Clone repository
git clone https://github.com/mayflower/docling-ocr-qwen3vl.git
cd docling-ocr-qwen3vl

# Install in development mode
pip install -e ".[test,quantization]"

# Run tests
pytest

# Run GPU test
python scripts/gpu_test.py

License

MIT

Acknowledgments

Qwen Team for the Qwen3-VL model
Docling Project for the document processing framework

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
scripts		scripts
src/docling_ocr_qwen3vl		src/docling_ocr_qwen3vl
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Docling Qwen3-VL Plugin

Features

Requirements

Installation

From PyPI (not yet available)

From Source

Quick Start

Python API

CLI

Prompt Modes

QWENVL_HTML Mode (Recommended for Docling)

Quantization (Reduce VRAM Usage)

Installation

Usage

Quantization Options

Configuration Options

Docling Integration

Step 1: Install Both Packages

Step 2: Enable External Plugins

Step 3: Configure OCR Options

Picture Description (Image Enrichment)

Python API

CLI

Configuration Options

Custom Prompts

Table Structure Detection

Python API

Configuration Options

Layout Analysis

Python API

Detected Element Types

Configuration Options

Picture Classification

Python API

Supported Categories

Configuration Options

Code and Formula Detection

Python API

Supported Languages

Formula Extraction

Configuration Options

Plugin Compatibility

Combining Multiple Features

Docling-Serve Integration

Step 1: Install in the Same Environment

Step 2: Enable External Plugins

Step 3: Start the Server

Step 4: Make API Requests

Picture Description via API

Docker Deployment

Attention Backends

Multi-GPU Support

Troubleshooting

CUDA Out of Memory

flash-attn Not Available

Plugin Not Found

Picture Descriptions Show

Slow First Inference

Development

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Picture Descriptions Show ``

Packages