Skip to content

Commit 0f9ec2c

Browse files
authored
Merge pull request #21 from fabianofranz/docling-conversion-tutorials
Add Docling conversion tutorials
2 parents d96f286 + 5dcd845 commit 0f9ec2c

File tree

5 files changed

+278
-0
lines changed

5 files changed

+278
-0
lines changed

docs/docling-conversion/README.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Docling Conversion Tutorials
2+
3+
A collection of tutorials and techniques for document processing and parsing using [Docling](https://docling-project.github.io/docling/) and related tools.
4+
5+
This repository provides a curated set of opinionated _conversion profiles_ that, based on our experience, effectively address some of the common problems users face when preparing and parsing documents for AI ingestion. Our goal is to help users achieve great results without needing to dive deep into all of Docling’s options and features.
6+
7+
Most examples focus on PDF to Markdown parsing, but can be easily adapted for other input and output formats.
8+
9+
## Installation
10+
11+
Please refer to the [documentation](https://docling-project.github.io/docling/installation/) for official installation instructions, but in most cases, the Docling CLI can be easily installed with:
12+
13+
```bash
14+
$ pip install docling
15+
```
16+
17+
If you prefer a web UI, [Docling Serve](https://github.com/docling-project/docling-serve) can be used with the same options available in the CLI. Install and run it with:
18+
19+
```bash
20+
$ pip install "docling-serve[ui]"
21+
$ docling-serve run --enable-ui
22+
```
23+
24+
## Document parsing
25+
26+
### Standard settings
27+
28+
Most of the settings in the example below are already the **defaults** and will produce good and fast results for most documents. Images will be embedded in the output document as Base64 and OCR will be used only for bitmap content. A newer PDF backend (dlparse_v4) is being used.
29+
30+
```bash
31+
$ docling /path/to/document.pdf \
32+
--to md \
33+
--pdf-backend dlparse_v4 \
34+
--image-export-mode embedded \
35+
--ocr \
36+
--ocr-engine easyocr \
37+
--table-mode accurate
38+
```
39+
40+
A Python version of this conversion technique is available in [standard_settings.py](./standard_settings.py).
41+
42+
### Force OCR
43+
44+
Depending on how a PDF document was structured upon its creation, the backend might not be able to effectively parse its layers and contents. That may happen even in documents apparently containing pure text. In these cases, **forcing OCR** on the entire document usually produce better results.
45+
46+
```bash
47+
$ docling /path/to/document.pdf \
48+
--to md \
49+
--pdf-backend dlparse_v4 \
50+
--image-export-mode embedded \
51+
--force-ocr \
52+
--ocr-engine easyocr \
53+
--table-mode accurate
54+
```
55+
56+
A Python version of this conversion technique is available in [force_ocr.py](./force_ocr.py).
57+
58+
### Enrichment
59+
60+
Documents with many **code blocks**, **images**, or **formulas** can be parsed using _enriched_ conversion pipelines that add additional model executions tailored to handle these types of content. Note that these options may increase the processing time.
61+
62+
#### Code blocks and formulas
63+
64+
For documents heavy on **code blocks** and **formulas**, use:
65+
66+
```bash
67+
$ docling /path/to/document.pdf \
68+
--to md \
69+
--pdf-backend dlparse_v4 \
70+
--image-export-mode embedded \
71+
--ocr \
72+
--ocr-engine easyocr \
73+
--table-mode accurate \
74+
--device auto \
75+
--enrich-code \
76+
--enrich-formula
77+
```
78+
79+
#### Image classification and description
80+
81+
For documents heavy on **images**, the _picture classification_ step will understand the classes of pictures found in the document, like chart types, flow diagrams, logos, signatures, and so on; while the _picture description_ step will annotate (caption) pictures using a vision model.
82+
83+
```bash
84+
$ docling /path/to/document.pdf \
85+
--to md \
86+
--pdf-backend dlparse_v4 \
87+
--image-export-mode embedded \
88+
--ocr \
89+
--ocr-engine easyocr \
90+
--table-mode accurate \
91+
--device auto \
92+
--enrich-picture-classes \
93+
--enrich-picture-description
94+
```
95+
96+
A Python version of this conversion technique is available in [enrichment.py](./enrichment.py).
97+
98+
### VLM
99+
100+
Docling supports the use of VLMs (Visual Language Models), which can be a good choice in cases where the previous conversion profiles didn't produce good results. The Docling team provides [SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview), a small and fast model specifically targeted at document conversion. Other models can also be used, like [Granite Vision](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview).
101+
102+
```bash
103+
$ docling /path/to/document.pdf \
104+
--to md \
105+
--pipeline vlm \
106+
--vlm-model smoldocling \
107+
--device auto \
108+
--table-mode accurate
109+
```
110+
111+
A Python version of this conversion technique is available in [vlm.py](./vlm.py).
112+
113+
Note that using a VLM significantly increases processing time, so running it on GPU is strongly recommended. If you know the accelerator device of the machine where you're running the parsing, it might be a good idea to provide either `cuda` or `mps` in the `--device` option.
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
"""Docling example for PDF conversion with image description and classification"""
2+
3+
from docling.datamodel.base_models import InputFormat
4+
from docling.datamodel.pipeline_options import (
5+
AcceleratorDevice,
6+
AcceleratorOptions,
7+
EasyOcrOptions,
8+
PdfPipelineOptions,
9+
TableFormerMode,
10+
)
11+
from docling.document_converter import DocumentConverter, PdfFormatOption
12+
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
13+
from docling_core.types.doc import (
14+
ImageRefMode,
15+
PictureClassificationData,
16+
)
17+
18+
source = "https://raw.githubusercontent.com//docling-project/docling/refs/heads/main/tests/data/pdf/picture_classification.pdf" # Path or URL to PDF
19+
20+
pipeline_options = PdfPipelineOptions()
21+
pipeline_options.do_ocr = True
22+
pipeline_options.ocr_options = EasyOcrOptions()
23+
pipeline_options.do_table_structure = True
24+
pipeline_options.table_structure_options.do_cell_matching = True
25+
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
26+
pipeline_options.generate_page_images = True
27+
pipeline_options.do_picture_classification = True
28+
pipeline_options.do_picture_description = True
29+
pipeline_options.do_formula_enrichment = False
30+
pipeline_options.do_code_enrichment = False
31+
pipeline_options.accelerator_options = AcceleratorOptions(
32+
num_threads=4, device=AcceleratorDevice.AUTO
33+
)
34+
35+
converter = DocumentConverter(
36+
format_options={
37+
InputFormat.PDF: PdfFormatOption(
38+
pipeline_options=pipeline_options,
39+
backend=DoclingParseV4DocumentBackend,
40+
)
41+
}
42+
)
43+
44+
result = converter.convert(source)
45+
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
46+
47+
print(md)
48+
49+
for picture in result.document.pictures:
50+
for annotation in picture.annotations:
51+
print(annotation.provenance)
52+
if isinstance(annotation, PictureClassificationData):
53+
for predicted_class in annotation.predicted_classes:
54+
print(
55+
f"{predicted_class.class_name} with {predicted_class.confidence} confidence"
56+
)
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
"""Docling example for PDF conversion with OCR"""
2+
3+
from docling.datamodel.base_models import InputFormat
4+
from docling.datamodel.pipeline_options import (
5+
AcceleratorDevice,
6+
AcceleratorOptions,
7+
EasyOcrOptions,
8+
PdfPipelineOptions,
9+
TableFormerMode,
10+
)
11+
from docling.document_converter import DocumentConverter, PdfFormatOption
12+
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
13+
from docling_core.types.doc import ImageRefMode
14+
15+
source = "https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/026-latex-multicolumn/multicolumn.pdf" # Path or URL to PDF
16+
17+
pipeline_options = PdfPipelineOptions()
18+
pipeline_options.do_ocr = True
19+
pipeline_options.ocr_options = EasyOcrOptions()
20+
pipeline_options.ocr_options.force_full_page_ocr = True
21+
pipeline_options.generate_picture_images = True
22+
pipeline_options.do_table_structure = True
23+
pipeline_options.table_structure_options.do_cell_matching = True
24+
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
25+
pipeline_options.accelerator_options = AcceleratorOptions(
26+
num_threads=4, device=AcceleratorDevice.AUTO
27+
)
28+
29+
converter = DocumentConverter(
30+
format_options={
31+
InputFormat.PDF: PdfFormatOption(
32+
pipeline_options=pipeline_options,
33+
backend=DoclingParseV4DocumentBackend,
34+
)
35+
}
36+
)
37+
38+
result = converter.convert(source)
39+
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
40+
41+
print(md)
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
"""Docling example for PDF conversion with most settings at their defaults"""
2+
3+
from docling.datamodel.base_models import InputFormat
4+
from docling.datamodel.pipeline_options import (
5+
AcceleratorDevice,
6+
AcceleratorOptions,
7+
EasyOcrOptions,
8+
PdfPipelineOptions,
9+
TableFormerMode,
10+
)
11+
from docling.document_converter import DocumentConverter, PdfFormatOption
12+
from docling.backend.docling_parse_v4_backend import DoclingParseV4DocumentBackend
13+
from docling_core.types.doc import ImageRefMode
14+
15+
source = "https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/001-trivial/minimal-document.pdf" # Path or URL to PDF
16+
17+
pipeline_options = PdfPipelineOptions()
18+
pipeline_options.do_ocr = True
19+
pipeline_options.ocr_options = EasyOcrOptions()
20+
pipeline_options.do_table_structure = True
21+
pipeline_options.table_structure_options.do_cell_matching = True
22+
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
23+
pipeline_options.accelerator_options = AcceleratorOptions(
24+
num_threads=4, device=AcceleratorDevice.AUTO
25+
)
26+
27+
converter = DocumentConverter(
28+
format_options={
29+
InputFormat.PDF: PdfFormatOption(
30+
pipeline_options=pipeline_options,
31+
backend=DoclingParseV4DocumentBackend,
32+
)
33+
}
34+
)
35+
36+
result = converter.convert(source)
37+
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
38+
39+
print(md)

docs/docling-conversion/vlm.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
"""Docling example for PDF conversion with VLM pipeline"""
2+
3+
from docling.datamodel.base_models import InputFormat
4+
from docling.datamodel.pipeline_options import (
5+
VlmPipelineOptions,
6+
smoldocling_vlm_conversion_options,
7+
)
8+
from docling.document_converter import DocumentConverter, PdfFormatOption
9+
from docling.pipeline.vlm_pipeline import VlmPipeline
10+
from docling_core.types.doc import ImageRefMode
11+
12+
source = "https://raw.githubusercontent.com/py-pdf/sample-files/refs/heads/main/026-latex-multicolumn/multicolumn.pdf" # Path or URL to PDF
13+
14+
pipeline_options = VlmPipelineOptions()
15+
pipeline_options.vlm_options = smoldocling_vlm_conversion_options
16+
17+
converter = DocumentConverter(
18+
format_options={
19+
InputFormat.PDF: PdfFormatOption(
20+
pipeline_options=pipeline_options,
21+
pipeline_cls=VlmPipeline,
22+
)
23+
}
24+
)
25+
26+
result = converter.convert(source)
27+
md = result.document.export_to_markdown(image_mode=ImageRefMode.EMBEDDED)
28+
29+
print(md)

0 commit comments

Comments
 (0)