Skip to content

Commit ff351fd

Browse files
authored
docs: Describe examples (#2262)
* Update .py examples with clearer guidance, update out of date imports and calls Signed-off-by: Mingxuan Zhao <[email protected]> * Fix minimal.py string error, fix ruff format error Signed-off-by: Mingxuan Zhao <[email protected]> * fix more CI issues Signed-off-by: Mingxuan Zhao <[email protected]> --------- Signed-off-by: Mingxuan Zhao <[email protected]>
1 parent 0e95171 commit ff351fd

21 files changed

+608
-85
lines changed

docs/examples/batch_convert.py

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,33 @@
1+
"""
2+
Batch convert multiple PDF files and export results in several formats.
3+
4+
What this example does
5+
- Loads a small set of sample PDFs.
6+
- Runs the Docling PDF pipeline once per file.
7+
- Writes outputs to `scratch/` in multiple formats (JSON, HTML, Markdown, text, doctags, YAML).
8+
9+
Prerequisites
10+
- Install Docling and dependencies as described in the repository README.
11+
- Ensure you can import `docling` from your Python environment.
12+
# - YAML export requires `PyYAML` (`pip install pyyaml`).
13+
14+
Input documents
15+
- By default, this example uses a few PDFs from `tests/data/pdf/` in the repo.
16+
- If you cloned without test data, or want to use your own files, edit
17+
`input_doc_paths` below to point to PDFs on your machine.
18+
19+
Output formats (controlled by flags)
20+
- `USE_V2 = True` enables the current Docling document exports (recommended).
21+
- `USE_LEGACY = False` keeps legacy Deep Search exports disabled.
22+
You can set it to `True` if you need legacy formats for compatibility tests.
23+
24+
Notes
25+
- Set `pipeline_options.generate_page_images = True` to include page images in HTML.
26+
- The script logs conversion progress and raises if any documents fail.
27+
# - This example shows both helper methods like `save_as_*` and lower-level
28+
# `export_to_*` + manual file writes; outputs may overlap intentionally.
29+
"""
30+
131
import json
232
import logging
333
import time
@@ -15,6 +45,9 @@
1545

1646
_log = logging.getLogger(__name__)
1747

48+
# Export toggles:
49+
# - USE_V2 controls modern Docling document exports.
50+
# - USE_LEGACY enables legacy Deep Search exports for comparison or migration.
1851
USE_V2 = True
1952
USE_LEGACY = False
2053

@@ -35,6 +68,9 @@ def export_documents(
3568
doc_filename = conv_res.input.file.stem
3669

3770
if USE_V2:
71+
# Recommended modern Docling exports. These helpers mirror the
72+
# lower-level "export_to_*" methods used below, but handle
73+
# common details like image handling.
3874
conv_res.document.save_as_json(
3975
output_dir / f"{doc_filename}.json",
4076
image_mode=ImageRefMode.PLACEHOLDER,
@@ -121,6 +157,9 @@ def export_documents(
121157
def main():
122158
logging.basicConfig(level=logging.INFO)
123159

160+
# Location of sample PDFs used by this example. If your checkout does not
161+
# include test data, change `data_folder` or point `input_doc_paths` to
162+
# your own files.
124163
data_folder = Path(__file__).parent / "../../tests/data"
125164
input_doc_paths = [
126165
data_folder / "pdf/2206.01062.pdf",
@@ -139,6 +178,8 @@ def main():
139178
# settings.debug.visualize_tables = True
140179
# settings.debug.visualize_cells = True
141180

181+
# Configure the PDF pipeline. Enabling page image generation improves HTML
182+
# previews (embedded images) but adds processing time.
142183
pipeline_options = PdfPipelineOptions()
143184
pipeline_options.generate_page_images = True
144185

@@ -152,11 +193,14 @@ def main():
152193

153194
start_time = time.time()
154195

196+
# Convert all inputs. Set `raises_on_error=False` to keep processing other
197+
# files even if one fails; errors are summarized after the run.
155198
conv_results = doc_converter.convert_all(
156199
input_doc_paths,
157200
raises_on_error=False, # to let conversion run through all and examine results at the end
158201
)
159-
success_count, partial_success_count, failure_count = export_documents(
202+
# Write outputs to ./scratch and log a summary.
203+
_success_count, _partial_success_count, failure_count = export_documents(
160204
conv_results, output_dir=Path("scratch")
161205
)
162206

docs/examples/compare_vlm_models.py

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,28 @@
1-
# Compare VLM models
2-
# ==================
1+
# %% [markdown]
2+
# Compare different VLM models by running the VLM pipeline and timing outputs.
33
#
4-
# This example runs the VLM pipeline with different vision-language models.
5-
# Their runtime as well output quality is compared.
4+
# What this example does
5+
# - Iterates through a list of VLM model configurations and converts the same file.
6+
# - Prints per-page generation times and saves JSON/MD/HTML to `scratch/`.
7+
# - Summarizes total inference time and pages processed in a table.
8+
#
9+
# Requirements
10+
# - Install `tabulate` for pretty printing (`pip install tabulate`).
11+
#
12+
# Prerequisites
13+
# - Install Docling with VLM extras. Ensure models can be downloaded or are available.
14+
#
15+
# How to run
16+
# - From the repo root: `python docs/examples/compare_vlm_models.py`.
17+
# - Results are saved to `scratch/` with filenames including the model and framework.
18+
#
19+
# Notes
20+
# - MLX models are skipped automatically on non-macOS platforms.
21+
# - On CUDA systems, you can enable flash_attention_2 (see commented lines).
22+
# - Running multiple VLMs can be GPU/CPU intensive and time-consuming; ensure
23+
# enough VRAM/system RAM and close other memory-heavy apps.
24+
25+
# %%
626

727
import json
828
import sys
@@ -31,6 +51,8 @@
3151

3252

3353
def convert(sources: list[Path], converter: DocumentConverter):
54+
# Note: this helper assumes a single-item `sources` list. It returns after
55+
# processing the first source to keep runtime/output focused.
3456
model_id = pipeline_options.vlm_options.repo_id.replace("/", "_")
3557
framework = pipeline_options.vlm_options.inference_framework
3658
for source in sources:
@@ -61,6 +83,8 @@ def convert(sources: list[Path], converter: DocumentConverter):
6183

6284
print("===== Final output of the converted document =======")
6385

86+
# Manual export for illustration. Below, `save_as_json()` writes the same
87+
# JSON again; kept intentionally to show both approaches.
6488
with (out_path / f"{fname}.json").open("w") as fp:
6589
fp.write(json.dumps(res.document.export_to_dict()))
6690

docs/examples/custom_convert.py

Lines changed: 50 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,39 @@
1+
# %% [markdown]
2+
# Customize PDF conversion by toggling OCR/backends and pipeline options.
3+
#
4+
# What this example does
5+
# - Shows several alternative configurations for the Docling PDF pipeline.
6+
# - Lets you try OCR engines (EasyOCR, Tesseract, system OCR) or no OCR.
7+
# - Converts a single sample PDF and exports results to `scratch/`.
8+
#
9+
# Prerequisites
10+
# - Install Docling and its optional OCR backends per the docs.
11+
# - Ensure you can import `docling` from your Python environment.
12+
#
13+
# How to run
14+
# - From the repository root, run: `python docs/examples/custom_convert.py`.
15+
# - Outputs are written under `scratch/` next to where you run the script.
16+
#
17+
# Choosing a configuration
18+
# - Only one configuration block should be active at a time.
19+
# - Uncomment exactly one of the sections below to experiment.
20+
# - The file ships with "Docling Parse with EasyOCR" enabled as a sensible default.
21+
# - If you uncomment a backend or OCR option that is not imported above, also
22+
# import its class, e.g.:
23+
# - `from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend`
24+
# - `from docling.datamodel.pipeline_options import TesseractOcrOptions, TesseractCliOcrOptions, OcrMacOptions`
25+
#
26+
# Input document
27+
# - Defaults to a single PDF from `tests/data/pdf/` in the repo.
28+
# - If you don't have the test data, update `input_doc_path` to a local PDF.
29+
#
30+
# Notes
31+
# - EasyOCR language: adjust `pipeline_options.ocr_options.lang` (e.g., ["en"], ["es"], ["en", "de"]).
32+
# - Accelerators: tune `AcceleratorOptions` to select CPU/GPU or threads.
33+
# - Exports: JSON, plain text, Markdown, and doctags are saved in `scratch/`.
34+
35+
# %%
36+
137
import json
238
import logging
339
import time
@@ -21,9 +57,8 @@ def main():
2157

2258
###########################################################################
2359

24-
# The following sections contain a combination of PipelineOptions
25-
# and PDF Backends for various configurations.
26-
# Uncomment one section at the time to see the differences in the output.
60+
# The sections below demo combinations of PdfPipelineOptions and backends.
61+
# Tip: Uncomment exactly one section at a time to compare outputs.
2762

2863
# PyPdfium without EasyOCR
2964
# --------------------
@@ -68,8 +103,10 @@ def main():
68103
# }
69104
# )
70105

71-
# Docling Parse with EasyOCR
72-
# ----------------------
106+
# Docling Parse with EasyOCR (default)
107+
# -------------------------------
108+
# Enables OCR and table structure with EasyOCR, using automatic device
109+
# selection via AcceleratorOptions. Adjust languages as needed.
73110
pipeline_options = PdfPipelineOptions()
74111
pipeline_options.do_ocr = True
75112
pipeline_options.do_table_structure = True
@@ -86,7 +123,7 @@ def main():
86123
)
87124

88125
# Docling Parse with EasyOCR (CPU only)
89-
# ----------------------
126+
# -------------------------------------
90127
# pipeline_options = PdfPipelineOptions()
91128
# pipeline_options.do_ocr = True
92129
# pipeline_options.ocr_options.use_gpu = False # <-- set this.
@@ -100,7 +137,7 @@ def main():
100137
# )
101138

102139
# Docling Parse with Tesseract
103-
# ----------------------
140+
# ----------------------------
104141
# pipeline_options = PdfPipelineOptions()
105142
# pipeline_options.do_ocr = True
106143
# pipeline_options.do_table_structure = True
@@ -114,7 +151,7 @@ def main():
114151
# )
115152

116153
# Docling Parse with Tesseract CLI
117-
# ----------------------
154+
# --------------------------------
118155
# pipeline_options = PdfPipelineOptions()
119156
# pipeline_options.do_ocr = True
120157
# pipeline_options.do_table_structure = True
@@ -127,8 +164,8 @@ def main():
127164
# }
128165
# )
129166

130-
# Docling Parse with ocrmac(Mac only)
131-
# ----------------------
167+
# Docling Parse with ocrmac (macOS only)
168+
# --------------------------------------
132169
# pipeline_options = PdfPipelineOptions()
133170
# pipeline_options.do_ocr = True
134171
# pipeline_options.do_table_structure = True
@@ -154,13 +191,13 @@ def main():
154191
output_dir.mkdir(parents=True, exist_ok=True)
155192
doc_filename = conv_result.input.file.stem
156193

157-
# Export Deep Search document JSON format:
194+
# Export Docling document JSON format:
158195
with (output_dir / f"{doc_filename}.json").open("w", encoding="utf-8") as fp:
159196
fp.write(json.dumps(conv_result.document.export_to_dict()))
160197

161-
# Export Text format:
198+
# Export Text format (plain text via Markdown export):
162199
with (output_dir / f"{doc_filename}.txt").open("w", encoding="utf-8") as fp:
163-
fp.write(conv_result.document.export_to_text())
200+
fp.write(conv_result.document.export_to_markdown(strict_text=True))
164201

165202
# Export Markdown format:
166203
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:

docs/examples/develop_formula_understanding.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
1-
# WARNING
2-
# This example demonstrates only how to develop a new enrichment model.
3-
# It does not run the actual formula understanding model.
1+
# %% [markdown]
2+
# Developing an enrichment model example (formula understanding: scaffold only).
3+
#
4+
# What this example does
5+
# - Shows how to define pipeline options, an enrichment model, and extend a pipeline.
6+
# - Displays cropped images of formula items and yields them back unchanged.
7+
#
8+
# Important
9+
# - This is a development scaffold; it does not run a real formula understanding model.
10+
#
11+
# How to run
12+
# - From the repo root: `python docs/examples/develop_formula_understanding.py`.
13+
#
14+
# Notes
15+
# - Set `do_formula_understanding=True` to enable the example enrichment stage.
16+
# - Extends `StandardPdfPipeline` and keeps the backend when enrichment is enabled.
17+
18+
# %%
419

520
import logging
621
from collections.abc import Iterable
@@ -42,6 +57,8 @@ def __call__(
4257
return
4358

4459
for enrich_element in element_batch:
60+
# Opens a window for each cropped formula image; comment this out when
61+
# running headless or processing many items to avoid blocking spam.
4562
enrich_element.image.show()
4663

4764
yield enrich_element.item

docs/examples/develop_picture_enrichment.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,21 @@
1-
# WARNING
2-
# This example demonstrates only how to develop a new enrichment model.
3-
# It does not run the actual picture classifier model.
1+
# %% [markdown]
2+
# Developing a picture enrichment model (classifier scaffold only).
3+
#
4+
# What this example does
5+
# - Demonstrates how to implement an enrichment model that annotates pictures.
6+
# - Adds a dummy PictureClassificationData entry to each PictureItem.
7+
#
8+
# Important
9+
# - This is a scaffold for development; it does not run a real classifier.
10+
#
11+
# How to run
12+
# - From the repo root: `python docs/examples/develop_picture_enrichment.py`.
13+
#
14+
# Notes
15+
# - Enables picture image generation and sets `images_scale` to improve crops.
16+
# - Extends `StandardPdfPipeline` with a custom enrichment stage.
17+
18+
# %%
419

520
import logging
621
from collections.abc import Iterable
@@ -43,7 +58,7 @@ def __call__(
4358
assert isinstance(element, PictureItem)
4459

4560
# uncomment this to interactively visualize the image
46-
# element.get_image(doc).show()
61+
# element.get_image(doc).show() # may block; avoid in headless runs
4762

4863
element.annotations.append(
4964
PictureClassificationData(

docs/examples/enrich_doclingdocument.py

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,26 @@
1-
## Enrich DoclingDocument
2-
# This example allows to run Docling enrichment models on documents which have been already converted
3-
# and stored as serialized DoclingDocument JSON files.
1+
# %% [markdown]
2+
# Enrich an existing DoclingDocument JSON with a custom model (post-conversion).
3+
#
4+
# What this example does
5+
# - Loads a previously converted DoclingDocument from JSON (no reconversion).
6+
# - Uses a backend to crop images for items and runs an enrichment model in batches.
7+
# - Prints a few example annotations to stdout.
8+
#
9+
# Prerequisites
10+
# - A DoclingDocument JSON produced by another conversion (path configured below).
11+
# - Install Docling and dependencies for the chosen enrichment model.
12+
# - Ensure the JSON and the referenced PDF match (same document/version), so
13+
# provenance bounding boxes line up for accurate cropping.
14+
#
15+
# How to run
16+
# - From the repo root: `python docs/examples/enrich_doclingdocument.py`.
17+
# - Adjust `input_doc_path` and `input_pdf_path` if your data is elsewhere.
18+
#
19+
# Notes
20+
# - `BATCH_SIZE` controls how many elements are passed to the model at once.
21+
# - `prepare_element()` crops context around elements based on the model's expansion.
22+
23+
# %%
424

525
### Load modules
626

@@ -24,6 +44,7 @@
2444
### Define batch size used for processing
2545

2646
BATCH_SIZE = 4
47+
# Trade-off: larger batches improve throughput but increase memory usage.
2748

2849
### From DocItem to the model inputs
2950
# The following function is responsible for taking an item and applying the required pre-processing for the model.

0 commit comments

Comments
 (0)