|
| 1 | +# Docling Conversion Tutorials |
| 2 | + |
| 3 | +A collection of tutorials and techniques for document processing and parsing using [Docling](https://docling-project.github.io/docling/) and related tools. |
| 4 | + |
| 5 | +This repository provides a curated set of opinionated _conversion profiles_ that, based on our experience, effectively address some of the common problems users face when preparing and parsing documents for AI ingestion. Our goal is to help users achieve great results without needing to dive deep into all of Docling’s options and features. |
| 6 | + |
| 7 | +Most examples focus on PDF to Markdown parsing, but can be easily adapted for other input and output formats. |
| 8 | + |
| 9 | +## Installation |
| 10 | + |
| 11 | +Please refer to the [documentation](https://docling-project.github.io/docling/installation/) for official installation instructions, but in most cases, the Docling CLI can be easily installed with: |
| 12 | + |
| 13 | +```bash |
| 14 | +$ pip install docling |
| 15 | +``` |
| 16 | + |
| 17 | +If you prefer a web UI, [Docling Serve](https://github.com/docling-project/docling-serve) can be used with the same options available in the CLI. Install and run it with: |
| 18 | + |
| 19 | +```bash |
| 20 | +$ pip install "docling-serve[ui]" |
| 21 | +$ docling-serve run --enable-ui |
| 22 | +``` |
| 23 | + |
| 24 | +## Document parsing |
| 25 | + |
| 26 | +### Standard settings |
| 27 | + |
| 28 | +Most of the settings in the example below are already the **defaults** and will produce good and fast results for most documents. Images will be embedded in the output document as Base64 and OCR will be used only for bitmap content. A newer PDF backend (dlparse_v4) is being used. |
| 29 | + |
| 30 | +```bash |
| 31 | +$ docling /path/to/document.pdf \ |
| 32 | + --to md \ |
| 33 | + --pdf-backend dlparse_v4 \ |
| 34 | + --image-export-mode embedded \ |
| 35 | + --ocr \ |
| 36 | + --ocr-engine easyocr \ |
| 37 | + --table-mode accurate |
| 38 | +``` |
| 39 | + |
| 40 | +A Python version of this conversion technique is available in [standard_settings.py](./standard_settings.py). |
| 41 | + |
| 42 | +### Force OCR |
| 43 | + |
| 44 | +Depending on how a PDF document was structured upon its creation, the backend might not be able to effectively parse its layers and contents. That may happen even in documents apparently containing pure text. In these cases, **forcing OCR** on the entire document usually produce better results. |
| 45 | + |
| 46 | +```bash |
| 47 | +$ docling /path/to/document.pdf \ |
| 48 | + --to md \ |
| 49 | + --pdf-backend dlparse_v4 \ |
| 50 | + --image-export-mode embedded \ |
| 51 | + --force-ocr \ |
| 52 | + --ocr-engine easyocr \ |
| 53 | + --table-mode accurate |
| 54 | +``` |
| 55 | + |
| 56 | +A Python version of this conversion technique is available in [force_ocr.py](./force_ocr.py). |
| 57 | + |
| 58 | +### Enrichment |
| 59 | + |
| 60 | +Documents with many **code blocks**, **images**, or **formulas** can be parsed using _enriched_ conversion pipelines that add additional model executions tailored to handle these types of content. Note that these options may increase the processing time. |
| 61 | + |
| 62 | +#### Code blocks and formulas |
| 63 | + |
| 64 | +For documents heavy on **code blocks** and **formulas**, use: |
| 65 | + |
| 66 | +```bash |
| 67 | +$ docling /path/to/document.pdf \ |
| 68 | + --to md \ |
| 69 | + --pdf-backend dlparse_v4 \ |
| 70 | + --image-export-mode embedded \ |
| 71 | + --ocr \ |
| 72 | + --ocr-engine easyocr \ |
| 73 | + --table-mode accurate \ |
| 74 | + --device auto \ |
| 75 | + --enrich-code \ |
| 76 | + --enrich-formula |
| 77 | +``` |
| 78 | + |
| 79 | +#### Image classification and description |
| 80 | + |
| 81 | +For documents heavy on **images**, the _picture classification_ step will understand the classes of pictures found in the document, like chart types, flow diagrams, logos, signatures, and so on; while the _picture description_ step will annotate (caption) pictures using a vision model. |
| 82 | + |
| 83 | +```bash |
| 84 | +$ docling /path/to/document.pdf \ |
| 85 | + --to md \ |
| 86 | + --pdf-backend dlparse_v4 \ |
| 87 | + --image-export-mode embedded \ |
| 88 | + --ocr \ |
| 89 | + --ocr-engine easyocr \ |
| 90 | + --table-mode accurate \ |
| 91 | + --device auto \ |
| 92 | + --enrich-picture-classes \ |
| 93 | + --enrich-picture-description |
| 94 | +``` |
| 95 | + |
| 96 | +A Python version of this conversion technique is available in [enrichment.py](./enrichment.py). |
| 97 | + |
| 98 | +### VLM |
| 99 | + |
| 100 | +Docling supports the use of VLMs (Visual Language Models), which can be a good choice in cases where the previous conversion profiles didn't produce good results. The Docling team provides [SmolDocling](https://huggingface.co/ds4sd/SmolDocling-256M-preview), a small and fast model specifically targeted at document conversion. Other models can also be used, like [Granite Vision](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview). |
| 101 | + |
| 102 | +```bash |
| 103 | +$ docling /path/to/document.pdf \ |
| 104 | + --to md \ |
| 105 | + --pipeline vlm \ |
| 106 | + --vlm-model smoldocling \ |
| 107 | + --device auto \ |
| 108 | + --table-mode accurate |
| 109 | +``` |
| 110 | + |
| 111 | +A Python version of this conversion technique is available in [vlm.py](./vlm.py). |
| 112 | + |
| 113 | +Note that using a VLM significantly increases processing time, so running it on GPU is strongly recommended. If you know the accelerator device of the machine where you're running the parsing, it might be a good idea to provide either `cuda` or `mps` in the `--device` option. |
0 commit comments