Conversation
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
✅ DCO Check Passed Thanks @cau-git, all your commits are properly signed off. 🎉 |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
67c74b6 to
b200c65
Compare
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
|
Related Documentation 4 document(s) may need updating based on files changed in this PR: Docling Can I use a custom OCR model in Docling, and how do I set its path in the pipeline options?View Suggested Changes@@ -1,4 +1,4 @@
-Yes, you can use a custom OCR model in Docling, but you must configure it according to the OCR engine you choose (EasyOCR, Tesseract, or RapidOCR). Each engine has its own options class where you can specify custom model or binary paths:
+Yes, you can use a custom OCR model in Docling, but you must configure it according to the OCR engine you choose (EasyOCR, Tesseract, RapidOCR, or Nemotron OCR). Each engine has its own options class where you can specify custom model or binary paths:
- **EasyOCR:** Set `model_storage_directory` in `EasyOcrOptions` to your custom model directory.
```python
@@ -23,10 +23,32 @@
)
```
- **RapidOCR:** Set `det_model_path`, `cls_model_path`, `rec_model_path`, etc., in `RapidOcrOptions` to your custom model files.
+- **Nemotron OCR:** Use `artifacts_path` at the pipeline level in `PdfPipelineOptions` to point to pre-downloaded checkpoints. Nemotron OCR is configured using `NemotronOcrOptions`.
+ ```python
+ from docling.datamodel.pipeline_options import PdfPipelineOptions, NemotronOcrOptions
+ pipeline_options = PdfPipelineOptions(
+ do_ocr=True,
+ ocr_options=NemotronOcrOptions(force_full_page_ocr=True),
+ artifacts_path="/path/to/pre-downloaded/nemotron/artifacts"
+ )
+ ```
+ **Requirements:** Linux x86_64, Python 3.12, and CUDA 13.x.
+
+ **Installation:**
+ ```bash
+ pip install "docling[nemotron-ocr]" \
+ --extra-index-url https://download.pytorch.org/whl/cu130 \
+ --index-strategy unsafe-best-match
+ ```
+
+ **Pre-download models:**
+ ```bash
+ docling-tools models download nemotron_ocr
+ ```
You then pass `pipeline_options` to your `DocumentConverter` as usual.
-If you want to use a completely custom OCR engine (not EasyOCR, Tesseract, or RapidOCR), you need to implement a plugin following Docling's plugin system. See [this example and discussion](https://github.com/docling-project/docling/issues/1502) for details.
+If you want to use a completely custom OCR engine (not EasyOCR, Tesseract, RapidOCR, or Nemotron OCR), you need to implement a plugin following Docling's plugin system. See [this example and discussion](https://github.com/docling-project/docling/issues/1502) for details.
For all engines, if your custom model files are not in the default locations, ensure your directory structure matches what the engine expects, and set the `DOCLING_SERVE_ARTIFACTS_PATH` environment variable to the parent directory containing all model subfolders if you want Docling to use local models for offline use. See [this guide](https://github.com/docling-project/docling/discussions/2724) for details.
Quais são as melhores opções de OCR para extração de tabelas financeiras complexas em PDFs usando Docling, incluindo alternativas externas e integração com plugins?View Suggested Changes@@ -1,6 +1,8 @@
Para extração de tabelas financeiras complexas em PDFs usando Docling, as melhores opções de OCR são:
-## 1. EasyOCR (Interno ao Docling)
+## 1. Motores OCR Internos ao Docling
+
+### EasyOCR
- **EasyOCR** é significativamente mais preciso que RapidOCR para documentos financeiros com tabelas numéricas, especialmente em português.
- Configuração recomendada:
```python
@@ -23,6 +25,35 @@
```
- EasyOCR é mais lento (~9s/página) que RapidOCR (~6s/página), mas entrega precisão superior em tabelas numéricas.
- [Referência: Discussão na comunidade Docling](https://github.com/docling-project/docling/discussions/1401#discussioncomment-12859424).
+
+### Nemotron OCR (NVIDIA)
+- **Nemotron OCR** é um motor OCR da NVIDIA disponível no Docling, configurado usando `NemotronOcrOptions`.
+- **Requisitos de plataforma:**
+ - Suportado apenas em Linux x86_64
+ - Requer Python 3.12
+ - Requer CUDA 13.x (aceleração GPU)
+- **Instalação:**
+ ```bash
+ pip install "docling[nemotron-ocr]" \
+ --extra-index-url https://download.pytorch.org/whl/cu130 \
+ --index-strategy unsafe-best-match
+ ```
+- **Níveis de granularidade configuráveis** através do parâmetro `merge_level` (palavra, sentença, parágrafo):
+ ```python
+ from docling.datamodel.base_models import InputFormat
+ from docling.datamodel.pipeline_options import PdfPipelineOptions, NemotronOcrOptions
+ from docling.document_converter import DocumentConverter, PdfFormatOption
+
+ pipeline_options = PdfPipelineOptions()
+ pipeline_options.do_ocr = True
+ pipeline_options.ocr_options = NemotronOcrOptions(
+ merge_level="word" # opções: "word", "sentence", "paragraph"
+ )
+
+ doc_converter = DocumentConverter(
+ format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)}
+ )
+ ```
## 2. OCR Externo via Plugins
- Docling suporta integração de OCR externo via sistema de plugins, permitindo uso de serviços como Azure Document Intelligence ou AWS Textract.
@@ -61,4 +92,4 @@
)
```
-**Resumo:** Para máxima precisão em tabelas financeiras, utilize EasyOCR (interno), plugins externos (como Azure Document Intelligence), ou VLMs suportados pelo Docling. Para integração de OCR externo, utilize o sistema de plugins do Docling ou pós-processamento das regiões de tabela.
+**Resumo:** Para máxima precisão em tabelas financeiras, utilize EasyOCR ou Nemotron OCR (internos), plugins externos (como Azure Document Intelligence), ou VLMs suportados pelo Docling. Para integração de OCR externo, utilize o sistema de plugins do Docling ou pós-processamento das regiões de tabela.What are all the pipelines that exist in Docling, including their purposes, selection criteria, and how they handle scanned documents?View Suggested Changes@@ -8,7 +8,7 @@
- **Manual Only:** No (default for relevant formats).
- **Strengths:** Always provides bounding boxes, well-tested, faster due to threading.
- **Limitations:** Traditional OCR may struggle with complex layouts.
- - **Scanned Documents:** This is the default pipeline for scanned documents, using OCR (EasyOCR by default, with options for Tesseract or RapidOCR).
+ - **Scanned Documents:** This is the default pipeline for scanned documents, using OCR (EasyOCR by default, with options for Tesseract, RapidOCR, or Nemotron OCR). Note: Nemotron OCR requires Linux x86_64, Python 3.12, and CUDA 13.x (GPU acceleration).
2. **SimplePipeline**
- **Purpose:** Direct conversion for structured formats.What are the detailed pipeline options and processing behaviors for PDF, DOCX, PPTX, and XLSX files in the Python SDK?View Suggested Changes@@ -6,7 +6,7 @@
- `pdf_backend`: Allowed values: `pypdfium2`, `docling_parse`, `dlparse_v1`, `dlparse_v2`, `dlparse_v4` (default: `docling_parse`)
- `do_ocr` (default True): Use OCR
- `force_ocr`: Replace existing text with OCR-generated text
- - `ocr_engine`, `ocr_lang`: OCR engine and language options
+ - `ocr_engine`, `ocr_lang`: OCR engine and language options. Available OCR engines include `EasyOcrOptions`, `TesseractOcrOptions`, `TesseractCliOcrOptions`, `OcrMacOptions`, `RapidOcrOptions`, and `NemotronOcrOptions` (see OCR Options section below for details)
- `image_export_mode`: `placeholder`, `embedded`, `referenced`
- `do_table_structure`, `table_mode`, `table_cell_matching`: Table extraction options (see Table Structure Models section below for details on TableFormer V1 and V2)
- `do_code_enrichment`, `do_formula_enrichment`: Code/formula recognition
@@ -54,6 +54,54 @@
- **CLI Model Download**: `docling models download --model tableformerv2`
**Usage Note**: The `table_structure_custom_config` option in `PdfPipelineOptions` can be used to specify custom model configurations for either TableFormer V1 or V2.
+
+---
+
+### OCR Options
+
+Docling supports multiple OCR engines for text extraction from PDF documents. Each engine has specific configuration options and platform requirements:
+
+#### EasyOcrOptions
+
+- **Engine**: [EasyOCR](https://github.com/JaidedAI/EasyOCR)
+- **Installation**: `pip install easyocr` or `pip install "docling[easyocr]"`
+- **Platform Support**: Cross-platform
+
+#### TesseractOcrOptions
+
+- **Engine**: Tesseract via Python bindings
+- **Installation**: System dependency required
+- **Platform Support**: Cross-platform
+
+#### TesseractCliOcrOptions
+
+- **Engine**: Tesseract via command-line interface
+- **Installation**: System dependency required
+- **Platform Support**: Cross-platform
+
+#### OcrMacOptions
+
+- **Engine**: macOS native OCR
+- **Platform Support**: macOS only
+
+#### RapidOcrOptions
+
+- **Engine**: [RapidOCR](https://github.com/RapidAI/RapidOCR)
+- **Installation**: `pip install "docling[rapidocr]"`
+- **Platform Support**: Cross-platform
+
+#### NemotronOcrOptions
+
+- **Engine**: [NVIDIA Nemotron OCR](https://huggingface.co/nvidia/nemotron-ocr-v1) - GPU-based OCR engine
+- **Installation**: `pip install "docling[nemotron-ocr]" --extra-index-url https://download.pytorch.org/whl/cu130 --index-strategy unsafe-best-match`
+- **Platform Support**: Linux x86_64 only
+- **Requirements**:
+ - Python 3.12
+ - CUDA 13.x
+- **Key Options**:
+ - `merge_level`: Controls the granularity of OCR output. Options are `"word"` (default), `"sentence"`, or `"paragraph"`. The `"word"` level maps most directly to Docling OCR cells.
+ - `lang`: Reserved for interface compatibility. Nemotron OCR does not expose runtime language selection through its public API.
+- **Notes**: The `lang` parameter is kept for interface compatibility but is not functional with Nemotron OCR.
---
Note: You must be authenticated to accept/decline updates. |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
|
Two observations. Auto OCR
Need for
|
What do you mean by "not having the extra"? Without an extra, this will not work. It needs to install special versions of pytorch which we do not generally want to downgrade to (not even on linux x86_64 and python 3.12). |
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
|
Below is the current MD output of the DocLayNet paper (2206.01062.pdf), which exposes lots of smaller transcription errors. We need to understand why they are seen. |
Summary
Adds NVIDIA Nemotron OCR as a supported OCR backend in Docling.
What changed
NemotronOcrModelandNemotronOcrOptionsnemotron-ocrextra, including CUDA 13-specifictorch/torchvisionpinsChecklist: