Skip to content

Commit f8cf4b0

Browse files
committed
feat: Add NVIDIA nemotron-ocr as supporetd backend
Signed-off-by: Christoph Auer <cau@zurich.ibm.com>
1 parent ce49923 commit f8cf4b0

File tree

10 files changed

+366
-11
lines changed

10 files changed

+366
-11
lines changed

.github/workflows/checks.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ jobs:
5050
pre-commit|${{ env.PY }}|
5151
5252
- name: Install Python Dependencies
53-
run: uv sync --frozen --all-extras
53+
run: uv sync --frozen --extra easyocr --extra tesserocr --extra ocrmac --extra rapidocr --extra vlm --extra asr --extra xbrl --extra remote-serving
5454

5555
- name: Check style
5656
run: |
@@ -92,7 +92,7 @@ jobs:
9292
python-version: ${{ matrix.python-version }}
9393

9494
- name: Install Python Dependencies
95-
run: uv sync --frozen --all-extras
95+
run: uv sync --frozen --extra easyocr --extra tesserocr --extra ocrmac --extra rapidocr --extra vlm --extra asr --extra xbrl --extra remote-serving
9696

9797
- name: Cache Models
9898
uses: actions/cache@v5
@@ -159,7 +159,7 @@ jobs:
159159
python-version: ${{ matrix.python-version }}
160160

161161
- name: Install Python Dependencies
162-
run: uv sync --frozen --all-extras
162+
run: uv sync --frozen --extra easyocr --extra tesserocr --extra ocrmac --extra rapidocr --extra vlm --extra asr --extra xbrl --extra remote-serving
163163

164164
- name: Cache Models
165165
uses: actions/cache@v5
@@ -231,7 +231,7 @@ jobs:
231231
python-version: ${{ matrix.python-version }}
232232

233233
- name: Install Python Dependencies
234-
run: uv sync --frozen --all-extras
234+
run: uv sync --frozen --extra easyocr --extra tesserocr --extra ocrmac --extra rapidocr --extra vlm --extra asr --extra xbrl --extra remote-serving
235235

236236
- name: Cache Models
237237
uses: actions/cache@v5
@@ -410,7 +410,7 @@ jobs:
410410
enable-cache: true
411411

412412
- name: Install dependencies
413-
run: uv sync --all-extras
413+
run: uv sync --extra easyocr --extra tesserocr --extra ocrmac --extra rapidocr --extra vlm --extra asr --extra xbrl --extra remote-serving
414414

415415
- name: Build package
416416
run: uv build

.github/workflows/pypi.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ jobs:
2929
python-version: ${{ matrix.python-version }}
3030
enable-cache: true
3131
- name: Install dependencies
32-
run: uv sync --all-extras
32+
run: uv sync --extra easyocr --extra tesserocr --extra ocrmac --extra rapidocr --extra vlm --extra asr --extra xbrl --extra remote-serving
3333
- name: Build package
3434
run: uv build
3535
- name: Publish distribution 📦 to PyPI

docling/datamodel/pipeline_options.py

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,7 +158,8 @@ class OcrOptions(BaseOptions):
158158
See Also:
159159
`OcrAutoOptions`: Automatic engine selection based on availability.
160160
`EasyOcrOptions`, `TesseractCliOcrOptions`, `TesseractOcrOptions`,
161-
`RapidOcrOptions`, `OcrMacOptions`: Engine-specific configurations.
161+
`RapidOcrOptions`, `OcrMacOptions`, `NemotronOcrOptions`: Engine-specific
162+
configurations.
162163
"""
163164

164165
lang: Annotated[
@@ -322,6 +323,49 @@ class RapidOcrOptions(OcrOptions):
322323
)
323324

324325

326+
class NemotronOcrOptions(OcrOptions):
327+
"""Configuration for NVIDIA Nemotron OCR.
328+
329+
Notes:
330+
Nemotron OCR does not expose runtime language selection through its public
331+
API. The `lang` field is kept only for compatibility with the shared OCR
332+
options interface.
333+
"""
334+
335+
kind: ClassVar[Literal["nemotron-ocr"]] = "nemotron-ocr"
336+
lang: Annotated[
337+
list[str],
338+
Field(
339+
description=(
340+
"Reserved for interface compatibility. Nemotron OCR does not expose "
341+
"runtime language selection through its public API."
342+
)
343+
),
344+
] = []
345+
model_dir: Annotated[
346+
Optional[Path],
347+
Field(
348+
description=(
349+
"Optional directory containing the Nemotron OCR checkpoint files "
350+
"(`detector.pth`, `recognizer.pth`, `relational.pth`, `charset.txt`). "
351+
"If omitted, the upstream package downloads them from Hugging Face."
352+
)
353+
),
354+
] = None
355+
merge_level: Annotated[
356+
Literal["word", "sentence", "paragraph"],
357+
Field(
358+
description=(
359+
"Granularity requested from Nemotron OCR. `word` is the default "
360+
"because it maps most directly to Docling OCR cells."
361+
)
362+
),
363+
] = "word"
364+
model_config = ConfigDict(
365+
extra="forbid",
366+
)
367+
368+
325369
class EasyOcrOptions(OcrOptions):
326370
"""Configuration for EasyOCR engine."""
327371

docling/models/plugins/defaults.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
def ocr_engines():
22
from docling.models.stages.ocr.auto_ocr_model import OcrAutoModel
33
from docling.models.stages.ocr.easyocr_model import EasyOcrModel
4+
from docling.models.stages.ocr.nemotron_ocr_model import NemotronOcrModel
45
from docling.models.stages.ocr.ocr_mac_model import OcrMacModel
56
from docling.models.stages.ocr.rapid_ocr_model import RapidOcrModel
67
from docling.models.stages.ocr.tesseract_ocr_cli_model import TesseractOcrCliModel
@@ -10,6 +11,7 @@ def ocr_engines():
1011
"ocr_engines": [
1112
OcrAutoModel,
1213
EasyOcrModel,
14+
NemotronOcrModel,
1315
OcrMacModel,
1416
RapidOcrModel,
1517
TesseractOcrModel,
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
import logging
2+
import platform
3+
import sys
4+
from collections.abc import Iterable, Sequence
5+
from pathlib import Path
6+
from typing import Optional, Type, TypedDict, cast
7+
8+
import numpy
9+
from docling_core.types.doc import BoundingBox, CoordOrigin
10+
from docling_core.types.doc.page import BoundingRectangle, TextCell
11+
12+
from docling.datamodel.accelerator_options import AcceleratorOptions
13+
from docling.datamodel.base_models import Page
14+
from docling.datamodel.document import ConversionResult
15+
from docling.datamodel.pipeline_options import (
16+
NemotronOcrOptions,
17+
OcrOptions,
18+
)
19+
from docling.datamodel.settings import settings
20+
from docling.models.base_ocr_model import BaseOcrModel
21+
from docling.utils.accelerator_utils import decide_device
22+
from docling.utils.profiling import TimeRecorder
23+
24+
_log = logging.getLogger(__name__)
25+
26+
27+
class NemotronOcrPrediction(TypedDict):
28+
"""Exact prediction schema returned by `nemotron_ocr` 1.0.1."""
29+
30+
text: str
31+
confidence: float
32+
left: float
33+
upper: float
34+
right: float
35+
lower: float
36+
37+
38+
class NemotronOcrModel(BaseOcrModel):
39+
def __init__(
40+
self,
41+
enabled: bool,
42+
artifacts_path: Optional[Path],
43+
options: NemotronOcrOptions,
44+
accelerator_options: AcceleratorOptions,
45+
):
46+
super().__init__(
47+
enabled=enabled,
48+
artifacts_path=artifacts_path,
49+
options=options,
50+
accelerator_options=accelerator_options,
51+
)
52+
self.options: NemotronOcrOptions
53+
self.scale = 3 # multiplier for 72 dpi == 216 dpi.
54+
55+
if self.enabled:
56+
self._validate_runtime(accelerator_options=accelerator_options)
57+
58+
try:
59+
from nemotron_ocr.inference.pipeline import NemotronOCR
60+
except ImportError as exc:
61+
raise ImportError(
62+
"Nemotron OCR is not installed. Install the optional dependency "
63+
'via `pip install "docling[nemotron-ocr]"` on Linux x86_64 with '
64+
"Python 3.12 and CUDA 13.x."
65+
) from exc
66+
67+
model_dir = (
68+
str(self.options.model_dir)
69+
if self.options.model_dir is not None
70+
else None
71+
)
72+
self.reader = NemotronOCR(model_dir=model_dir)
73+
74+
@staticmethod
75+
def _fail_runtime(message: str) -> None:
76+
_log.warning(message)
77+
raise RuntimeError(message)
78+
79+
@classmethod
80+
def _validate_runtime(cls, accelerator_options: AcceleratorOptions) -> None:
81+
if sys.platform != "linux":
82+
cls._fail_runtime("Nemotron OCR is only supported on Linux.")
83+
84+
if platform.machine() != "x86_64":
85+
cls._fail_runtime("Nemotron OCR is only supported on x86_64 machines.")
86+
87+
if sys.version_info[:2] != (3, 12):
88+
cls._fail_runtime("Nemotron OCR requires Python 3.12.")
89+
90+
requested_device = decide_device(accelerator_options.device)
91+
if not requested_device.startswith("cuda"):
92+
cls._fail_runtime(
93+
"Nemotron OCR requires a CUDA accelerator. Set "
94+
"`pipeline_options.accelerator_options.device` to CUDA or AUTO on a "
95+
"CUDA-enabled machine."
96+
)
97+
98+
import torch
99+
100+
if not torch.cuda.is_available():
101+
cls._fail_runtime(
102+
"Nemotron OCR requires CUDA at initialization time, but "
103+
"`torch.cuda.is_available()` is false."
104+
)
105+
106+
cuda_version = torch.version.cuda
107+
if cuda_version is None or not cuda_version.startswith("13."):
108+
cls._fail_runtime(
109+
"Nemotron OCR requires CUDA 13.x, but the current PyTorch runtime "
110+
f"reports CUDA {cuda_version!r}."
111+
)
112+
113+
@staticmethod
114+
def _prediction_to_cell(
115+
prediction: NemotronOcrPrediction,
116+
index: int,
117+
ocr_rect: BoundingBox,
118+
image_width: int,
119+
image_height: int,
120+
scale: int,
121+
) -> TextCell:
122+
# `nemotron_ocr` 1.0.1 returns normalized `left/right` and an inverted
123+
# pair `lower/upper`, where `lower` is the top Y and `upper` is the
124+
# bottom Y in image coordinates.
125+
left = (prediction["left"] * image_width) / scale + ocr_rect.l
126+
top = (prediction["lower"] * image_height) / scale + ocr_rect.t
127+
right = (prediction["right"] * image_width) / scale + ocr_rect.l
128+
bottom = (prediction["upper"] * image_height) / scale + ocr_rect.t
129+
text = prediction["text"]
130+
131+
return TextCell(
132+
index=index,
133+
text=text,
134+
orig=text,
135+
from_ocr=True,
136+
confidence=float(prediction["confidence"]),
137+
rect=BoundingRectangle.from_bounding_box(
138+
BoundingBox(
139+
l=left,
140+
t=top,
141+
r=right,
142+
b=bottom,
143+
coord_origin=CoordOrigin.TOPLEFT,
144+
)
145+
),
146+
)
147+
148+
def __call__(
149+
self, conv_res: ConversionResult, page_batch: Iterable[Page]
150+
) -> Iterable[Page]:
151+
if not self.enabled:
152+
yield from page_batch
153+
return
154+
155+
for page in page_batch:
156+
assert page._backend is not None
157+
if not page._backend.is_valid():
158+
yield page
159+
else:
160+
with TimeRecorder(conv_res, "ocr"):
161+
ocr_rects = self.get_ocr_rects(page)
162+
163+
all_ocr_cells = []
164+
for ocr_rect in ocr_rects:
165+
if ocr_rect.area() == 0:
166+
continue
167+
168+
high_res_image = page._backend.get_page_image(
169+
scale=self.scale, cropbox=ocr_rect
170+
)
171+
image_width, image_height = high_res_image.size
172+
image_array = numpy.array(high_res_image)
173+
174+
raw_predictions = cast(
175+
Sequence[NemotronOcrPrediction],
176+
self.reader(
177+
image_array,
178+
merge_level=self.options.merge_level,
179+
visualize=False,
180+
),
181+
)
182+
183+
del high_res_image
184+
del image_array
185+
186+
cells = [
187+
self._prediction_to_cell(
188+
prediction=prediction,
189+
index=index,
190+
ocr_rect=ocr_rect,
191+
image_width=image_width,
192+
image_height=image_height,
193+
scale=self.scale,
194+
)
195+
for index, prediction in enumerate(raw_predictions)
196+
]
197+
all_ocr_cells.extend(cells)
198+
199+
self.post_process_cells(all_ocr_cells, page)
200+
201+
if settings.debug.visualize_ocr:
202+
self.draw_ocr_rects_and_cells(conv_res, page, ocr_rects)
203+
204+
yield page
205+
206+
@classmethod
207+
def get_options_type(cls) -> Type[OcrOptions]:
208+
return NemotronOcrOptions

docs/examples/full_page_ocr.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030

3131
from docling.datamodel.base_models import InputFormat
3232
from docling.datamodel.pipeline_options import (
33+
NemotronOcrOptions,
3334
PdfPipelineOptions,
3435
TableStructureOptions,
3536
TesseractCliOcrOptions,
@@ -49,8 +50,10 @@ def main():
4950
)
5051

5152
# Any of the OCR options can be used: EasyOcrOptions, TesseractOcrOptions,
52-
# TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions
53+
# TesseractCliOcrOptions, OcrMacOptions (macOS only), RapidOcrOptions,
54+
# NemotronOcrOptions (Linux x86_64, Python 3.12, CUDA 13.x only)
5355
# ocr_options = EasyOcrOptions(force_full_page_ocr=True)
56+
# ocr_options = NemotronOcrOptions(force_full_page_ocr=True)
5457
# ocr_options = TesseractOcrOptions(force_full_page_ocr=True)
5558
# ocr_options = OcrMacOptions(force_full_page_ocr=True)
5659
# ocr_options = RapidOcrOptions(force_full_page_ocr=True)

docs/getting_started/installation.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ The following table summarizes the extras available in the `docling` package. Th
5353
| `asr` | Installs dependencies for running the ASR pipeline. |
5454
| `vlm` | Installs dependencies for running the VLM pipeline. |
5555
| `easyocr` | Installs the [EasyOCR](https://github.com/JaidedAI/EasyOCR) OCR engine. |
56+
| `nemotron-ocr` | Installs NVIDIA Nemotron OCR. Supported only on Linux x86_64 with Python 3.12 and CUDA 13.x. |
5657
| `tesserocr` | Installs the Tesseract binding for using it as OCR engine. |
5758
| `ocrmac` | Installs the OcrMac OCR engine. |
5859
| `rapidocr` | Installs the [RapidOCR](https://github.com/RapidAI/RapidOCR) OCR engine with [onnxruntime](https://github.com/microsoft/onnxruntime/) backend. |
@@ -67,6 +68,7 @@ the following engines.
6768
| Engine | Installation | Usage |
6869
| ------ | ------------ | ----- |
6970
| [EasyOCR](https://github.com/JaidedAI/EasyOCR) | `easyocr` extra or via `pip install easyocr`. | `EasyOcrOptions` |
71+
| [Nemotron OCR](https://huggingface.co/nvidia/nemotron-ocr-v1) | `nemotron-ocr` extra. Supported only on Linux x86_64 with Python 3.12 and CUDA 13.x. | `NemotronOcrOptions` |
7072
| Tesseract | System dependency. See description for Tesseract and Tesserocr below. | `TesseractOcrOptions` |
7173
| Tesseract CLI | System dependency. See description below. | `TesseractCliOcrOptions` |
7274
| OcrMac | System dependency. See description below. | `OcrMacOptions` |
@@ -141,5 +143,16 @@ doc_converter = DocumentConverter(
141143
To develop Docling features, bugfixes etc., install as follows from your local clone's root dir:
142144

143145
```bash
144-
uv sync --all-extras
146+
uv sync \
147+
--extra asr \
148+
--extra easyocr \
149+
--extra ocrmac \
150+
--extra rapidocr \
151+
--extra remote-serving \
152+
--extra tesserocr \
153+
--extra vlm \
154+
--extra xbrl
145155
```
156+
157+
The `nemotron-ocr` extra is intentionally excluded from the default development
158+
setup because it is only usable on Linux x86_64 with Python 3.12 and CUDA 13.x.

0 commit comments

Comments
 (0)