Reset OCR engine to tesseract by alimaredia · Pull Request #78 · opendatahub-io/data-processing

alimaredia · 2026-01-08T18:44:02Z

Description

Set OCR engine to be tesseract OCR. Tesseract OCR relies on the TesseractCliOcrOptions being set instead of TesseractOcrOptions the way it was set previously.

Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0.

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

Chores
- Default OCR engine updated from easyocr to tesseract_cli across standard workflows and pipeline inputs.
- Standard model package refined to exclude easyocr; standard downloads now include layout and tableformer only.
- Base/container images upgraded to newer UBI9 Python and updated Docling images for improved runtime compatibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-08T18:44:13Z

📝 Walkthrough

Walkthrough

Default OCR engine changed from "easyocr" to "tesseract_cli"; the standard model download no longer includes easyocr (now layout + tableformer); base image constants updated to Red Hat UBI and FabianoFranz docling images.

Changes

Cohort / File(s)	Summary
Model download config `kubeflow-pipelines/common/components.py`, `kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml`, `kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml`	Removed easyocr from the standard download set: `with_easyocr` changed from `True` → `False`; standard path now downloads layout and tableformer only; VL(M) paths adjusted to explicit model flags.
Base image constants `kubeflow-pipelines/common/constants.py`	Updated defaults: `PYTHON_BASE_IMAGE` → `registry.access.redhat.com/ubi9/python-311:9.6-1755074620`; `DOCLING_BASE_IMAGE` → `quay.io/fabianofranz/docling-ubi9:2.54.0`.
OCR engine defaults & docs `kubeflow-pipelines/docling-standard/standard_components.py`, `kubeflow-pipelines/docling-standard/standard_convert_pipeline.py`, `kubeflow-pipelines/docling-standard/README.md`	Changed default `ocr_engine` / `docling_ocr_engine` from `"easyocr"` → `"tesseract_cli"` in function signatures, pipeline parameters, and README.
Compiled pipeline manifests / wiring `kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml`, `kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml`	Updated public input defaults for `docling_ocr_engine` to `tesseract_cli`; replaced container image references across many executors to the new UBI/docling images; adjusted model-selection flags and parameter wiring to reflect easyocr removal.

Sequence Diagram(s)

(omitted — changes are defaults/config and manifest updates without new multi-component control flow)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble bytes and hop through code,

Tesseract now leads the OCR road.
Easy gone, layout and table stay,
New images trotted in today—
A joyful hop for every load.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Reset OCR engine to tesseract' directly reflects the main change across the pull request, which systematically updates OCR engine defaults from 'easyocr' to 'tesseract_cli' across multiple files and updates related base images.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

kubeflow-pipelines/docling-standard/standard_components.py (1)

119-125: Add missing "auto" OCR engine to ocr_engine_map.

The OcrEngine enum in docling v2.54.0 includes an AUTO engine option that is not present in the ocr_engine_map. The code validates ocr_engine strings against all OcrEngine enum values (line 112), but the map only handles 5 of 6 engines. This creates a gap: if someone passes ocr_engine="auto", the validation passes but line 145's lookup will fail.

Add the AUTO engine mapping. Note that AUTO may require special handling—verify what the corresponding options class should be (if any).

Confirmed: TesseractCliOcrOptions is the correct class for tesseract_cli; all other mappings are correct.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37b97a0 and 8949d12.

📒 Files selected for processing (6)

kubeflow-pipelines/common/components.py
kubeflow-pipelines/common/constants.py
kubeflow-pipelines/docling-standard/README.md
kubeflow-pipelines/docling-standard/standard_components.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....

🔇 Additional comments (10)

kubeflow-pipelines/docling-standard/README.md (1)

21-21: LGTM! Documentation correctly reflects the OCR engine change.

The documentation update aligns with the code changes switching the default OCR engine to tesseract_cli.

kubeflow-pipelines/common/components.py (2)

178-178: LGTM! Docstring updated to reflect model changes.

The documentation correctly indicates that only layout and tableformer models are downloaded for the standard pipeline, consistent with removing easyocr.

196-204: Tesseract support confirmed in base image—no additional model downloads required.

The change to with_easyocr=False is correct. The base image quay.io/fabianofranz/docling-ubi9:2.54.0 includes all necessary Tesseract components (tesseract, tesseract-devel, tesseract-langpack-eng, tesseract-osd, leptonica-devel) installed as system packages, and the language data (tessdata) is pre-downloaded at /usr/share/tesseract/tessdata/. The with_easyocr parameter only controls EasyOCR model downloads via download_models(); Tesseract models are part of the base image and do not require separate downloads through this function. No additional action needed.

kubeflow-pipelines/docling-standard/standard_components.py (1)

27-27: No action required – "tesseract_cli" is a valid OcrEngine enum value in docling.

The default value "tesseract_cli" at line 27 is confirmed as a valid option in the docling library's OcrEngine enum, and it will pass the validation logic at lines 111-116. The ocr_engine_map at lines 119-125 correctly maps this value to TesseractCliOcrOptions.

kubeflow-pipelines/common/constants.py (2)

7-9: No action needed—Tesseract availability in the image is documented in the commit message.

The commit "Reset OCR engine to tesseract" (8949d12) explicitly states: "Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0." The pipelines are configured to use tesseract_cli as the OCR engine (as documented in the README), confirming the image choice aligns with the dependency requirements.

4-6: Verify that the Python base image is accessible and contains necessary dependencies.

The base image has been changed to Red Hat UBI9 Python 3.11 (registry.access.redhat.com/ubi9/python-311:9.6-1755074620). The image is publicly accessible from the unauthenticated Red Hat registry and is appropriate for OpenShift AI deployments. However, verify that all required system dependencies are present, particularly for the standard pipeline which requires tesserocr (which has external OCR library dependencies like Tesseract and Leptonica).

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4)

160-164: OCR engine default change looks good.

The default OCR engine is correctly updated to tesseract_cli and the corresponding TesseractCliOcrOptions class is already present in the import list (line 570) and ocr_engine_map (line 593). The change is consistent with the component's input definition here and the root-level pipeline input at line 975.

688-692: Correctly disabling EasyOCR model downloads for standard pipeline.

Adding with_easyocr=False is consistent with the switch to tesseract_cli as the default OCR engine. This avoids unnecessary downloads of EasyOCR models, reducing startup time and storage requirements.

386-386: UBI9 Python image is appropriate for lightweight PDF tasks.

These executors (create-pdf-splits and import-pdfs) perform simple file operations (listing, splitting, and downloading PDFs) without requiring Docling or OCR capabilities, making the base UBI9 Python image a suitable and efficient choice compared to the heavier Docling image used by other executors in the pipeline.

500-500: Verify the docling-ubi9:2.54.0 image contains Tesseract packages.

The image tag exists and is accessible at quay.io, but Tesseract presence cannot be confirmed from available sources. No Tesseract references appear in the codebase, and without access to the image's Dockerfile or build configuration, the claim that this image contains the required Tesseract CLI packages for the tesseract_cli OCR engine remains unverified. Confirm this image was built with Tesseract support before deploying.

Also applies to: 637-637, 714-714

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py

shruthis4

The vlm pipeline needs to be recompiled, one of the CI checks failed indicating this:https://github.com/opendatahub-io/data-processing/actions/runs/20827889344/job/59833926066?pr=78
Also, push these changes to a branch on the odh repo and trigger the workflow ->Run workflow->set "Pull request number or branch name" to your odh branch name-> Run workflow
List the link to the workflow run from the above step to the PR

Set OCR engine to be tesseract OCR. Tesseract OCR relies on the TesseractCliOcrOptions being set instead of TesseractOcrOptions the way it was set previously. Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0. Signed-off-by: Ali Maredia <amaredia@redhat.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In @kubeflow-pipelines/common/constants.py:
- Around line 7-8: Update the DOCLING_BASE_IMAGE tag to match the pinned docling
library version: change the DOCLING_BASE_IMAGE default value from
"quay.io/fabianofranz/docling-ubi9:2.54.0" to
"quay.io/fabianofranz/docling-ubi9:2.57.0" (or verify the existing image already
contains docling 2.57.0 and required Tesseract deps); modify the
DOCLING_BASE_IMAGE constant so it aligns with the docling == 2.57.0 entries in
requirements.txt to avoid version mismatches.

🧹 Nitpick comments (1)

kubeflow-pipelines/docling-standard/standard_components.py (1)
145-148: Consider adding a warning when OCR options cannot be configured.

When ocr=True but ocr_engine is not in ocr_engine_map (e.g., if allow_external_plugins=True and an external engine is specified), the pipeline silently proceeds without setting ocr_options. This could lead to unexpected behavior where OCR is enabled but using a default engine configuration instead of the user-specified one.
💡 Suggested improvement
     pipeline_options.do_ocr = ocr
     if ocr and ocr_engine in ocr_engine_map:
         OcrOptionsClass = ocr_engine_map[ocr_engine]
         ocr_options_instance = OcrOptionsClass(force_full_page_ocr=force_ocr)
         pipeline_options.ocr_options = ocr_options_instance
+    elif ocr and ocr_engine not in ocr_engine_map:
+        print(
+            f"docling-standard-convert: Warning - OCR engine '{ocr_engine}' not in built-in map, using default OCR options",
+            flush=True,
+        )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8949d12 and 35193f9.

📒 Files selected for processing (7)

kubeflow-pipelines/common/components.py
kubeflow-pipelines/common/constants.py
kubeflow-pipelines/docling-standard/README.md
kubeflow-pipelines/docling-standard/standard_components.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

kubeflow-pipelines/common/components.py
kubeflow-pipelines/docling-standard/README.md
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
GitHub Check: Summary

🔇 Additional comments (7)

kubeflow-pipelines/docling-standard/standard_components.py (2)

27-27: LGTM: OCR engine default updated to tesseract_cli.

The default change aligns with the PR objective to use Tesseract OCR with the CLI-based options class (TesseractCliOcrOptions).

119-125: LGTM: OCR engine mapping is correct.

The ocr_engine_map correctly maps engine string identifiers to their corresponding options classes, with tesseract_cli mapped to TesseractCliOcrOptions as required for the Tesseract CLI.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)

304-304: LGTM: Image references consistently updated.

All executor container images have been updated to use the new base images:

registry.access.redhat.com/ubi9/python-311:9.6-1755074620 for Python-only components

quay.io/fabianofranz/docling-ubi9:2.54.0 for Docling components

These align with the constants defined in constants.py.

Also applies to: 418-418, 550-550, 627-627, 721-721

601-614: LGTM: EasyOCR removed from model downloads.

The with_easyocr=False flag is correctly set across all pipeline types, aligning with the switch to Tesseract CLI which doesn't require separate model downloads.

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (3)

17-17: LGTM: OCR engine defaults consistently updated to tesseract_cli.

The docling_ocr_engine default value has been updated in:

Pipeline header comments (Line 17)

Component input definitions (Line 161)

Root pipeline input definitions (Line 975)

All are consistent with the source code changes in standard_components.py.

Also applies to: 161-161, 975-975

386-386: LGTM: Container images updated consistently.

All executor images are updated to match the constants in constants.py, ensuring the pipeline uses images with the required Tesseract RPM packages.

Also applies to: 500-500, 637-637, 714-714, 808-808

688-691: LGTM: Model download configuration aligned with OCR engine change.

The standard pipeline correctly sets with_easyocr=False since the pipeline now defaults to tesseract_cli which uses system-installed Tesseract binaries rather than downloaded models.

coderabbitai · 2026-01-13T20:58:15Z

kubeflow-pipelines/common/constants.py

 DOCLING_BASE_IMAGE = os.getenv(
-    "DOCLING_BASE_IMAGE", "quay.io/amaredia/aipcc-docling-image"
+    "DOCLING_BASE_IMAGE", "quay.io/fabianofranz/docling-ubi9:2.54.0"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for docling version references in the repository rg -n 'docling' -i --type=py | head -30

Repository: opendatahub-io/data-processing

Length of output: 3221

🏁 Script executed:

# Check for requirements files that might specify the docling version fd -e 'requirements.txt' -e 'setup.py' -e 'pyproject.toml' -e 'setup.cfg' | xargs rg 'docling' -i

Repository: opendatahub-io/data-processing

Length of output: 50386

🏁 Script executed:

# Check constants.py for any related version information cat -n kubeflow-pipelines/common/constants.py | head -20

Repository: opendatahub-io/data-processing

Length of output: 435

Align container image version with pinned library version.

The DOCLING_BASE_IMAGE tag is pinned to 2.54.0, but requirements.txt in both docling-standard and docling-vlm specifies docling == 2.57.0. Update the image tag to 2.57.0 or verify that the image includes the required version and dependencies for Tesseract compatibility.

🤖 Prompt for AI Agents

In @kubeflow-pipelines/common/constants.py around lines 7 - 8, Update the DOCLING_BASE_IMAGE tag to match the pinned docling library version: change the DOCLING_BASE_IMAGE default value from "quay.io/fabianofranz/docling-ubi9:2.54.0" to "quay.io/fabianofranz/docling-ubi9:2.57.0" (or verify the existing image already contains docling 2.57.0 and required Tesseract deps); modify the DOCLING_BASE_IMAGE constant so it aligns with the docling == 2.57.0 entries in requirements.txt to avoid version mismatches.

RobuRishabh

lgtm

alimaredia · 2026-01-13T21:25:25Z

@shruthis4: https://github.com/opendatahub-io/data-processing/actions/runs/20972284333

mergify · 2026-01-15T15:20:42Z

🎉 Auto-merged successfully!

✅ All reviewers approved: 2
✅ CI checks passed: All

Approved by:

alimaredia requested a review from shruthis4 January 8, 2026 18:44

alimaredia requested a review from a team as a code owner January 8, 2026 18:44

coderabbitai bot reviewed Jan 8, 2026

View reviewed changes

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py Show resolved Hide resolved

shruthis4 requested changes Jan 13, 2026

View reviewed changes

alimaredia force-pushed the readd-tesseract-ocr branch from 8949d12 to 35193f9 Compare January 13, 2026 20:52

coderabbitai bot reviewed Jan 13, 2026

View reviewed changes

RobuRishabh approved these changes Jan 13, 2026

View reviewed changes

shruthis4 approved these changes Jan 15, 2026

View reviewed changes

mergify bot merged commit d5586ff into opendatahub-io:main Jan 15, 2026
7 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset OCR engine to tesseract#78

Reset OCR engine to tesseract#78
mergify[bot] merged 1 commit intoopendatahub-io:mainfrom
alimaredia:readd-tesseract-ocr

alimaredia commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

shruthis4 left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 13, 2026

Uh oh!

RobuRishabh left a comment

Uh oh!

alimaredia commented Jan 13, 2026

Uh oh!

Uh oh!

mergify bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alimaredia commented Jan 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shruthis4 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

RobuRishabh left a comment

Choose a reason for hiding this comment

Uh oh!

alimaredia commented Jan 13, 2026

Uh oh!

Uh oh!

mergify bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alimaredia commented Jan 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 8, 2026 •

edited

Loading