Skip to content

Reset OCR engine to tesseract#78

Merged
mergify[bot] merged 1 commit intoopendatahub-io:mainfrom
alimaredia:readd-tesseract-ocr
Jan 15, 2026
Merged

Reset OCR engine to tesseract#78
mergify[bot] merged 1 commit intoopendatahub-io:mainfrom
alimaredia:readd-tesseract-ocr

Conversation

@alimaredia
Copy link
Contributor

@alimaredia alimaredia commented Jan 8, 2026

Description

Set OCR engine to be tesseract OCR. Tesseract OCR relies on the TesseractCliOcrOptions being set instead of TesseractOcrOptions the way it was set previously.

Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0.

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Chores
    • Default OCR engine updated from easyocr to tesseract_cli across standard workflows and pipeline inputs.
    • Standard model package refined to exclude easyocr; standard downloads now include layout and tableformer only.
    • Base/container images upgraded to newer UBI9 Python and updated Docling images for improved runtime compatibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@alimaredia alimaredia requested a review from shruthis4 January 8, 2026 18:44
@alimaredia alimaredia requested a review from a team as a code owner January 8, 2026 18:44
@coderabbitai
Copy link

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

Default OCR engine changed from "easyocr" to "tesseract_cli"; the standard model download no longer includes easyocr (now layout + tableformer); base image constants updated to Red Hat UBI and FabianoFranz docling images.

Changes

Cohort / File(s) Summary
Model download config
kubeflow-pipelines/common/components.py, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml, kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
Removed easyocr from the standard download set: with_easyocr changed from TrueFalse; standard path now downloads layout and tableformer only; VL(M) paths adjusted to explicit model flags.
Base image constants
kubeflow-pipelines/common/constants.py
Updated defaults: PYTHON_BASE_IMAGEregistry.access.redhat.com/ubi9/python-311:9.6-1755074620; DOCLING_BASE_IMAGEquay.io/fabianofranz/docling-ubi9:2.54.0.
OCR engine defaults & docs
kubeflow-pipelines/docling-standard/standard_components.py, kubeflow-pipelines/docling-standard/standard_convert_pipeline.py, kubeflow-pipelines/docling-standard/README.md
Changed default ocr_engine / docling_ocr_engine from "easyocr""tesseract_cli" in function signatures, pipeline parameters, and README.
Compiled pipeline manifests / wiring
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
Updated public input defaults for docling_ocr_engine to tesseract_cli; replaced container image references across many executors to the new UBI/docling images; adjusted model-selection flags and parameter wiring to reflect easyocr removal.

Sequence Diagram(s)

(omitted — changes are defaults/config and manifest updates without new multi-component control flow)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibble bytes and hop through code,

Tesseract now leads the OCR road.
Easy gone, layout and table stay,
New images trotted in today—
A joyful hop for every load.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Reset OCR engine to tesseract' directly reflects the main change across the pull request, which systematically updates OCR engine defaults from 'easyocr' to 'tesseract_cli' across multiple files and updates related base images.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kubeflow-pipelines/docling-standard/standard_components.py (1)

119-125: Add missing "auto" OCR engine to ocr_engine_map.

The OcrEngine enum in docling v2.54.0 includes an AUTO engine option that is not present in the ocr_engine_map. The code validates ocr_engine strings against all OcrEngine enum values (line 112), but the map only handles 5 of 6 engines. This creates a gap: if someone passes ocr_engine="auto", the validation passes but line 145's lookup will fail.

Add the AUTO engine mapping. Note that AUTO may require special handling—verify what the corresponding options class should be (if any).

Confirmed: TesseractCliOcrOptions is the correct class for tesseract_cli; all other mappings are correct.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37b97a0 and 8949d12.

📒 Files selected for processing (6)
  • kubeflow-pipelines/common/components.py
  • kubeflow-pipelines/common/constants.py
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
🔇 Additional comments (10)
kubeflow-pipelines/docling-standard/README.md (1)

21-21: LGTM! Documentation correctly reflects the OCR engine change.

The documentation update aligns with the code changes switching the default OCR engine to tesseract_cli.

kubeflow-pipelines/common/components.py (2)

178-178: LGTM! Docstring updated to reflect model changes.

The documentation correctly indicates that only layout and tableformer models are downloaded for the standard pipeline, consistent with removing easyocr.


196-204: Tesseract support confirmed in base image—no additional model downloads required.

The change to with_easyocr=False is correct. The base image quay.io/fabianofranz/docling-ubi9:2.54.0 includes all necessary Tesseract components (tesseract, tesseract-devel, tesseract-langpack-eng, tesseract-osd, leptonica-devel) installed as system packages, and the language data (tessdata) is pre-downloaded at /usr/share/tesseract/tessdata/. The with_easyocr parameter only controls EasyOCR model downloads via download_models(); Tesseract models are part of the base image and do not require separate downloads through this function. No additional action needed.

kubeflow-pipelines/docling-standard/standard_components.py (1)

27-27: No action required – "tesseract_cli" is a valid OcrEngine enum value in docling.

The default value "tesseract_cli" at line 27 is confirmed as a valid option in the docling library's OcrEngine enum, and it will pass the validation logic at lines 111-116. The ocr_engine_map at lines 119-125 correctly maps this value to TesseractCliOcrOptions.

kubeflow-pipelines/common/constants.py (2)

7-9: No action needed—Tesseract availability in the image is documented in the commit message.

The commit "Reset OCR engine to tesseract" (8949d12) explicitly states: "Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0." The pipelines are configured to use tesseract_cli as the OCR engine (as documented in the README), confirming the image choice aligns with the dependency requirements.


4-6: Verify that the Python base image is accessible and contains necessary dependencies.

The base image has been changed to Red Hat UBI9 Python 3.11 (registry.access.redhat.com/ubi9/python-311:9.6-1755074620). The image is publicly accessible from the unauthenticated Red Hat registry and is appropriate for OpenShift AI deployments. However, verify that all required system dependencies are present, particularly for the standard pipeline which requires tesserocr (which has external OCR library dependencies like Tesseract and Leptonica).

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4)

160-164: OCR engine default change looks good.

The default OCR engine is correctly updated to tesseract_cli and the corresponding TesseractCliOcrOptions class is already present in the import list (line 570) and ocr_engine_map (line 593). The change is consistent with the component's input definition here and the root-level pipeline input at line 975.


688-692: Correctly disabling EasyOCR model downloads for standard pipeline.

Adding with_easyocr=False is consistent with the switch to tesseract_cli as the default OCR engine. This avoids unnecessary downloads of EasyOCR models, reducing startup time and storage requirements.


386-386: UBI9 Python image is appropriate for lightweight PDF tasks.

These executors (create-pdf-splits and import-pdfs) perform simple file operations (listing, splitting, and downloading PDFs) without requiring Docling or OCR capabilities, making the base UBI9 Python image a suitable and efficient choice compared to the heavier Docling image used by other executors in the pipeline.


500-500: Verify the docling-ubi9:2.54.0 image contains Tesseract packages.

The image tag exists and is accessible at quay.io, but Tesseract presence cannot be confirmed from available sources. No Tesseract references appear in the codebase, and without access to the image's Dockerfile or build configuration, the claim that this image contains the required Tesseract CLI packages for the tesseract_cli OCR engine remains unverified. Confirm this image was built with Tesseract support before deploying.

Also applies to: 637-637, 714-714

Copy link
Contributor

@shruthis4 shruthis4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The vlm pipeline needs to be recompiled, one of the CI checks failed indicating this:https://github.com/opendatahub-io/data-processing/actions/runs/20827889344/job/59833926066?pr=78
  2. Also, push these changes to a branch on the odh repo and trigger the workflow ->Run workflow->set "Pull request number or branch name" to your odh branch name-> Run workflow
  3. List the link to the workflow run from the above step to the PR

Set OCR engine to be tesseract OCR. Tesseract OCR relies on the
TesseractCliOcrOptions being set instead of TesseractOcrOptions the way it
was set previously.

Since quay.io/amaredia/aipcc-docling-image does not have all of the required
tesseract RPM packages installed the default base image must be reverted back
to quay.io/fabianofranz/docling-ubi9:2.54.0.

Signed-off-by: Ali Maredia <amaredia@redhat.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @kubeflow-pipelines/common/constants.py:
- Around line 7-8: Update the DOCLING_BASE_IMAGE tag to match the pinned docling
library version: change the DOCLING_BASE_IMAGE default value from
"quay.io/fabianofranz/docling-ubi9:2.54.0" to
"quay.io/fabianofranz/docling-ubi9:2.57.0" (or verify the existing image already
contains docling 2.57.0 and required Tesseract deps); modify the
DOCLING_BASE_IMAGE constant so it aligns with the docling == 2.57.0 entries in
requirements.txt to avoid version mismatches.
🧹 Nitpick comments (1)
kubeflow-pipelines/docling-standard/standard_components.py (1)

145-148: Consider adding a warning when OCR options cannot be configured.

When ocr=True but ocr_engine is not in ocr_engine_map (e.g., if allow_external_plugins=True and an external engine is specified), the pipeline silently proceeds without setting ocr_options. This could lead to unexpected behavior where OCR is enabled but using a default engine configuration instead of the user-specified one.

💡 Suggested improvement
     pipeline_options.do_ocr = ocr
     if ocr and ocr_engine in ocr_engine_map:
         OcrOptionsClass = ocr_engine_map[ocr_engine]
         ocr_options_instance = OcrOptionsClass(force_full_page_ocr=force_ocr)
         pipeline_options.ocr_options = ocr_options_instance
+    elif ocr and ocr_engine not in ocr_engine_map:
+        print(
+            f"docling-standard-convert: Warning - OCR engine '{ocr_engine}' not in built-in map, using default OCR options",
+            flush=True,
+        )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8949d12 and 35193f9.

📒 Files selected for processing (7)
  • kubeflow-pipelines/common/components.py
  • kubeflow-pipelines/common/constants.py
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • kubeflow-pipelines/common/components.py
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: Summary
🔇 Additional comments (7)
kubeflow-pipelines/docling-standard/standard_components.py (2)

27-27: LGTM: OCR engine default updated to tesseract_cli.

The default change aligns with the PR objective to use Tesseract OCR with the CLI-based options class (TesseractCliOcrOptions).


119-125: LGTM: OCR engine mapping is correct.

The ocr_engine_map correctly maps engine string identifiers to their corresponding options classes, with tesseract_cli mapped to TesseractCliOcrOptions as required for the Tesseract CLI.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)

304-304: LGTM: Image references consistently updated.

All executor container images have been updated to use the new base images:

  • registry.access.redhat.com/ubi9/python-311:9.6-1755074620 for Python-only components
  • quay.io/fabianofranz/docling-ubi9:2.54.0 for Docling components

These align with the constants defined in constants.py.

Also applies to: 418-418, 550-550, 627-627, 721-721


601-614: LGTM: EasyOCR removed from model downloads.

The with_easyocr=False flag is correctly set across all pipeline types, aligning with the switch to Tesseract CLI which doesn't require separate model downloads.

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (3)

17-17: LGTM: OCR engine defaults consistently updated to tesseract_cli.

The docling_ocr_engine default value has been updated in:

  • Pipeline header comments (Line 17)
  • Component input definitions (Line 161)
  • Root pipeline input definitions (Line 975)

All are consistent with the source code changes in standard_components.py.

Also applies to: 161-161, 975-975


386-386: LGTM: Container images updated consistently.

All executor images are updated to match the constants in constants.py, ensuring the pipeline uses images with the required Tesseract RPM packages.

Also applies to: 500-500, 637-637, 714-714, 808-808


688-691: LGTM: Model download configuration aligned with OCR engine change.

The standard pipeline correctly sets with_easyocr=False since the pipeline now defaults to tesseract_cli which uses system-installed Tesseract binaries rather than downloaded models.

Comment on lines 7 to +8
DOCLING_BASE_IMAGE = os.getenv(
"DOCLING_BASE_IMAGE", "quay.io/amaredia/aipcc-docling-image"
"DOCLING_BASE_IMAGE", "quay.io/fabianofranz/docling-ubi9:2.54.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for docling version references in the repository
rg -n 'docling' -i --type=py | head -30

Repository: opendatahub-io/data-processing

Length of output: 3221


🏁 Script executed:

# Check for requirements files that might specify the docling version
fd -e 'requirements.txt' -e 'setup.py' -e 'pyproject.toml' -e 'setup.cfg' | xargs rg 'docling' -i

Repository: opendatahub-io/data-processing

Length of output: 50386


🏁 Script executed:

# Check constants.py for any related version information
cat -n kubeflow-pipelines/common/constants.py | head -20

Repository: opendatahub-io/data-processing

Length of output: 435


Align container image version with pinned library version.

The DOCLING_BASE_IMAGE tag is pinned to 2.54.0, but requirements.txt in both docling-standard and docling-vlm specifies docling == 2.57.0. Update the image tag to 2.57.0 or verify that the image includes the required version and dependencies for Tesseract compatibility.

🤖 Prompt for AI Agents
In @kubeflow-pipelines/common/constants.py around lines 7 - 8, Update the
DOCLING_BASE_IMAGE tag to match the pinned docling library version: change the
DOCLING_BASE_IMAGE default value from "quay.io/fabianofranz/docling-ubi9:2.54.0"
to "quay.io/fabianofranz/docling-ubi9:2.57.0" (or verify the existing image
already contains docling 2.57.0 and required Tesseract deps); modify the
DOCLING_BASE_IMAGE constant so it aligns with the docling == 2.57.0 entries in
requirements.txt to avoid version mismatches.

Copy link
Contributor

@RobuRishabh RobuRishabh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@alimaredia
Copy link
Contributor Author

@mergify mergify bot merged commit d5586ff into opendatahub-io:main Jan 15, 2026
7 of 9 checks passed
@mergify
Copy link

mergify bot commented Jan 15, 2026

🎉 Auto-merged successfully!

All reviewers approved: 2
CI checks passed: All

Approved by:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants