Reset OCR engine to tesseract#78
Conversation
📝 WalkthroughWalkthroughDefault OCR engine changed from "easyocr" to "tesseract_cli"; the standard model download no longer includes easyocr (now layout + tableformer); base image constants updated to Red Hat UBI and FabianoFranz docling images. Changes
Sequence Diagram(s)(omitted — changes are defaults/config and manifest updates without new multi-component control flow) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
kubeflow-pipelines/docling-standard/standard_components.py (1)
119-125: Add missing "auto" OCR engine to ocr_engine_map.The OcrEngine enum in docling v2.54.0 includes an AUTO engine option that is not present in the ocr_engine_map. The code validates ocr_engine strings against all OcrEngine enum values (line 112), but the map only handles 5 of 6 engines. This creates a gap: if someone passes
ocr_engine="auto", the validation passes but line 145's lookup will fail.Add the AUTO engine mapping. Note that AUTO may require special handling—verify what the corresponding options class should be (if any).
Confirmed: TesseractCliOcrOptions is the correct class for tesseract_cli; all other mappings are correct.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
kubeflow-pipelines/common/components.pykubeflow-pipelines/common/constants.pykubeflow-pipelines/docling-standard/README.mdkubeflow-pipelines/docling-standard/standard_components.pykubeflow-pipelines/docling-standard/standard_convert_pipeline.pykubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
🔇 Additional comments (10)
kubeflow-pipelines/docling-standard/README.md (1)
21-21: LGTM! Documentation correctly reflects the OCR engine change.The documentation update aligns with the code changes switching the default OCR engine to
tesseract_cli.kubeflow-pipelines/common/components.py (2)
178-178: LGTM! Docstring updated to reflect model changes.The documentation correctly indicates that only layout and tableformer models are downloaded for the standard pipeline, consistent with removing easyocr.
196-204: Tesseract support confirmed in base image—no additional model downloads required.The change to
with_easyocr=Falseis correct. The base imagequay.io/fabianofranz/docling-ubi9:2.54.0includes all necessary Tesseract components (tesseract, tesseract-devel, tesseract-langpack-eng, tesseract-osd, leptonica-devel) installed as system packages, and the language data (tessdata) is pre-downloaded at/usr/share/tesseract/tessdata/. Thewith_easyocrparameter only controls EasyOCR model downloads viadownload_models(); Tesseract models are part of the base image and do not require separate downloads through this function. No additional action needed.kubeflow-pipelines/docling-standard/standard_components.py (1)
27-27: No action required – "tesseract_cli" is a valid OcrEngine enum value in docling.The default value
"tesseract_cli"at line 27 is confirmed as a valid option in the docling library'sOcrEngineenum, and it will pass the validation logic at lines 111-116. Theocr_engine_mapat lines 119-125 correctly maps this value toTesseractCliOcrOptions.kubeflow-pipelines/common/constants.py (2)
7-9: No action needed—Tesseract availability in the image is documented in the commit message.The commit "Reset OCR engine to tesseract" (8949d12) explicitly states: "Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0." The pipelines are configured to use
tesseract_clias the OCR engine (as documented in the README), confirming the image choice aligns with the dependency requirements.
4-6: Verify that the Python base image is accessible and contains necessary dependencies.The base image has been changed to Red Hat UBI9 Python 3.11 (
registry.access.redhat.com/ubi9/python-311:9.6-1755074620). The image is publicly accessible from the unauthenticated Red Hat registry and is appropriate for OpenShift AI deployments. However, verify that all required system dependencies are present, particularly for the standard pipeline which requirestesserocr(which has external OCR library dependencies like Tesseract and Leptonica).kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4)
160-164: OCR engine default change looks good.The default OCR engine is correctly updated to
tesseract_cliand the correspondingTesseractCliOcrOptionsclass is already present in the import list (line 570) andocr_engine_map(line 593). The change is consistent with the component's input definition here and the root-level pipeline input at line 975.
688-692: Correctly disabling EasyOCR model downloads for standard pipeline.Adding
with_easyocr=Falseis consistent with the switch totesseract_clias the default OCR engine. This avoids unnecessary downloads of EasyOCR models, reducing startup time and storage requirements.
386-386: UBI9 Python image is appropriate for lightweight PDF tasks.These executors (
create-pdf-splitsandimport-pdfs) perform simple file operations (listing, splitting, and downloading PDFs) without requiring Docling or OCR capabilities, making the base UBI9 Python image a suitable and efficient choice compared to the heavier Docling image used by other executors in the pipeline.
500-500: Verify thedocling-ubi9:2.54.0image contains Tesseract packages.The image tag exists and is accessible at quay.io, but Tesseract presence cannot be confirmed from available sources. No Tesseract references appear in the codebase, and without access to the image's Dockerfile or build configuration, the claim that this image contains the required Tesseract CLI packages for the
tesseract_cliOCR engine remains unverified. Confirm this image was built with Tesseract support before deploying.Also applies to: 637-637, 714-714
shruthis4
left a comment
There was a problem hiding this comment.
- The vlm pipeline needs to be recompiled, one of the CI checks failed indicating this:https://github.com/opendatahub-io/data-processing/actions/runs/20827889344/job/59833926066?pr=78
- Also, push these changes to a branch on the odh repo and trigger the workflow ->Run workflow->set "Pull request number or branch name" to your odh branch name-> Run workflow
- List the link to the workflow run from the above step to the PR
Set OCR engine to be tesseract OCR. Tesseract OCR relies on the TesseractCliOcrOptions being set instead of TesseractOcrOptions the way it was set previously. Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0. Signed-off-by: Ali Maredia <amaredia@redhat.com>
8949d12 to
35193f9
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @kubeflow-pipelines/common/constants.py:
- Around line 7-8: Update the DOCLING_BASE_IMAGE tag to match the pinned docling
library version: change the DOCLING_BASE_IMAGE default value from
"quay.io/fabianofranz/docling-ubi9:2.54.0" to
"quay.io/fabianofranz/docling-ubi9:2.57.0" (or verify the existing image already
contains docling 2.57.0 and required Tesseract deps); modify the
DOCLING_BASE_IMAGE constant so it aligns with the docling == 2.57.0 entries in
requirements.txt to avoid version mismatches.
🧹 Nitpick comments (1)
kubeflow-pipelines/docling-standard/standard_components.py (1)
145-148: Consider adding a warning when OCR options cannot be configured.When
ocr=Truebutocr_engineis not inocr_engine_map(e.g., ifallow_external_plugins=Trueand an external engine is specified), the pipeline silently proceeds without settingocr_options. This could lead to unexpected behavior where OCR is enabled but using a default engine configuration instead of the user-specified one.💡 Suggested improvement
pipeline_options.do_ocr = ocr if ocr and ocr_engine in ocr_engine_map: OcrOptionsClass = ocr_engine_map[ocr_engine] ocr_options_instance = OcrOptionsClass(force_full_page_ocr=force_ocr) pipeline_options.ocr_options = ocr_options_instance + elif ocr and ocr_engine not in ocr_engine_map: + print( + f"docling-standard-convert: Warning - OCR engine '{ocr_engine}' not in built-in map, using default OCR options", + flush=True, + )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
kubeflow-pipelines/common/components.pykubeflow-pipelines/common/constants.pykubeflow-pipelines/docling-standard/README.mdkubeflow-pipelines/docling-standard/standard_components.pykubeflow-pipelines/docling-standard/standard_convert_pipeline.pykubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yamlkubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- kubeflow-pipelines/common/components.py
- kubeflow-pipelines/docling-standard/README.md
- kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: Summary
🔇 Additional comments (7)
kubeflow-pipelines/docling-standard/standard_components.py (2)
27-27: LGTM: OCR engine default updated totesseract_cli.The default change aligns with the PR objective to use Tesseract OCR with the CLI-based options class (
TesseractCliOcrOptions).
119-125: LGTM: OCR engine mapping is correct.The
ocr_engine_mapcorrectly maps engine string identifiers to their corresponding options classes, withtesseract_climapped toTesseractCliOcrOptionsas required for the Tesseract CLI.kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)
304-304: LGTM: Image references consistently updated.All executor container images have been updated to use the new base images:
registry.access.redhat.com/ubi9/python-311:9.6-1755074620for Python-only componentsquay.io/fabianofranz/docling-ubi9:2.54.0for Docling componentsThese align with the constants defined in
constants.py.Also applies to: 418-418, 550-550, 627-627, 721-721
601-614: LGTM: EasyOCR removed from model downloads.The
with_easyocr=Falseflag is correctly set across all pipeline types, aligning with the switch to Tesseract CLI which doesn't require separate model downloads.kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (3)
17-17: LGTM: OCR engine defaults consistently updated totesseract_cli.The
docling_ocr_enginedefault value has been updated in:
- Pipeline header comments (Line 17)
- Component input definitions (Line 161)
- Root pipeline input definitions (Line 975)
All are consistent with the source code changes in
standard_components.py.Also applies to: 161-161, 975-975
386-386: LGTM: Container images updated consistently.All executor images are updated to match the constants in
constants.py, ensuring the pipeline uses images with the required Tesseract RPM packages.Also applies to: 500-500, 637-637, 714-714, 808-808
688-691: LGTM: Model download configuration aligned with OCR engine change.The standard pipeline correctly sets
with_easyocr=Falsesince the pipeline now defaults totesseract_cliwhich uses system-installed Tesseract binaries rather than downloaded models.
| DOCLING_BASE_IMAGE = os.getenv( | ||
| "DOCLING_BASE_IMAGE", "quay.io/amaredia/aipcc-docling-image" | ||
| "DOCLING_BASE_IMAGE", "quay.io/fabianofranz/docling-ubi9:2.54.0" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Search for docling version references in the repository
rg -n 'docling' -i --type=py | head -30Repository: opendatahub-io/data-processing
Length of output: 3221
🏁 Script executed:
# Check for requirements files that might specify the docling version
fd -e 'requirements.txt' -e 'setup.py' -e 'pyproject.toml' -e 'setup.cfg' | xargs rg 'docling' -iRepository: opendatahub-io/data-processing
Length of output: 50386
🏁 Script executed:
# Check constants.py for any related version information
cat -n kubeflow-pipelines/common/constants.py | head -20Repository: opendatahub-io/data-processing
Length of output: 435
Align container image version with pinned library version.
The DOCLING_BASE_IMAGE tag is pinned to 2.54.0, but requirements.txt in both docling-standard and docling-vlm specifies docling == 2.57.0. Update the image tag to 2.57.0 or verify that the image includes the required version and dependencies for Tesseract compatibility.
🤖 Prompt for AI Agents
In @kubeflow-pipelines/common/constants.py around lines 7 - 8, Update the
DOCLING_BASE_IMAGE tag to match the pinned docling library version: change the
DOCLING_BASE_IMAGE default value from "quay.io/fabianofranz/docling-ubi9:2.54.0"
to "quay.io/fabianofranz/docling-ubi9:2.57.0" (or verify the existing image
already contains docling 2.57.0 and required Tesseract deps); modify the
DOCLING_BASE_IMAGE constant so it aligns with the docling == 2.57.0 entries in
requirements.txt to avoid version mismatches.
|
🎉 Auto-merged successfully! ✅ All reviewers approved: 2 Approved by: |
Description
Set OCR engine to be tesseract OCR. Tesseract OCR relies on the TesseractCliOcrOptions being set instead of TesseractOcrOptions the way it was set previously.
Since quay.io/amaredia/aipcc-docling-image does not have all of the required tesseract RPM packages installed the default base image must be reverted back to quay.io/fabianofranz/docling-ubi9:2.54.0.
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.