Skip to content

refactor OCR engine configuration to use dependency pattern#77

Merged
AdemBoukhris457 merged 3 commits intomainfrom
refactor/ocr_engine_initialization
Nov 9, 2025
Merged

refactor OCR engine configuration to use dependency pattern#77
AdemBoukhris457 merged 3 commits intomainfrom
refactor/ocr_engine_initialization

Conversation

@AdemBoukhris457
Copy link
Owner

Summary

Refactors OCR engine configuration in StructuredPDFParser and EnhancedPDFParser to use a dependency pattern. Instead of exposing individual OCR parameters directly in parser classes, users now initialize OCR engines externally and pass them as dependencies.

Motivation

  • Clearer API: Users explicitly choose and configure one OCR backend at a time
  • Prevent mixed configurations: Avoids confusion from mixing PyTesseract and PaddleOCR parameters
  • Reusability: OCR engines can be created once and reused across multiple parsers
  • Separation of concerns: OCR configuration is separated from parser logic

Changes

Code Changes

  • Remove individual OCR parameters (ocr_lang, ocr_psm, ocr_oem, ocr_extra_config, paddleocr_*) from parser constructors
  • Add single ocr_engine parameter accepting PytesseractOCREngine or PaddleOCREngine instances
  • Maintain backward compatibility: creates default PytesseractOCREngine if ocr_engine=None
  • Update CLI to instantiate OCR engines before passing to parsers
  • Update UI components (Gradio apps) to create OCR engine instances
  • Update all parser instantiation points across codebase

Documentation Changes

  • Update README with new OCR engine examples
  • Update all documentation files (user guides, API reference, examples)
  • Add examples showing OCR engine reuse across parsers

Migration Guide

Before:

parser = StructuredPDFParser(
    ocr_lang="eng",
    ocr_psm=6,
    ocr_oem=3
)

After:

from doctra.engines.ocr import PytesseractOCREngine

tesseract_ocr = PytesseractOCREngine(lang="eng", psm=6, oem=3)
parser = StructuredPDFParser(ocr_engine=tesseract_ocr)

Default behavior (backward compatible):

parser = StructuredPDFParser()  # Still works, uses default PyTesseract

…rameter accepting pre-initialized PytesseractOCREngine or PaddleOCREngine instances. This improves API clarity and allows reusing OCR engines across parsers.

- Remove ocr_lang, ocr_psm, ocr_oem, ocr_extra_config, paddleocr_* parameters
- Add ocr_engine parameter with default PyTesseract fallback
- Update CLI, UI, and all parser instantiation points
Update examples to show initializing OCR engines externally and passing them to parsers, replacing individual OCR parameter examples.
Update all documentation files to show OCR engine initialization and dependency pattern instead of individual OCR parameters. Includes updates to user guides, API reference, quick start, and examples.
@AdemBoukhris457 AdemBoukhris457 self-assigned this Nov 9, 2025
@AdemBoukhris457 AdemBoukhris457 added documentation Improvements or additions to documentation enhancement New feature or request refactor labels Nov 9, 2025
@AdemBoukhris457 AdemBoukhris457 merged commit af743aa into main Nov 9, 2025
1 check passed
@AdemBoukhris457 AdemBoukhris457 deleted the refactor/ocr_engine_initialization branch November 9, 2025 13:55
@AdemBoukhris457 AdemBoukhris457 changed the title Refactor OCR engine configuration to use dependency pattern refactor OCR engine configuration to use dependency pattern Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request refactor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant