refactor OCR engine configuration to use dependency pattern by AdemBoukhris457 · Pull Request #77 · AdemBoukhris457/Doctra

AdemBoukhris457 · 2025-11-09T13:55:35Z

Summary

Refactors OCR engine configuration in StructuredPDFParser and EnhancedPDFParser to use a dependency pattern. Instead of exposing individual OCR parameters directly in parser classes, users now initialize OCR engines externally and pass them as dependencies.

Motivation

Clearer API: Users explicitly choose and configure one OCR backend at a time
Prevent mixed configurations: Avoids confusion from mixing PyTesseract and PaddleOCR parameters
Reusability: OCR engines can be created once and reused across multiple parsers
Separation of concerns: OCR configuration is separated from parser logic

Changes

Code Changes

Remove individual OCR parameters (ocr_lang, ocr_psm, ocr_oem, ocr_extra_config, paddleocr_*) from parser constructors
Add single ocr_engine parameter accepting PytesseractOCREngine or PaddleOCREngine instances
Maintain backward compatibility: creates default PytesseractOCREngine if ocr_engine=None
Update CLI to instantiate OCR engines before passing to parsers
Update UI components (Gradio apps) to create OCR engine instances
Update all parser instantiation points across codebase

Documentation Changes

Update README with new OCR engine examples
Update all documentation files (user guides, API reference, examples)
Add examples showing OCR engine reuse across parsers

Migration Guide

Before:

parser = StructuredPDFParser(
    ocr_lang="eng",
    ocr_psm=6,
    ocr_oem=3
)

After:

from doctra.engines.ocr import PytesseractOCREngine

tesseract_ocr = PytesseractOCREngine(lang="eng", psm=6, oem=3)
parser = StructuredPDFParser(ocr_engine=tesseract_ocr)

Default behavior (backward compatible):

parser = StructuredPDFParser()  # Still works, uses default PyTesseract

…rameter accepting pre-initialized PytesseractOCREngine or PaddleOCREngine instances. This improves API clarity and allows reusing OCR engines across parsers. - Remove ocr_lang, ocr_psm, ocr_oem, ocr_extra_config, paddleocr_* parameters - Add ocr_engine parameter with default PyTesseract fallback - Update CLI, UI, and all parser instantiation points

Update examples to show initializing OCR engines externally and passing them to parsers, replacing individual OCR parameter examples.

Update all documentation files to show OCR engine initialization and dependency pattern instead of individual OCR parameters. Includes updates to user guides, API reference, quick start, and examples.

AdemBoukhris457 added 3 commits November 9, 2025 14:50

docs: update README for OCR engine dependency pattern

dca389c

Update examples to show initializing OCR engines externally and passing them to parsers, replacing individual OCR parameter examples.

docs: update documentation for OCR engine dependency pattern

c28bb90

Update all documentation files to show OCR engine initialization and dependency pattern instead of individual OCR parameters. Includes updates to user guides, API reference, quick start, and examples.

AdemBoukhris457 self-assigned this Nov 9, 2025

AdemBoukhris457 added documentation Improvements or additions to documentation enhancement New feature or request refactor labels Nov 9, 2025

AdemBoukhris457 merged commit af743aa into main Nov 9, 2025
1 check passed

AdemBoukhris457 deleted the refactor/ocr_engine_initialization branch November 9, 2025 13:55

AdemBoukhris457 changed the title ~~Refactor OCR engine configuration to use dependency pattern~~ refactor OCR engine configuration to use dependency pattern Nov 9, 2025

AdemBoukhris457 mentioned this pull request Nov 9, 2025

release: prepare v0.8.0 - Dependency Pattern Refactoring #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor OCR engine configuration to use dependency pattern#77

refactor OCR engine configuration to use dependency pattern#77
AdemBoukhris457 merged 3 commits intomainfrom
refactor/ocr_engine_initialization

AdemBoukhris457 commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AdemBoukhris457 commented Nov 9, 2025

Summary

Motivation

Changes

Code Changes

Documentation Changes

Migration Guide

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant