Skip to content

Comments

feat(parser): add optional PaddleOCR backend#199

Open
SaqlainXoas wants to merge 4 commits intoHKUDS:mainfrom
SaqlainXoas:feat/paddleocr-parser
Open

feat(parser): add optional PaddleOCR backend#199
SaqlainXoas wants to merge 4 commits intoHKUDS:mainfrom
SaqlainXoas:feat/paddleocr-parser

Conversation

@SaqlainXoas
Copy link

Description

This PR adds an optional PaddleOCR parser backend to RAG-Anything while keeping default MinerU/Docling behavior unchanged.

The change is intentionally minimal and focused on:

  • adding parser="paddleocr" support,
  • preserving optional dependency behavior (lazy imports),
  • keeping output compatible with existing content_list processing,
  • updating docs/config/examples,
  • adding CI-safe tests.

Related Issues

Refs #178

Changes Made

  • Added PaddleOCRParser in raganything/parser.py with:
    • lazy paddleocr import (no import-time hard dependency),
    • support for both ocr(...) and predict(...) call styles,
    • PDF OCR path using pypdfium2 page rendering,
    • normalized text-block content_list output including page_idx.
  • Added centralized parser registry/factory:
    • SUPPORTED_PARSERS = ("mineru", "docling", "paddleocr")
    • get_parser(parser_type)
  • Switched parser selection wiring to the shared factory in:
    • raganything/raganything.py
    • raganything/processor.py
    • raganything/batch_parser.py
    • parser CLI (raganything/parser.py)
  • Updated optional dependencies:
    • pyproject.toml: paddleocr extra (paddleocr, pypdfium2)
    • setup.py: matching extras_require updates
  • Updated docs/config/examples:
    • README.md
    • docs/batch_processing.md
    • env.example
    • raganything/config.py
    • examples/raganything_example.py
    • examples/batch_dry_run_example.py
  • Added tests:
    • tests/testpaddleocr_parser.py
    • tests/testparser_wiring.py

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

Local validation commands:

  • .venv/bin/python -m pytest -q tests/testpaddleocr_parser.py tests/testparser_wiring.py -> 11 passed
  • .venv/bin/ruff check raganything/parser.py raganything/batch_parser.py raganything/processor.py raganything/config.py tests/testpaddleocr_parser.py tests/testparser_wiring.py -> passed

Note: pytest -q over the whole repository still discovers existing examples/*_test.py scripts that expect a file_path fixture. This is pre-existing and unrelated to this PR.

@LarFii
Copy link
Collaborator

LarFii commented Feb 17, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1419979230

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Include paddleocr/pypdfium2 in [all] optional extra\n- Keep setup.py 'all' extras consistent (add markdown deps)\n- check_installation() also verifies pypdfium2 to avoid PDF runtime failures\n- Add test for missing pypdfium2
@SaqlainXoas
Copy link
Author

Pushed ff68d51 to add PaddleOCR deps to [all] and make check_installation() also verify pypdfium2 + added a test.

@chatgpt-codex-connector
Copy link

To use Codex here, create a Codex account and connect to github.

Copy link
Author

@SaqlainXoas SaqlainXoas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review it out

@LarFii
Copy link
Collaborator

LarFii commented Feb 18, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ff68d51517

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Preserve repeated OCR lines (no content dedup)

- Stream rendered PDF pages and close on errors

- Add regression test + fix pytest discovery (ignore examples)
@LarFii
Copy link
Collaborator

LarFii commented Feb 20, 2026

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 94fc2295be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants