feat(parser): add optional PaddleOCR backend#199
feat(parser): add optional PaddleOCR backend#199SaqlainXoas wants to merge 4 commits intoHKUDS:mainfrom
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1419979230
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Include paddleocr/pypdfium2 in [all] optional extra\n- Keep setup.py 'all' extras consistent (add markdown deps)\n- check_installation() also verifies pypdfium2 to avoid PDF runtime failures\n- Add test for missing pypdfium2
|
Pushed ff68d51 to add PaddleOCR deps to [all] and make check_installation() also verify pypdfium2 + added a test. |
|
To use Codex here, create a Codex account and connect to github. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff68d51517
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Preserve repeated OCR lines (no content dedup) - Stream rendered PDF pages and close on errors - Add regression test + fix pytest discovery (ignore examples)
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 94fc2295be
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Description
This PR adds an optional PaddleOCR parser backend to RAG-Anything while keeping default MinerU/Docling behavior unchanged.
The change is intentionally minimal and focused on:
parser="paddleocr"support,content_listprocessing,Related Issues
Refs #178
Changes Made
PaddleOCRParserinraganything/parser.pywith:paddleocrimport (no import-time hard dependency),ocr(...)andpredict(...)call styles,pypdfium2page rendering,content_listoutput includingpage_idx.SUPPORTED_PARSERS = ("mineru", "docling", "paddleocr")get_parser(parser_type)raganything/raganything.pyraganything/processor.pyraganything/batch_parser.pyraganything/parser.py)pyproject.toml:paddleocrextra (paddleocr,pypdfium2)setup.py: matchingextras_requireupdatesREADME.mddocs/batch_processing.mdenv.exampleraganything/config.pyexamples/raganything_example.pyexamples/batch_dry_run_example.pytests/testpaddleocr_parser.pytests/testparser_wiring.pyChecklist
Additional Notes
Local validation commands:
.venv/bin/python -m pytest -q tests/testpaddleocr_parser.py tests/testparser_wiring.py-> 11 passed.venv/bin/ruff check raganything/parser.py raganything/batch_parser.py raganything/processor.py raganything/config.py tests/testpaddleocr_parser.py tests/testparser_wiring.py-> passedNote:
pytest -qover the whole repository still discovers existingexamples/*_test.pyscripts that expect afile_pathfixture. This is pre-existing and unrelated to this PR.