Skip to content

Commit 1d270ef

Browse files
[New Package] Add PaddleOCR Reader for extracting text from images in PDFs (#19827)
* submit for llama-index-readers-paddle-ocr submit for llama-index-readers-paddle-ocr * Add PaddleOCR reader integration 1. Version: Changed to 0.1.0 2. License: Changed to MIT 3. Class Naming: PDFPaddleOCR changed to PaddleOcrReader 4. Default Language: Changed from "ch" to "en" 5. Arbitrary Filtering: Changed arbitrary logic like '"第", "页"...' to '"page", "of"' 6. Testing: Added several test cases covering all methods. I ran uv run --pytest -v and all tests passed. * ci: lint
1 parent ded12a8 commit 1d270ef

File tree

11 files changed

+5916
-0
lines changed

11 files changed

+5916
-0
lines changed
Lines changed: 154 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,154 @@
1+
llama_index/_static
2+
.DS_Store
3+
# Byte-compiled / optimized / DLL files
4+
__pycache__/
5+
*.py[cod]
6+
*$py.class
7+
8+
# C extensions
9+
*.so
10+
11+
# Distribution / packaging
12+
.Python
13+
bin/
14+
build/
15+
develop-eggs/
16+
dist/
17+
downloads/
18+
eggs/
19+
.eggs/
20+
etc/
21+
include/
22+
lib/
23+
lib64/
24+
parts/
25+
sdist/
26+
share/
27+
var/
28+
wheels/
29+
pip-wheel-metadata/
30+
share/python-wheels/
31+
*.egg-info/
32+
.installed.cfg
33+
*.egg
34+
MANIFEST
35+
36+
# PyInstaller
37+
# Usually these files are written by a python script from a template
38+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
39+
*.manifest
40+
*.spec
41+
42+
# Installer logs
43+
pip-log.txt
44+
pip-delete-this-directory.txt
45+
46+
# Unit test / coverage reports
47+
htmlcov/
48+
.tox/
49+
.nox/
50+
.coverage
51+
.coverage.*
52+
.cache
53+
nosetests.xml
54+
coverage.xml
55+
*.cover
56+
*.py,cover
57+
.hypothesis/
58+
.pytest_cache/
59+
.ruff_cache
60+
61+
# Translations
62+
*.mo
63+
*.pot
64+
65+
# Django stuff:
66+
*.log
67+
local_settings.py
68+
db.sqlite3
69+
db.sqlite3-journal
70+
71+
# Flask stuff:
72+
instance/
73+
.webassets-cache
74+
75+
# Scrapy stuff:
76+
.scrapy
77+
78+
# Sphinx documentation
79+
docs/_build/
80+
81+
# PyBuilder
82+
target/
83+
84+
# Jupyter Notebook
85+
.ipynb_checkpoints
86+
notebooks/
87+
88+
# IPython
89+
profile_default/
90+
ipython_config.py
91+
92+
# pyenv
93+
.python-version
94+
95+
# pipenv
96+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
97+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
98+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
99+
# install all needed dependencies.
100+
#Pipfile.lock
101+
102+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
103+
__pypackages__/
104+
105+
# Celery stuff
106+
celerybeat-schedule
107+
celerybeat.pid
108+
109+
# SageMath parsed files
110+
*.sage.py
111+
112+
# Environments
113+
.env
114+
.venv
115+
env/
116+
venv/
117+
ENV/
118+
env.bak/
119+
venv.bak/
120+
pyvenv.cfg
121+
122+
# Spyder project settings
123+
.spyderproject
124+
.spyproject
125+
126+
# Rope project settings
127+
.ropeproject
128+
129+
# mkdocs documentation
130+
/site
131+
132+
# mypy
133+
.mypy_cache/
134+
.dmypy.json
135+
dmypy.json
136+
137+
# Pyre type checker
138+
.pyre/
139+
140+
# Jetbrains
141+
.idea
142+
modules/
143+
*.swp
144+
145+
# VsCode
146+
.vscode
147+
148+
# pipenv
149+
Pipfile
150+
Pipfile.lock
151+
152+
# pyright
153+
pyrightconfig.json
154+
/.github
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# CHANGELOG
2+
3+
## [0.1.0] - 2025-09-15
4+
5+
- Add maintainers and keywords from library.json (llamahub)
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
MIT License
2+
3+
Copyright (c) Michael Ip
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20+
SOFTWARE.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
GIT_ROOT ?= $(shell git rev-parse --show-toplevel)
2+
3+
help: ## Show all Makefile targets.
4+
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'
5+
6+
format: ## Run code autoformatters (black).
7+
pre-commit install
8+
git ls-files | xargs pre-commit run black --files
9+
10+
lint: ## Run linters: pre-commit (black, ruff, codespell) and mypy
11+
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
12+
13+
test: ## Run tests via pytest.
14+
pytest tests
15+
16+
watch-docs: ## Build and watch documentation.
17+
sphinx-autobuild docs/ docs/_build/html --open-browser --watch $(GIT_ROOT)/llama_index/
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Paddle OCR loader
2+
3+
```bash
4+
pip install llama-index-readers-paddle-ocr
5+
```
6+
7+
This loader reads the equations, symbols, and tables included in the PDF.
8+
9+
Users can input the path of the academic PDF document `file` which they want to parse. This OCR understands LaTeX math and tables.
10+
11+
## Usage
12+
13+
Here's an example usage of the PDFPaddleOCR.
14+
15+
```python
16+
from llama_index.readers.paddle_ocr import PDFPaddleOCR
17+
18+
reader = PDFPaddleOCR()
19+
20+
pdf_path = Path("/path/to/pdf")
21+
22+
documents = reader.load_data(pdf_path)
23+
```
24+
25+
## Miscellaneous
26+
27+
An `output` folder will be created with the same name as the pdf and `.mmd` extension.
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from llama_index.readers.paddle_ocr.base import PDFPaddleOCRReader
2+
3+
__all__ = ["PDFPaddleOCRReader"]

0 commit comments

Comments
 (0)