Skip to content

Commit dae0492

Browse files
Merge pull request #73 from AdemBoukhris457/feature/paddleocr_integration
feat: Add PaddleOCR PP-OCRv5_server engine support
2 parents 7a7f8f9 + 6f40932 commit dae0492

File tree

9 files changed

+449
-38
lines changed

9 files changed

+449
-38
lines changed

docs/api/parsers.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,10 +194,17 @@ parser.parse(
194194

195195
| Parameter | Type | Default | Description |
196196
|-----------|------|---------|-------------|
197-
| `ocr_lang` | str | "eng" | Tesseract language code |
198-
| `ocr_psm` | int | 4 | Page segmentation mode |
199-
| `ocr_oem` | int | 3 | OCR engine mode |
200-
| `ocr_extra_config` | str | "" | Additional Tesseract configuration |
197+
| `ocr_engine` | str | "pytesseract" | OCR engine to use: "pytesseract" or "paddleocr" |
198+
| `ocr_lang` | str | "eng" | Tesseract language code (PyTesseract only) |
199+
| `ocr_psm` | int | 4 | Page segmentation mode (PyTesseract only) |
200+
| `ocr_oem` | int | 3 | OCR engine mode (PyTesseract only) |
201+
| `ocr_extra_config` | str | "" | Additional Tesseract configuration (PyTesseract only) |
202+
| `paddleocr_use_doc_orientation_classify` | bool | False | Enable document orientation classification (PaddleOCR only) |
203+
| `paddleocr_use_doc_unwarping` | bool | False | Enable text image rectification (PaddleOCR only) |
204+
| `paddleocr_use_textline_orientation` | bool | False | Enable text line orientation classification (PaddleOCR only) |
205+
| `paddleocr_device` | str | "gpu" | Device for PaddleOCR: "cpu" or "gpu" (PaddleOCR only) |
206+
207+
**Note**: When using `ocr_engine="paddleocr"`, PaddleOCR 3.0's PP-OCRv5_server model is used by default. Models are automatically downloaded on first use.
201208

202209
### VLM Parameters
203210

docs/user-guide/core-concepts.md

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -171,18 +171,43 @@ This shows bounding boxes with colors:
171171

172172
## OCR Processing
173173

174-
OCR (Optical Character Recognition) extracts text from images.
174+
OCR (Optical Character Recognition) extracts text from images. Doctra supports two OCR engines:
175+
176+
### OCR Engines
177+
178+
**PyTesseract** (default)
179+
: Traditional Tesseract OCR with extensive language support and fine-grained control.
180+
181+
**PaddleOCR**
182+
: Advanced PP-OCRv5_server model (PaddleOCR 3.0) with superior accuracy and GPU acceleration.
175183

176184
### Configuration
177185

186+
**Using PyTesseract (default):**
187+
178188
```python
179189
parser = StructuredPDFParser(
190+
ocr_engine="pytesseract", # Optional, this is the default
180191
ocr_lang="eng", # Language
181192
ocr_psm=6, # Page segmentation mode
182193
ocr_oem=3 # OCR Engine mode
183194
)
184195
```
185196

197+
**Using PaddleOCR:**
198+
199+
```python
200+
parser = StructuredPDFParser(
201+
ocr_engine="paddleocr",
202+
paddleocr_device="gpu", # Use "cpu" if no GPU available
203+
paddleocr_use_doc_orientation_classify=False,
204+
paddleocr_use_doc_unwarping=False,
205+
paddleocr_use_textline_orientation=False
206+
)
207+
```
208+
209+
### PyTesseract Parameters
210+
186211
**ocr_lang**
187212
: Tesseract language code. Examples: `eng`, `fra`, `spa`, `deu`
188213

@@ -201,22 +226,50 @@ parser = StructuredPDFParser(
201226
- `1`: Neural nets LSTM
202227
- `3`: Default (both)
203228

229+
### PaddleOCR Parameters
230+
231+
**paddleocr_device**
232+
: Processing device: `"gpu"` (default, recommended) or `"cpu"`
233+
234+
**paddleocr_use_doc_orientation_classify**
235+
: Enable automatic document orientation detection (default: `False`)
236+
237+
**paddleocr_use_doc_unwarping**
238+
: Enable perspective correction for scanned documents (default: `False`)
239+
240+
**paddleocr_use_textline_orientation**
241+
: Enable text line orientation classification (default: `False`)
242+
204243
### Improving OCR Accuracy
205244

206-
1. **Increase DPI**: Higher resolution = better text recognition
245+
1. **Choose PaddleOCR for complex documents**: Better accuracy on degraded or complex documents
246+
```python
247+
parser = StructuredPDFParser(
248+
ocr_engine="paddleocr",
249+
paddleocr_device="gpu"
250+
)
251+
```
252+
253+
2. **Increase DPI**: Higher resolution = better text recognition
207254
```python
208255
parser = StructuredPDFParser(dpi=300)
209256
```
210257

211-
2. **Use Image Restoration**: Enhance document quality first
258+
3. **Use Image Restoration**: Enhance document quality first
212259
```python
213260
from doctra import EnhancedPDFParser
214-
parser = EnhancedPDFParser(use_image_restoration=True)
261+
parser = EnhancedPDFParser(
262+
use_image_restoration=True,
263+
ocr_engine="paddleocr" # Combine for best results
264+
)
215265
```
216266

217-
3. **Correct Language**: Specify document language
267+
4. **Correct Language** (PyTesseract): Specify document language
218268
```python
219-
parser = StructuredPDFParser(ocr_lang="fra") # French
269+
parser = StructuredPDFParser(
270+
ocr_engine="pytesseract",
271+
ocr_lang="fra" # French
272+
)
220273
```
221274

222275
## Image Restoration

docs/user-guide/engines/ocr-engine.md

Lines changed: 112 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,62 @@ Guide to text extraction using OCR in Doctra.
44

55
## Overview
66

7-
Doctra uses Tesseract OCR to extract text from document images. The OCR engine is highly configurable for different document types and languages.
7+
Doctra supports two OCR engines for text extraction:
88

9-
## Configuration
9+
1. **PyTesseract** (default) - Traditional Tesseract OCR engine with extensive language support
10+
2. **PaddleOCR** - Advanced PP-OCRv5_server model released in PaddleOCR 3.0, offering superior accuracy and performance
11+
12+
You can choose between these engines based on your needs. PyTesseract is the default and works well for most use cases, while PaddleOCR provides enhanced accuracy for complex documents.
13+
14+
## Choosing an OCR Engine
15+
16+
### PyTesseract (Default)
17+
18+
PyTesseract is the default OCR engine and works well for most documents. It offers extensive language support and fine-grained control.
1019

1120
```python
1221
from doctra import StructuredPDFParser
1322

23+
# PyTesseract is the default - no need to specify
1424
parser = StructuredPDFParser(
1525
ocr_lang="eng",
1626
ocr_psm=6,
1727
ocr_oem=3
1828
)
29+
30+
# Or explicitly specify it
31+
parser = StructuredPDFParser(
32+
ocr_engine="pytesseract",
33+
ocr_lang="eng",
34+
ocr_psm=6,
35+
ocr_oem=3
36+
)
37+
```
38+
39+
### PaddleOCR with PP-OCRv5_server
40+
41+
PaddleOCR provides the advanced **PP-OCRv5_server** model (default in PaddleOCR 3.0), which offers:
42+
43+
- **Higher accuracy** for complex documents
44+
- **Better performance** on GPU
45+
- **Advanced text detection** and recognition
46+
- **Automatic model management** (models downloaded automatically)
47+
48+
```python
49+
from doctra import StructuredPDFParser
50+
51+
parser = StructuredPDFParser(
52+
ocr_engine="paddleocr",
53+
paddleocr_device="gpu", # Use "cpu" if no GPU available
54+
paddleocr_use_doc_orientation_classify=False,
55+
paddleocr_use_doc_unwarping=False,
56+
paddleocr_use_textline_orientation=False
57+
)
1958
```
2059

21-
## Parameters
60+
## PyTesseract Parameters
61+
62+
These parameters are only used when `ocr_engine="pytesseract"` (or when using the default):
2263

2364
**ocr_lang**
2465
: Tesseract language code
@@ -41,40 +82,103 @@ parser = StructuredPDFParser(
4182
- `1`: Neural nets LSTM
4283
- `3`: Default (both)
4384

85+
**ocr_extra_config**
86+
: Additional Tesseract configuration string
87+
88+
## PaddleOCR Parameters
89+
90+
These parameters are only used when `ocr_engine="paddleocr"`:
91+
92+
**paddleocr_device**
93+
: Device to use for OCR processing
94+
- `"gpu"`: Use GPU acceleration (default, recommended if available)
95+
- `"cpu"`: Use CPU processing
96+
97+
**paddleocr_use_doc_orientation_classify**
98+
: Enable document orientation classification model (default: `False`)
99+
- Automatically detects and corrects document orientation
100+
101+
**paddleocr_use_doc_unwarping**
102+
: Enable text image rectification model (default: `False`)
103+
- Corrects perspective distortion in scanned documents
104+
105+
**paddleocr_use_textline_orientation**
106+
: Enable text line orientation classification model (default: `False`)
107+
- Handles rotated text lines
108+
109+
**Note**: The PP-OCRv5_server model is automatically used by default in PaddleOCR 3.0. Models are automatically downloaded on first use and cached for future use.
110+
44111
## Improving Accuracy
45112

46-
### 1. Increase DPI
113+
### 1. Choose the Right OCR Engine
114+
115+
For complex documents or when accuracy is critical, consider using PaddleOCR:
116+
117+
```python
118+
parser = StructuredPDFParser(
119+
ocr_engine="paddleocr",
120+
paddleocr_device="gpu" # Use GPU for better performance
121+
)
122+
```
123+
124+
### 2. Increase DPI
125+
126+
Higher resolution improves text recognition for both engines:
47127

48128
```python
49129
parser = StructuredPDFParser(dpi=300)
50130
```
51131

52-
### 2. Use Image Restoration
132+
### 3. Use Image Restoration
133+
134+
Enhance document quality before OCR:
53135

54136
```python
55137
from doctra import EnhancedPDFParser
56138

57139
parser = EnhancedPDFParser(
58-
use_image_restoration=True
140+
use_image_restoration=True,
141+
ocr_engine="paddleocr" # Combine with PaddleOCR for best results
59142
)
60143
```
61144

62-
### 3. Correct Language
145+
### 4. Correct Language (PyTesseract)
146+
147+
For PyTesseract, specify the document language:
63148

64149
```python
65150
parser = StructuredPDFParser(
151+
ocr_engine="pytesseract",
66152
ocr_lang="fra" # For French documents
67153
)
68154
```
69155

70-
## Multi-language Documents
156+
## Multi-language Documents (PyTesseract)
157+
158+
PyTesseract supports multiple languages:
71159

72160
```python
73161
parser = StructuredPDFParser(
162+
ocr_engine="pytesseract",
74163
ocr_lang="eng+fra+deu" # Multiple languages
75164
)
76165
```
77166

167+
## When to Use Each Engine
168+
169+
### Use PyTesseract when:
170+
- Working with standard documents
171+
- Need multi-language support
172+
- Want fine-grained control over OCR parameters
173+
- CPU-only environment
174+
175+
### Use PaddleOCR when:
176+
- Dealing with complex or degraded documents
177+
- Need maximum accuracy
178+
- Have GPU available for faster processing
179+
- Working with Asian languages (better support)
180+
- Processing large batches of documents
181+
78182
## See Also
79183

80184
- [Enhanced Parser](../parsers/enhanced-parser.md) - Improve OCR with restoration

doctra/engines/ocr/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
from .pytesseract_engine import PytesseractOCREngine
2-
from .api import ocr_image
2+
from .paddleocr_engine import PaddleOCREngine
3+
from .api import ocr_image, ocr_image_paddleocr
34

4-
__all__ = ["PytesseractOCREngine", "ocr_image"]
5+
__all__ = ["PytesseractOCREngine", "PaddleOCREngine", "ocr_image", "ocr_image_paddleocr"]

doctra/engines/ocr/api.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
from PIL import Image
55

66
from .pytesseract_engine import PytesseractOCREngine
7+
from .paddleocr_engine import PaddleOCREngine
78

89

910
def ocr_image(
@@ -34,3 +35,34 @@ def ocr_image(
3435
tesseract_cmd=tesseract_cmd, lang=lang, psm=psm, oem=oem, extra_config=extra_config
3536
)
3637
return engine.recognize(cropped_pil)
38+
39+
40+
def ocr_image_paddleocr(
41+
cropped_pil: Image.Image,
42+
*,
43+
use_doc_orientation_classify: bool = False,
44+
use_doc_unwarping: bool = False,
45+
use_textline_orientation: bool = False,
46+
device: str = "gpu",
47+
) -> str:
48+
"""
49+
One-shot OCR: run PaddleOCR on a cropped PIL image and return text.
50+
51+
Convenience function that creates a PaddleOCREngine instance and
52+
immediately runs OCR on the provided image. Useful for quick text extraction
53+
without needing to manage engine instances.
54+
55+
:param cropped_pil: PIL Image object to perform OCR on
56+
:param use_doc_orientation_classify: Enable document orientation classification (default: False)
57+
:param use_doc_unwarping: Enable text image rectification (default: False)
58+
:param use_textline_orientation: Enable text line orientation classification (default: False)
59+
:param device: Device to use for OCR ("cpu" or "gpu", default: "gpu")
60+
:return: Extracted text string from the image
61+
"""
62+
engine = PaddleOCREngine(
63+
use_doc_orientation_classify=use_doc_orientation_classify,
64+
use_doc_unwarping=use_doc_unwarping,
65+
use_textline_orientation=use_textline_orientation,
66+
device=device
67+
)
68+
return engine.recognize(cropped_pil)

0 commit comments

Comments
 (0)