You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|`paddleocr_use_textline_orientation`| bool | False | Enable text line orientation classification (PaddleOCR only) |
205
+
|`paddleocr_device`| str | "gpu" | Device for PaddleOCR: "cpu" or "gpu" (PaddleOCR only) |
206
+
207
+
**Note**: When using `ocr_engine="paddleocr"`, PaddleOCR 3.0's PP-OCRv5_server model is used by default. Models are automatically downloaded on first use.
Copy file name to clipboardExpand all lines: docs/user-guide/engines/ocr-engine.md
+112-8Lines changed: 112 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,21 +4,62 @@ Guide to text extraction using OCR in Doctra.
4
4
5
5
## Overview
6
6
7
-
Doctra uses Tesseract OCR to extract text from document images. The OCR engine is highly configurable for different document types and languages.
7
+
Doctra supports two OCR engines for text extraction:
8
8
9
-
## Configuration
9
+
1.**PyTesseract** (default) - Traditional Tesseract OCR engine with extensive language support
10
+
2.**PaddleOCR** - Advanced PP-OCRv5_server model released in PaddleOCR 3.0, offering superior accuracy and performance
11
+
12
+
You can choose between these engines based on your needs. PyTesseract is the default and works well for most use cases, while PaddleOCR provides enhanced accuracy for complex documents.
13
+
14
+
## Choosing an OCR Engine
15
+
16
+
### PyTesseract (Default)
17
+
18
+
PyTesseract is the default OCR engine and works well for most documents. It offers extensive language support and fine-grained control.
10
19
11
20
```python
12
21
from doctra import StructuredPDFParser
13
22
23
+
# PyTesseract is the default - no need to specify
14
24
parser = StructuredPDFParser(
15
25
ocr_lang="eng",
16
26
ocr_psm=6,
17
27
ocr_oem=3
18
28
)
29
+
30
+
# Or explicitly specify it
31
+
parser = StructuredPDFParser(
32
+
ocr_engine="pytesseract",
33
+
ocr_lang="eng",
34
+
ocr_psm=6,
35
+
ocr_oem=3
36
+
)
37
+
```
38
+
39
+
### PaddleOCR with PP-OCRv5_server
40
+
41
+
PaddleOCR provides the advanced **PP-OCRv5_server** model (default in PaddleOCR 3.0), which offers:
42
+
43
+
-**Higher accuracy** for complex documents
44
+
-**Better performance** on GPU
45
+
-**Advanced text detection** and recognition
46
+
-**Automatic model management** (models downloaded automatically)
47
+
48
+
```python
49
+
from doctra import StructuredPDFParser
50
+
51
+
parser = StructuredPDFParser(
52
+
ocr_engine="paddleocr",
53
+
paddleocr_device="gpu", # Use "cpu" if no GPU available
54
+
paddleocr_use_doc_orientation_classify=False,
55
+
paddleocr_use_doc_unwarping=False,
56
+
paddleocr_use_textline_orientation=False
57
+
)
19
58
```
20
59
21
-
## Parameters
60
+
## PyTesseract Parameters
61
+
62
+
These parameters are only used when `ocr_engine="pytesseract"` (or when using the default):
These parameters are only used when `ocr_engine="paddleocr"`:
91
+
92
+
**paddleocr_device**
93
+
: Device to use for OCR processing
94
+
- `"gpu"`: Use GPU acceleration (default, recommended if available)
95
+
- `"cpu"`: Use CPU processing
96
+
97
+
**paddleocr_use_doc_orientation_classify**
98
+
: Enable document orientation classification model (default: `False`)
99
+
- Automatically detects and corrects document orientation
100
+
101
+
**paddleocr_use_doc_unwarping**
102
+
: Enable text image rectification model (default: `False`)
103
+
- Corrects perspective distortion in scanned documents
104
+
105
+
**paddleocr_use_textline_orientation**
106
+
: Enable text line orientation classification model (default: `False`)
107
+
- Handles rotated text lines
108
+
109
+
**Note**: The PP-OCRv5_server model is automatically used by default in PaddleOCR 3.0. Models are automatically downloaded on first use and cached for future use.
110
+
44
111
## Improving Accuracy
45
112
46
-
### 1. Increase DPI
113
+
### 1. Choose the Right OCR Engine
114
+
115
+
For complex documents or when accuracy is critical, consider using PaddleOCR:
116
+
117
+
```python
118
+
parser = StructuredPDFParser(
119
+
ocr_engine="paddleocr",
120
+
paddleocr_device="gpu"# Use GPU for better performance
121
+
)
122
+
```
123
+
124
+
### 2. Increase DPI
125
+
126
+
Higher resolution improves text recognition for both engines:
47
127
48
128
```python
49
129
parser = StructuredPDFParser(dpi=300)
50
130
```
51
131
52
-
### 2. Use Image Restoration
132
+
### 3. Use Image Restoration
133
+
134
+
Enhance document quality before OCR:
53
135
54
136
```python
55
137
from doctra import EnhancedPDFParser
56
138
57
139
parser = EnhancedPDFParser(
58
-
use_image_restoration=True
140
+
use_image_restoration=True,
141
+
ocr_engine="paddleocr"# Combine with PaddleOCR for best results
59
142
)
60
143
```
61
144
62
-
### 3. Correct Language
145
+
### 4. Correct Language (PyTesseract)
146
+
147
+
For PyTesseract, specify the document language:
63
148
64
149
```python
65
150
parser = StructuredPDFParser(
151
+
ocr_engine="pytesseract",
66
152
ocr_lang="fra"# For French documents
67
153
)
68
154
```
69
155
70
-
## Multi-language Documents
156
+
## Multi-language Documents (PyTesseract)
157
+
158
+
PyTesseract supports multiple languages:
71
159
72
160
```python
73
161
parser = StructuredPDFParser(
162
+
ocr_engine="pytesseract",
74
163
ocr_lang="eng+fra+deu"# Multiple languages
75
164
)
76
165
```
77
166
167
+
## When to Use Each Engine
168
+
169
+
### Use PyTesseract when:
170
+
- Working with standard documents
171
+
- Need multi-language support
172
+
- Want fine-grained control over OCR parameters
173
+
- CPU-only environment
174
+
175
+
### Use PaddleOCR when:
176
+
- Dealing with complex or degraded documents
177
+
- Need maximum accuracy
178
+
- Have GPU available for faster processing
179
+
- Working with Asian languages (better support)
180
+
- Processing large batches of documents
181
+
78
182
## See Also
79
183
80
184
-[Enhanced Parser](../parsers/enhanced-parser.md) - Improve OCR with restoration
0 commit comments