You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `ocropy-clip`tool can be used to remove intrusions of neighbouring segments in regions / lines of a page.
154
-
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (as AlternativeImage).
195
+
The `clip`processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
196
+
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via `AlternativeImage`).
155
197
(Use this to suppress separators and neighbouring text.)
156
198
```sh
157
199
ocrd-cis-ocropy-clip \
158
-
--input-file-grp OCR-D-SEG-LINE \
159
-
--output-file-grp OCR-D-SEG-LINE-CLIP \
200
+
--input-file-grp OCR-D-SEG-REGION \
201
+
--output-file-grp OCR-D-SEG-REGION-CLIP \
160
202
--mets mets.xml
161
-
--parameter file:///path/to/config.json
203
+
--parameter path/to/config.json
162
204
```
163
205
164
206
### ocrd-cis-ocropy-resegment
165
-
The `ocropy-resegment`tool can be used to remove overlap between neighbouring lines of a page.
207
+
The `resegment`processor can be used to remove overlap between neighbouring lines of a page.
166
208
It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
167
-
(Use this to polygonalise text lines poorly segmented, e.g. via bounding boxes.)
209
+
(Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)
168
210
```sh
169
211
ocrd-cis-ocropy-resegment \
170
212
--input-file-grp OCR-D-SEG-LINE \
171
213
--output-file-grp OCR-D-SEG-LINE-RES \
172
214
--mets mets.xml
173
-
--parameter file:///path/to/config.json
215
+
--parameter path/to/config.json
174
216
```
175
217
176
218
### ocrd-cis-ocropy-segment
177
-
The `ocropy-segment`tool can be used to segment (pages or) regions of a page into lines.
178
-
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE.
179
-
(Does not detect tables or images.)
219
+
The `segment`processor can be used to segment (pages or) regions of a page into (regions and) lines.
220
+
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) `TextLine` elements with the resulting polygon outlines to the annotation of the output PAGE.
221
+
(Does _not_ detect tables.)
180
222
```sh
181
223
ocrd-cis-ocropy-segment \
182
224
--input-file-grp OCR-D-SEG-BLOCK \
183
225
--output-file-grp OCR-D-SEG-LINE \
184
226
--mets mets.xml
185
-
--parameter file:///path/to/config.json
227
+
--parameter path/to/config.json
186
228
```
187
229
188
230
### ocrd-cis-ocropy-deskew
189
-
The `ocropy-deskew`tool can be used to deskew pages / regions of a page.
231
+
The `deskew`processor can be used to deskew pages / regions of a page.
190
232
It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
191
-
(Does not include orientation detection.)
233
+
(Does _not_ include orientation detection.)
192
234
```sh
193
235
ocrd-cis-ocropy-deskew \
194
236
--input-file-grp OCR-D-SEG-LINE \
195
237
--output-file-grp OCR-D-SEG-LINE-DES \
196
238
--mets mets.xml
197
-
--parameter file:///path/to/config.json
239
+
--parameter path/to/config.json
198
240
```
199
241
200
242
### ocrd-cis-ocropy-denoise
201
-
The `ocropy-denoise`tool can be used to despeckle pages / regions / lines of a page.
202
-
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
243
+
The `denoise`processor can be used to despeckle pages / regions / lines of a page.
244
+
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`).
203
245
```sh
204
246
ocrd-cis-ocropy-denoise \
205
247
--input-file-grp OCR-D-SEG-LINE-DES \
206
248
--output-file-grp OCR-D-SEG-LINE-DEN \
207
249
--mets mets.xml
208
-
--parameter file:///path/to/config.json
250
+
--parameter path/to/config.json
209
251
```
210
252
211
253
### ocrd-cis-ocropy-binarize
212
-
The `ocropy-binarize`tool can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
213
-
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
254
+
The `binarize`processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
255
+
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
214
256
```sh
215
257
ocrd-cis-ocropy-binarize \
216
258
--input-file-grp OCR-D-SEG-LINE-DES \
217
259
--output-file-grp OCR-D-SEG-LINE-BIN \
218
260
--mets mets.xml
219
-
--parameter file:///path/to/config.json
261
+
--parameter path/to/config.json
220
262
```
221
263
222
264
### ocrd-cis-ocropy-dewarp
223
-
The `ocropy-dewarp`tool can be used to dewarp text lines of a page.
224
-
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
265
+
The `dewarp`processor can be used to vertically dewarp text lines of a page.
266
+
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as `AlternativeImage`).
225
267
```sh
226
268
ocrd-cis-ocropy-dewarp \
227
269
--input-file-grp OCR-D-SEG-LINE-BIN \
228
270
--output-file-grp OCR-D-SEG-LINE-DEW \
229
271
--mets mets.xml
230
-
--parameter file:///path/to/config.json
272
+
--parameter path/to/config.json
231
273
```
232
274
233
275
### ocrd-cis-ocropy-recognize
234
-
The `ocropy-recognize`tool can be used to recognize the lines / words / glyphs of a page.
276
+
The `recognize`processor can be used to recognize the lines / words / glyphs of a page.
235
277
It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
236
278
```sh
237
279
ocrd-cis-ocropy-recognize \
238
280
--input-file-grp OCR-D-SEG-LINE-DEW \
239
281
--output-file-grp OCR-D-OCR-OCRO \
240
282
--mets mets.xml
241
-
--parameter file:///path/to/config.json
283
+
--parameter path/to/config.json
242
284
```
243
285
244
286
### Tesserocr
@@ -263,21 +305,29 @@ own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata
263
305
264
306
A decent pipeline might look like this:
265
307
266
-
0. page-level binarization
308
+
1. image normalization/optimization
309
+
1. page-level binarization
267
310
1. page-level cropping
268
-
2. (page-level binarization)
269
-
3. page-level deskewing
270
-
4. (page-level dewarping)
271
-
5. region segmentation
272
-
6. region-level clipping
273
-
7. (region-level deskewing)
274
-
8. line segmentation
275
-
9. (line-level clipping or resegmentation)
276
-
10. line-level dewarping
277
-
11. line-level recognition
278
-
12. (line-level alignment and post-correction)
279
-
280
-
If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
311
+
1. (page-level binarization)
312
+
1. (page-level despeckling)
313
+
1. page-level deskewing
314
+
1. (page-level dewarping)
315
+
1. region segmentation, possibly subdivided into
316
+
1. text/non-text separation
317
+
1. text region segmentation (and classification)
318
+
1. reading order detection
319
+
1. non-text region classification
320
+
1. region-level clipping
321
+
1. (region-level deskewing)
322
+
1. line segmentation
323
+
1. (line-level clipping or resegmentation)
324
+
1. line-level dewarping
325
+
1. line-level recognition
326
+
1. (line-level alignment and post-correction)
327
+
328
+
If GT is used, then cropping/segmentation steps can be omitted.
329
+
330
+
If a segmentation is used which does not produce overlapping segments, then clipping/resegmentation can be omitted.
281
331
282
332
## Testing
283
333
To run a few basic tests type `make test` (`ocrd_cis` has to be
@@ -289,11 +339,11 @@ installed in order to run any tests).
289
339
* Create a new (empty) workspace: `ocrd workspace init workspace-dir`
290
340
* cd into `workspace-dir`
291
341
* Add new file to workspace: `ocrd workspace add file -G group -i id
0 commit comments