You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The `clip` processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
195
+
The [clip](ocrd_cis/ocropy/clip.py) processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
196
196
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via `AlternativeImage`).
197
197
(Use this to suppress separators and neighbouring text.)
198
198
```sh
199
199
ocrd-cis-ocropy-clip \
200
-
--input-file-grp OCR-D-SEG-REGION \
201
-
--output-file-grp OCR-D-SEG-REGION-CLIP \
202
-
--mets mets.xml
203
-
--parameter path/to/config.json
200
+
-I OCR-D-SEG-REGION \
201
+
-O OCR-D-SEG-REGION-CLIP \
202
+
-p '{"level-of-operation": "region"}'
203
+
```
204
+
205
+
Available parameters are:
206
+
```sh
207
+
"level-of-operation" [string - "region"]
208
+
PAGE XML hierarchy level granularity to annotate images for
209
+
Possible values: ["region", "line"]
210
+
"dpi" [number - -1]
211
+
pixel density in dots per inch (overrides any meta-data in the
212
+
images); disabled when negative
213
+
"min_fraction" [number - 0.7]
214
+
share of foreground pixels that must be retained by the largest label
204
215
```
205
216
206
217
### ocrd-cis-ocropy-resegment
207
-
The `resegment` processor can be used to remove overlap between neighbouring lines of a page.
218
+
The [resegment](ocrd_cis/ocropy/resegment.py) processor can be used to remove overlap between neighbouring lines of a page.
208
219
It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
209
220
(Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)
210
221
```sh
211
222
ocrd-cis-ocropy-resegment \
212
-
--input-file-grp OCR-D-SEG-LINE \
213
-
--output-file-grp OCR-D-SEG-LINE-RES \
214
-
--mets mets.xml
215
-
--parameter path/to/config.json
223
+
-I OCR-D-SEG-LINE \
224
+
-O OCR-D-SEG-LINE-RES \
225
+
-p '{"extend_margins": 3}'
226
+
```
227
+
228
+
Available parameters are:
229
+
```sh
230
+
"dpi" [number - -1]
231
+
pixel density in dots per inch (overrides any meta-data in the
232
+
images); disabled when negative
233
+
"min_fraction" [number - 0.8]
234
+
share of foreground pixels that must be retained by the largest label
235
+
"extend_margins" [number - 3]
236
+
number of pixels to extend the input polygons horizontally and
237
+
vertically before intersecting
216
238
```
217
239
218
240
### ocrd-cis-ocropy-segment
219
-
The `segment` processor can be used to segment (pages or) regions of a page into (regions and) lines.
241
+
The [segment](ocrd_cis/ocropy/segment.py) processor can be used to segment (pages or) regions of a page into (regions and) lines.
220
242
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) `TextLine` elements with the resulting polygon outlines to the annotation of the output PAGE.
pixel density in dots per inch (overrides any meta-data in the
255
+
images); disabled when negative; when disabled and no meta-data is
256
+
found, 300 is assumed
257
+
"level-of-operation" [string - "region"]
258
+
PAGE XML hierarchy level to read images from and add elements to
259
+
Possible values: ["page", "table", "region"]
260
+
"maxcolseps" [number - 20]
261
+
(when operating on the page/table level) maximum number of
262
+
white/background column separators to detect, counted piece-wise
263
+
"maxseps" [number - 20]
264
+
(when operating on the page/table level) number of black/foreground
265
+
column separators to detect (and suppress), counted piece-wise
266
+
"maximages" [number - 10]
267
+
(when operating on the page level) maximum number of black/foreground
268
+
very large components to detect (and suppress), counted piece-wise
269
+
"csminheight" [number - 4]
270
+
(when operating on the page/table level) minimum height of
271
+
white/background or black/foreground column separators in multiples
272
+
of scale/capheight, counted piece-wise
273
+
"hlminwidth" [number - 10]
274
+
(when operating on the page/table level) minimum width of
275
+
black/foreground horizontal separators in multiples of
276
+
scale/capheight, counted piece-wise
277
+
"gap_height" [number - 0.01]
278
+
(when operating on the page/table level) largest minimum pixel
279
+
average in the horizontal or vertical profiles (across the binarized
280
+
image) to still be regarded as a gap during recursive X-Y cut from
281
+
lines to regions; needs to be larger when more foreground noise is
282
+
present, reduce to avoid mistaking text for noise
283
+
"gap_width" [number - 1.5]
284
+
(when operating on the page/table level) smallest width in multiples
285
+
of scale/capheight of a valley in the horizontal or vertical
286
+
profiles (across the binarized image) to still be regarded as a gap
287
+
during recursive X-Y cut from lines to regions; needs to be smaller
288
+
when more foreground noise is present, increase to avoid mistaking
289
+
inter-line as paragraph gaps and inter-word as inter-column gaps
290
+
"overwrite_order" [boolean - true]
291
+
(when operating on the page/table level) remove any references for
292
+
existing TextRegion elements within the top (page/table) reading
293
+
order; otherwise append
294
+
"overwrite_separators" [boolean - true]
295
+
(when operating on the page/table level) remove any existing
296
+
SeparatorRegion elements; otherwise append
297
+
"overwrite_regions" [boolean - true]
298
+
(when operating on the page/table level) remove any existing
299
+
TextRegion elements; otherwise append
300
+
"overwrite_lines" [boolean - true]
301
+
(when operating on the region level) remove any existing TextLine
302
+
elements; otherwise append
303
+
"spread" [number - 2.4]
304
+
distance in points (pt) from the foreground to project text line (or
305
+
text region) labels into the background for polygonal contours;if
306
+
zero, project half a scale/capheight
228
307
```
229
308
230
309
### ocrd-cis-ocropy-deskew
231
-
The `deskew` processor can be used to deskew pages / regions of a page.
310
+
The [deskew](ocrd_cis/ocropy/deskew.py) processor can be used to deskew pages / regions of a page.
232
311
It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
modulus of maximum skewing angle to detect (larger will be slower, 0
324
+
will deactivate deskewing)
325
+
"level-of-operation" [string - "region"]
326
+
PAGE XML hierarchy level granularity to annotate images for
327
+
Possible values: ["page", "region"]
240
328
```
241
329
242
330
### ocrd-cis-ocropy-denoise
243
-
The `denoise` processor can be used to despeckle pages / regions / lines of a page.
331
+
The [denoise](ocrd_cis/ocropy/denoise.py) processor can be used to despeckle pages / regions / lines of a page.
244
332
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`).
245
333
```sh
246
334
ocrd-cis-ocropy-denoise \
247
-
--input-file-grp OCR-D-SEG-LINE-DES \
248
-
--output-file-grp OCR-D-SEG-LINE-DEN \
249
-
--mets mets.xml
250
-
--parameter path/to/config.json
335
+
-I OCR-D-SEG-LINE-DES \
336
+
-O OCR-D-SEG-LINE-DEN \
337
+
-p '{"noise_maxsize": 2}'
338
+
```
339
+
340
+
Available parameters are:
341
+
```sh
342
+
"noise_maxsize" [number - 3.0]
343
+
maximum size in points (pt) for connected components to regard as
344
+
noise (0 will deactivate denoising)
345
+
"dpi" [number - -1]
346
+
pixel density in dots per inch (overrides any meta-data in the
347
+
images); disabled when negative
348
+
"level-of-operation" [string - "page"]
349
+
PAGE XML hierarchy level granularity to annotate images for
350
+
Possible values: ["page", "region", "line"]
251
351
```
252
352
253
353
### ocrd-cis-ocropy-binarize
254
-
The `binarize` processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
354
+
The [binarize](ocrd_cis/ocropy/binarize.py) processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
255
355
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
binarization method to use (only 'ocropy' will include deskewing and
367
+
denoising)
368
+
Possible values: ["none", "global", "otsu", "gauss-otsu", "ocropy"]
369
+
"threshold" [number - 0.5]
370
+
for the 'ocropy' and ' global' method, black/white threshold to apply
371
+
on the whitelevel normalized image (the larger the more/heavier
372
+
foreground)
373
+
"grayscale" [boolean - false]
374
+
for the 'ocropy' method, produce grayscale-normalized instead of
375
+
thresholded image
376
+
"maxskew" [number - 0.0]
377
+
modulus of maximum skewing angle (in degrees) to detect (larger will
378
+
be slower, 0 will deactivate deskewing)
379
+
"noise_maxsize" [number - 0]
380
+
maximum pixel number for connected components to regard as noise (0
381
+
will deactivate denoising)
382
+
"level-of-operation" [string - "page"]
383
+
PAGE XML hierarchy level granularity to annotate images for
384
+
Possible values: ["page", "region", "line"]
262
385
```
263
386
264
387
### ocrd-cis-ocropy-dewarp
265
-
The `dewarp` processor can be used to vertically dewarp text lines of a page.
388
+
The [dewarp](ocrd_cis/ocropy/dewarp.py) processor can be used to vertically dewarp text lines of a page.
266
389
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as `AlternativeImage`).
267
390
```sh
268
391
ocrd-cis-ocropy-dewarp \
269
-
--input-file-grp OCR-D-SEG-LINE-BIN \
270
-
--output-file-grp OCR-D-SEG-LINE-DEW \
271
-
--mets mets.xml
272
-
--parameter path/to/config.json
392
+
-I OCR-D-SEG-LINE-BIN \
393
+
-O OCR-D-SEG-LINE-DEW \
394
+
-p '{"range": 5}'
395
+
```
396
+
397
+
Available parameters are:
398
+
```sh
399
+
"dpi" [number - -1]
400
+
pixel density in dots per inch (overrides any meta-data in the
401
+
images); disabled when negative
402
+
"range" [number - 4.0]
403
+
maximum vertical disposition or maximum margin (will be multiplied by
404
+
mean centerline deltas to yield pixels)
405
+
"max_neighbour" [number - 0.05]
406
+
maximum rate of foreground pixels intruding from neighbouring lines
407
+
(line will not be processed above that)
273
408
```
274
409
275
410
### ocrd-cis-ocropy-recognize
276
-
The `recognize` processor can be used to recognize the lines / words / glyphs of a page.
411
+
The [recognize](ocrd_cis/ocropy/recognize.py) processor can be used to recognize the lines / words / glyphs of a page.
277
412
It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
0 commit comments