Skip to content

Commit 9748b39

Browse files
authored
Merge pull request #55 from bertsky/order-only
segment: add param 'overwrite_order'
2 parents fa40e7e + 5ec2f2d commit 9748b39

File tree

5 files changed

+214
-54
lines changed

5 files changed

+214
-54
lines changed

README.md

Lines changed: 184 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ Arguments:
178178
* `--mets` path to METS file in the workspace
179179

180180
### ocrd-cis-ocropy-train
181-
The `ocropy-train` tool can be used to train LSTM models.
181+
The [ocropy-train](ocrd_cis/ocropy/train.py) tool can be used to train LSTM models.
182182
It takes ground truth from the workspace and saves (image+text) snippets from the corresponding pages.
183183
Then a model is trained on all snippets for 1 million (or the given number of) randomized iterations from the parameter file.
184184

@@ -192,95 +192,238 @@ java -jar $(ocrd-cis-data -jar) \
192192
```
193193

194194
### ocrd-cis-ocropy-clip
195-
The `clip` processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
195+
The [clip](ocrd_cis/ocropy/clip.py) processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
196196
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via `AlternativeImage`).
197197
(Use this to suppress separators and neighbouring text.)
198198
```sh
199199
ocrd-cis-ocropy-clip \
200-
--input-file-grp OCR-D-SEG-REGION \
201-
--output-file-grp OCR-D-SEG-REGION-CLIP \
202-
--mets mets.xml
203-
--parameter path/to/config.json
200+
-I OCR-D-SEG-REGION \
201+
-O OCR-D-SEG-REGION-CLIP \
202+
-p '{"level-of-operation": "region"}'
203+
```
204+
205+
Available parameters are:
206+
```sh
207+
"level-of-operation" [string - "region"]
208+
PAGE XML hierarchy level granularity to annotate images for
209+
Possible values: ["region", "line"]
210+
"dpi" [number - -1]
211+
pixel density in dots per inch (overrides any meta-data in the
212+
images); disabled when negative
213+
"min_fraction" [number - 0.7]
214+
share of foreground pixels that must be retained by the largest label
204215
```
205216

206217
### ocrd-cis-ocropy-resegment
207-
The `resegment` processor can be used to remove overlap between neighbouring lines of a page.
218+
The [resegment](ocrd_cis/ocropy/resegment.py) processor can be used to remove overlap between neighbouring lines of a page.
208219
It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
209220
(Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)
210221
```sh
211222
ocrd-cis-ocropy-resegment \
212-
--input-file-grp OCR-D-SEG-LINE \
213-
--output-file-grp OCR-D-SEG-LINE-RES \
214-
--mets mets.xml
215-
--parameter path/to/config.json
223+
-I OCR-D-SEG-LINE \
224+
-O OCR-D-SEG-LINE-RES \
225+
-p '{"extend_margins": 3}'
226+
```
227+
228+
Available parameters are:
229+
```sh
230+
"dpi" [number - -1]
231+
pixel density in dots per inch (overrides any meta-data in the
232+
images); disabled when negative
233+
"min_fraction" [number - 0.8]
234+
share of foreground pixels that must be retained by the largest label
235+
"extend_margins" [number - 3]
236+
number of pixels to extend the input polygons horizontally and
237+
vertically before intersecting
216238
```
217239

218240
### ocrd-cis-ocropy-segment
219-
The `segment` processor can be used to segment (pages or) regions of a page into (regions and) lines.
241+
The [segment](ocrd_cis/ocropy/segment.py) processor can be used to segment (pages or) regions of a page into (regions and) lines.
220242
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) `TextLine` elements with the resulting polygon outlines to the annotation of the output PAGE.
221243
(Does _not_ detect tables.)
222244
```sh
223245
ocrd-cis-ocropy-segment \
224-
--input-file-grp OCR-D-SEG-BLOCK \
225-
--output-file-grp OCR-D-SEG-LINE \
226-
--mets mets.xml
227-
--parameter path/to/config.json
246+
-I OCR-D-SEG-BLOCK \
247+
-O OCR-D-SEG-LINE \
248+
-p '{"level-of-operation": "page", "gap_height": 0.015}'
249+
```
250+
251+
Available parameters are:
252+
```sh
253+
"dpi" [number - -1]
254+
pixel density in dots per inch (overrides any meta-data in the
255+
images); disabled when negative; when disabled and no meta-data is
256+
found, 300 is assumed
257+
"level-of-operation" [string - "region"]
258+
PAGE XML hierarchy level to read images from and add elements to
259+
Possible values: ["page", "table", "region"]
260+
"maxcolseps" [number - 20]
261+
(when operating on the page/table level) maximum number of
262+
white/background column separators to detect, counted piece-wise
263+
"maxseps" [number - 20]
264+
(when operating on the page/table level) number of black/foreground
265+
column separators to detect (and suppress), counted piece-wise
266+
"maximages" [number - 10]
267+
(when operating on the page level) maximum number of black/foreground
268+
very large components to detect (and suppress), counted piece-wise
269+
"csminheight" [number - 4]
270+
(when operating on the page/table level) minimum height of
271+
white/background or black/foreground column separators in multiples
272+
of scale/capheight, counted piece-wise
273+
"hlminwidth" [number - 10]
274+
(when operating on the page/table level) minimum width of
275+
black/foreground horizontal separators in multiples of
276+
scale/capheight, counted piece-wise
277+
"gap_height" [number - 0.01]
278+
(when operating on the page/table level) largest minimum pixel
279+
average in the horizontal or vertical profiles (across the binarized
280+
image) to still be regarded as a gap during recursive X-Y cut from
281+
lines to regions; needs to be larger when more foreground noise is
282+
present, reduce to avoid mistaking text for noise
283+
"gap_width" [number - 1.5]
284+
(when operating on the page/table level) smallest width in multiples
285+
of scale/capheight of a valley in the horizontal or vertical
286+
profiles (across the binarized image) to still be regarded as a gap
287+
during recursive X-Y cut from lines to regions; needs to be smaller
288+
when more foreground noise is present, increase to avoid mistaking
289+
inter-line as paragraph gaps and inter-word as inter-column gaps
290+
"overwrite_order" [boolean - true]
291+
(when operating on the page/table level) remove any references for
292+
existing TextRegion elements within the top (page/table) reading
293+
order; otherwise append
294+
"overwrite_separators" [boolean - true]
295+
(when operating on the page/table level) remove any existing
296+
SeparatorRegion elements; otherwise append
297+
"overwrite_regions" [boolean - true]
298+
(when operating on the page/table level) remove any existing
299+
TextRegion elements; otherwise append
300+
"overwrite_lines" [boolean - true]
301+
(when operating on the region level) remove any existing TextLine
302+
elements; otherwise append
303+
"spread" [number - 2.4]
304+
distance in points (pt) from the foreground to project text line (or
305+
text region) labels into the background for polygonal contours; if
306+
zero, project half a scale/capheight
228307
```
229308
230309
### ocrd-cis-ocropy-deskew
231-
The `deskew` processor can be used to deskew pages / regions of a page.
310+
The [deskew](ocrd_cis/ocropy/deskew.py) processor can be used to deskew pages / regions of a page.
232311
It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
233312
(Does _not_ include orientation detection.)
234313
```sh
235314
ocrd-cis-ocropy-deskew \
236-
--input-file-grp OCR-D-SEG-LINE \
237-
--output-file-grp OCR-D-SEG-LINE-DES \
238-
--mets mets.xml
239-
--parameter path/to/config.json
315+
-I OCR-D-SEG-LINE \
316+
-O OCR-D-SEG-LINE-DES \
317+
-p '{"level-of-operation": "page", "maxskew": 10}'
318+
```
319+
320+
Available parameters are:
321+
```sh
322+
"maxskew" [number - 5.0]
323+
modulus of maximum skewing angle to detect (larger will be slower, 0
324+
will deactivate deskewing)
325+
"level-of-operation" [string - "region"]
326+
PAGE XML hierarchy level granularity to annotate images for
327+
Possible values: ["page", "region"]
240328
```
241329
242330
### ocrd-cis-ocropy-denoise
243-
The `denoise` processor can be used to despeckle pages / regions / lines of a page.
331+
The [denoise](ocrd_cis/ocropy/denoise.py) processor can be used to despeckle pages / regions / lines of a page.
244332
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`).
245333
```sh
246334
ocrd-cis-ocropy-denoise \
247-
--input-file-grp OCR-D-SEG-LINE-DES \
248-
--output-file-grp OCR-D-SEG-LINE-DEN \
249-
--mets mets.xml
250-
--parameter path/to/config.json
335+
-I OCR-D-SEG-LINE-DES \
336+
-O OCR-D-SEG-LINE-DEN \
337+
-p '{"noise_maxsize": 2}'
338+
```
339+
340+
Available parameters are:
341+
```sh
342+
"noise_maxsize" [number - 3.0]
343+
maximum size in points (pt) for connected components to regard as
344+
noise (0 will deactivate denoising)
345+
"dpi" [number - -1]
346+
pixel density in dots per inch (overrides any meta-data in the
347+
images); disabled when negative
348+
"level-of-operation" [string - "page"]
349+
PAGE XML hierarchy level granularity to annotate images for
350+
Possible values: ["page", "region", "line"]
251351
```
252352
253353
### ocrd-cis-ocropy-binarize
254-
The `binarize` processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
354+
The [binarize](ocrd_cis/ocropy/binarize.py) processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
255355
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
256356
```sh
257357
ocrd-cis-ocropy-binarize \
258-
--input-file-grp OCR-D-SEG-LINE-DES \
259-
--output-file-grp OCR-D-SEG-LINE-BIN \
260-
--mets mets.xml
261-
--parameter path/to/config.json
358+
-I OCR-D-SEG-LINE-DES \
359+
-O OCR-D-SEG-LINE-BIN \
360+
-p '{"level-of-operation": "page", "threshold": 0.7}'
361+
```
362+
363+
Available parameters are:
364+
```sh
365+
"method" [string - "ocropy"]
366+
binarization method to use (only 'ocropy' will include deskewing and
367+
denoising)
368+
Possible values: ["none", "global", "otsu", "gauss-otsu", "ocropy"]
369+
"threshold" [number - 0.5]
370+
for the 'ocropy' and ' global' method, black/white threshold to apply
371+
on the whitelevel normalized image (the larger the more/heavier
372+
foreground)
373+
"grayscale" [boolean - false]
374+
for the 'ocropy' method, produce grayscale-normalized instead of
375+
thresholded image
376+
"maxskew" [number - 0.0]
377+
modulus of maximum skewing angle (in degrees) to detect (larger will
378+
be slower, 0 will deactivate deskewing)
379+
"noise_maxsize" [number - 0]
380+
maximum pixel number for connected components to regard as noise (0
381+
will deactivate denoising)
382+
"level-of-operation" [string - "page"]
383+
PAGE XML hierarchy level granularity to annotate images for
384+
Possible values: ["page", "region", "line"]
262385
```
263386
264387
### ocrd-cis-ocropy-dewarp
265-
The `dewarp` processor can be used to vertically dewarp text lines of a page.
388+
The [dewarp](ocrd_cis/ocropy/dewarp.py) processor can be used to vertically dewarp text lines of a page.
266389
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as `AlternativeImage`).
267390
```sh
268391
ocrd-cis-ocropy-dewarp \
269-
--input-file-grp OCR-D-SEG-LINE-BIN \
270-
--output-file-grp OCR-D-SEG-LINE-DEW \
271-
--mets mets.xml
272-
--parameter path/to/config.json
392+
-I OCR-D-SEG-LINE-BIN \
393+
-O OCR-D-SEG-LINE-DEW \
394+
-p '{"range": 5}'
395+
```
396+
397+
Available parameters are:
398+
```sh
399+
"dpi" [number - -1]
400+
pixel density in dots per inch (overrides any meta-data in the
401+
images); disabled when negative
402+
"range" [number - 4.0]
403+
maximum vertical disposition or maximum margin (will be multiplied by
404+
mean centerline deltas to yield pixels)
405+
"max_neighbour" [number - 0.05]
406+
maximum rate of foreground pixels intruding from neighbouring lines
407+
(line will not be processed above that)
273408
```
274409
275410
### ocrd-cis-ocropy-recognize
276-
The `recognize` processor can be used to recognize the lines / words / glyphs of a page.
411+
The [recognize](ocrd_cis/ocropy/recognize.py) processor can be used to recognize the lines / words / glyphs of a page.
277412
It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
278413
```sh
279414
ocrd-cis-ocropy-recognize \
280-
--input-file-grp OCR-D-SEG-LINE-DEW \
281-
--output-file-grp OCR-D-OCR-OCRO \
282-
--mets mets.xml
283-
--parameter path/to/config.json
415+
-I OCR-D-SEG-LINE-DEW \
416+
-O OCR-D-OCR-OCRO \
417+
-p '{"textequiv_level": "word", "model": "fraktur-jze.pyrnn"}'
418+
```
419+
420+
Available parameters are:
421+
```sh
422+
"textequiv_level" [string - "line"]
423+
PAGE XML hierarchy level granularity to add the TextEquiv results to
424+
Possible values: ["line", "word", "glyph"]
425+
"model" [string]
426+
ocropy model to apply (e.g. fraktur.pyrnn)
284427
```
285428
286429
### Tesserocr

ocrd_cis/ocrd-tool.json

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"git_url": "https://github.com/cisocrgroup/ocrd_cis",
3-
"version": "0.0.8",
3+
"version": "0.0.10",
44
"tools": {
55
"ocrd-cis-ocropy-binarize": {
66
"executable": "ocrd-cis-ocropy-binarize",
@@ -365,6 +365,11 @@
365365
"default": 1.5,
366366
"description": "(when operating on the page/table level) smallest width in multiples of scale/capheight of a valley in the horizontal or vertical profiles (across the binarized image) to still be regarded as a gap during recursive X-Y cut from lines to regions; needs to be smaller when more foreground noise is present, increase to avoid mistaking inter-line as paragraph gaps and inter-word as inter-column gaps"
367367
},
368+
"overwrite_order": {
369+
"type": "boolean",
370+
"default": true,
371+
"description": "(when operating on the page/table level) remove any references for existing TextRegion elements within the top (page/table) reading order; otherwise append"
372+
},
368373
"overwrite_separators": {
369374
"type": "boolean",
370375
"default": true,

ocrd_cis/ocropy/common.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1300,7 +1300,7 @@ def finalize():
13001300
lpartitions[label2])
13011301
lpartitions[label2] = [0]
13021302
# re-label and re-order surviving partitions
1303-
#lpartitions = np.setdiff1d(np.unique(partitions), [0]) # without bg/sepm
1303+
lpartitions = np.setdiff1d(np.unique(partitions), [0]) # without bg/sepm
13041304
npartitions = len(lpartitions)
13051305
if debug: LOG.debug(' %d sepmask partitions after filtering and merging', npartitions)
13061306
if npartitions > 1:

ocrd_cis/ocropy/segment.py

Lines changed: 17 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -144,14 +144,17 @@ def process(self):
144144
then iterate over the element hierarchy down to the requested level.
145145
146146
Depending on ``level-of-operation``, consider existing segments:
147-
- if ``overwrite_separators=True`` on ``page`` level, then
148-
delete any SeparatorRegions,
149-
- if ``overwrite_regions=True`` on ``page`` level, then
150-
delete any top-level TextRegions (along with ReadingOrder),
151-
- if ``overwrite_regions=True`` on ``table`` level, then
152-
delete any TextRegions in TableRegions (along with their OrderGroup),
153-
- if ``overwrite_lines=True`` on ``region`` level, then
147+
- If ``overwrite_separators=True`` on ``page`` level, then
148+
delete any SeparatorRegions.
149+
- If ``overwrite_regions=True`` on ``page`` level, then
150+
delete any top-level TextRegions (along with ReadingOrder).
151+
- If ``overwrite_regions=True`` on ``table`` level, then
152+
delete any TextRegions in TableRegions (along with their OrderGroup).
153+
- If ``overwrite_lines=True`` on ``region`` level, then
154154
delete any TextLines in TextRegions.
155+
- If ``overwrite_order=True`` on ``page`` or ``table`` level, then
156+
delete the reading order OrderedGroup entry corresponding
157+
to the (page/table) segment.
155158
156159
Next, get each element image according to the layout annotation (from
157160
the alternative image of the page/region, or by cropping via coordinates
@@ -206,6 +209,7 @@ def process(self):
206209
overwrite_lines = self.parameter['overwrite_lines']
207210
overwrite_regions = self.parameter['overwrite_regions']
208211
overwrite_separators = self.parameter['overwrite_separators']
212+
overwrite_order = self.parameter['overwrite_order']
209213
oplevel = self.parameter['level-of-operation']
210214

211215
for (n, input_file) in enumerate(self.input_files):
@@ -289,7 +293,7 @@ def process(self):
289293
LOG.warning('keeping existing TextRegions in page "%s"', page_id)
290294
ignore.extend(regions)
291295
# create reading order if necessary
292-
if not ro:
296+
if not ro or overwrite_order:
293297
ro = ReadingOrderType()
294298
page.set_ReadingOrder(ro)
295299
rogroup = ro.get_OrderedGroup() or ro.get_UnorderedGroup()
@@ -330,6 +334,11 @@ def process(self):
330334
if not roelem:
331335
LOG.warning("Page '%s' table region '%s' is not referenced in reading order (%s)",
332336
page_id, region.id, "no target to add cells to")
337+
elif overwrite_order:
338+
# replace by empty ordered group with same (index and) ref
339+
# (which can then take the cells as subregions)
340+
roelem = page_subgroup_in_reading_order(roelem)
341+
reading_order[region.id] = roelem
333342
elif isinstance(roelem, (OrderedGroupType, OrderedGroupIndexedType)):
334343
LOG.warning("Page '%s' table region '%s' already has an ordered group (%s)",
335344
page_id, region.id, "cells will be appended")

setup.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,12 @@
2121
with codecs.open('README.md', encoding='utf-8') as f:
2222
README = f.read()
2323

24+
with open('./ocrd-tool.json', 'r') as f:
25+
version = json.load(f)['version']
26+
2427
setup(
2528
name='ocrd_cis',
26-
version='0.0.9',
29+
version=version,
2730
description='CIS OCR-D command line tools',
2831
long_description=README,
2932
long_description_content_type='text/markdown',
@@ -34,11 +37,11 @@
3437
packages=find_packages(),
3538
include_package_data=True,
3639
install_requires=[
37-
'ocrd>=2.4.0',
40+
'ocrd>=2.10.4',
3841
'click',
3942
'scipy',
4043
'numpy>=1.17.0',
41-
'pillow>=6.2.0',
44+
'pillow>=7.1.2',
4245
'shapely',
4346
'scikit-image',
4447
'opencv-python-headless',

0 commit comments

Comments
 (0)