Skip to content

Commit fa40e7e

Browse files
authored
Merge pull request #48 from bertsky/segment-fixes
some fixes for recent segmentation update
2 parents f2b42d4 + 62a96f9 commit fa40e7e

File tree

5 files changed

+203
-134
lines changed

5 files changed

+203
-134
lines changed

README.md

Lines changed: 121 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,74 @@
11
[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/context:python)
22
[![Total alerts](https://img.shields.io/lgtm/alerts/g/cisocrgroup/ocrd_cis.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/cisocrgroup/ocrd_cis/alerts/)
3+
4+
Content:
5+
* [ocrd_cis](#ocrd_cis)
6+
* [Introduction](#introduction)
7+
* [Installation](#installation)
8+
* [Profiler](#profiler)
9+
* [Usage](#usage)
10+
* [ocrd-cis-postcorrect](#ocrd-cis-postcorrect)
11+
* [ocrd-cis-align](#ocrd-cis-align)
12+
* [ocrd-cis-data](#ocrd-cis-data)
13+
* [Trainining](#trainining)
14+
* [ocrd-cis-ocropy-train](#ocrd-cis-ocropy-train)
15+
* [ocrd-cis-ocropy-clip](#ocrd-cis-ocropy-clip)
16+
* [ocrd-cis-ocropy-resegment](#ocrd-cis-ocropy-resegment)
17+
* [ocrd-cis-ocropy-segment](#ocrd-cis-ocropy-segment)
18+
* [ocrd-cis-ocropy-deskew](#ocrd-cis-ocropy-deskew)
19+
* [ocrd-cis-ocropy-denoise](#ocrd-cis-ocropy-denoise)
20+
* [ocrd-cis-ocropy-binarize](#ocrd-cis-ocropy-binarize)
21+
* [ocrd-cis-ocropy-dewarp](#ocrd-cis-ocropy-dewarp)
22+
* [ocrd-cis-ocropy-recognize](#ocrd-cis-ocropy-recognize)
23+
* [Tesserocr](#tesserocr)
24+
* [Workflow configuration](#workflow-configuration)
25+
* [Testing](#testing)
26+
* [Miscellaneous](#miscellaneous)
27+
* [OCR-D workspace](#ocr-d-workspace)
28+
* [OCR-D links](#ocr-d-links)
29+
330
# ocrd_cis
431

532
[CIS](http://www.cis.lmu.de) [OCR-D](http://ocr-d.de) command line
633
tools for the automatic post-correction of OCR-results.
734

835
## Introduction
9-
`ocrd_cis` contains different tools for the automatic post correction
10-
of OCR-results. It contains tools for the training, evaluation and
11-
execution of the post correction. Most of the tools are following the
12-
[OCR-D cli conventions](https://ocr-d.github.io/cli).
36+
`ocrd_cis` contains different tools for the automatic post-correction
37+
of OCR results. It contains tools for the training, evaluation and
38+
execution of the post-correction. Most of the tools are following the
39+
[OCR-D CLI conventions](https://ocr-d.de/en/spec/cli).
1340

14-
There is a helper tool to align multiple OCR results as well as a
15-
version of ocropy that works with python3.
41+
Additionally, there is a helper tool to align multiple OCR results,
42+
as well as an improved version of [Ocropy](https://github.com/tmbarchive/ocropy)
43+
that works with Python 3 and is also wrapped for [OCR-D](https://ocr-d.de/en/spec/).
1644

1745
## Installation
18-
There are multiple ways to install the `ocrd_cis` tools:
19-
* `make install` uses `pip` to install `ocrd_cis` (see below).
20-
* `make install-devel` uses `pip -e` to install `ocrd_cis` (see
21-
below).
22-
* `pip install --upgrade pip ocrd_cis_dir`
23-
* `pip install -e --upgrade pip ocrd_cis_dir`
24-
25-
It is possible to install `ocrd_cis` in a custom directory using
26-
`virtualenv`:
46+
There are 2 ways to install the `ocrd_cis` tools:
47+
* normal packaging:
48+
```sh
49+
make install # or equally: pip install -U pip .
50+
```
51+
(Installs `ocrd_cis` including its Python dependencies
52+
from the current directory to the Python package directory.)
53+
* editable mode:
54+
```sh
55+
make install-devel # or equally: pip install -e -U pip .
56+
```
57+
(Installs `ocrd_cis` including its Python dependencies
58+
from the current directory.)
59+
60+
It is possible (and recommended) to install `ocrd_cis` in a custom user directory
61+
(instead of system-wide) by using `virtualenv` (or `venv`):
2762
```sh
28-
python3 -m venv venv-dir
63+
# create venv:
64+
python3 -m venv venv-dir # where "venv-dir" could be any path name
65+
# enter venv in current shell:
2966
source venv-dir/bin/activate
30-
make install # or any other command to install ocrd_cis (see above)
31-
# use ocrd_cis
67+
# install ocrd_cis:
68+
make install # or any other way (see above)
69+
# use ocrd_cis:
70+
ocrd-cis-ocropy-binarize ...
71+
# finally, leave venv:
3272
deactivate
3373
```
3474

@@ -49,19 +89,21 @@ and the language configurations lie in `/etc/profiler/languages` in
4989
the container image.
5090

5191
## Usage
52-
Most tools follow the [OCR-D cli
53-
conventions](https://ocr-d.github.io/cli). They accept the
54-
`--input-file-grp`, `--output-file-grp`, `--parameter`, `--mets`,
55-
`--log-level` command line arguments (short and long). Some of the
56-
tools (most notably the alignment tool) expect a comma seperated list
57-
of multiple input file groups.
92+
Most tools follow the [OCR-D specifications](https://ocr-d.de/en/spec),
93+
(which makes them [OCR-D _processors_](https://ocr-d.de/en/spec/cli),)
94+
i.e. they accept the command-line options `--input-file-grp`, `--output-file-grp`,
95+
`--page-id`, `--parameter`, `--mets`, `--log-level` (each with an argument).
96+
Invoke with `--help` to get self-documentation.
5897

59-
The [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a schema
60-
description of the parameter config file for the different tools that
61-
accept the `--parameter` argument.
98+
Some of the processors (most notably the alignment tool) expect a comma-seperated list
99+
of multiple input file groups, or multiple output file groups.
100+
101+
The [ocrd-tool.json](ocrd_cis/ocrd-tool.json) contains a formal
102+
description of all the processors along with the parameter config file
103+
accepted by their `--parameter` argument.
62104

63105
### ocrd-cis-postcorrect
64-
This command runs the post correction using a pre-trained model. If
106+
This processor runs the post correction using a pre-trained model. If
65107
additional support OCRs should be used, models for these OCR steps are
66108
required and must be executed and aligned beforehand (see [the test
67109
script](tests/run_postcorrection_test.bash) for an example).
@@ -99,7 +141,7 @@ ocrd-cis-postcorrect -I ALGN -O PC ... # post correction
99141

100142
### ocrd-cis-align
101143
Aligns tokens of multiple input file groups to one output file group.
102-
This tool is used to align the master OCR with any additional support
144+
This processor is used to align the master OCR with any additional support
103145
OCRs. It accepts a comma-separated list of input file groups, which
104146
it aligns in order.
105147

@@ -150,95 +192,95 @@ java -jar $(ocrd-cis-data -jar) \
150192
```
151193

152194
### ocrd-cis-ocropy-clip
153-
The `ocropy-clip` tool can be used to remove intrusions of neighbouring segments in regions / lines of a page.
154-
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (as AlternativeImage).
195+
The `clip` processor can be used to remove intrusions of neighbouring segments in regions / lines of a page.
196+
It runs a connected component analysis on every text region / line of every PAGE in the input file group, as well as its overlapping neighbours, and for each binary object of conflict, determines whether it belongs to the neighbour, and can therefore be clipped to the background. It references the resulting segment image files in the output PAGE (via `AlternativeImage`).
155197
(Use this to suppress separators and neighbouring text.)
156198
```sh
157199
ocrd-cis-ocropy-clip \
158-
--input-file-grp OCR-D-SEG-LINE \
159-
--output-file-grp OCR-D-SEG-LINE-CLIP \
200+
--input-file-grp OCR-D-SEG-REGION \
201+
--output-file-grp OCR-D-SEG-REGION-CLIP \
160202
--mets mets.xml
161-
--parameter file:///path/to/config.json
203+
--parameter path/to/config.json
162204
```
163205

164206
### ocrd-cis-ocropy-resegment
165-
The `ocropy-resegment` tool can be used to remove overlap between neighbouring lines of a page.
207+
The `resegment` processor can be used to remove overlap between neighbouring lines of a page.
166208
It runs a line segmentation on every text region of every PAGE in the input file group, and for each line already annotated, determines the label of largest extent within the original coordinates (polygon outline) in that line, and annotates the resulting coordinates in the output PAGE.
167-
(Use this to polygonalise text lines poorly segmented, e.g. via bounding boxes.)
209+
(Use this to polygonalise text lines that are poorly segmented, e.g. via bounding boxes.)
168210
```sh
169211
ocrd-cis-ocropy-resegment \
170212
--input-file-grp OCR-D-SEG-LINE \
171213
--output-file-grp OCR-D-SEG-LINE-RES \
172214
--mets mets.xml
173-
--parameter file:///path/to/config.json
215+
--parameter path/to/config.json
174216
```
175217

176218
### ocrd-cis-ocropy-segment
177-
The `ocropy-segment` tool can be used to segment (pages or) regions of a page into lines.
178-
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) TextLine elements with the resulting polygon outlines to the annotation of the output PAGE.
179-
(Does not detect tables or images.)
219+
The `segment` processor can be used to segment (pages or) regions of a page into (regions and) lines.
220+
It runs a line segmentation on every (page or) text region of every PAGE in the input file group, and adds (text regions containing) `TextLine` elements with the resulting polygon outlines to the annotation of the output PAGE.
221+
(Does _not_ detect tables.)
180222
```sh
181223
ocrd-cis-ocropy-segment \
182224
--input-file-grp OCR-D-SEG-BLOCK \
183225
--output-file-grp OCR-D-SEG-LINE \
184226
--mets mets.xml
185-
--parameter file:///path/to/config.json
227+
--parameter path/to/config.json
186228
```
187229

188230
### ocrd-cis-ocropy-deskew
189-
The `ocropy-deskew` tool can be used to deskew pages / regions of a page.
231+
The `deskew` processor can be used to deskew pages / regions of a page.
190232
It runs a projection profile-based skew estimation on every segment of every PAGE in the input file group and annotates the orientation angle in the output PAGE.
191-
(Does not include orientation detection.)
233+
(Does _not_ include orientation detection.)
192234
```sh
193235
ocrd-cis-ocropy-deskew \
194236
--input-file-grp OCR-D-SEG-LINE \
195237
--output-file-grp OCR-D-SEG-LINE-DES \
196238
--mets mets.xml
197-
--parameter file:///path/to/config.json
239+
--parameter path/to/config.json
198240
```
199241

200242
### ocrd-cis-ocropy-denoise
201-
The `ocropy-denoise` tool can be used to despeckle pages / regions / lines of a page.
202-
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage).
243+
The `denoise` processor can be used to despeckle pages / regions / lines of a page.
244+
It runs a connected component analysis and removes small components (black or white) on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`).
203245
```sh
204246
ocrd-cis-ocropy-denoise \
205247
--input-file-grp OCR-D-SEG-LINE-DES \
206248
--output-file-grp OCR-D-SEG-LINE-DEN \
207249
--mets mets.xml
208-
--parameter file:///path/to/config.json
250+
--parameter path/to/config.json
209251
```
210252

211253
### ocrd-cis-ocropy-binarize
212-
The `ocropy-binarize` tool can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
213-
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as AlternativeImage). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
254+
The `binarize` processor can be used to binarize (and optionally denoise and deskew) pages / regions / lines of a page.
255+
It runs the "nlbin" adaptive whitelevel thresholding on every segment of every PAGE in the input file group and references the resulting segment image files in the output PAGE (as `AlternativeImage`). (If a deskewing angle has already been annotated in a region, the tool respects that and rotates accordingly.) Images can also be produced grayscale-normalized.
214256
```sh
215257
ocrd-cis-ocropy-binarize \
216258
--input-file-grp OCR-D-SEG-LINE-DES \
217259
--output-file-grp OCR-D-SEG-LINE-BIN \
218260
--mets mets.xml
219-
--parameter file:///path/to/config.json
261+
--parameter path/to/config.json
220262
```
221263

222264
### ocrd-cis-ocropy-dewarp
223-
The `ocropy-dewarp` tool can be used to dewarp text lines of a page.
224-
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as AlternativeImage).
265+
The `dewarp` processor can be used to vertically dewarp text lines of a page.
266+
It runs the baseline estimation and center normalizer algorithm on every line in every text region of every PAGE in the input file group and references the resulting line image files in the output PAGE (as `AlternativeImage`).
225267
```sh
226268
ocrd-cis-ocropy-dewarp \
227269
--input-file-grp OCR-D-SEG-LINE-BIN \
228270
--output-file-grp OCR-D-SEG-LINE-DEW \
229271
--mets mets.xml
230-
--parameter file:///path/to/config.json
272+
--parameter path/to/config.json
231273
```
232274

233275
### ocrd-cis-ocropy-recognize
234-
The `ocropy-recognize` tool can be used to recognize the lines / words / glyphs of a page.
276+
The `recognize` processor can be used to recognize the lines / words / glyphs of a page.
235277
It runs LSTM optical character recognition on every line in every text region of every PAGE in the input file group and adds the resulting text annotation in the output PAGE.
236278
```sh
237279
ocrd-cis-ocropy-recognize \
238280
--input-file-grp OCR-D-SEG-LINE-DEW \
239281
--output-file-grp OCR-D-OCR-OCRO \
240282
--mets mets.xml
241-
--parameter file:///path/to/config.json
283+
--parameter path/to/config.json
242284
```
243285

244286
### Tesserocr
@@ -263,21 +305,29 @@ own models and place them into: /usr/share/tesseract-ocr/4.00/tessdata
263305

264306
A decent pipeline might look like this:
265307

266-
0. page-level binarization
308+
1. image normalization/optimization
309+
1. page-level binarization
267310
1. page-level cropping
268-
2. (page-level binarization)
269-
3. page-level deskewing
270-
4. (page-level dewarping)
271-
5. region segmentation
272-
6. region-level clipping
273-
7. (region-level deskewing)
274-
8. line segmentation
275-
9. (line-level clipping or resegmentation)
276-
10. line-level dewarping
277-
11. line-level recognition
278-
12. (line-level alignment and post-correction)
279-
280-
If GT is used, steps 1, 5 and 8 can be omitted. Else if a segmentation is used in 5 and 8 which does not produce overlapping sections, steps 6 and 9 can be omitted.
311+
1. (page-level binarization)
312+
1. (page-level despeckling)
313+
1. page-level deskewing
314+
1. (page-level dewarping)
315+
1. region segmentation, possibly subdivided into
316+
1. text/non-text separation
317+
1. text region segmentation (and classification)
318+
1. reading order detection
319+
1. non-text region classification
320+
1. region-level clipping
321+
1. (region-level deskewing)
322+
1. line segmentation
323+
1. (line-level clipping or resegmentation)
324+
1. line-level dewarping
325+
1. line-level recognition
326+
1. (line-level alignment and post-correction)
327+
328+
If GT is used, then cropping/segmentation steps can be omitted.
329+
330+
If a segmentation is used which does not produce overlapping segments, then clipping/resegmentation can be omitted.
281331

282332
## Testing
283333
To run a few basic tests type `make test` (`ocrd_cis` has to be
@@ -289,11 +339,11 @@ installed in order to run any tests).
289339
* Create a new (empty) workspace: `ocrd workspace init workspace-dir`
290340
* cd into `workspace-dir`
291341
* Add new file to workspace: `ocrd workspace add file -G group -i id
292-
-m mimetype`
342+
-m mimetype -g pageId`
293343

294344
## OCR-D links
295345

296346
- [OCR-D](https://ocr-d.github.io)
297347
- [Github](https://github.com/OCR-D)
298348
- [Project-page](http://www.ocr-d.de/)
299-
- [Ground-truth](http://www.ocr-d.de/sites/all/GTDaten/IndexGT.html)
349+
- [Ground-truth](https://ocr-d-repo.scc.kit.edu/api/v1/metastore/bagit/search)

ocrd_cis/ocropy/common.py

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import numpy as np
77
from scipy.ndimage import measurements, filters, interpolation, morphology
88
from scipy import stats, signal
9-
from skimage.morphology import convex_hull_image
9+
#from skimage.morphology import convex_hull_image
1010
from PIL import Image
1111

1212
from . import ocrolib
@@ -344,7 +344,7 @@ def check_region(binary, zoom=1.0):
344344
if np.amax(binary)==np.amin(binary): return "image is blank"
345345
if np.mean(binary)<np.median(binary): return "image may be inverted"
346346
h,w = binary.shape
347-
if h<60/zoom: return "image not tall enough for a region image %s"%(binary.shape,)
347+
if h<45/zoom: return "image not tall enough for a region image %s"%(binary.shape,)
348348
if h>5000/zoom: return "image too tall for a region image %s"%(binary.shape,)
349349
if w<100/zoom: return "image too narrow for a region image %s"%(binary.shape,)
350350
if w>5000/zoom: return "image too wide for a region image %s"%(binary.shape,)
@@ -1189,7 +1189,7 @@ def lines2regions(binary, llabels,
11891189
region label (in the order of the call chain, which is controlled
11901190
by ``rl`` and ``bt``), covering all the line labels inside it.
11911191
1192-
Afterwards, for each region label, combine line labels by using
1192+
Afterwards, for each region label, simplify regions by using
11931193
their convex hull polygon.
11941194
11951195
Return a Numpy array of text region labels.
@@ -1563,17 +1563,15 @@ def finalize():
15631563
# apply re-assignments:
15641564
rlabels = relabel[llabels]
15651565
DSAVE('rlabels', rlabels)
1566-
LOG.debug('closing %d regions component-wise', np.amax(relabel))
1567-
# close regions (label by label)
1568-
for region in np.unique(relabel):
1569-
if not region:
1570-
continue # ignore bg
1571-
# lines = np.setdiff1d(np.nonzero(relabel==region)[0], [0])
1572-
# if len(lines) < 2:
1573-
# LOG.debug('region %d has only 1 line', region)
1574-
# continue
1575-
# faster than morphological closing:
1576-
region_hull = convex_hull_image(rlabels==region)
1577-
rlabels[region_hull] = region
1578-
DSAVE('rlabels_closed', rlabels)
1566+
# FIXME: hulls can overlap, we just need simplification
1567+
# (but cv2.approxPolyDP is faulty and morphology costly)
1568+
# LOG.debug('closing %d regions component-wise', np.amax(relabel))
1569+
# # close regions (label by label)
1570+
# for region in np.unique(relabel):
1571+
# if not region:
1572+
# continue # ignore bg
1573+
# # faster than morphological closing:
1574+
# region_hull = convex_hull_image(rlabels==region)
1575+
# rlabels[region_hull] = region
1576+
# DSAVE('rlabels_closed', rlabels)
15791577
return rlabels

ocrd_cis/ocropy/ocrolib/morph.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,12 +170,12 @@ def find_contours(image):
170170
contours, _ = cv2.findContours(image.astype(uint8),
171171
cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
172172
# convert to y,x tuples
173-
return list(zip((contour[:,0,::-1], cv2.contourArea(contour))
174-
for contour in contours))
173+
return [(contour[:,0,::-1], cv2.contourArea(contour))
174+
for contour in contours]
175175

176176
@checks(SEGMENTATION)
177177
def find_label_contours(labels):
178-
contours = [[]]*amax(labels)+1
178+
contours = [[]]*(amax(labels)+1)
179179
for label in unique(labels):
180180
if not label:
181181
continue

0 commit comments

Comments
 (0)