Merge pull request #2917 from jnzw/handwritten-english-recognition-0001

Wovchena · web-flow · commit 45ec7998219c · 2021-12-02T14:02:32.000+03:00
Contribute the handwritten-english-recognition-0001 model
diff --git a/data/dataset_classes/gnhk.txt b/data/dataset_classes/gnhk.txt
@@ -0,0 +1,94 @@
+ 
+!
+"
+#
+$
+%
+&
+'
+(
+)
+*
++
+,
+-
+.
+/
+0
+1
+2
+3
+4
+5
+6
+7
+8
+9
+:
+;
+<
+=
+>
+?
+@
+A
+B
+C
+D
+E
+F
+G
+H
+I
+J
+K
+L
+M
+N
+O
+P
+Q
+R
+S
+T
+U
+V
+W
+X
+Y
+Z
+[
+]
+^
+_
+a
+b
+c
+d
+e
+f
+g
+h
+i
+j
+k
+l
+m
+n
+o
+p
+q
+r
+s
+t
+u
+v
+w
+x
+y
+z
+{
+|
+}
+~
+£
diff --git a/data/dataset_definitions.yml b/data/dataset_definitions.yml
@@ -1200,6 +1200,15 @@ datasets:
     annotation: scut_ept_recognition.pickle
     dataset_meta: scut_ept_recognition.json
 
+  - name: GNHK
+    data_source: gnhk
+    annotation_conversion:
+      converter: unicode_character_recognition
+      decoding_char_file: gnhk_char_list.txt
+      annotation_file: test_img_id_gt.txt
+    annotation: gnhk_recognition.pickle
+    dataset_meta: gnhk_recognition.json
+
   - name: ADEChallengeData2016
     annotation_conversion:
       converter: ade20k
diff --git a/demos/README.md b/demos/README.md
@@ -27,7 +27,7 @@ The Open Model Zoo includes the following demos:
 - [Gaze Estimation C++ G-API Demo](./gaze_estimation_demo/cpp_gapi/README.md) - Face detection followed by gaze estimation, head pose estimation and facial landmarks regression. G-API version.
 - [Gesture Recognition Python\* Demo](./gesture_recognition_demo/python/README.md) - Demo application for Gesture Recognition algorithm (e.g. American Sign Language gestures), which classifies gesture actions that are being performed on input video.
 - [GPT-2 Text Prediction Python\* Demo](./gpt2_text_prediction_demo/python/README.md) - GPT-2 text prediction demo.
-- [Handwritten Text Recognition Python\* Demo](./handwritten_text_recognition_demo/python/README.md) - The demo demonstrates how to run Handwritten Japanese Recognition models and Handwritten Simplified Chinese Recognition models.
+- [Handwritten Text Recognition Python\* Demo](./handwritten_text_recognition_demo/python/README.md) - The demo demonstrates how to run Handwritten Text Recognition models for Japanese, Simplified Chinese and English.
 - [Human Pose Estimation C++ Demo](./human_pose_estimation_demo/cpp/README.md) - Human pose estimation demo.
 - [Human Pose Estimation Python\* Demo](./human_pose_estimation_demo/python/README.md) - Human pose estimation demo.
 - [Image Inpainting Python\* Demo](./image_inpainting_demo/python/README.md) - Demo application for GMCNN inpainting network.
diff --git a/demos/handwritten_text_recognition_demo/python/README.md b/demos/handwritten_text_recognition_demo/python/README.md
@@ -1,7 +1,6 @@
 # Handwritten Text Recognition Demo
 
-This example demonstrates an approach to recognize handwritten Japanese and simplified Chinese text lines using OpenVINO™. For Japanese, this demo supports all the characters in datasets [Kondate](http://web.tuat.ac.jp/~nakagawa/database/en/kondate_about.html) and [Nakayosi](http://web.tuat.ac.jp/~nakagawa/database/en/about_nakayosi.html). For simplified Chinese, it supports the characters in [SCUT-EPT](https://github.com/HCIILAB/SCUT-EPT_Dataset_Release).
-
+This example demonstrates an approach to recognize handwritten Japanese, simplified Chinese, and English text lines using OpenVINO™. For Japanese, this demo supports all the characters in datasets [Kondate](http://web.tuat.ac.jp/~nakagawa/database/en/kondate_about.html) and [Nakayosi](http://web.tuat.ac.jp/~nakagawa/database/en/about_nakayosi.html). For simplified Chinese, it supports the characters in [SCUT-EPT](https://github.com/HCIILAB/SCUT-EPT_Dataset_Release). For English, it supports the characters in [GNHK](https://goodnotes.com/gnhk/).
 ## How It Works
 
 The demo workflow is the following:
@@ -29,6 +28,7 @@ omz_converter --list models.lst
 
 * handwritten-japanese-recognition-0001
 * handwritten-simplified-chinese-recognition-0001
+* handwritten-english-recognition-0001
 
 > **NOTE**: Refer to the tables [Intel's Pre-Trained Models Device Support](../../../models/intel/device_support.md) and [Public Pre-Trained Models Device Support](../../../models/public/device_support.md) for the details on models inference support at different devices.
 
@@ -76,9 +76,11 @@ Options:
   -tk TOP_K, --top_k TOP_K
                         Optional. Top k steps in looking up the decoded
                         character, until a designated one is found
+  -ob OUTPUT_BLOB, --output_blob OUTPUT_BLOB
+                        Optional. Name of the output layer of the model. Default is 'output'
 ```
 
-The decoding char list files provided within Open Model Zoo and for Japanese it is the `<omz_dir>/data/dataset_classes/kondate_nakayosi.txt`file, while for Simplified Chinese it is the `<omz_dir>/data/dataset_classes/scut_ept.txt` file. For example, to do inference on a CPU with the OpenVINO&trade; toolkit pre-trained `handwritten-japanese-recognition-0001` model, run the following command:
+The decoding char list files provided within Open Model Zoo and for Japanese it is the `<omz_dir>/data/dataset_classes/kondate_nakayosi.txt` file, while for Simplified Chinese it is the `<omz_dir>/data/dataset_classes/scut_ept.txt` file, and for English it is the `<omz_dir>/data/dataset_classes/gnhk.txt` file. For example, to do inference on a CPU with the OpenVINO&trade; toolkit pre-trained `handwritten-japanese-recognition-0001` model, run the following command:
 
 ```sh
 python handwritten_text_recognition_demo.py \
diff --git a/demos/handwritten_text_recognition_demo/python/data/handwritten_english_test.png b/demos/handwritten_text_recognition_demo/python/data/handwritten_english_test.png
diff --git a/demos/handwritten_text_recognition_demo/python/handwritten_text_recognition_demo.py b/demos/handwritten_text_recognition_demo/python/handwritten_text_recognition_demo.py
@@ -48,6 +48,7 @@ def build_argparser():
                       help="Path to the decoding char list file. Default is for Japanese")
     args.add_argument("-dc", "--designated_characters", type=str, default=None, help="Optional. Path to the designated character file")
     args.add_argument("-tk", "--top_k", type=int, default=20, help="Optional. Top k steps in looking up the decoded character, until a designated one is found")
+    args.add_argument("-ob", "--output_blob", type=str, default=None, help="Optional. Name of the output layer of the model. Default is None, in which case the demo will read the output name from the model, assuming there is only 1 output layer")
     return parser
 
 
@@ -77,16 +78,20 @@ def main():
     log.info('OpenVINO Inference Engine')
     log.info('\tbuild: {}'.format(get_version()))
     ie = IECore()
+    ie.set_config(config={"GPU_ENABLE_LOOP_UNROLLING": "NO", "CACHE_DIR": "./"}, device_name="GPU")
 
     # Read IR
     log.info('Reading model {}'.format(args.model))
     net = ie.read_network(args.model, os.path.splitext(args.model)[0] + ".bin")
 
     assert len(net.input_info) == 1, "Demo supports only single input topologies"
-    assert len(net.outputs) == 1, "Demo supports only single output topologies"
-
     input_blob = next(iter(net.input_info))
-    out_blob = next(iter(net.outputs))
+
+    if args.output_blob is not None:
+        out_blob = args.output_blob
+    else:
+        assert len(net.outputs) == 1, "Demo supports only single output topologies"
+        out_blob = next(iter(net.output_info))
 
     characters = get_characters(args)
     codec = CTCCodec(characters, args.designated_characters, args.top_k)
diff --git a/demos/handwritten_text_recognition_demo/python/models.lst b/demos/handwritten_text_recognition_demo/python/models.lst
@@ -1,3 +1,4 @@
 # This file can be used with the --list option of the model downloader.
 handwritten-japanese-recognition-0001
 handwritten-simplified-chinese-recognition-0001
+handwritten-english-recognition-0001
diff --git a/models/intel/handwritten-english-recognition-0001/README.md b/models/intel/handwritten-english-recognition-0001/README.md
@@ -0,0 +1,50 @@
+# handwritten-english-recognition-0001
+
+## Use Case and High-Level Description
+
+This is a network for handwritten English text recognition scenario. It consists of a CNN followed by Bi-LSTM, reshape layer and a fully connected layer.
+The network is able to recognize English text consisting of characters in the [GNHK](https://goodnotes.com/gnhk/) dataset.
+
+## Example
+
+![](./assets/handwritten-english-recognition-0001.jpg) -> 'Picture ID. and Passport photo'
+
+## Specification
+
+| Metric                    | Value     |
+| ------------------------- | --------- |
+| GFlops                    | 1.3182    |
+| MParams                   | 0.1413    |
+| Accuracy on GNHK test subset (excluding images wider than 2000px after resized to height 96px with aspect ratio) | 81.5%     |
+| Source framework          | PyTorch\* |
+
+> **Note:** to achieve the accuracy, images from the GNHK test set should be binarized using adaptive thresholding, and preprocessed into single-line text images, using the coordinates from the accompanying JSON annotation files in the GNHK dataset. See [preprocess_gnhk.py](./preprocess_gnhk.py).
+
+This model adopts [label error rate](https://dl.acm.org/doi/abs/10.1145/1143844.1143891) as the metric for accuracy.
+
+## Inputs
+
+Grayscale image, name - `actual_input`, shape - `1, 1, 96, 2000`, format is `B, C, H, W`, where:
+
+- `B` - batch size
+- `C` - number of channels
+- `H` - image height
+- `W` - image width
+
+> **NOTE:**  the source image should be resized to specific height (such as 96) while keeping aspect ratio, and the width after resizing should be no larger than 2000 and then the width should be right-bottom padded to 2000 with edge values.
+
+## Outputs
+
+Name - `output`, shape - `250, 1, 95`, format is `W, B, L`, where:
+
+- `W` - output sequence length
+- `B` - batch size
+- `L` - confidence distribution across the supported symbols in [GNHK](https://goodnotes.com/gnhk/)
+
+The network output can be decoded by CTC Greedy Decoder.
+
+The network also outputs 10 LSTM hidden states of shape `2, 1, 256`, which can be simply ignored.
+
+## Legal Information
+
+[*] Other names and brands may be claimed as the property of others.
diff --git a/models/intel/handwritten-english-recognition-0001/accuracy-check.yml b/models/intel/handwritten-english-recognition-0001/accuracy-check.yml
@@ -0,0 +1,31 @@
+models:
+  - name: handwritten-english-recognition-0001
+
+    launchers:
+      - framework: dlsdk
+        adapter:
+          type: ctc_greedy_search_decoder
+          logits_output: output
+
+    datasets:
+      - name: GNHK
+        # In order to be used by model, images must be:
+        # 1) Resized to fixed height 96, using AREA interpolation
+        # 2) Padded with edge padding, right padding only
+        preprocessing:
+          - type: bgr_to_gray
+          - type: resize
+            interpolation: AREA
+            aspect_ratio_scale: width
+            size: 96
+          - type: padding
+            use_numpy: True
+            numpy_pad_mode: edge
+            dst_height: 96
+            dst_width: 2000
+            pad_type: right_bottom
+
+        metrics:
+          - type: label_level_recognition_accuracy
+            reference: 0.815
+
diff --git a/models/intel/handwritten-english-recognition-0001/assets/handwritten-english-recognition-0001.jpg b/models/intel/handwritten-english-recognition-0001/assets/handwritten-english-recognition-0001.jpg
diff --git a/models/intel/handwritten-english-recognition-0001/preprocess_gnhk.py b/models/intel/handwritten-english-recognition-0001/preprocess_gnhk.py
@@ -0,0 +1,110 @@
+#!/usr/bin/env python3
+
+import json
+import glob
+import os
+from pathlib import Path
+import numpy as np
+import cv2
+from skimage.filters.rank import entropy
+from skimage.morphology import disk
+
+TEST_FOLDER="test/"
+TEST_TXT=TEST_FOLDER+"test_img_id_gt.txt"
+
+def binarize(img):
+    # calculate local entropy
+    entr = entropy(img, disk(5))
+    # Normalize and negate entropy values
+    MAX_ENTROPY = 8.0
+    MAX_PIX_VAL = 255
+    negative = 1 - (entr / MAX_ENTROPY)
+    u8img = (negative * MAX_PIX_VAL).astype(np.uint8)
+    # Global thresholding
+    ret, mask = cv2.threshold(u8img, 0, MAX_PIX_VAL, cv2.THRESH_OTSU)
+    # mask out text
+    masked = cv2.bitwise_and(img, img, mask=mask)
+    # fill in the holes to estimate the background
+    kernel = np.ones((35, 35), np.uint8)
+    background = cv2.dilate(masked, kernel, iterations=1)
+    # By subtracting background from the original image, we get a clean text image
+    text_only = cv2.absdiff(img, background)
+    # Negate and increase contrast
+    neg_text_only = (MAX_PIX_VAL - text_only) * 1.15
+    # clamp the image within u8 range
+    ret, clamped = cv2.threshold(neg_text_only, 255, MAX_PIX_VAL, cv2.THRESH_TRUNC)
+    clamped_u8 = clamped.astype(np.uint8)
+    # Do final adaptive thresholding to binarize image
+    processed = cv2.adaptiveThreshold(clamped_u8, MAX_PIX_VAL, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,
+                                        31, 2)
+    return processed
+
+def main():
+    """
+    Preprocess GNHK dataset into text line images
+    """
+    os.makedirs(os.path.dirname(TEST_FOLDER), exist_ok=True)
+    open(TEST_TXT, 'w').close() # clear the file
+    images = glob.glob('eng*.jpg')
+
+    for img_idx, image in enumerate(images):
+        img_id = Path(image).stem
+        print(img_idx, img_id)
+        # open corresponding JSON annotation file
+        with open(img_id + ".json") as f:
+            data = json.load(f)
+            line_indices = set(map(lambda obj: obj["line_idx"], data))
+            img = cv2.imread(image, cv2.IMREAD_GRAYSCALE)
+            img = binarize(img)
+            for idx in sorted(list(line_indices)):
+                objects = list(filter(lambda obj: obj["line_idx"] == idx, data))
+                # discard math symbols, scribbles, illegible text, and printed text
+                objects = list(filter(lambda obj: obj["text"] != "%math%" and obj["text"] != "%SC%" and obj["text"] != "%NA%" and obj["type"] != "P", objects))
+                if not objects:
+                    continue
+                objects = sorted(objects, key=lambda x: x['polygon']['x0'])
+                label = " ".join(map(lambda obj: obj["text"], objects))
+                print(img_id, idx, label)
+
+                # create mask for the words
+                mask = np.zeros(img.shape[0:2], dtype=np.uint8)
+                for obj in objects:
+                        region = [
+                            [obj["polygon"]["x0"], obj["polygon"]["y0"]],
+                            [obj["polygon"]["x1"], obj["polygon"]["y1"]],
+                            [obj["polygon"]["x2"], obj["polygon"]["y2"]],
+                            [obj["polygon"]["x3"], obj["polygon"]["y3"]]
+                        ]
+                        points = np.array([region])
+                        cv2.drawContours(mask, [points], -1, (255, 255, 255), -1, cv2.LINE_AA)
+                masked = cv2.bitwise_and(img, img, mask = mask)
+                bg = np.ones_like(img, np.uint8) * 255
+                cv2.bitwise_not(bg, bg, mask = mask)
+                overlay = bg + masked
+                # crop bounding rectangle of the text region
+                l = list(map(lambda obj: [
+                    [obj["polygon"]["x0"], obj["polygon"]["y0"]],
+                    [obj["polygon"]["x1"], obj["polygon"]["y1"]],
+                    [obj["polygon"]["x2"], obj["polygon"]["y2"]],
+                    [obj["polygon"]["x3"], obj["polygon"]["y3"]]
+                    ], objects))
+                flat = [item for sublist in l for item in sublist]
+                pts = np.array(flat)
+                rect = cv2.boundingRect(pts)
+                x, y, w, h = rect
+                cropped = overlay[y:y+h, x:x+w].copy()
+
+                # discard image if width > 2000 after resizing to height=96 while keeping aspect ratio
+                height, width = cropped.shape
+                ratio = 96.0 / height
+                new_width = int(width * ratio)
+                if new_width > 2000:
+                    continue
+
+                cv2.imwrite(TEST_FOLDER + img_id + '_line'+ str(idx) + '.jpg', cropped)
+                with open(TEST_TXT, 'a', encoding='utf-8') as test_txt:
+                    test_txt.write(img_id + '_line'+ str(idx) + '.jpg' + ',' + label + '\n')
+
+if __name__ == '__main__':
+    main()
+
diff --git a/models/intel/index.md b/models/intel/index.md
@@ -207,6 +207,7 @@ Deep Learning models for text recognition in various applications.
 | [handwritten-score-recognition-0003](./handwritten-score-recognition-0003/README.md)       | 0.792   | 5.555 |
 | [handwritten-japanese-recognition-0001](./handwritten-japanese-recognition-0001/README.md) | 117.136 | 15.31 |
 | [handwritten-simplified-chinese-recognition-0001](./handwritten-simplified-chinese-recognition-0001/README.md) | 134.513 | 17.270 |
+| [handwritten-english-recognition-0001](./handwritten-english-recognition-0001/README.md) | 1.3182 | 0.1413 |
 | [formula-recognition-medium-scan-0001](./formula-recognition-medium-scan-0001/README.md) |    |    |
 |   encoder | 16.56 | 1.86 |
 |   decoder | 1.69  | 2.56 |
diff --git a/tools/accuracy_checker/openvino/tools/accuracy_checker/adapters/README.md b/tools/accuracy_checker/openvino/tools/accuracy_checker/adapters/README.md
@@ -273,6 +273,7 @@ AccuracyChecker supports following set of adapters:
   * `vocabulary_file` - file with model vocab, represented as txt file, where each label is located on own line (Optional).
 * `ctc_greedy_search_decoder` - realization CTC Greedy Search decoder for symbol sequence recognition, converting model output to `CharacterRecognitionPrediction`.
   * `blank_label` - index of the CTC blank label (default 0).
+  * `logits_output` - Name of the output layer of the network (Optional).
   * `custom_label_map` - Alphabet as a dict of strings. Must include blank symbol for CTC algorithm (Optional, if provided in dataset_meta or vocabulary_file).
   * `vocabulary_file` - file with model vocab, represented as txt file, where each label is located on own line (Optional).
   * `shift_labels` - shift label map ids on 1 if it represented without blank label on zero position (Optional, default False).
diff --git a/tools/accuracy_checker/openvino/tools/accuracy_checker/annotation_converters/unicode_character_recognition.py b/tools/accuracy_checker/openvino/tools/accuracy_checker/annotation_converters/unicode_character_recognition.py

-Original file line number
+Diff line change
@@ @@ -0,0 +1,94 @@ @@
++
 +!
 +"
 +#
 +$
 +%
 +&
 +'
 +(
 +)
 +*
 ++
 +,
 +-
 +.
 +/
 +0
 +1
 +2
 +3
 +4
 +5
 +6
 +7
 +8
 +9
 +:
 +;
 +<
 +=
 +>
 +?
 +@
 +A
 +B
 +C
 +D
 +E
 +F
 +G
 +H
 +I
 +J
 +K
 +L
 +M
 +N
 +O
 +P
 +Q
 +R
 +S
 +T
 +U
 +V
 +W
 +X
 +Y
 +Z
 +[
 +]
 +^
 +_
 +a
 +b
 +c
 +d
 +e
 +f
 +g
 +h
 +i
 +j
 +k
 +l
 +m
 +n
 +o
 +p
 +q
 +r
 +s
 +t
 +u
 +v
 +w
 +x
 +y
 +z
 +{
 +|
 +}
 +~
 +£