Skip to content

Commit 00d8a33

Browse files
author
Aleksei Korobeinikov
authored
speech recognition dynamic version (#3237)
* speech recognition dynamic version * fix flake errors * update README and add dynamic flag * update flag and README * update dynamic cases
1 parent 2501256 commit 00d8a33

File tree

4 files changed

+22
-14
lines changed

4 files changed

+22
-14
lines changed

demos/speech_recognition_wav2vec_demo/python/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ omz_converter --list models.lst
3434
Run the application with `-h` option to see help message.
3535

3636
```
37-
usage: speech_recognition_wav2vec_demo.py [-h] -m MODEL -i INPUT [-d DEVICE]
37+
usage: speech_recognition_wav2vec_demo.py [-h] -m MODEL -i INPUT [-d DEVICE] [--vocab VOCAB] [--dynamic_shape]
3838
3939
optional arguments:
4040
-h, --help Show this help message and exit.
@@ -43,11 +43,12 @@ optional arguments:
4343
-i INPUT, --input INPUT
4444
Required. Path to an audio file in WAV PCM 16 kHz mono format.
4545
-d DEVICE, --device DEVICE
46-
Optional. Specify the target device to infer on, for
47-
example: CPU, GPU, HDDL, MYRIAD or HETERO. The
48-
demo will look for a suitable IE plugin for this
49-
device. Default value is CPU.
50-
--vocab VOCAB Optional. Path to an .json file with model encoding vocabulary.
46+
Optional. Specify the target device to infer on, for example: CPU, GPU, HDDL, MYRIAD or
47+
HETERO. The demo will look for a suitable IE plugin for this device. Default value is
48+
CPU.
49+
--vocab VOCAB Optional. Path to an .json file with encoding vocabulary.
50+
--dynamic_shape Optional. Using dynamic shapes for inputs and outputs of model.
51+
5152
```
5253

5354
The typical command line is:

demos/speech_recognition_wav2vec_demo/python/speech_recognition_wav2vec_demo.py

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,8 @@ def build_argparser():
4040
"CPU, GPU, HDDL, MYRIAD or HETERO. "
4141
"The demo will look for a suitable IE plugin for this device. Default value is CPU.")
4242
parser.add_argument('--vocab', help='Optional. Path to an .json file with encoding vocabulary.')
43+
parser.add_argument('--dynamic_shape', action='store_true',
44+
help='Optional. Using dynamic shapes for inputs of model.')
4345
return parser
4446

4547

@@ -51,23 +53,26 @@ class Wav2Vec:
5153
words_delimiter = '|'
5254
pad_token = '<pad>'
5355

54-
def __init__(self, core, model_path, input_shape, device, vocab_file):
56+
def __init__(self, core, model_path, input_shape, device, vocab_file, dynamic_flag):
5557
log.info('Reading model {}'.format(model_path))
5658
model = core.read_model(model_path)
5759
if len(model.inputs) != 1:
5860
raise RuntimeError('Wav2Vec must have one input')
5961
self.input_tensor_name = model.inputs[0].get_any_name()
60-
model_input_shape = model.inputs[0].shape
62+
model_input_shape = model.inputs[0].partial_shape
6163
if len(model_input_shape) != 2:
6264
raise RuntimeError('Wav2Vec input must be 2-dimensional')
6365
if len(model.outputs) != 1:
6466
raise RuntimeError('Wav2Vec must have one output')
65-
model_output_shape = model.outputs[0].shape
67+
model_output_shape = model.outputs[0].partial_shape
6668
if len(model_output_shape) != 3:
6769
raise RuntimeError('Wav2Vec output must be 3-dimensional')
6870
if model_output_shape[2] != len(self.alphabet):
6971
raise RuntimeError(f'Wav2Vec output third dimension size must be {len(self.alphabet)}')
70-
model.reshape({self.input_tensor_name: PartialShape(input_shape)})
72+
if not dynamic_flag:
73+
model.reshape({self.input_tensor_name: PartialShape(input_shape)})
74+
elif not model.is_dynamic():
75+
model.reshape({self.input_tensor_name: PartialShape((-1, -1))})
7176
compiled_model = core.compile_model(model, device)
7277
self.output_tensor = compiled_model.outputs[0]
7378
self.infer_request = compiled_model.create_infer_request()
@@ -124,7 +129,7 @@ def main():
124129
log.info('\tbuild: {}'.format(get_version()))
125130
core = Core()
126131

127-
model = Wav2Vec(core, args.model, audio.shape, args.device, args.vocab)
132+
model = Wav2Vec(core, args.model, audio.shape, args.device, args.vocab, args.dynamic_shape)
128133
normalized_audio = model.preprocess(audio)
129134
character_probs = model.infer(normalized_audio)
130135
transcription = model.decode(character_probs)

models/public/wav2vec2-base/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,13 @@ For details please also check [repository](https://github.com/pytorch/fairseq/tr
2525

2626
#### Original model
2727

28-
Normalized audio signal, name - `inputs`, shape - `1, 30480`, format is `B, N`, where:
28+
Normalized audio signal, name - `inputs`, shape - `B, N`, format is `B, N`, where:
2929

3030
- `B` - batch size
3131
- `N` - sequence length
3232

33+
Model is dynamic and can working with different shapes of input.
34+
3335
**NOTE**: Model expects 16-bit, 16 kHz, mono-channel WAVE audio as input data.
3436

3537
#### Converted model
@@ -40,12 +42,13 @@ The converted model has the same parameters as the original model.
4042

4143
#### Original model
4244

43-
Per-token probabilities (after LogSoftmax) for every symbol in the alphabet, name - `logits`, shape - `1, 95, 32`, output data format is `B, N, C`, where:
45+
Per-token probabilities (after LogSoftmax) for every symbol in the alphabet, name - `logits`, shape - `B, N, 32`, output data format is `B, N, C`, where:
4446

4547
- `B` - batch size
4648
- `N` - number of recognized tokens
4749
- `C` - alphabet size
4850

51+
`B` and `N` dimensions can take different values, because model is dynamic. Alphabet size `C` is static and equals 32.
4952
Model alphabet: "[pad]", "[s]", "[/s]", "[unk]", "|", "E", "T", "A", "O", "N", "I", "H", "S", "R", "D", "L", "U", "M", "W", "C", "F", "G", "Y", "P", "B", "V", "K", "'", "X", "J", "Q", "Z", where:
5053

5154
- `[pad]` - padding token used as CTC-blank label

models/public/wav2vec2-base/model.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,6 @@ conversion_to_onnx_args:
107107
- '--conversion-param=dynamic_axes={"inputs": {0: "batch_size", 1: "sequence_len"},
108108
"logits": {0: "batch_size", 1: "sequence_len"}}'
109109
model_optimizer_args:
110-
- --input_shape=[1,30480]
111110
- --input=inputs
112111
- --layout=inputs(NS)
113112
- --input_model=$conv_dir/wav2vec2-base.onnx

0 commit comments

Comments
 (0)