|
1 | | -# Conformer: Convolution-augmented Transformer for Speech Recognition |
2 | | - |
3 | | -Reference: [https://arxiv.org/abs/2005.08100](https://arxiv.org/abs/2005.08100) |
4 | | - |
5 | | - |
6 | | - |
7 | | -## Example Model YAML Config |
8 | | - |
9 | | -```yaml |
10 | | -speech_config: |
11 | | - sample_rate: 16000 |
12 | | - frame_ms: 25 |
13 | | - stride_ms: 10 |
14 | | - feature_type: log_mel_spectrogram |
15 | | - num_feature_bins: 80 |
16 | | - preemphasis: 0.97 |
17 | | - normalize_signal: True |
18 | | - normalize_feature: True |
19 | | - normalize_per_feature: False |
20 | | - |
21 | | -decoder_config: |
22 | | - vocabulary: null |
23 | | - target_vocab_size: 1024 |
24 | | - max_subword_length: 4 |
25 | | - blank_at_zero: True |
26 | | - beam_width: 5 |
27 | | - norm_score: True |
28 | | - |
29 | | -model_config: |
30 | | - name: conformer |
31 | | - subsampling: |
32 | | - type: conv2 |
33 | | - kernel_size: 3 |
34 | | - strides: 2 |
35 | | - filters: 144 |
36 | | - positional_encoding: sinusoid_concat |
37 | | - dmodel: 144 |
38 | | - num_blocks: 16 |
39 | | - head_size: 36 |
40 | | - num_heads: 4 |
41 | | - mha_type: relmha |
42 | | - kernel_size: 32 |
43 | | - fc_factor: 0.5 |
44 | | - dropout: 0.1 |
45 | | - embed_dim: 320 |
46 | | - embed_dropout: 0.0 |
47 | | - num_rnns: 1 |
48 | | - rnn_units: 320 |
49 | | - rnn_type: lstm |
50 | | - layer_norm: True |
51 | | - joint_dim: 320 |
52 | | - |
53 | | -learning_config: |
54 | | - augmentations: |
55 | | - after: |
56 | | - time_masking: |
57 | | - num_masks: 10 |
58 | | - mask_factor: 100 |
59 | | - p_upperbound: 0.2 |
60 | | - freq_masking: |
61 | | - num_masks: 1 |
62 | | - mask_factor: 27 |
63 | | - |
64 | | - dataset_config: |
65 | | - train_paths: ... |
66 | | - eval_paths: ... |
67 | | - test_paths: ... |
68 | | - tfrecords_dir: ... |
69 | | - |
70 | | - optimizer_config: |
71 | | - warmup_steps: 10000 |
72 | | - beta1: 0.9 |
73 | | - beta2: 0.98 |
74 | | - epsilon: 1e-9 |
75 | | - |
76 | | - running_config: |
77 | | - batch_size: 4 |
78 | | - num_epochs: 22 |
79 | | - outdir: ... |
80 | | - log_interval_steps: 400 |
81 | | - save_interval_steps: 400 |
82 | | - eval_interval_steps: 1000 |
83 | | -``` |
84 | | -
|
85 | | -## Usage |
86 | | -
|
87 | | -Training, see `python examples/conformer/train_conformer.py --help` |
88 | | - |
89 | | -Testing, see `python examples/conformer/train_conformer.py --help` |
90 | | - |
91 | | -TFLite Conversion, see `python examples/conformer/tflite_conformer.py --help` |
92 | | - |
93 | | -## Conformer Subwords - Results on LibriSpeech |
94 | | - |
95 | | -**Summary** |
96 | | - |
97 | | -- Number of subwords: 1031 |
98 | | -- Maxium length of a subword: 4 |
99 | | -- Subwords corpus: all training sets, dev sets and test-clean |
100 | | -- Number of parameters: 10,341,639 |
101 | | -- Positional Encoding Type: sinusoid concatenation |
102 | | - |
103 | | -**Pretrained and Config**, go to [drive](https://drive.google.com/drive/folders/1VAihgSB5vGXwIVTl3hkUk95joxY1YbfW?usp=sharing) |
104 | | - |
105 | | -**Transducer Loss** |
106 | | - |
107 | | -<img src="./figs/subword_conformer_loss.svg" alt="conformer_subword" width="300px" /> |
108 | | - |
109 | | -**Error Rates** |
110 | | - |
111 | | -| Test-clean | WER (%) | CER (%) | |
112 | | -| :--------: | :-------: | :--------: | |
113 | | -| _Greedy_ | 6.4476862 | 2.51828337 | |
| 1 | +# Conformer: Convolution-augmented Transformer for Speech Recognition |
| 2 | + |
| 3 | +Reference: [https://arxiv.org/abs/2005.08100](https://arxiv.org/abs/2005.08100) |
| 4 | + |
| 5 | + |
| 6 | + |
| 7 | +## Example Model YAML Config |
| 8 | + |
| 9 | +```yaml |
| 10 | +speech_config: |
| 11 | + sample_rate: 16000 |
| 12 | + frame_ms: 25 |
| 13 | + stride_ms: 10 |
| 14 | + feature_type: log_mel_spectrogram |
| 15 | + num_feature_bins: 80 |
| 16 | + preemphasis: 0.97 |
| 17 | + normalize_signal: True |
| 18 | + normalize_feature: True |
| 19 | + normalize_per_feature: False |
| 20 | + |
| 21 | +decoder_config: |
| 22 | + vocabulary: null |
| 23 | + target_vocab_size: 1024 |
| 24 | + max_subword_length: 4 |
| 25 | + blank_at_zero: True |
| 26 | + beam_width: 5 |
| 27 | + norm_score: True |
| 28 | + |
| 29 | +model_config: |
| 30 | + name: conformer |
| 31 | + subsampling: |
| 32 | + type: conv2 |
| 33 | + kernel_size: 3 |
| 34 | + strides: 2 |
| 35 | + filters: 144 |
| 36 | + positional_encoding: sinusoid_concat |
| 37 | + dmodel: 144 |
| 38 | + num_blocks: 16 |
| 39 | + head_size: 36 |
| 40 | + num_heads: 4 |
| 41 | + mha_type: relmha |
| 42 | + kernel_size: 32 |
| 43 | + fc_factor: 0.5 |
| 44 | + dropout: 0.1 |
| 45 | + embed_dim: 320 |
| 46 | + embed_dropout: 0.0 |
| 47 | + num_rnns: 1 |
| 48 | + rnn_units: 320 |
| 49 | + rnn_type: lstm |
| 50 | + layer_norm: True |
| 51 | + joint_dim: 320 |
| 52 | + |
| 53 | +learning_config: |
| 54 | + augmentations: |
| 55 | + after: |
| 56 | + time_masking: |
| 57 | + num_masks: 10 |
| 58 | + mask_factor: 100 |
| 59 | + p_upperbound: 0.2 |
| 60 | + freq_masking: |
| 61 | + num_masks: 1 |
| 62 | + mask_factor: 27 |
| 63 | + |
| 64 | + dataset_config: |
| 65 | + train_paths: ... |
| 66 | + eval_paths: ... |
| 67 | + test_paths: ... |
| 68 | + tfrecords_dir: ... |
| 69 | + |
| 70 | + optimizer_config: |
| 71 | + warmup_steps: 10000 |
| 72 | + beta1: 0.9 |
| 73 | + beta2: 0.98 |
| 74 | + epsilon: 1e-9 |
| 75 | + |
| 76 | + running_config: |
| 77 | + batch_size: 4 |
| 78 | + num_epochs: 22 |
| 79 | + outdir: ... |
| 80 | + log_interval_steps: 400 |
| 81 | + save_interval_steps: 400 |
| 82 | + eval_interval_steps: 1000 |
| 83 | +``` |
| 84 | +
|
| 85 | +## Usage |
| 86 | +
|
| 87 | +Training, see `python examples/conformer/train_conformer.py --help` |
| 88 | + |
| 89 | +Testing, see `python examples/conformer/train_conformer.py --help` |
| 90 | + |
| 91 | +TFLite Conversion, see `python examples/conformer/tflite_conformer.py --help` |
| 92 | + |
| 93 | +## Conformer Subwords - Results on LibriSpeech |
| 94 | + |
| 95 | +**Summary** |
| 96 | + |
| 97 | +- Number of subwords: 1031 |
| 98 | +- Maxium length of a subword: 4 |
| 99 | +- Subwords corpus: all training sets, dev sets and test-clean |
| 100 | +- Number of parameters: 10,341,639 |
| 101 | +- Positional Encoding Type: sinusoid concatenation |
| 102 | + |
| 103 | +**Pretrained and Config**, go to [drive](https://drive.google.com/drive/folders/1VAihgSB5vGXwIVTl3hkUk95joxY1YbfW?usp=sharing) |
| 104 | + |
| 105 | +**Transducer Loss** |
| 106 | + |
| 107 | +<img src="./figs/subword_conformer_loss.svg" alt="conformer_subword" width="300px" /> |
| 108 | + |
| 109 | +**Error Rates** |
| 110 | + |
| 111 | +| Test-clean | WER (%) | CER (%) | |
| 112 | +| :--------: | :-------: | :--------: | |
| 113 | +| _Greedy_ | 6.4476862 | 2.51828337 | |
0 commit comments